# Beautiful Soup Basics
04 April 2019

### Import modules

In [156]:
import requests
import bs4
from bs4 import BeautifulSoup

### Scrape this site's home page and create a soup object from its html

In [157]:
# Store the url in a variable
url = 'https://rakeshbhatia.github.io/notes/'

# Get the site content using requests
r = requests.get(url)

# Extract text from the content
content = r.text

# Convert html text content into a beautiful soup object
soup = BeautifulSoup(content, 'html.parser')

print(soup)

<!DOCTYPE html>

<html lang="en-US">
<head>
<meta charset="utf-8"/>
<!-- Begin Jekyll SEO tag v2.5.0 -->
<title>Python &amp; Data Science | Data Science for Truth</title>
<meta content="Jekyll v3.7.4" name="generator">
<meta content="Python &amp; Data Science" property="og:title"/>
<meta content="en_US" property="og:locale"/>
<link href="https://rakeshbhatia.github.io/notes/" rel="canonical"/>
<meta content="https://rakeshbhatia.github.io/notes/" property="og:url"/>
<meta content="Data Science for Truth" property="og:site_name"/>
<script type="application/ld+json">
{"@type":"WebSite","url":"https://rakeshbhatia.github.io/notes/","name":"Data Science for Truth","headline":"Python &amp; Data Science","@context":"http://schema.org"}</script>
<!-- End Jekyll SEO tag -->
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="#157878" name="theme-color"/>
<link href="/notes/assets/css/style.css?v=6ec90d8ec648d62ec1680b0da961df765d98d8c4" rel="stylesheet"/>
</met

### Website title

In [158]:
# View the soup object's title tag
soup.title

<title>Python &amp; Data Science | Data Science for Truth</title>

### Contents of title tag
* Returns a list

In [159]:
soup.title.contents

['Python & Data Science | Data Science for Truth']

### String inside title tag

In [160]:
# View the string contained in the title tag
soup.title.string

'Python & Data Science | Data Science for Truth'

### Parent of the title tag

In [161]:
soup.title.parent.name

'head'

### First paragraph tag

In [162]:
# View the soup object's first paragraph tag
soup.p

<p>I am a data scientist who is fascinated with solving challenging data-oriented problems across a wide variety of fields. I enjoy seeking the truth, revealing the truth, and searching for hidden truths in data. Check out my technical notes on python and data science below!</p>

### String inside paragraph tag
* Note: if the desired string is contained within a nested tab, it will not show up here
    * The string must be extracted directly from the tag that immediately encloses it

In [163]:
soup.p.string

'I am a data scientist who is fascinated with solving challenging data-oriented problems across a wide variety of fields. I enjoy seeking the truth, revealing the truth, and searching for hidden truths in data. Check out my technical notes on python and data science below!'

### First link tag

In [164]:
soup.a

<a class="btn" href="https://github.com/rakeshbhatia/notes">View on GitHub</a>

### Find all link tags and print the first three
* The `find_all()` function enables you to search the DOM tree for any desired elements by their tag
* Found tags are returned as a list of `Tag` objects
    * If only a single tag is found, the `find_all()` method will return just a single `Tag` object

In [165]:
soup.find_all('a')[0:2]

[<a class="btn" href="https://github.com/rakeshbhatia/notes">View on GitHub</a>,
 <a href="https://rakeshbhatia.github.io/notes/python/if_else">If Else</a>]

### Make our soup object's content more readable
* This will appear just like the site's original html

In [166]:
pretty_soup = soup.prettify()
print(pretty_soup)

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <!-- Begin Jekyll SEO tag v2.5.0 -->
  <title>
   Python &amp; Data Science | Data Science for Truth
  </title>
  <meta content="Jekyll v3.7.4" name="generator">
   <meta content="Python &amp; Data Science" property="og:title"/>
   <meta content="en_US" property="og:locale"/>
   <link href="https://rakeshbhatia.github.io/notes/" rel="canonical"/>
   <meta content="https://rakeshbhatia.github.io/notes/" property="og:url"/>
   <meta content="Data Science for Truth" property="og:site_name"/>
   <script type="application/ld+json">
    {"@type":"WebSite","url":"https://rakeshbhatia.github.io/notes/","name":"Data Science for Truth","headline":"Python &amp; Data Science","@context":"http://schema.org"}
   </script>
   <!-- End Jekyll SEO tag -->
   <meta content="width=device-width, initial-scale=1" name="viewport"/>
   <meta content="#157878" name="theme-color"/>
   <link href="/notes/assets/css/style.css?v=6ec90d8ec648d