# Beautiful Soup Basics

Let's demonstrate the basics of Beautiful Soup by scraping the homepage of this website. Note: the site has been updated since this notebook was written, so scraped content shown here will not reflect any new changes. It doesn't matter as this notebook is simply for demonstration purposes.

![notes_website_home.png](attachment:notes_website_home.png)

### Import modules

In [1]:
import requests
import bs4
from bs4 import BeautifulSoup

### Scrape this site's home page and create a soup object from its html

In [2]:
# Store the url in a variable
url = 'https://rakeshbhatia.github.io/notes/'

# Get the site content using requests
r = requests.get(url)

# Extract text from the content
content = r.text

# Convert html text content into a beautiful soup object
soup = BeautifulSoup(content, 'html.parser')

print(soup)

<!DOCTYPE html>

<html lang="en-US">
<head>
<meta charset="utf-8"/>
<!-- Begin Jekyll SEO tag v2.5.0 -->
<title>Python • Data Science • Machine Learning | Data Science for Truth</title>
<meta content="Jekyll v3.8.5" name="generator">
<meta content="Python • Data Science • Machine Learning" property="og:title"/>
<meta content="en_US" property="og:locale"/>
<link href="https://rakeshbhatia.github.io/notes/" rel="canonical"/>
<meta content="https://rakeshbhatia.github.io/notes/" property="og:url"/>
<meta content="Data Science for Truth" property="og:site_name"/>
<script type="application/ld+json">
{"headline":"Python • Data Science • Machine Learning","@type":"WebSite","url":"https://rakeshbhatia.github.io/notes/","name":"Data Science for Truth","@context":"http://schema.org"}</script>
<!-- End Jekyll SEO tag -->
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="#157878" name="theme-color"/>
<link href="/notes/assets/css/style.css?v=c85d4a303ad9a8ef6d7a3

### Website title

In [3]:
# View the soup object's title tag
soup.title

<title>Python • Data Science • Machine Learning | Data Science for Truth</title>

### Contents of title tag
* Returns a list

In [4]:
soup.title.contents

['Python • Data Science • Machine Learning | Data Science for Truth']

### String inside title tag

In [5]:
# View the string contained in the title tag
soup.title.string

'Python • Data Science • Machine Learning | Data Science for Truth'

### Parent of the title tag

In [6]:
soup.title.parent.name

'head'

### First paragraph tag

In [7]:
# View the soup object's first paragraph tag
soup.p

<p>Hello! I’m Rakesh Bhatia. I enjoy searching for hidden truths in data, which inspired me to create this site with a variety of technical notes on python, data science, machine learning, and more. Check out all my technical notes below!</p>

### String inside paragraph tag
* Note: if the desired string is contained within a nested tab, it will not show up here
    * The string must be extracted directly from the tag that immediately encloses it

In [8]:
soup.p.string

'Hello! I’m Rakesh Bhatia. I enjoy searching for hidden truths in data, which inspired me to create this site with a variety of technical notes on python, data science, machine learning, and more. Check out all my technical notes below!'

### First link tag

In [9]:
soup.a

<a class="btn" href="https://github.com/rakeshbhatia/notes">View on GitHub</a>

### Find all link tags and print the first three
* The `find_all()` function enables you to search the DOM tree for any desired elements by their tag
* Found tags are returned as a list of `Tag` objects
    * If only a single tag is found, the `find_all()` method will return just a single `Tag` object

In [10]:
soup.find_all('a')[0:2]

[<a class="btn" href="https://github.com/rakeshbhatia/notes">View on GitHub</a>,
 <a href="https://rakeshbhatia.github.io/notes/content/python/sets">Sets</a>]

### Make our soup object's content more readable
* This will appear just like the site's original html

In [11]:
pretty_soup = soup.prettify()
print(pretty_soup)

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <!-- Begin Jekyll SEO tag v2.5.0 -->
  <title>
   Python • Data Science • Machine Learning | Data Science for Truth
  </title>
  <meta content="Jekyll v3.8.5" name="generator">
   <meta content="Python • Data Science • Machine Learning" property="og:title"/>
   <meta content="en_US" property="og:locale"/>
   <link href="https://rakeshbhatia.github.io/notes/" rel="canonical"/>
   <meta content="https://rakeshbhatia.github.io/notes/" property="og:url"/>
   <meta content="Data Science for Truth" property="og:site_name"/>
   <script type="application/ld+json">
    {"headline":"Python • Data Science • Machine Learning","@type":"WebSite","url":"https://rakeshbhatia.github.io/notes/","name":"Data Science for Truth","@context":"http://schema.org"}
   </script>
   <!-- End Jekyll SEO tag -->
   <meta content="width=device-width, initial-scale=1" name="viewport"/>
   <meta content="#157878" name="theme-color"/>
   <link href