### BeautifulSoup
The library that we will be using to easly parse html is called BeautifulSoup.<br>
For the documentation on BeautifulSoup go here: https://beautiful-soup-4.readthedocs.io/en/latest/ <br>
The website we will be using for this part is https://quotes.toscrape.com/

In [None]:
pip install beautifulsoup4

In [1]:
from bs4 import BeautifulSoup as bs
import urllib.request

In [2]:
html = urllib.request.urlopen("https://quotes.toscrape.com/").read().decode('utf-8')

In [3]:
#BeautifulSoup takes a string object and parse out the document structure
# and turn it into a BeautifulSoup object.
soup = bs(html, "html.parser")
soup

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"/>
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<div class="row">
<div class="col-md-8">
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="

In [17]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert

Some notes on terminology:
<br> <br>
You will notice that the html is broken into sections like \<title>Quotes to Scrape\</title> or \<p>\<a href="/login">Login\</a>\</p>
<br> These are called tags. In the examples above, we have a "title" tag and an "a" tag inside a "p" tag. <br><br>
Sometimes the tag will have attributes, which are listed after the tag name but within the <>. Example: the "a" tag has an "href" attribute.
<br><br> The part between > < can be another tag or text. (There could be other things, like images, but for this we will only be dealing with the text part of html.)

### Parsing the soup

The most useful function for finding what you want is soup.find_all()

In [5]:
#Let's get the quotes
soup.find_all("span") # Makes a list of all the noted tags.

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span>by <small class="author" itemprop="author">J.K. Rowling</small>
 <a href="/author/J-K-Rowling">(about)</a>
 </span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span>by <small class="author" itemprop="author">Albert Einstein</small>
 <a href="/author/Albert-Einstein">(about)</a>
 </span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,

In [6]:
# Let's narrow down our search
soup.find_all("span",class_="text")

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.

In [8]:
soup.find_all("span",itemprop="text",class_="text")

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>,
 <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>,
 <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>,
 <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>,
 <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>,
 <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>,
 <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.

In [9]:
# There is also soup.find()
soup.find("span") #This gives the first instance of the tag.

<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>

In [10]:
# Another useful function is tag.get_text()
for quote in soup.find_all("span",class_="text"):
    print(quote.get_text())
    print("------")

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
------
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
------
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
------
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
------
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
------
“Try not to become a man of success. Rather become a man of value.”
------
“It is better to be hated for what you are than to be loved for what you are not.”
------
“I have not failed. I've just found 10,000 ways that won't work.”
------
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
------
“A day without sunshine is like, you know, night.”
------


In [11]:
p = bs("<p>Hello world!</p>")
p

<html><body><p>Hello world!</p></body></html>

In [12]:
p.find("p")

<p>Hello world!</p>

In [13]:
p.get_text()

'Hello world!'

In [19]:
# How would we get the authors?
# Try your own code here
for author in soup.find_all(class_="author"):
    print(author.get_text())
    print("-----------")

Albert Einstein
-----------
J.K. Rowling
-----------
Albert Einstein
-----------
Jane Austen
-----------
Marilyn Monroe
-----------
Albert Einstein
-----------
André Gide
-----------
Thomas A. Edison
-----------
Eleanor Roosevelt
-----------
Steve Martin
-----------


In [29]:
# What about the tags?
for tags in soup.find_all("div",class_="tags"):
    print(tags)
    print(tags.get_text())
    print("-------------------")

<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>

            Tags:
            
change
deep-thoughts
thinking
world

-------------------
<div class="tags">
            Tags:
            <meta class="keywords" content="abilities,choices" itemprop="keywords"/>
<a class="tag" href="/tag/abilities/page/1/">abilities</a>
<a class="tag" href="/tag/choices/page/1/">choices</a>
</div>

            Tags:
            
abilities
choices

-------------------
<div class="tags">
            Tags:
            <meta class="keywords" content="inspirational,life,live,miracle,miracles" itemprop="keywords"/>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
<a class="

In [32]:
display(soup.find("div",class_="tags"))
display(soup.find("div",class_="tags").find("meta"))
display(soup.find("div",class_="tags").find("meta")["content"])

<div class="tags">
            Tags:
            <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>

<meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>

'change,deep-thoughts,thinking,world'

In [33]:
for tags in soup.find_all("div",class_="tags"):
    print(tags.find("meta")["content"])
    print("------------------------")

change,deep-thoughts,thinking,world
------------------------
abilities,choices
------------------------
inspirational,life,live,miracle,miracles
------------------------
aliteracy,books,classic,humor
------------------------
be-yourself,inspirational
------------------------
adulthood,success,value
------------------------
life,love
------------------------
edison,failure,inspirational,paraphrased
------------------------
misattributed-eleanor-roosevelt
------------------------
humor,obvious,simile
------------------------


In [23]:
pp = bs('<b class="tags">Type b</b><a class="tags">Type a</a>')
pp

<html><body><b class="tags">Type b</b><a class="tags">Type a</a></body></html>

In [26]:
pp.find_all(class_="tags")

[<b class="tags">Type b</b>, <a class="tags">Type a</a>]