# Parse HTML with BeautifulSoup

## What you will learn in this course 🧐🧐

Web is full of HTML pages that contain a lot of insightful data. What if you could extract it to create your own custom dataset? Opportunities would be endless! In this course, we will cover: 

* What is `BeautifulSoup`
* Select parser for reading a webpage content
* Select any HTML element within a webpage

Let's get you started with `BeautifulSoup` library to crawl and harvest websites data.

Before we start, make sure that `BeautifulSoup` is installed in your environment simply by running

In [1]:
# Use '!' only if you are installing directly from your notebook. 
# '!' sign tells Jupyter Notebook to interpret the following code as bash code (what you use in your terminal)
!pip install beautifulsoup4



## Read content with BeautifulSoup 📰📰

First of all, to read HTML content, you will need to _parse_ your data using the library. This is done very simply as follows:

In [2]:
# Import BeautifulSoup
from bs4 import BeautifulSoup

# Instanciate BeautifulSoup class
soup = BeautifulSoup("<html>data</html>", "html.parser") # Here we used an HTML Parser

If you are dealing with more complex content, you have other _parsers_, especially for XML, which may be useful. Be careful, however, you will have to install the _parser_ using `pip`. These are the ones you can find:

<table>
  <tr>
      <td>Parser</td>
      <td>Typical usage</td>
      <td>Advantages</td>
      <td>Disadvantages</td>
  </tr>
  <tr>
      <td>Python’s html.parser</td>
      <td>BeautifulSoup(markup, "html.parser")</td>
      <td>
         <ul>
            <li>Batteries included</li>
            <li>Decent speed</li>
            <li>Lenient (as of Python 2.7.3 and 3.2.)</li>
         </ul>
      </td>
      <td>
         <ul>
            <li>Not very lenient (before Python 2.7.3 or 3.2.2)</li>
         </ul>
      </td>
  </tr>
  <tr>
      <td>lxml’s HTML parser</td>
      <td>BeautifulSoup(markup, "lxml")</td>
      <td>
         <ul>
            <li>Very fast</li>
            <li>Lenient</li>
         </ul>
         <ul>
            <li>External C dependency</li>
         </ul>
      </td>
  </tr>
  <tr>
      <td>lxml’s XML parser</td>
      <td>BeautifulSoup(markup, "lxml-xml")BeautifulSoup(markup, "xml")</td>
      <td>
         <ul>
            <li>Very fast</li>
            <li>The only currently supported XML parser</li>
         </ul>
      </td>
      <td>
         <ul>
            <li>External C dependency</li>
         </ul>
      </td>
   </tr>
   <tr>
      <td>html5lib</td>
      <td>BeautifulSoup(markup, "html5lib")</td>
      <td>
         <ul>
            <li>Extremely lenient</li>
            <li>Parses pages the same way a web browser does</li>
            <li>Creates valid HTML5</li>
         </ul>
      </td>
   <td>
      <ul>
         <li>Very slow</li>
         <li>External Python dependency</li>
      </ul>
   </td>
</tr>
</table>

To install these _parsers_, you can do it using: 

In [3]:
# Use '!' only if you are installing directly from your notebook. 
# '!' sign tells Jupyter Notebook to interpret the following code as bash code (what you use in your terminal)
!pip install lxml
!pip install html5lib

Collecting lxml
  Downloading lxml-4.6.2-cp38-cp38-manylinux1_x86_64.whl (5.4 MB)
[K     |████████████████████████████████| 5.4 MB 8.6 MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.6.2
Collecting html5lib
  Downloading html5lib-1.1-py2.py3-none-any.whl (112 kB)
[K     |████████████████████████████████| 112 kB 8.4 MB/s eta 0:00:01
Installing collected packages: html5lib
Successfully installed html5lib-1.1


## Play content via BeautifulSoup

The following code will be used for the rest of the course:

In [4]:
HTML_DOC = """
<html>
  <head><title>The Dormouse's story</title></head>
  <body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">
      Once upon a time there were three little sisters; and their names were
      <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
      <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
      <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
      and they lived at the bottom of a well.
    </p>

    <p class="story">...</p>
  </body>
</html>
"""

Let's create a new instance of `BeautifulSoup`. Here we won't need to import this class again since we did it above. 

In [5]:
# Create a new instance of BeautifulSoup 
soup = BeautifulSoup(HTML_DOC, 'html.parser')

In [10]:
soup.prettify

<bound method Tag.prettify of 
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">
      Once upon a time there were three little sisters; and their names were
      <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
      <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
      <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
      and they lived at the bottom of a well.
    </p>
<p class="story">...</p>
</body>
</html>
>

## Find HTML content using HTML tag name 🏷️

You can find the content of an HTML page by the name of its HTML tag:

In [11]:
# Find <head> tag contained within HTML_DOC
soup.head

<head><title>The Dormouse's story</title></head>

In [12]:
# Find <title> tag contained within HTML_DOC
soup.title

<title>The Dormouse's story</title>

In [13]:
soup.head.title

<title>The Dormouse's story</title>

## Find a parent's element 👩‍👧

You can find a parent element using loops:

In [40]:
#Get all direct children

for child in soup.body.children:
    print(child)



<p class="title"><b>The Dormouse's story</b></p>


<p class="story">
      Once upon a time there were three little sisters; and their names were
      <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
      <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
      <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
      and they lived at the bottom of a well.
    </p>


<p class="story">...</p>




In [42]:
#Get all children

for child in soup.body.descendants:
    print(child)
    
display(list(soup.body.descendants))



<p class="title"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">
      Once upon a time there were three little sisters; and their names were
      <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
      <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
      <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
      and they lived at the bottom of a well.
    </p>

      Once upon a time there were three little sisters; and their names were
      
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Elsie
,
      
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and
      
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
      and they lived at the bottom of a well.
    


<p class="story">...</p>
...




['\n',
 <p class="title"><b>The Dormouse's story</b></p>,
 <b>The Dormouse's story</b>,
 "The Dormouse's story",
 '\n',
 <p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
       <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
       <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
       and they lived at the bottom of a well.
     </p>,
 '\n      Once upon a time there were three little sisters; and their names were\n      ',
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 'Elsie',
 ',\n      ',
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 'Lacie',
 ' and\n      ',
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
 'Tillie',
 ';\n      and they lived at the bottom of a well.\n    ',
 '\n',
 <p class="story">...</p>,
 '...',
 '\n']

In [43]:
# Find the first <a> tag within HTML_DOC
link = soup.a
link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [14]:
# link.parents creates a generator (more info => https://wiki.python.org/moin/Generators)
## We can use python built-in function list() to see what is inside it
list(link.parents)

[<p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
       <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
       <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
       and they lived at the bottom of a well.
     </p>,
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">
       Once upon a time there were three little sisters; and their names were
       <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
       <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
       <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
       and they lived at the bottom of a well.
     </p>
 <p class="story">...</p>
 </body>,
 <html>
 <head><title>The Dormouse's story</title></head>
 <body>
 <p class="title"><b>The Dormouse's story</b></

In [15]:
# Let's now Loop through all parents of link
# Let's also use enumerate() built-in function to get the index number of each iteration
for i, parent in enumerate(link.parents):
    # if we have no more parents
    if parent is None:
        print(parent)
    else:
        print("Parent {} is: {}".format(i, parent.name)) # parent.name will give only the element name as output

Parent 0 is: p
Parent 1 is: body
Parent 2 is: html
Parent 3 is: [document]


## Find a sibling's element 🎎 🎎

You can select an element's next siblings with `.next_siblings` parameter:

In [18]:
# Select all the next <a> tag
for sibling in soup.a.next_siblings:
    print(sibling)

,
      
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 and
      
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
;
      and they lived at the bottom of a well.
    


Conversely, you can select an element's previous siblings with `.previous_siblings` parameter:

In [20]:
# Let's use .find() method to select <a> tag with id="link3"
# Then display all its previous siblings
for sibling in soup.find(id="link3").previous_siblings:
    print(sibling)

 and
      
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
,
      
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

      Once upon a time there were three little sisters; and their names were
      


## Find all items that meet a specific condition ❓

There is a very handy function in beautifulsoup: `.find_all()` which fetch all the elements of an HTML page that meet certain criteria. For example:

In [44]:
# Select all elements named title
soup.find_all("title")

[<title>The Dormouse's story</title>]

In [45]:
# Select all p elements with the class "title"
soup.find_all("p", "title")

[<p class="title"><b>The Dormouse's story</b></p>]

In [46]:
# Select all <a> tags
soup.find_all("a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [24]:
# Select all elements with id="link2"
soup.find_all(id="link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [25]:
# Select all strings that contain "sisters"
## Here we use re package which is used for regular expression 
## More info here => https://docs.python.org/3/library/re.html
import re

soup.find(string=re.compile("sisters"))

'\n      Once upon a time there were three little sisters; and their names were\n      '

## Find elements using CSS 🧑‍🎨🧑‍🎨

Finally, content can be found via CSS selectors, by using the `.select()` method:

In [27]:
# Select all <a> tags with class of "sister"
soup.select("a.sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [28]:
# Select all <a> with id="link1"
soup.select("a#link1")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

For deeply nested elements, you can specify a path, just like you would do in plain CSS:

In [29]:
# Select all <a> with id="link1" that are contained within <p> with class="story"
## NB this is exactly like writing soup.select("a#link1")
soup.select("p.story a#link1")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

## Extract text 📃📃
You can use `.get_text()` to extract the encapsulated text within an HTML tag:

In [17]:
#get text between tags
display(soup.title.text)
type(soup.title.text)

"The Dormouse's story"

str

In [23]:
#contruct test object between tags
display(soup.title.string)
type(soup.title.string)

"The Dormouse's story"

bs4.element.NavigableString

In [19]:
#get all text element
soup.get_text()

"\n\nThe Dormouse's story\n\nThe Dormouse's story\n\n      Once upon a time there were three little sisters; and their names were\n      Elsie,\n      Lacie and\n      Tillie;\n      and they lived at the bottom of a well.\n    \n...\n\n\n"

In [22]:
#delete all spaces and backslash end
soup.get_text(" ", strip=True)

"The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie , Lacie and Tillie ;\n      and they lived at the bottom of a well. ..."

In [27]:
#contruct test object between tags
strings_element = [element for element in soup.strings]
str_element = [str(element) for element in soup.strings]
display(strings_element)
display(str_element)
    
for element in soup.strings:
    display(type(element))
    
for element in soup.strings:
    display(type(str(element)))
    
#liste of textx

['\n',
 '\n',
 "The Dormouse's story",
 '\n',
 '\n',
 "The Dormouse's story",
 '\n',
 '\n      Once upon a time there were three little sisters; and their names were\n      ',
 'Elsie',
 ',\n      ',
 'Lacie',
 ' and\n      ',
 'Tillie',
 ';\n      and they lived at the bottom of a well.\n    ',
 '\n',
 '...',
 '\n',
 '\n',
 '\n']

['\n',
 '\n',
 "The Dormouse's story",
 '\n',
 '\n',
 "The Dormouse's story",
 '\n',
 '\n      Once upon a time there were three little sisters; and their names were\n      ',
 'Elsie',
 ',\n      ',
 'Lacie',
 ' and\n      ',
 'Tillie',
 ';\n      and they lived at the bottom of a well.\n    ',
 '\n',
 '...',
 '\n',
 '\n',
 '\n']

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

bs4.element.NavigableString

str

str

str

str

str

str

str

str

str

str

str

str

str

str

str

str

str

str

str

In [15]:
# Access the list item that soup.select("a#link1") outputs
## Use get_text() to get only the string part 
soup.select("a#link1")[0].get_text()

'Elsie'

In [31]:
# Let's use list comprehensions to select all text from all <a> tags of class="sister"
[a.get_text() for a in soup.select("a.sister")]

['Elsie', 'Lacie', 'Tillie']

## Extract a property 🔗🔗

Sometimes, it is extremely useful to extract the value of a property within an HTML tag. For example, you might want to extract all the URLs of a given webpage. You can use the `.get()` method to extract a given property from an element:

In [28]:
#Access all attributes of a tag
soup.p.attrs

{'class': ['title']}

In [31]:
soup.p['class']

['title']

In [35]:
#Get links and attributes
display(soup.a)
display(soup.a.attrs)
display(soup.a["href"])

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

'http://example.com/elsie'

In [32]:
# Extract href property from <a> tags with id="link1"
soup.select("a#link1")[0].get('href')

'http://example.com/elsie'

## Resources 📚 📚

- <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank">BeautifulSoup documentation</a>
- <a href="https://pypi.org/project/beautifulsoup4/" target="_blank">pip install beautifulsoup4</a> 
- <a href="https://docs.python.org/3/library/re.html" target="_blank">re</a>
- <a href="https://en.wikipedia.org/wiki/Web_scraping" target="_blank">Web Scraping</a>