# Session 1 - BeautifulSoup


Beautiful Soup is an HTML or XML parsing library for Python, which can easily extract data from the messy web.  
In this session I'm going to introduce BeautifulSoup. Learn its basic functionality and how to used it with regex. At the end we will build our first Web Scraper, to get all time top 100 movies for  given genre.  
`BeautifulSoup` supports a various number of parser, also third-party parsers. So make sure to have lxml and html5lib parsers installed.

In [44]:
import requests
from bs4 import BeautifulSoup as bs
import re

In [45]:
# Making the Soup
# Like its Wonderland namesake, BeautifulSoup tries to make sense of the nonsensical; 
# it helps format and organize the messy web by fixing bad HTML 
# and presenting us with easily traversable Python objects representing XML structures.

# define the link you want to target
link = 'http://pythonscraping.com/pages/page2.html'

html = requests.get(link)
soup = bs(html.text, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   A Useful Page
  </title>
 </head>
 <body>
  <h1>
   An Interesting Title
  </h1>
  <div class="body" id="fakeLatin">
   Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  </div>
 </body>
</html>



`.get_text()` strips all tags from the document.

In [46]:
print(soup.get_text().strip())

A Useful Page


An Interesting Title

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.


However, if you want to process the soup and extract target data, it is much easier to do so on BeautifulSoup objects, such as: `Tag`, `NavigableString`, `BeautifulSoup` and `Comment`.

In [50]:
tag = soup.div
print(type(tag))
print('Tag name: ', tag.name)

<class 'bs4.element.Tag'>
tag name  div


In [51]:
# We can change a tag's name.
# The change is reflected on HTML markup generated by bs
tag.name = "blockquote"
print('Tag name: ', tag.name)

Tag name:  blockquote


In [54]:
# to change the tag id
print('Old tag id: ', tag['id'])
tag['id'] = 'Latin'
print('New tag id: ', tag['id'])

Old tag id:  fakeLatin
New tag id:  Latin


In [55]:
soup.blockquote

<blockquote class="body" id="Latin">
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</blockquote>

---
Let's try lxml parser

In [73]:
link = 'http://www.pythonscraping.com/pages/warandpeace.html'
html = requests.get(link)
soup = bs(html.text, 'lxml')

In [75]:
print(soup.title)
print(type(soup.title))
print(soup.p)

None
<class 'NoneType'>
<p></p>


In [79]:
soup.find_all(['h1', 'h2','h3','h4','h5','h6'])

[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]

In [83]:
soup.span.attrs

{'class': ['red']}

In [84]:
soup.span['class']

['red']

In [86]:
# Get the text contained in the node element by using the string attribute
soup.span.string

"Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don't tell me that this means war,\nif you still try to defend the infamies and horrors perpetrated by\nthat Antichrist- I really believe he is Antichrist- I will have\nnothing more to do with you and you are no longer my friend, no longer\nmy 'faithful slave,' as you call yourself! But how do you do? I see\nI have frightened you- sit down and tell me all the news."

In [87]:
soup.h1

<h1>War and Peace</h1>

In [88]:
soup.h1.string

'War and Peace'

In [99]:
nameList = soup.find_all(class_='green')

In [103]:
for name in nameList:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


In [None]:
soup.find_all

---

In [106]:
html = requests.get('http://www.pythonscraping.com/pages/page3.html')
soup = bs(html.text, 'xml')
soup

<?xml version="1.0" encoding="utf-8"?>
<html>
<head>
<style>
img{
	width:75px;
}
table{
	width:50%;
}
td{
	margin:10px;
	padding:10px;
}
.wrapper{
	width:800px;
}
.excitingNote{
	font-style:italic;
	font-weight:bold;
}
</style>
</head>
<body>
<div id="wrapper">
<img src="../img/gifts/logo.jpg" style="float:left;">
<h1>Totally Normal Gifts</h1>
<div id="content">Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.<p>
We haven't figured out how to make online shopping carts yet, but you can send us a check to:<br>
123 Main St.<br>
Abuja, Nigeria
</br>We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.</br>
<table id="giftList">
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your he

In [109]:
for tag in soup.find_all('img'):
    print(tag['src'])

../img/gifts/logo.jpg
../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


---
That was a small overview about BeautifulSoup module. We didn't even scratch the surface of it.
It's a very reach module that allows to extract every information you are looking from, no matter the "messiness" of the HTML code. Because, indeed, the web is a messy place.

## References

1. Mitchell, R. 2018. Web Scraping with Python: Collecting More Data from the Modern Web. O’Reilly Media.
2. BeautifulSoup official documentation