# Lesson 2

## Starting with Beautiful Soup

The first step to web-scraping is "parsing". That makes the html code readable in for us and the python interpreter. Doing that is sometimes called "creating the soup", since the convention is to store the parsed html code in a variable called soup.

You can navigate down the "html tree of tags" using parent.children.

The most popular method in beautifulsoup is find_all(). Simply gets all elements belonging to a tag.

If you want to get the value of an attribute, you can use element.get(attributeName). It's specially useful to get links, as they're always the values of the attribute href. You can also use element[attributeName] to achieve the same results.

get_text() extracts the content of a tag.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

## Loading the library

In [1]:
from bs4 import BeautifulSoup

## Loading the data

In [2]:
html_doc = """
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>

<p class="story">...</p>
</html>
"""

## Getting the elements from the html (parsing)

In [3]:
soup = BeautifulSoup(html_doc, 'html.parser')

## Displaying the contents of our new "soup"

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



## Basic tree navigation

In [14]:
print("The html code of the document's title is: ",soup.title)
print()

print("The name of the variable which contains the title of the document is: ",soup.title.name)
print()

print("The content of the html title is: ",soup.title.string)
print()

print("The name of the 'parent' element of the document's title is: ",soup.title.parent.name)
print()

print("The paragraphs of the hmtl are: ",soup.p)
print()

print("The first paragraph of the document which has a class is: ",soup.p["class"])


The html code of the document's title is:  <title>The Dormouse's story</title>

The name of the variable which contains the title of the document is:  title

The content of the html title is:  The Dormouse's story

The name of the 'parent' element of the document's title is:  head

The paragraphs of the hmtl are:  <p class="title"><b>The Dormouse's story</b></p>

The first paragraph of the document which has a class is:  ['title']


## Getting all the alements with a given "tag".

In [15]:
soup.find_all("p")

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [17]:
for i in soup.find_all("p"):
    print("The paragraph is: ",i)
    print()

The paragraph is:  <p class="title"><b>The Dormouse's story</b></p>

The paragraph is:  <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

The paragraph is:  <p class="story">...</p>



## Getting the content inside a "tag"

In [21]:
for i in range(len(soup.find_all("a"))):
    print("The link is: ",soup.find_all("a")[i].get("href"))

The link is:  http://example.com/elsie
The link is:  http://example.com/lacie
The link is:  http://example.com/tillie


## Extracting the content of the hmtl element

In [22]:
print(soup.get_text())



The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...




In [23]:
for tag in soup.find_all('a'):
    print(tag.get_text())

Elsie
Lacie
Tillie


## CSS selectors

CSS styles are a way to control the display of the content of a html tag.

Beautiful soup also allows us to access html elements by the css styles.

Let's understand how the work. Go to the following webpage and try up to level 6.

https://flukeout.github.io/



In [27]:
a_tags = soup.select("a")
a_tags

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [26]:
for a_tag in soup.select('a'):
    print(a_tag.get_text())

Elsie
Lacie
Tillie


In [28]:
# Getting the elements with the "class" 'title'
soup.select(".title")

[<p class="title"><b>The Dormouse's story</b></p>]

In [29]:
# Getting the elements with the "class" 'sister'
soup.select(".sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [30]:
# Getting the "text" of the second paragraph with the class "story"
soup.select("p.story")[1].get_text()

'...'

## Activity

Conside the following html code:

html_doc2 = """<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>"""

Write code to print the following contents (not including the html tags, only human-readable text): 

1. The text in the "p" paragraphs. 
2. The names of all the places. 
3. The content (name and fact) of all the cities (only cities, not countries!) 
4. The names (not facts!) of all the cities (not countries!)

You will have to search for themselves how to get elements by class. They can use either find_all() or select().  Chances are you will stumble upon this post on Stackoverflow: https://stackoverflow.com/questions/5041008/how-to-find-elements-by-class

# Lesson 3

# Scraping the contents of IMDB movie database

Let's inspect the web page:

 https://www.imdb.com/chart/top

 Let's gather the slector of the first movie:

In [32]:
#main > div > span > div > div > div.lister > table > tbody > tr:nth-child(1) > td.titleColumn

## Loading more libraries

In [33]:
import requests
import pandas as pd

## Storing the URL

In [38]:
url = "https://www.imdb.com/chart/top"

## Getting the html code of the web page

In [39]:
response = requests.get(url)
response.status_code # 200 status code means OK!

200

## Parsing the html code

In [40]:
soup = BeautifulSoup(response.content, "html.parser")
soup


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
<style>
                body#styleguide-v2 {
                    background: no-repeat fixed center top #000;
                }
            </style>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>IMDb Top 250 - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>


## Retrieving the desired info from the Soup.

In [45]:
#main > div > span > div > div > div.lister > table > tbody > tr:nth-child(1) > td.titleColumn

soup.select("tbody")

[<tbody class="lister-list">
 <tr>
 <td class="posterColumn">
 <span data-value="1" name="rk"></span>
 <span data-value="9.222807081600635" name="ir"></span>
 <span data-value="7.791552E11" name="us"></span>
 <span data-value="2308301" name="nv"></span>
 <span data-value="-1.7771929183993649" name="ur"></span>
 <a href="/title/tt0111161/"> <img alt="The Shawshank Redemption" height="67" src="https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY67_CR0,0,45,67_AL_.jpg" width="45"/>
 </a> </td>
 <td class="titleColumn">
       1.
       <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
 <span class="secondaryInfo">(1994)</span>
 </td>
 <td class="ratingColumn imdbRating">
 <strong title="9.2 based on 2,308,301 user ratings">9.2</strong>
 </td>
 <td class="ratingColumn">
 <div class="seen-widget seen-widget-tt0111161 pending" data-titleid="tt0111161">
 <di

Too much info, we would like to be more "selective"

In [46]:
soup.select("td.titleColumn") # all the info about all the movies

[<td class="titleColumn">
       1.
       <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
 <span class="secondaryInfo">(1994)</span>
 </td>, <td class="titleColumn">
       2.
       <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a>
 <span class="secondaryInfo">(1972)</span>
 </td>, <td class="titleColumn">
       3.
       <a href="/title/tt0071562/" title="Francis Ford Coppola (dir.), Al Pacino, Robert De Niro">The Godfather: Part II</a>
 <span class="secondaryInfo">(1974)</span>
 </td>, <td class="titleColumn">
       4.
       <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">The Dark Knight</a>
 <span class="secondaryInfo">(2008)</span>
 </td>, <td class="titleColumn">
       5.
       <a href="/title/tt0050083/" title="Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb">12 Angry Men</a>
 <span class="secondaryInfo">(195

Even more.

In [47]:
soup.select("td.titleColumn a") # all elements containing movie titles

[<a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>,
 <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a>,
 <a href="/title/tt0071562/" title="Francis Ford Coppola (dir.), Al Pacino, Robert De Niro">The Godfather: Part II</a>,
 <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">The Dark Knight</a>,
 <a href="/title/tt0050083/" title="Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb">12 Angry Men</a>,
 <a href="/title/tt0108052/" title="Steven Spielberg (dir.), Liam Neeson, Ralph Fiennes">Schindler's List</a>,
 <a href="/title/tt0167260/" title="Peter Jackson (dir.), Elijah Wood, Viggo Mortensen">The Lord of the Rings: The Return of the King</a>,
 <a href="/title/tt0110912/" title="Quentin Tarantino (dir.), John Travolta, Uma Thurman">Pulp Fiction</a>,
 <a href="/title/tt0060196/" title="Sergio Leone (dir.), Clint Eastwood, Eli Wal

Now let's get the content.

In [48]:
# we can use .get_text() to extract the content of the tags we selected
# we'll need to do it to each tag with a for loop: here we do it to the first one
soup.select("td.titleColumn a")[0]
soup.select("td.titleColumn a")[0].get_text()

'The Shawshank Redemption'

In [50]:
# the director and main stars are in the same tag, but as a value of the attribute "title"
# we can access attributes as key-value pairs of dictionaries: using ["key"] to get the value:
print("The title: ",soup.select("td.titleColumn a")[0]["title"])
print("The title: ",soup.select("td.titleColumn a")[0]["href"])

The title:  Frank Darabont (dir.), Tim Robbins, Morgan Freeman
The title:  /title/tt0111161/


In [51]:
# the years are inside a 'span' tag with the 'secondaryInfo' class
# we also specify the parent tag and its class, which is the same we used before
# the years are inside parentheses, but we'll take care of that later
soup.select("td.titleColumn span.secondaryInfo")[0].get_text()

'(1994)'

## Activity

Collect the following info from the movies:

* Title
* Ranking
* Year
* Rating

# Lesson 4

In [97]:
#initialize empty lists
title = []
dir_stars = []
year = []
ratings =[]

Getting the total number of movies

In [98]:
num_iter = len(soup.select("td.titleColumn a"))

# iterate through the result set and retrive all the data
for i in range(num_iter):
    title.append(soup.select("td.titleColumn a")[i].get_text())
    dir_stars.append(soup.select("td.titleColumn a")[i]["title"])
    year.append(soup.select("td.titleColumn span.secondaryInfo")[i].get_text())
    ratings.append(float(soup.select("strong")[i].get_text()))

Checking our data

In [99]:
print(title)
print(dir_stars)
print(year)
print(ratings)

['The Shawshank Redemption', 'The Godfather', 'The Godfather: Part II', 'The Dark Knight', '12 Angry Men', "Schindler's List", 'The Lord of the Rings: The Return of the King', 'Pulp Fiction', 'The Good, the Bad and the Ugly', 'The Lord of the Rings: The Fellowship of the Ring', 'Fight Club', 'Forrest Gump', 'Inception', 'The Lord of the Rings: The Two Towers', 'Star Wars: Episode V - The Empire Strikes Back', 'The Matrix', 'Goodfellas', "One Flew Over the Cuckoo's Nest", 'Seven Samurai', 'Se7en', 'Life Is Beautiful', 'City of God', 'The Silence of the Lambs', "It's a Wonderful Life", 'Star Wars: Episode IV - A New Hope', 'Saving Private Ryan', 'Spirited Away', 'The Green Mile', 'Interstellar', 'Parasite', 'Léon: The Professional', 'The Usual Suspects', 'Harakiri', 'The Lion King', 'Back to the Future', 'The Pianist', 'Terminator 2: Judgment Day', 'American History X', 'Modern Times', 'Psycho', 'Gladiator', 'City Lights', 'The Departed', 'The Intouchables', 'Whiplash', 'The Prestige', '

Cleaning the "year" list

In [100]:
import re

In [101]:
p = re.compile("\d+")
result = p.search("(1994)")
print(result.group(0))

1994


In [102]:
year = list(map(lambda x: int(p.search(x).group(0)),year))

## Constructing the dataframe

In [103]:
# each list becomes a column
movies = pd.DataFrame({"title":title,
                       "dir_stars":dir_stars,
                       "year":year,
                       "rating": ratings
                      })

movies.head()

Unnamed: 0,title,dir_stars,year,rating
0,The Shawshank Redemption,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",1994,9.2
1,The Godfather,"Francis Ford Coppola (dir.), Marlon Brando, Al...",1972,9.1
2,The Godfather: Part II,"Francis Ford Coppola (dir.), Al Pacino, Robert...",1974,9.0
3,The Dark Knight,"Christopher Nolan (dir.), Christian Bale, Heat...",2008,9.0
4,12 Angry Men,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",1957,8.9


## Cleaning the dataframe

We would like to have the column "dir_stars" split in two. 

* Director
* Stars

Let's do it.

In [104]:
movies.dir_stars.str.split(", ",expand=True,)

Unnamed: 0,0,1,2
0,Frank Darabont (dir.),Tim Robbins,Morgan Freeman
1,Francis Ford Coppola (dir.),Marlon Brando,Al Pacino
2,Francis Ford Coppola (dir.),Al Pacino,Robert De Niro
3,Christopher Nolan (dir.),Christian Bale,Heath Ledger
4,Sidney Lumet (dir.),Henry Fonda,Lee J. Cobb
...,...,...,...
245,Gillo Pontecorvo (dir.),Brahim Hadjadj,Jean Martin
246,James Cameron (dir.),Arnold Schwarzenegger,Linda Hamilton
247,Zaza Urushadze (dir.),Lembit Ulfsak,Elmo Nüganen
248,Nuri Bilge Ceylan (dir.),Haluk Bilginer,Melisa Sözen


In [105]:
movies[['Director','star_1','star_2']] = movies.dir_stars.str.split(", ",expand=True,)
movies.head()

Unnamed: 0,title,dir_stars,year,rating,Director,star_1,star_2
0,The Shawshank Redemption,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",1994,9.2,Frank Darabont (dir.),Tim Robbins,Morgan Freeman
1,The Godfather,"Francis Ford Coppola (dir.), Marlon Brando, Al...",1972,9.1,Francis Ford Coppola (dir.),Marlon Brando,Al Pacino
2,The Godfather: Part II,"Francis Ford Coppola (dir.), Al Pacino, Robert...",1974,9.0,Francis Ford Coppola (dir.),Al Pacino,Robert De Niro
3,The Dark Knight,"Christopher Nolan (dir.), Christian Bale, Heat...",2008,9.0,Christopher Nolan (dir.),Christian Bale,Heath Ledger
4,12 Angry Men,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",1957,8.9,Sidney Lumet (dir.),Henry Fonda,Lee J. Cobb


Dropping the "(dir.)" from the "Director" column.

In [107]:
movies['Director'] = list(map(lambda x: x.replace(" (dir.)",""),movies['Director']))
movies.head()

Unnamed: 0,title,dir_stars,year,rating,Director,star_1,star_2
0,The Shawshank Redemption,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",1994,9.2,Frank Darabont,Tim Robbins,Morgan Freeman
1,The Godfather,"Francis Ford Coppola (dir.), Marlon Brando, Al...",1972,9.1,Francis Ford Coppola,Marlon Brando,Al Pacino
2,The Godfather: Part II,"Francis Ford Coppola (dir.), Al Pacino, Robert...",1974,9.0,Francis Ford Coppola,Al Pacino,Robert De Niro
3,The Dark Knight,"Christopher Nolan (dir.), Christian Bale, Heat...",2008,9.0,Christopher Nolan,Christian Bale,Heath Ledger
4,12 Angry Men,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",1957,8.9,Sidney Lumet,Henry Fonda,Lee J. Cobb


Dropping the "dir_stars" column

In [108]:
movies.drop(columns="dir_stars",inplace=True)
movies.head()

Unnamed: 0,title,year,rating,Director,star_1,star_2
0,The Shawshank Redemption,1994,9.2,Frank Darabont,Tim Robbins,Morgan Freeman
1,The Godfather,1972,9.1,Francis Ford Coppola,Marlon Brando,Al Pacino
2,The Godfather: Part II,1974,9.0,Francis Ford Coppola,Al Pacino,Robert De Niro
3,The Dark Knight,2008,9.0,Christopher Nolan,Christian Bale,Heath Ledger
4,12 Angry Men,1957,8.9,Sidney Lumet,Henry Fonda,Lee J. Cobb


In [110]:
movies = movies[['title', 'Director', 'star_1', 'star_2', 'year','rating']]
movies.head()

Unnamed: 0,title,Director,star_1,star_2,year,rating
0,The Shawshank Redemption,Frank Darabont,Tim Robbins,Morgan Freeman,1994,9.2
1,The Godfather,Francis Ford Coppola,Marlon Brando,Al Pacino,1972,9.1
2,The Godfather: Part II,Francis Ford Coppola,Al Pacino,Robert De Niro,1974,9.0
3,The Dark Knight,Christopher Nolan,Christian Bale,Heath Ledger,2008,9.0
4,12 Angry Men,Sidney Lumet,Henry Fonda,Lee J. Cobb,1957,8.9
