# Intro to web scraping

The first step of web scraping is to identify a website and download the html code from it. 

Real html from websites tends to be long and a bit too chaotic for a total beginner. Here we will start with a dummy html document and learn the basics of extracting info with beautifulsoup.

In [1]:
html_doc = """
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</html>
"""

In [2]:
html_doc

'\n<!DOCTYPE html>\n<html><head><title>The Dormouse\'s story</title></head>\n<body>\n<p class="title"><b>The Dormouse\'s story</b></p>\n\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,\n<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and\n<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n\n<p class="story">...</p>\n</html>\n'

In [3]:
from bs4 import BeautifulSoup

#### "creating the soup"

In [4]:
# parse the element
soup = BeautifulSoup(html_doc, 'html.parser')

In [5]:
soup


<!DOCTYPE html>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

#### accessing single elements

We can access to the html tags appending to the correspoding soup a dot . and the name of the corresponding tag, ie:

* title
* body
* p
* a

In case of having multiple instances of the tag, **only the first one will be retrieved**.



In [8]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



In [7]:
soup.title

<title>The Dormouse's story</title>

In [9]:
soup.body

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

In [10]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

#### finding all elements of a tag with find_all()

If we want to retrieve all the elements which have a particular attribute (id, class), we can provide a dictionaty two `find_all()`. Moreover, if one element, has more than one `class` we can add the corresponding classes as elements of a list inside the dictionary.

In [13]:
soup.find("p") # to find first element of a given nature (anchor, div, span)

<p class="title"><b>The Dormouse's story</b></p>

In [12]:
soup.find_all("p") # to find all elements of a given nature

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [11]:
len(soup.find_all("p"))

3

In [14]:
soup.find_all("p")[-1]

<p class="story">...</p>

In [15]:
soup.find_all("p")

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [22]:
soup.find_all("a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

We can restrict which tag we want, providing additional tag's atttributes like the "class" with a dictionary.

In [26]:
soup.find_all("p", {"class":"story"})

[<p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [27]:
soup.find_all("p", {"class":"story"})[0]

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

In [28]:
soup.find_all("p", {"class":"story"})[-1]

<p class="story">...</p>

In [29]:
soup


<!DOCTYPE html>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [30]:
soup.find_all("a", {"id":"link2"})

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

#### Using css selectors

Let's learn first the syntax of css selectors playing this game: https://flukeout.github.io/

Everyone should reach level 6!

In order to specify a hierarchy, we can use the `>`:

soup.select("tag1 > tag2") will select all the tag2 inside tag1.

In [31]:
soup


<!DOCTYPE html>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [36]:
# get all sister names
for i in soup.select("p > a"):
    print(i.get_text())

Elsie
Lacie
Tillie


In [40]:
soup.select("a#link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [42]:
soup.select("a")[0]

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [41]:
soup.select("p.title")

[<p class="title"><b>The Dormouse's story</b></p>]

In [32]:
soup.select("p > b")

[<b>The Dormouse's story</b>]

In [33]:
soup.select("p > b")[0].get_text()

"The Dormouse's story"

In [34]:
type(soup.select("p > b")[0].get_text())

str

We can combine the `select()` method with other bs4 methods, such as `get_text()`.

`get_text()`, however, can only be applied to single elements, while `select()` might return multiple elements. It's common to iterate through the output of `select()`

In [None]:
[elem.get_text().replace("\n"," ") for elem in soup.select("p")]

### Your turn:

Write code to print the following contents (not including the html tags, only human-readable text): 

1. All the "fun facts". 

2. The names of all the places. 

3. The content (name and fact) of all the cities (only cities, not countries!) 

4. The names (not facts!) of all the cities (not countries!)

In [52]:
# 1.
soup.select("p")[0].get_text()
soup.select("p")

for i in soup.select("p"):
    print(i.get_text())

London is the most popular tourist destination in the world.
Paris was originally a Roman City called Lutetia.
Spain produces 43,8% of all the world's Olive Oil.


In [54]:
# 2.
soup.select("p")[0].get_text()
soup.select("h2")

for i in soup.select("h2"):
    print(i.get_text())

London
Paris
Spain


In [57]:
# 3.
for i in soup.select(".city"):
    print(i.get_text())


London
London is the most popular tourist destination in the world.


Paris
Paris was originally a Roman City called Lutetia.



In [62]:
# 4.

for i in soup.select(".city > h2"):  # (".city h2"): also possible - selects h2 elements in all descendents, not just direct children  
    print(i.get_text())

London
Paris


In [59]:
soup.select(".city")[0]

<div class="city">
<h2>London</h2>
<p>London is the most popular tourist destination in the world.</p>
</div>

In [45]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [46]:
soup = BeautifulSoup(geography,'html.parser')

In [None]:
city_list = []
for x in soup.find_all("p"):
    city_list.append(x.get_text())

In [None]:
city_list

In [None]:
[elem.get_text() for elem in soup.find_all("p")]

In [None]:
soup

In [None]:
soup.find_all("div", {"class":"city"})

In [None]:
for elem in soup.find_all("div", {"class":"city"}):
    #print(elem.find_all("h2")[0].get_text())
    #print(elem.find_all("p")[0].get_text())
    print("{}: facts: {}".format(elem.find_all("h2")[0].get_text(),elem.find_all("p")[0].get_text()))

In [None]:
[elem.get_text() for elem in soup.select("h2")]

In [None]:
soup = BeautifulSoup(geography, 'html.parser')

In [None]:
# 1. All the "fun facts" using .find_all()


Get first the tags which contains the text you want

Now get the text inside

In [None]:
# 2. The names of all the places.

In [None]:
# 3. All the content (name and fact) of all the cities (only cities, not countries!)


In [None]:
for elem in soup.find_all("div", {"class":"city"}):
    #print(elem.h2.get_text() + ': ' + elem.p.get_text())
    print(elem.h2.get_text() + ': ' + ' '.join(elem.p.get_text().split()[1:]))
    #print(' '.join(elem.p.get_text().split()[1:]))

In [None]:
# 4. The names (not facts!) of all the cities (not countries!)


## Use case: imdb top charts

Let's go to https://www.imdb.com/chart/top, where we'll see the top 250 movies according to IMDb ratings.

Notice how each movie has the following elements:

- Title

- Release Year

- IMDb rating

- Director & main stars (they appear when you hover over the title)

Our objective is going to be to scrape this information and store it in a pandas dataframe. We will proceed in steps:

1.
* Store the titles inside a list of titles
* Store the release year inside a list of years
* Store the rating inside another list
* Store the director and main stars into another list

2.
* Create a dictionary in which the keys will contain the column names of the dataframe and the values will be the lists created before

3.
* Create the dataframe from the dictionary


In [63]:
# 1. import libraries
import requests # to download html code
from bs4 import BeautifulSoup # to navigate through the html code
import pandas as pd
import numpy as np
import re

In [69]:
# 2. find url and store it in a variable
#url = "https://www.imdb.com/chart/top"
url = "https://www.imdb.com/search/title/?title_type=feature&sort=user_rating,desc"

In [70]:
# 3. download html with a get request. Use the function request.get() and store the output in response
response = requests.get(url)
# 200 status code means OK! response.status_code
print(response.status_code)

200


In [71]:
# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.text, 'html.parser')
# 4.2. check that the html code looks like it should
print(soup.prettify())

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <script type="text/javascript">
   var IMDbTimer={starttime: new Date().getTime(),pt:'java'};
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <title>
   Feature Film
(Sorted by IMDb Rating Descending) - IMDb
  </title>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
  </script>
  <link href="https://www.imdb.com/search/title/?title_type=feature" rel="canonical"/>
  <meta content="http://www

In [82]:
# 5. retrieve/extract the desired info (here, you'll paste the "Selector" you copied before to get the element that belongs to the top movie)

#text = soup.select("td.titleColumn")[0].get_text()
#text = soup.select("a")
text = soup.select("h3 > a")
print(len(text))
print(text[0].get_text())
#print("Original text: ",text)
#\b(\w*\s){1,}
#re.search(r"\b(\w*\s{1}){1,}",text).group()[:-1]
#print("Final text: ",re.match(r"\b\w*\s+\w*",text))

50
Erumbu


Let's start creating the list of titles.

In [86]:
titles = []

In [87]:
for i in text:
    print(i.get_text())
    titles.append(i.get_text())

Erumbu
Nee Jathaga
Heel'D
What is Art
Paper Line
Neon Bleed
Road King
Rudy
RI$E
Simón
Kaya Palat
Beega
Famous
Melody Drama
Abort
Potentially Dangerous
Stolen Dough
Important in the Life
Caralique
Unveiled
Juliet
Unbeatable Fighter
The Fragile King
Apple Cinema
Breaking the Dwarf Wall
Song of the Fly
Efunsetan Aniwura
Fight Back!
A1 Quality Media Presents Innocence
Decent Reflection
Am Rande der Zeiten
Jeta
Sisters and the Shrink 2
Shift-e Nimeh Shab
Mr. Local Man
Matriarch
Sospeso
Bandu Boxer
Sadguru
Praveena
Iron Rule
Omr-e Setare
Flames of Wrath
Tolou Dar Shab
Turvo
Prince Oak Oakleyski Starring Supremacy
Poets Are the Destroyers
Zhuchok
El Pirata
Elmar


In [88]:
len(titles)

50

In [None]:
titles = [elem.get_text() for elem in soup.select("td.titleColumn a")]
len(titles)

Now the lists of years.

In [117]:
years = []


In [119]:
#soup.find_all("span",{"class":"lister-item-year"})
text = soup.select("h3 > span.lister-item-year")
#soup.select("h3 > span.lister-item-year.text-muted.unbold")

for i in text:
    #print(i.get_text())
    years.append(i.get_text())

In [None]:
years = [int(re.sub("\D","",elem.get_text())) for elem in soup.find_all("span",{"class":"secondaryInfo"})]
years

Now the ratings.

In [113]:
#main > div > span > div > div > div.lister > table > tbody > tr:nth-child(1) > td.ratingColumn.imdbRating
ratings = []


In [116]:
text = soup.select("div > strong")
#soup.select("h3 > span.lister-item-year.text-muted.unbold")

for i in text:
    # print(i.get_text())
    ratings.append(i.get_text())

In [None]:
ratings = [float(elem.get_text()) for elem in soup.select("strong")]
ratings

Now the directors

In [129]:
#main > div > span > div > div > div.lister > table > tbody > tr:nth-child(1) > td.titleColumn > a
directors = []


In [130]:
text = soup.select("p > a:nth-child(1)")

for i in text:
    # print(i.get_text())
    directors.append(i.get_text())

In [None]:
[elem.select("a") for elem in soup.find_all("td", {"class":"titleColumn"})]

In [None]:
soup.select("td.titleColumn a")[0]['title']

In [None]:
soup.select("td.titleColumn a")[0]['title'].split(" (dir.)")[0]

In [None]:
directors = [elem['title'].split(" (dir.)")[0] for elem in soup.select("td.titleColumn a")]
directors

Finally the stars

In [131]:
stars = []


In [132]:
text = soup.select("p > a:nth-child(n+2)")

for i in text:
    # print(i.get_text())
    stars.append(i.get_text())

In [None]:
star1 = [elem['title'].split(",")[1][1:] for elem in soup.select("td.titleColumn a")]
star1

In [None]:
star2 = [elem['title'].split(",")[2][1:] for elem in soup.select("td.titleColumn a")]
star2

In [None]:
imdb_df = pd.DataFrame({'titles':titles,'rating':ratings,'director':directors, 'star1':star1, 'star2':star2 })

In [None]:
imdb_df

This long selector we copied is kind of long and ugly, isn't it? And it only selects one single movie, while we will want to collect data from all of them. Going from that particular selector to one that's more "general" and "elegant" is the actual work the web scraper needs to do.

In this case, we can play around a bit with different tags and classes, until we notice that all the information about the movies is under the tag <td class="titleColumn">. We're lucky that under this tag there's not much "trash", just the info we need.

In [None]:
# the director and main stars are in the same tag, but as a value of the attribute "title"
# we can access attributes as key-value pairs of dictionaries: using ["key"] to get the value:

# instead of ["title"] we could use .get("title"): choose whatever you prefer

In [None]:
# the years are inside a 'span' tag with the 'secondaryInfo' class
# we also specify the parent tag and its class, which is the same we used before
# the years are inside parentheses, but we'll take care of that later


#### Building the dataframe

In [None]:
# Create a list for each of the variables you're scraping


In [137]:
# Each list becomes a dataframe column
#pd.DataFrame()
# We want a dataframe in which the columns are: title, year, director, stars
df = pd.DataFrame({'title': titles, 'year': years, 'rating': ratings}) # , 'director': directors, 'stars1': star1, 'stars2': star2, 

ValueError: All arrays must be of the same length

Unfortunatelly the elements contained inside the column stars are lists. We would like to have two columns: one for the first star and another for the second star. 

In [134]:
star1 = []
stat2 = []

star1 = [elem[0] for elem in stars]
star2 = [elem[1] for elem in stars]

pd.DataFrame({'title': titles, 'year': years, 'director': directors, 'star1': star1, 'star2': star2, 'ratings': ratings})

ValueError: All arrays must be of the same length

#### Cleaning the data

An inherent part of web scraping is data cleaning. We managed to get the information we needed, but for it to be useful, we still need some extra steps:

- Take the year out of the parentheses: we know we can totally do that with regex, but string methods such as str.replace() might be simpler to use.

- Split dir_stars into 3 columns, one for each person: "director", "star_1", "star_2". This could have been done by filtering when extracting the data from the html document, but it looks easier afterwards:

    - The "(dir.)" pattern can be totally removed
    - We can split the string at each comma
    
- Change the data type of the year column to integer.

In [None]:
# year out of the parentheses


In [None]:
# remove "(dir.)"


In [None]:
# a column for each person


In [None]:
# year column to integer


In [None]:
url = "https://www.billboard.com/charts/hot-100/"

In [None]:
respinse = requests.get(url)

In [None]:
hot100 = BeautifulSoup(respinse.content)

In [None]:
hot100

In [None]:
for row in soup.select("li.o-chart-results-list__item"):
    print(row.select("span").get_text(strip=True))

In [None]:
soup.select("span.c_label")

In [None]:
soup