# Webscraping one page using beautiful soup 

### Tools for scraping 

+ https://www.crummy.com/software/BeautifulSoup/bs4/doc/  (this is what we will use in lectures)

+ https://scrapy.org/

+ https://selenium-python.readthedocs.io/



## Dormouse HTML Code 


In [1]:
#create the variable

html_doc ="""
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</html>
"""

In [2]:
# after installing as conda install -c anaconda beautifulsoup4

#Import needed libraries - BeautifulSoup

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd


In [3]:
# parse (create) the soup 
soup_mouse = BeautifulSoup(html_doc, 'html.parser')

In [4]:
soup_mouse


<!DOCTYPE html>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [5]:
# prettify the soup 
print(soup_mouse.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



## Option 1 - using beautiful soup the "HTML" way  

In [6]:
# using basic tree navigation to access single elements
soup_mouse.title

<title>The Dormouse's story</title>

In [7]:
soup_mouse.title.string

"The Dormouse's story"

In [8]:
soup_mouse.body

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

In [9]:
soup_mouse.p

<p class="title"><b>The Dormouse's story</b></p>

In [10]:
# find elements of the tag using find_all()

In [11]:
p_tags = soup_mouse.find_all('p')

In [12]:
for p in p_tags:
    print(p.get_text())

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


In [13]:
soup_mouse.title.parent.string

"The Dormouse's story"

In [14]:
a_tags = soup_mouse.find_all('a')

In [15]:
a_tags

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [16]:
for atag in a_tags:
    print(atag.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [17]:
soup_mouse.text.count('were')

2

In [18]:
re.findall(r'\w+', requests.get('https://www.ironhack.com/en').text).count('data')

1725

# Option 2 - using beautiful soup the "CSS" way

As we will be be using css selectors, let's learn first the syntax of css selectors playing this game: https://flukeout.github.io/

Everyone should reach level 12!

In [19]:
# using select()
soup_mouse.select('#link2')

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

We can combine the `select()` method with other bs4 methods, such as `get_text()`.

`get_text()`, however, can only be applied to single elements, while `select()` might return multiple elements. It's common to iterate through the output of `select()`

In [20]:
soup_mouse.select('.sister')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [21]:
for a in soup_mouse.select('a'):
    print(a.get_text())

Elsie
Lacie
Tillie


useful links for the lecture : 
    
    https://www.w3schools.com/cssref/css_selectors.asp
    https://www.w3schools.com/tags/default.asp
    https://www.w3schools.com/css/css_syntax.ASP
    https://www.imdb.com/chart/top/

In [22]:
for p in soup_mouse.select('p.story'):
    print(p.get_text())

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


In [23]:
print(soup_mouse.select('p.story')[1].get_text())

...


## Activity 

Write code to extract and print the following contents (not including the html tags, only human-readable text): 

1. All the "fun facts"

2. The names of all the places

3. The content (name and fact) of all the cities (only cities, not countries) 

4. The names (not facts!) of all the cities (not countries)


In [24]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [25]:
soup_geo = BeautifulSoup(geography, 'html.parser')

In [26]:
# 1. All the "fun facts"
fun_facts = soup_geo.find_all('p')
for fact in fun_facts:
    print(fact.string)



London is the most popular tourist destination in the world.
Paris was originally a Roman City called Lutetia.
Spain produces 43,8% of all the world's Olive Oil.


In [27]:
for fact in fun_facts:
    print(fact.get_text())

London is the most popular tourist destination in the world.
Paris was originally a Roman City called Lutetia.
Spain produces 43,8% of all the world's Olive Oil.


example : 
    

**Paris was originally a Roman City called Lutetia**


In [28]:
# 2. The names of all the places.
cities = soup_geo.find_all('div')
for city in cities:
    print(city.find('h2').string)



London
Paris
Spain


In [29]:
for i in soup_geo.find_all('h2'):
    print(i.get_text())


London
Paris
Spain


example: 

**Paris**

In [30]:
# 3. All the content (name and fact) of all the cities (only cities, not countries!)
divs = soup_geo.find_all('div')
for a in divs:
    print(a.find('h2').string)
    print(a.find('p').string)


London
London is the most popular tourist destination in the world.
Paris
Paris was originally a Roman City called Lutetia.
Spain
Spain produces 43,8% of all the world's Olive Oil.


In [31]:
for i in soup_geo.find_all('div', {'class':'city'}):
    print(i.get_text())


London
London is the most popular tourist destination in the world.


Paris
Paris was originally a Roman City called Lutetia.



example: 
    
**Paris**

**Paris was originally a Roman City called Lutetia.**

In [32]:
# 4. The names (not facts!) of all the cities (not countries!)

for i in soup_geo.find_all('div', {'class':'city'}):
    print(i.h2.get_text())

London
Paris


# Scraping the IMDB top 250

Let's go to https://www.imdb.com/chart/top, where we'll see the top 250 movies according to IMDb ratings.

Notice how each movie has the following elements:

- Title

- Release Year

- IMDb rating

- Director & main stars (they appear when you hover over the title)

Our objective is going to be to scrape this information and store it in a pandas dataframe.

In [33]:
# 1. importing libraries- BeautifulSoup, requests, pandas


# 2. find url and store it in avariable
url = "https://www.imdb.com/chart/top"

# 3. download html with a get request
response = requests.get(url)

In [34]:
#check response status code 
response.status_code


200

In [58]:
#parse and store the contents of the url call
soup_imdb = BeautifulSoup(response.content, 'html.parser')
#print(soup_imdb.prettify())


In [36]:
#prettify the soup 


### Query the soup to get movie title, actors, director, year 


In [38]:
soup_imdb.select('td.titleColumn a')[4].text

'Die zwölf Geschworenen'

In [40]:
soup_imdb.select('td.titleColumn a')[4]['title']

'Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb'

In [None]:
# the director and main stars are in the same tag, but as a value of the attribute "title"
# we can access attributes as key-value pairs of dictionaries: using ["key"] to get the value:

# instead of ["title"] we could use .get("title"): choose whatever you prefer

In [41]:
soup_imdb.select('td.titleColumn span.secondaryInfo')[5].text

'(1993)'

In [None]:
# the years are inside a 'span' tag with the 'secondaryInfo' class
# we also specify the parent tag and its class, which is the same we used before
# the years are inside parentheses, but we'll take care of that later

### Once we have a method working for one movie, we can apply it for all the movies

- loop through movies
- pick up title, director, actors, year

+ store in a list

- for example 

**movie_lst = soup.select("td.titleColumn a")**

**yr_lst = soup.select("td.titleColumn span.secondaryInfo")**

In [43]:
## install tqqm.notebook using conda install -c conda-forge tqdm
from tqdm.notebook import tqdm

In [47]:
title = []
dir_stars = []
year = []
len_movies = len(soup_imdb.select('td.titleColumn a'))
len_movies

250

In [48]:
for i in tqdm(range(len_movies)):
    title.append(soup_imdb.select('td.titleColumn a')[i].text)
    dir_stars.append(soup_imdb.select('td.titleColumn a')[i]['title'])
    year.append(soup_imdb.select('td.titleColumn span.secondaryInfo')[i].text)

  0%|          | 0/250 [00:00<?, ?it/s]

In [49]:
movies_top_250 = pd.DataFrame({'title':title, 'dir_stars': dir_stars, 'year':year})

In [50]:
movies_top_250.head()

Unnamed: 0,title,dir_stars,year
0,Die Verurteilten,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",(1994)
1,Der Pate,"Francis Ford Coppola (dir.), Marlon Brando, Al...",(1972)
2,Der Pate 2,"Francis Ford Coppola (dir.), Al Pacino, Robert...",(1974)
3,The Dark Knight,"Christopher Nolan (dir.), Christian Bale, Heat...",(2008)
4,Die zwölf Geschworenen,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",(1957)


### Cleaning / Wrangling steps for the scraped data 

An inherent part of web scraping is data cleaning. We managed to get the information we needed, but for it to be useful, we still need some extra steps:

- Take the year out of the parentheses: we know we can do that with regex, but string methods such as str.replace() might be simpler to use.

- Split dir_stars into 3 columns, one for each person: "director", "star_1", "star_2". This could have been done by filtering when extracting the data from the html document, but it looks easier afterwards:

    - The "(dir.)" pattern can be removed
    - We can split the string at each comma
    
- Change the data type of the year column to integer.


In [53]:
year_clean = [yr.strip(')').strip('(') for yr in year]

In [64]:
director = []
star_1 = []
star_2 = []
for movie in dir_stars:
    split_list = movie.split(',')
    director.append(split_list[0].replace(' (dir.)', ''))
    star_1.append(split_list[1])
    star_2.append(split_list[2])

### Create data frame from results and preview 

In [65]:
movies_new = pd.DataFrame({'title':title, 'director': director, 'star_1':star_1, 'star_2':star_2})

In [66]:
movies_new.head()

Unnamed: 0,title,director,star_1,star_2
0,Die Verurteilten,Frank Darabont,Tim Robbins,Morgan Freeman
1,Der Pate,Francis Ford Coppola,Marlon Brando,Al Pacino
2,Der Pate 2,Francis Ford Coppola,Al Pacino,Robert De Niro
3,The Dark Knight,Christopher Nolan,Christian Bale,Heath Ledger
4,Die zwölf Geschworenen,Sidney Lumet,Henry Fonda,Lee J. Cobb
