# Web Scraping multiple pages 

We have practiced web scraping when all the information we wanted was on a single table of a site. What happens when we want to scrape information from multiple pages?

## First example - IMDB 

Go to https://www.imdb.com/search/title/ and enter the following parameters, leaving all other fields blank or with its default value:

- Title Type: Feature film

- Release date: From 1990 to 1992

- User Rating: 7.5 to "-"

The page you get should be familiar. There's a list with movies and each movie has its title, release year, crew, etc. You could inspect the page and build the code to collect the date.

Note the resulting query obtained contain hundreds of movies, and each page only contains 50 of them (you can change the settings to obtain up to 250 movies/page, but that still won't be the complete list).

One way to automatize multi page web scraping is to look at the URLs. 

https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,

Note what the url looks like if you scroll down and click on "Next", the URL is now: 

https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt

Can you see the pattern?

our search options are in the parameters title_type, release_date and user_rating. Then, we have the start parameter, which jumps in intervals of 50, and the ref_ parameter, which takes the value of "adv_nxt".

In [1]:
#  import libraries
from bs4 import BeautifulSoup
import pandas as pd
import requests

In [2]:
#  url: this time, start with the 'second' page
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt"

In [4]:
# download html with a request, check response code 
response = requests.get(url)
response.status_code

200

In [14]:
#  parse html (create the 'soup')
soup_imdb = BeautifulSoup(response.content, 'html.parser')

# check that the html code looks as expected 
#soup_imdb.prettify())


Now, we'll have to build a list of values which jumps by 50, up to the total number of movies we want to scrape.  

In [15]:
# define iterations 
iter = range(1, 537, 50)

In [16]:
# check the iterations work
iter

range(1, 537, 50)

In [17]:
# create the url string for the page search, populate with the iterations
for i in iter:
    start_at = str(i)
    url = 'https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=' + start_at + '&ref_=adv_nxt'
    print(url)

https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=1&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=101&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=151&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=201&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=251&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=301&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating

In [None]:
# test the urls 


### Respectful scraping:

Before starting with the actual scraping, though, there's something we need to note when sending automated requests to websites: it's good practice to let a few seconds pass in between requests. 

Some pages don't like being scraped and will block your IP if they detect you are sending automated requests. Others might have a small server for the traffic they handle, and sending too many requests might crash the site.

The sleep module will help us with that. 

In [18]:
from time import sleep

#simple example 
for i in range(5):
    print(i)
    sleep(3)



0
1
2
3
4


In [19]:
# To make it more "human", we can randomize the waiting time:
from random import randint

In [20]:
for i in range(5):
    print(i)
    wait_time = randint(1, 4)
    print('sleep time', wait_time)
    sleep(wait_time)

0
sleep time 4
1
sleep time 4
2
sleep time 2
3
sleep time 1
4
sleep time 4


### Assembling the script to send and store multiple requests

In [21]:
pages = []
for i in iter:
    start_at = str(i)
    url = 'https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=' + start_at + '&ref_=adv_nxt'
    response = requests.get(url)
    print(response.status_code)
    pages.append(response)
    wait_time = randint(1, 4)
    sleep(wait_time)
    

200
200
200
200
200
200
200
200
200
200
200


In [22]:
pages

[<Response [200]>,
 <Response [200]>,
 <Response [200]>,
 <Response [200]>,
 <Response [200]>,
 <Response [200]>,
 <Response [200]>,
 <Response [200]>,
 <Response [200]>,
 <Response [200]>,
 <Response [200]>]

In [24]:
#BeautifulSoup(pages[2].content, 'html.parser')

Note: if you print the object pages after running the code above, you'll just see the response code messages, but the html code is still accessible and you can parse it the same way as before

### Build code to collect the relevant information from the Request 

this is what we need : 

##### Parse just the first page, for testing purposes
- soup=BeautifulSoup(pages[0].content, "html.parser")

##### title and synopsis

- soup.select("div.lister-item-content > h3 > a")
- soup.select("div.lister-item-content > p:nth-child(4)")

#### titles

In [36]:
# Parse just the first page, for testing purposes
soup = BeautifulSoup(pages[0].content, 'html.parser')
soup.select('h3 > a')
# Paste the Selector from the first movie title copied from Chrome Dev Tools

# Trim the selection


[<a href="/title/tt0103064/">Terminator 2: Tag der Abrechnung</a>,
 <a href="/title/tt0099685/">GoodFellas - Drei Jahrzehnte in der Mafia</a>,
 <a href="/title/tt0099674/">Der Pate 3</a>,
 <a href="/title/tt0105236/">Reservoir Dogs: Wilde Hunde</a>,
 <a href="/title/tt0102926/">Das Schweigen der Lämmer</a>,
 <a href="/title/tt0104257/">Eine Frage der Ehre</a>,
 <a href="/title/tt0104691/">Der letzte Mohikaner</a>,
 <a href="/title/tt0100802/">Total Recall - Die totale Erinnerung</a>,
 <a href="/title/tt0101507/">Boyz n the Hood - Jungs im Viertel</a>,
 <a href="/title/tt0105695/">Erbarmungslos</a>,
 <a href="/title/tt0099785/">Kevin - Allein zu Haus</a>,
 <a href="/title/tt0104952/">Mein Vetter Winnie</a>,
 <a href="/title/tt0099348/">Der mit dem Wolf tanzt</a>,
 <a href="/title/tt0103074/">Thelma &amp; Louise</a>,
 <a href="/title/tt0105323/">Der Duft der Frauen</a>,
 <a href="/title/tt0099810/">Jagd auf Roter Oktober</a>,
 <a href="/title/tt0099487/">Edward mit den Scherenhänden</a>,

#### synopsis

In [53]:
# Paste the Selector from the first movie title copied from Chrome Dev Tools
soup.select('p:nth-child(4)')[0].get_text().strip()


'A cyborg, identical to the one who failed to kill Sarah Connor, must now protect her ten year old son, John Connor, from a more advanced and powerful cyborg.'

### combine all the code 

There are many approaches to do this. The one we'll follow is: 

- Loop through the pages we collected, parse them ("create the soup") and store the parsed pages in a list. 

- For each parsed page, select the "blocks of HTML elements" that contain all the information of each movie (the title, the synopsis and other stuff). 

- For each one of the "blocks" we collected in the previous step: 

    - Get the movie titles and store them in a list 

    - Get the synopsis and store them in a list

In [54]:
titles = []
synopsis = []
pages_parsed = []
for i in range(len(pages)):
    pages_parsed.append(BeautifulSoup(pages[i].content, 'html.parser'))
    movies_html = pages_parsed[i].select('div.lister-item-content')
    for j in range(len(movies_html)):
        titles.append(movies_html[j].select('h3 > a')[0].get_text())
        synopsis.append(movies_html[j].select('p:nth-child(4)')[0].get_text().strip())


In [44]:
# check the output and identify any wrangling steps we missed
len(titles), len(synopsis)


(537, 537)

In [55]:
#titles

In [57]:
#synopsis

-----------

## 2nd example - Scraping presidents

Our objective is to create a dataframe with information about the presidents of the United States. To do this, we will go through 5 steps:

1. Scrape this [list of presidents of the United States](https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States).


In [59]:
# 1. import libraries

# 2. find url and store it in a variable
url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
# 3. download html with a get request
response = requests.get(url)
response.status_code

# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.content, 'html.parser')
# 4.2. check that the html code looks like it should
#soup

2. Collect all the links to the Wikipedia page of each president.


In [70]:
presidents = []
for i in range(84):
    presidents += soup.select('tbody > tr:nth-child(' + str(i) + ') > td:nth-child(4) > b > a')

In [71]:
presidents

[<a href="/wiki/George_Washington" title="George Washington">George Washington</a>,
 <a href="/wiki/John_Adams" title="John Adams">John Adams</a>,
 <a href="/wiki/Thomas_Jefferson" title="Thomas Jefferson">Thomas Jefferson</a>,
 <a href="/wiki/James_Madison" title="James Madison">James Madison</a>,
 <a href="/wiki/James_Monroe" title="James Monroe">James Monroe</a>,
 <a href="/wiki/John_Quincy_Adams" title="John Quincy Adams">John Quincy Adams</a>,
 <a href="/wiki/Andrew_Jackson" title="Andrew Jackson">Andrew Jackson</a>,
 <a href="/wiki/Martin_Van_Buren" title="Martin Van Buren">Martin Van Buren</a>,
 <a href="/wiki/William_Henry_Harrison" title="William Henry Harrison">William Henry Harrison</a>,
 <a href="/wiki/John_Tyler" title="John Tyler">John Tyler</a>,
 <a href="/wiki/James_K._Polk" title="James K. Polk">James K. Polk</a>,
 <a href="/wiki/Zachary_Taylor" title="Zachary Taylor">Zachary Taylor</a>,
 <a href="/wiki/Millard_Fillmore" title="Millard Fillmore">Millard Fillmore</a>,
 

In [72]:
# we can access the links searching for the attribute "href"
# in each element
presidents[40]['href']

'/wiki/George_H._W._Bush'

In [75]:
# Now, we just assemble a new request to the link
url = 'https://en.wikipedia.org' + presidents[0]['href']
response = requests.get(url)
response
# send request


# parse & store html
soup = BeautifulSoup(response.content, 'html.parser')

In [77]:
#soup.find('table', {'class':'infobox vcard'})

3. Scrape the Wikipedia page of each president.


In this step we could very well store the whole wikipedia page for each president, or just the tiny, final pieces of information. Storing the boxes is a middle ground (we don't have too much noise but retain the flexibility of deciding later which specific elements to extract).

When sending multiple requests, remember to be respectful by spacing the requests a few seconds from each other. We will also ping the success code to monitor that everything is going well:

In [79]:
# 2. find url and store it in a variable
pres_soups = []
for pres in presidents:
    # send request
    url = 'https://en.wikipedia.org' + pres['href']
    response = requests.get(url)
    print(pres.get_text(), response.status_code)
    # parse & store html
    soup = BeautifulSoup(response.content, 'html.parser')
    pres_soups.append(soup.find('table', {'class':'infobox vcard'}))
    # respectful nap:
    wait_time = randint(1, 2)
    sleep(wait_time)
 

George Washington 200
John Adams 200
Thomas Jefferson 200
James Madison 200
James Monroe 200
John Quincy Adams 200
Andrew Jackson 200
Martin Van Buren 200
William Henry Harrison 200
John Tyler 200
James K. Polk 200
Zachary Taylor 200
Millard Fillmore 200
Franklin Pierce 200
James Buchanan 200
Abraham Lincoln 200
Andrew Johnson 200
Ulysses S. Grant 200
Rutherford B. Hayes 200
James A. Garfield 200
Chester A. Arthur 200
Grover Cleveland 200
Benjamin Harrison 200
Grover Cleveland 200
William McKinley 200
Theodore Roosevelt 200
William Howard Taft 200
Woodrow Wilson 200
Warren G. Harding 200
Calvin Coolidge 200
Herbert Hoover 200
Franklin D. Roosevelt 200
Harry S. Truman 200
Dwight D. Eisenhower 200
John F. Kennedy 200
Lyndon B. Johnson 200
Richard Nixon 200
Gerald Ford 200
Jimmy Carter 200
Ronald Reagan 200
George H. W. Bush 200
Bill Clinton 200
George W. Bush 200
Barack Obama 200
Donald Trump 200
Joe Biden 200


4. Find and store information about each president.


We extracted the 'infoboxes': now it's time to extract specific information from them. First test what can we get from a single president and then assemble a loop for all of them.

Here, we will use [the string argument](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument) in the find function, since wikipedia tags and classes are not always helpful to locate. The string argument allows us to locate elements by its actual content.

In [85]:
#Birthday
pres_soups[40].find('span', {'class':'bday'}).get_text()
#Political party
pres_soups[12].find('th', string = 'Political party').parent.find('a').get_text()
#Number of sons/daughters
pres_soups[12].find('th', string = 'Children').parent.find_all('li')
# collect with a loop 
name=[]
dob=[]
party=[]
children=[]

for presi in pres_soups:
    name.append(presi.find("div",{"class":"fn"}).get_text())
    dob.append(presi.find("span",{"class":"bday"}).get_text())
    party.append(presi.find("th",string="Political party").parent.find("a").get_text())
    try:
        children.append(len(presi.find("th",string="Children").parent.find_all("li")))
    except:
        children.append(0)

5. Organize the information in a dataframe where we have each president as a row and each variable we collected as a column.

In [88]:
pres_df = pd.DataFrame({'name':name, 'dob':dob, 'party':party, 'children':children})

In [89]:
pres_df.head()

Unnamed: 0,name,dob,party,children
0,George Washington,1732-02-22,Independent,0
1,John Adams,1735-10-30,Pro-Administration,0
2,Thomas Jefferson,1743-04-13,Democratic-Republican,6
3,James Madison,1751-03-16,Democratic-Republican,0
4,James Monroe,1758-04-28,Democratic-Republican,0
