# Web Scraping multiple pages 

We have practiced web scraping when all the information we wanted was on a single table of a site. What happens when we want to scrape information from multiple pages?

## First example - IMDB 

Go to https://www.imdb.com/search/title/ and enter the following parameters, leaving all other fields blank or with its default value:

- Title Type: Feature film

- Release date: From 1990 to 1992

- User Rating: 7.5 to "-"

The page you get should be familiar. There's a list with movies and each movie has its title, release year, crew, etc. You could inspect the page and build the code to collect the date.

Note the resulting query obtained contain hundreds of movies, and each page only contains 50 of them (you can change the settings to obtain up to 250 movies/page, but that still won't be the complete list).

One way to automatize multi page web scraping is to look at the URLs. 

https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,

Note what the url looks like if you scroll down and click on "Next", the URL is now: 

https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt

Can you see the pattern?

our search options are in the parameters title_type, release_date and user_rating. Then, we have the start parameter, which jumps in intervals of 50, and the ref_ parameter, which takes the value of "adv_nxt".

In [26]:
#  import libraries
from bs4 import BeautifulSoup
import requests

In [27]:
#  url: this time, start with the 'second' page
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt"

In [28]:
# download html with a request, check response code 
response=requests.get(url)
response.status_code

200

In [29]:
#  parse html (create the 'soup')
soup=BeautifulSoup(response.content, "html.parser")

# check that the html code looks as expected 


Now, we'll have to build a list of values which jumps by 50, up to the total number of movies we want to scrape.  

In [30]:
# define iterations 
iterations = range(1,537,50)

In [31]:
# check the iterations work
iterations

range(1, 537, 50)

In [32]:
# create the url string for the page search, populate with the iterations
for i in iterations:
    start_at =str(i)
    url="https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=" + start_at + "&ref_=adv_nxt"
    print(url)

https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=1&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=101&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=151&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=201&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=251&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=301&ref_=adv_nxt
https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating

In [33]:
# test the urls 


### Respectful scraping:

Before starting with the actual scraping, though, there's something we need to note when sending automated requests to websites: it's good practice to let a few seconds pass in between requests. 

Some pages don't like being scraped and will block your IP if they detect you are sending automated requests. Others might have a small server for the traffic they handle, and sending too many requests might crash the site.

The sleep module will help us with that. 

In [34]:
from time import sleep

#simple example 
for i in range(5):
    print(i)
    sleep(3)



0
1
2
3
4


In [35]:
# To make it more "human", we can randomize the waiting time:
from random import randint



In [36]:
for i in range(5):
    print(i)
    wait_time=randint(1,4)
    print("i will sleep for..." +str(wait_time) +" seconds now")
    sleep(wait_time)

0
i will sleep for...1 seconds now
1
i will sleep for...2 seconds now
2
i will sleep for...4 seconds now
3
i will sleep for...4 seconds now
4
i will sleep for...3 seconds now


### Assembling the script to send and store multiple requests

ingredients for our multi page scraper : 
    + iterations 
    + url list with iterations 
    + sleepy time + random gaps (to look human)

In [37]:
pages =[]
#assemble urls
for i in iterations:
    start_at =str(i)
    url="https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=" + start_at + "&ref_=adv_nxt"
#download html with get request
    response = requests.get(url)
#monitor the status codes for each page 
    print("status=" +str(response.status_code))
#store pages into a list 
    pages.append(response)
#respectful nap time 
    wait_time=randint(1,4)
    print("i will sleep for..." +str(wait_time) +" seconds now")
    sleep(wait_time)


status=200
i will sleep for...4 seconds now
status=200
i will sleep for...2 seconds now
status=200
i will sleep for...3 seconds now
status=200
i will sleep for...4 seconds now
status=200
i will sleep for...2 seconds now
status=200
i will sleep for...4 seconds now
status=200
i will sleep for...4 seconds now
status=200
i will sleep for...1 seconds now
status=200
i will sleep for...4 seconds now
status=200
i will sleep for...3 seconds now
status=200
i will sleep for...4 seconds now


In [38]:
BeautifulSoup(pages[0].content,"html.parser")


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Feature Film,
Released between 1990-01-01 and 1992-12-31,
User Rating at least 7.5
(Sorted by Popularity Ascending) - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
 

Note: if you print the object pages after running the code above, you'll just see the response code messages, but the html code is still accessible and you can parse it the same way as before

### Build code to collect the relevant information from the Request 

this is what we need : 

##### Parse just the first page, for testing purposes
- soup=BeautifulSoup(pages[0].content, "html.parser")

##### title and synopsis

- soup.select("div.lister-item-content > h3 > a")
- soup.select("div.lister-item-content > p:nth-child(4)")

#### titles

In [39]:
# Parse just the first page, for testing purposes

soup=BeautifulSoup(pages[0].content,"html.parser")

# Paste the Selector from the first movie title copied from Chrome Dev Tools

soup.select("#main > div > div.lister.list.detail.sub-list > div > div:nth-child(1) > div.lister-item-content > h3 > a")

# Trim the selection
soup.select("h3 > a")

[<a href="/title/tt0103064/">Terminator 2: El juicio final</a>,
 <a href="/title/tt0099685/">Uno de los nuestros</a>,
 <a href="/title/tt0099674/">El padrino: Parte III</a>,
 <a href="/title/tt0105236/">Reservoir Dogs</a>,
 <a href="/title/tt0102926/">El silencio de los corderos</a>,
 <a href="/title/tt0104257/">Algunos hombres buenos</a>,
 <a href="/title/tt0104691/">El último mohicano</a>,
 <a href="/title/tt0100802/">Desafío total</a>,
 <a href="/title/tt0101507/">Los chicos del barrio</a>,
 <a href="/title/tt0105695/">Sin perdón</a>,
 <a href="/title/tt0099785/">Solo en casa</a>,
 <a href="/title/tt0104952/">Mi primo Vinny</a>,
 <a href="/title/tt0099348/">Bailando con lobos</a>,
 <a href="/title/tt0103074/">Thelma &amp; Louise</a>,
 <a href="/title/tt0105323/">Esencia de mujer</a>,
 <a href="/title/tt0099810/">La caza del Octubre Rojo</a>,
 <a href="/title/tt0099487/">Eduardo Manostijeras</a>,
 <a href="/title/tt0103639/">Aladdín</a>,
 <a href="/title/tt0101414/">La bella y la bes

#### synopsis

In [40]:
# Paste the Selector from the first movie title copied from Chrome Dev Tools
soup.select("p:nth-child(4)")

[<p class="text-muted">
 A cyborg, identical to the one who failed to kill Sarah Connor, must now protect her ten year old son, John Connor, from a more advanced and powerful cyborg.</p>,
 <p class="text-muted">
 The story of <a href="/name/nm1453737">Henry Hill</a> and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.</p>,
 <p class="text-muted">
 Follows Michael Corleone, now in his 60s, as he seeks to free his family from crime and find a suitable successor to his empire.</p>,
 <p class="text-muted">
 When a simple jewelry heist goes horribly wrong, the surviving criminals begin to suspect that one of them is a police informant.</p>,
 <p class="text-muted">
 A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.</p>,
 <p class="text-muted">
 Military lawyer Lieute

In [41]:
# Trim the selection


### combine all the code 

There are many approaches to do this. The one we'll follow is: 

- Loop through the pages we collected, parse them ("create the soup") and store the parsed pages in a list. 

- For each parsed page, select the "blocks of HTML elements" that contain all the information of each movie (the title, the synopsis and other stuff). 

- For each one of the "blocks" we collected in the previous step: 

    - Get the movie titles and store them in a list 

    - Get the synopsis and store them in a list

In [42]:
titles =[]
synopsis=[]
pages_parsed=[]

for i in range(len(pages)):
    pages_parsed.append(BeautifulSoup(pages[i].content,"html.parser"))
    movies_html=pages_parsed[i].select("div.lister-item-content")
    #for each movie, store title and synopsis into the lists
    for j in range(len(movies_html)):
        titles.append(movies_html[j].select("h3 > a")[0].get_text())
        synopsis.append(movies_html[j].select("p:nth-child(4)")[0].get_text())

# check output
print(len(titles))
print(len(synopsis))


537
537


In [43]:
len(pages)

11

In [44]:
titles[0:15]

['Terminator 2: El juicio final',
 'Uno de los nuestros',
 'El padrino: Parte III',
 'Reservoir Dogs',
 'El silencio de los corderos',
 'Algunos hombres buenos',
 'El último mohicano',
 'Desafío total',
 'Los chicos del barrio',
 'Sin perdón',
 'Solo en casa',
 'Mi primo Vinny',
 'Bailando con lobos',
 'Thelma & Louise',
 'Esencia de mujer']

In [45]:
synopsis[5:5]

[]

In [46]:
# check the output and identify any wrangling steps we missed 

In [47]:
+ strip the \n from the synopsis 

SyntaxError: invalid syntax (<ipython-input-47-6702f38d0137>, line 1)

-----------

## 2nd example - Scraping presidents

Our objective is to create a dataframe with information about the presidents of the United States. To do this, we will go through 5 steps:

1. Scrape this [list of presidents of the United States](https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States).


In [48]:
# 1. import libraries
import pandas as pd

# 2. find url and store it in a variable
url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"

# 3. download html with a get request
response = requests.get(url)
response.status_code


# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.content, "html.parser")

# 4.2. check that the html code looks like it should
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of presidents of the United States - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"ff652a79-4b87-4282-9b32-da99fad7f61b","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_presidents_of_the_United_States","wgTitle":"List of presidents of the United States","wgCurRevisionId":1030362347,"wgRevisionId":1030362347,"wgArticleId":19908980,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Wikipedia semi-protected pages","Articles with short description","Short description is

In [49]:
# copy selector #mw-content-text > div.mw-parser-output > table.wikitable >
nixon
#tbody > tr:nth-child(66) > td:nth-child(4) > b > a
george bush
#tbody > tr:nth-child(66) > td:nth-child(4) > b > a

SyntaxError: invalid syntax (<ipython-input-49-3c44ca21ff98>, line 4)

2. Collect all the links to the Wikipedia page of each president.


In [50]:
presidents = []

for i in range(84):
    presidents = presidents + soup.select("tbody > tr:nth-child(" +str(i)+") > td:nth-child(4) > b > a")
    
presidents

[<a href="/wiki/George_Washington" title="George Washington">George Washington</a>,
 <a href="/wiki/John_Adams" title="John Adams">John Adams</a>,
 <a href="/wiki/Thomas_Jefferson" title="Thomas Jefferson">Thomas Jefferson</a>,
 <a href="/wiki/James_Madison" title="James Madison">James Madison</a>,
 <a href="/wiki/James_Monroe" title="James Monroe">James Monroe</a>,
 <a href="/wiki/John_Quincy_Adams" title="John Quincy Adams">John Quincy Adams</a>,
 <a href="/wiki/Andrew_Jackson" title="Andrew Jackson">Andrew Jackson</a>,
 <a href="/wiki/Martin_Van_Buren" title="Martin Van Buren">Martin Van Buren</a>,
 <a href="/wiki/William_Henry_Harrison" title="William Henry Harrison">William Henry Harrison</a>,
 <a href="/wiki/John_Tyler" title="John Tyler">John Tyler</a>,
 <a href="/wiki/James_K._Polk" title="James K. Polk">James K. Polk</a>,
 <a href="/wiki/Zachary_Taylor" title="Zachary Taylor">Zachary Taylor</a>,
 <a href="/wiki/Millard_Fillmore" title="Millard Fillmore">Millard Fillmore</a>,
 

In [2]:
# we can access the links searching for the attribute "href"
# in each element
presidents[45]["href"]


NameError: name 'presidents' is not defined

In [52]:
# Now, we just assemble a new request to the link
url = "https://en.wikipedia.org" + presidents[0]["href"]

# send request
response = requests.get(url)
response.status_code

# parse & store html
soup = BeautifulSoup(response.content, "html.parser")

In [53]:
soup.find("table", {"class":"infobox vcard"})

<table class="infobox vcard"><tbody><tr><th class="infobox-above" colspan="2" style="font-size: 100%;"><div class="fn" style="display:inline-block; font-size:125%;">George Washington</div></th></tr><tr><td class="infobox-image" colspan="2"><a class="image" href="/wiki/File:Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg" title="Head and shoulders portrait of George Washington"><img alt="Head and shoulders portrait of George Washington" data-file-height="5615" data-file-width="4626" decoding="async" height="267" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg/220px-Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg/330px-Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Gilbert_Stuart_Williamstown_Portr

3. Scrape the Wikipedia page of each president.


In this step we could very well store the whole wikipedia page for each president, or just the tiny, final pieces of information. Storing the boxes is a middle ground (we don't have too much noise but retain the flexibility of deciding later which specific elements to extract).

When sending multiple requests, remember to be respectful by spacing the requests a few seconds from each other. We will also ping the success code to monitor that everything is going well:

In [55]:
presi_soups=[]

for presi in presidents:
    # send request
    url="https://en.wikipedia.org"+presi["href"]
    response=requests.get(url)
    print(presi.get_text(), response.status_code)
    # parse & store html
    soup=BeautifulSoup(response.content, "html.parser")
    presi_soups.append(soup.find("table", {"class":"infobox vcard"}))
    # respectful nap:
    wait_time=randint(1,2)
    print("I will sleep now for..."+str(wait_time)+"secs")
    sleep(wait_time)

George Washington 200
I will sleep now for...1secs
John Adams 200
I will sleep now for...1secs
Thomas Jefferson 200
I will sleep now for...2secs
James Madison 200
I will sleep now for...2secs
James Monroe 200
I will sleep now for...1secs
John Quincy Adams 200
I will sleep now for...1secs
Andrew Jackson 200
I will sleep now for...2secs
Martin Van Buren 200
I will sleep now for...2secs
William Henry Harrison 200
I will sleep now for...1secs
John Tyler 200
I will sleep now for...2secs
James K. Polk 200
I will sleep now for...2secs
Zachary Taylor 200
I will sleep now for...1secs
Millard Fillmore 200
I will sleep now for...1secs
Franklin Pierce 200
I will sleep now for...2secs
James Buchanan 200
I will sleep now for...1secs
Abraham Lincoln 200
I will sleep now for...1secs
Andrew Johnson 200
I will sleep now for...1secs
Ulysses S. Grant 200
I will sleep now for...1secs
Rutherford B. Hayes 200
I will sleep now for...2secs
James A. Garfield 200
I will sleep now for...1secs
Chester A. Arthur 20

4. Find and store information about each president.


We extracted the 'infoboxes': now it's time to extract specific information from them. First test what can we get from a single president and then assemble a loop for all of them.

Here, we will use [the string argument](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument) in the find function, since wikipedia tags and classes are not always helpful to locate. The string argument allows us to locate elements by its actual content.

In [58]:
#Birthday
presi_soups[45].find("span",{"class":"bday"}).get_text()

#Political party
presi_soups[45].find("th",string="Political party").parent.find("a").get_text()

#Number of sons/daughters
len(presi_soups[45].find("th",string="Children").parent.find_all("li"))


4

In [65]:
# collect with a loop - presidents name, their birthday , political party, no of children 
name=[]
dob=[]
party=[]
children=[]

for presi in presi_soups:
    name.append(presi.find("div",{"class":"fn"}).get_text())
    dob.append(presi.find("span",{"class":"bday"}).get_text())
    party.append(presi.find("th",string="Political party").parent.find("a").get_text())
    try:
        children.append(len(presi.find("th",string="Children").parent.find_all("li")))
    except:
        children.append(0)

5. Organize the information in a dataframe where we have each president as a row and each variable we collected as a column.

In [70]:
presidents_data = pd.DataFrame({"name":name, "birthday":dob, "party":party, "noofchild":children})
presidents_data

Unnamed: 0,name,birthday,party,noofchild
0,George Washington,1732-02-22,Independent,0
1,John Adams,1735-10-30,Pro-Administration,0
2,Thomas Jefferson,1743-04-13,Democratic-Republican,6
3,James Madison,1751-03-16,Democratic-Republican,0
4,James Monroe,1758-04-28,Democratic-Republican,0
5,John Quincy Adams,1767-07-11,Federalist,4
6,Andrew Jackson,1767-03-15,Democratic-Republican,0
7,Martin Van Buren,1782-12-05,Democratic-Republican,0
8,William Henry Harrison,1773-02-09,Democratic-Republican,0
9,John Tyler,1790-03-29,Independent,0
