# Scraping data for over 2000 movies
It's essential to identify the goal of our scraping right from the beginning. Writing a scraping script can take a lot of time, especially if we want to scrape more than one web page. We want to avoid spending hours writing a script which scrapes data we won't actually need.


## Working out which pages to scrape
Once we've established our goal, we then need to identify an efficient set of pages to scrape.

We want to find **a combination of pages** that requires a relatively small number of requests. A request is what happens whenever we access a web page. We 'request' the content of a page from the server. The more requests we make, the longer our script will need to run, and the greater the strain on the server.

One way to get all the data we need is to ** compile a list of movie names,** and use it to access the web page of each movie on both IMDB and Metacritic websites.

Since we want to get over 2000 ratings from both IMDB and Metacritic, we'll have to make at least 4000 requests. If we make one request per second, our script will need a little over an hour to make 4000 requests. Because of this, it's worth trying to identify more efficient ways of obtaining our data.

If we explore the IMDB website, we can discover a way to halve the number of requests. Metacritic scores are shown on the IMDB movie page, so we can scrape both ratings with a single request.

If we investigate the IMDB site further, we can discover the page shown below. It contains all the data we need for 50 movies. Given our aim, this means we'll only have to do about 40 requests, which is 100 times less than our first option. Let's explore this last option further.

## Identifying the URL structure
In the following code cell we will:

- Import the get() function from the requests module.
- Assign the address of the web page to a variable named url.
- Request the server the content of the web page by using get(), and store the server’s response in the variable response.
- Print a small part of response's content by accessing its .text attribute (response is now a Response object).

In [5]:
import requests
from requests import get

In [2]:
# page = requests.get("https://www.imdb.com/search/title?release_date=2017-01-01,2017-12-31&sort=num_votes,desc&page=1&ref_=adv_prv")

In [3]:
# page

<Response [200]>

In [6]:
# page.content
url = "https://www.imdb.com/search/title?release_date=2017-01-01,2017-12-31&sort=num_votes,desc&page=1&ref_=adv_prv"

In [7]:
response = get(url)
print(response.text[:500])




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle"


## Using BeautifulSoup to parse the HTML content
In the following code cell we will:

- Import the BeautifulSoup class creator from the package bs4.
- Parse response.text by creating a BeautifulSoup object, and assign this object to html_soup. The 'html.parser' argument indicates that we want to do the parsing using Python’s built-in HTML parser.

In [10]:
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.content, 'html.parser')

Before extracting the 50 div containers, we need to figure out what distinguishes them from other div elements on that page. Often, the distinctive mark resides in the **class attribute**. If you inspect the HTML lines of the containers of interest, you'll notice that the class attribute has two values: lister-item and mode-advanced. This combination is unique to these div containers. We can see that's true by doing a quick search (Ctrl + F). We have 50 such containers, so we expect to see only 50 matches:



In [11]:
movie_containers = html_soup.find_all('div',class_="lister-item mode-advanced")

In [12]:
print(type(movie_containers))
print(len(movie_containers))

<class 'bs4.element.ResultSet'>
50


Now we'll select only the first container, and extract, by turn, each item of interest:

- The name of the movie.
- The year of release.
- The IMDB rating.
- The Metascore.
- The number of votes.

In [14]:
# movie_containers

### Extracting the data for a single movie

In [16]:
first_movie = movie_containers[0]
# first_movie
first_movie.div

<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt3315342"></div>
</div>

In [17]:
first_movie.a

<a href="/title/tt3315342/?ref_=adv_li_i"> <img alt="Logan" class="loadlate" data-tconst="tt3315342" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BYzc5MTU4N2EtYTkyMi00NjdhLTg3NWEtMTY4OTEyMzJhZTAzXkEyXkFqcGdeQXVyNjc1NTYyMjg@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB499613450_.png" width="67"/>
</a>

In [18]:
first_movie.h3

<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt3315342/?ref_=adv_li_tt">Logan</a>
<span class="lister-item-year text-muted unbold">(2017)</span>
</h3>

In [20]:
first_movie.h3.a

<a href="/title/tt3315342/?ref_=adv_li_tt">Logan</a>

In [21]:
# now it is all a matter of accessing the text from 
# within that <a> tag:
first_name = first_movie.h3.a.text
first_name

'Logan'

## The year of the movie's release

In [33]:
# This data is stored within the <span> tag
first_movie.span
# it returns the wrong line because Dot notation 
# will only access the first span element.

<span class="lister-item-index unbold text-primary">1.</span>

In [29]:
first_year = first_movie.h3.find('span',class_="lister-item-year text-muted unbold")

In [30]:
first_year

<span class="lister-item-year text-muted unbold">(2017)</span>

In [32]:
first_year = first_year.text
first_year

'(2017)'

## IMDB rating

In [34]:
first_movie.strong

<strong>8.1</strong>

In [35]:
# Great! We'll access the text, 
# convert it to the float type, 
#and assign it to the variable first_imdb:
first_imdb = float(first_movie.strong.text)
first_imdb

8.1

## Metascore 

In [52]:
first_mscore = first_movie.find('span',class_="metascore favorable")
# Note that if you copy-paste those values 
# from DevTools' tab, 
# there will be two white space characters 
# between metascore and favorable. 
# Make sure there will be only one whitespace 
# character when you pass the values as arguments 
# to the class_ parameter. Otherwise, find() 
# won't find anything.

In [53]:
first_mscore = int(first_mscore.text)
print(first_mscore)

77


## number of votes

In [59]:
first_vote = first_movie.find('span',name_="nv")
# this does not work because

In [64]:
first_vote = first_movie.find('span',attrs={'name':'nv'})

In [67]:
int(first_vote.text)
# the reason of invalid literal is the comma in 498,637

ValueError: invalid literal for int() with base 10: '498,637'

We could use .text notation to access the <span> tag's content. It would be better though if we accessed the value of the ** data-value attribute. This way we can convert the extracted datapoint to an int without having to strip a comma.**

** You can treat a Tag object just like a dictionary. The HTML attributes are the dictionary's keys. The values of the HTML attributes are the values of the dictionary's keys.** This is how we can access the value of the data-value attribute:

In [68]:
first_vote['data-value']

'498637'

In [69]:
# Let's convert that value to an integer, and assign it to first_votes:
first_vote = int(first_vote['data-value'])

# The script for a single page

In [75]:
# 23st movie does not have a meta score
twenty_third_movie_mscore = movie_containers[22].find('div', class_ = 'ratings-metascore')
type(twenty_third_movie_mscore)

NoneType

In the next code block we:

- Declare some list variables to have something to store the extracted data in.
- Loop through each container in movie_containers (the variable which contains all the 50 movie containers).
- Extract the data points of interest only if the container has a Metascore.

In [79]:
names = []
years = []
imdb_ratings = []
metascores = []
votes = []

for container in movie_containers:
    if container.find('div', class_ = 'ratings-metascore') is not None:
        name = container.h3.a.text
        # put all 50 movie names into a list called names
        names.append(name)
        
        year = container.h3.find('span',class_="lister-item-year text-muted unbold").text
        years.append(year)
        
        imdb = float(container.strong.text)
        imdb_ratings.append(imdb)
        
        meta = int(first_movie.find('span',class_="metascore favorable").text)
        metascores.append(meta)
        
        vote = int(container.find('span',attrs={'name':'nv'})['data-value'])
        votes.append(vote)
        

Pandas makes it easy for us to see whether we've scraped our data successfully.

In [80]:
import pandas as pd
test_df = pd.DataFrame({'movie':names,
                       'year':years,
                       'imdb':imdb_ratings,
                       'metascores':metascores,
                       'vote':votes})

print(test_df.info())
test_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 5 columns):
imdb          47 non-null float64
metascores    47 non-null int64
movie         47 non-null object
vote          47 non-null int64
year          47 non-null object
dtypes: float64(1), int64(2), object(2)
memory usage: 1.9+ KB
None


Unnamed: 0,imdb,metascores,movie,vote,year
0,8.1,77,Logan,498637,(2017)
1,7.5,77,Wonder Woman,428133,(2017)
2,8.0,77,Dunkirk,408325,(2017)
3,7.3,77,Star Wars: Episode VIII - The Last Jedi,403952,(2017)
4,7.7,77,Guardians of the Galaxy Vol. 2,399168,(2017)
5,7.9,77,Thor: Ragnarok,368439,(2017)
6,7.5,77,Spider-Man: Homecoming,345751,(2017)
7,7.7,77,Get Out,322071,(I) (2017)
8,8.1,77,Blade Runner 2049,318579,(2017)
9,7.7,77,Baby Driver,314816,(2017)


In [81]:
headers = {"Accept-Language":"en-US,en;q=0.5"}

# The script for multiple pages

We'll build upon our one-page script by doing three more things:

- Making all the requests we want from within the loop.
- Controlling the loop's rate to avoid bombarding the server with requests.
- Monitoring the loop while it runs.

We'll scrape the first 4 pages of each year in the interval 2000-2017. 4 pages for each of the 18 years makes for a total of 72 pages. Each page has 50 movies, so we'll scrape data for 3600 movies at most. But not all the movies have a Metascore, so the number will be lower than that. Even so, we are still very likely to get data for over 2000 movies.

### Changing the URL's parameters

As we are making the requests, we'll only have to vary the values of only two parameters of the URL: the release_date parameter, and page. Let's prepare the values we'll need for the forthcoming loop. In the next code cell we will:

- Create a list called pages, and populate it with the strings corresponding to the first 4 pages.
- Create a list called years_url and populate it with the strings corresponding to the years 2000-2017.

In [82]:
pages = [str(i) for i in range(1,5)]
years_url = [str(i) for i in range(2000,2018)]

### Controlling the crawl-rate

We'll control the loop's rate by using the sleep() function from Python's time module. sleep() will pause the execution of the loop for a specified amount of seconds.

To mimic human behavior, we'll vary the amount of waiting time between requests by using the randint() function from the Python's random module. randint() randomly generates integers within a specified interval.

In [83]:
from time import sleep
from random import randint

for i in range(0,5):
    print('Blah')
    sleep(randint(1,4))

Blah
Blah
Blah
Blah
Blah


### Monitoring the loop as it's still going

Given that we're scraping 72 pages, it would be nice if we could find a way to monitor the scraping process as it's still going. This feature is definitely optional, but it can be very helpful in the testing and debugging process. Also, the greater the number of pages, the more helpful the monitoring becomes. If you are going to scrape hundreds or thousands of web pages in a single code run, I would say that this feature becomes a must.

For our script, we'll make use of this feature, and monitor the following parameters:

- The frequency (speed) of requests, so we make sure our program is not overloading the server.
- The number of requests, so we can halt the loop in case the number of expected requests is exceeded.
- The status code of our requests, so we make sure the server is sending back the proper responses.

To get a frequency value we'll divide the number of requests by the time elapsed since the first request. This is similar to computing the speed of a car - we divide the distance by the time taken to cover that distance. Let's experiment with this monitoring technique at a small scale first. In the following code cell we will:

- Set a starting time using the time() function from the time module, and assign the value to start_time.
- Assign 0 to the variable requests which we'll use to count the number of requests.
- Start a loop, and then with each iteration:
  - Simulate a request.
  - Increment the number of requests by 1.
  - Pause the loop for a time interval between 8 and 15 seconds.
  - Calculate the elapsed time since the first request, and assign the value to elapsed_time.
  - Print the number of requests and the frequency.

In [84]:
from time import time
start_time = time()
requests = 0

for _ in range(5):
    requests += 1
    sleep(randint(1,3))
    elapsed_time = time() - start_time
    print('Request: {}; Frenquency:{} requests/s'.\
         format(requests,requests/elapsed_time))

Request: 1; Frenquency:0.3327792439508427 requests/s
Request: 2; Frenquency:0.3329667313023041 requests/s
Request: 3; Frenquency:0.33292013694614664 requests/s
Request: 4; Frenquency:0.3632065287573407 requests/s
Request: 5; Frenquency:0.38409532053059553 requests/s


Since we're going to make 72 requests, our work will look a bit untidy as the output accumulates. To avoid that, we'll clear the output after each iteration, and replace it with information about the most recent request. To do that we'll use the ** clear_output()** function from the IPython's core.display module. We'll set the wait parameter of clear_output() to True to wait with replacing the current output until some new output appears.

In [85]:
from IPython.core.display import clear_output

start_time = time()
requests = 0

for _ in range(5):
    # A request would go here
    requests += 1
    sleep(randint(1,3))
    current_time = time()
    elapsed_time = current_time - start_time
    print('Request: {}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
    clear_output(wait = True)

Request: 5; Frequency: 0.4532834384484193 requests/s


To monitor the status code we'll set the program to warn us if there's something off. A successful request is indicated by a status code of 200. We'll use the warn() function from the warnings module to throw a warning if the status code is not 200.

In [86]:
from warnings import warn
warn("Warning Simulation")

  


# Piecing everything together

In the following code cell, we start by:

- Redeclaring the lists variables so they become empty again.
- Preparing the monitoring of the loop.
Then, we'll:

- Loop through the years_url list to vary the release_date parameter of the URL.
- For each element in years_url, loop through the pages list to vary the page parameter of the URL.
- Make the GET requests within the pages loop (and give the headers parameter the right value to make sure we get only English content).
- Pause the loop for a time interval between 8 and 15 seconds.
- Monitor each request as discussed before.
- Throw a warning for non-200 status codes.
- Break the loop if the number of requests is greater than expected.
- Convert the response's HTML content to a BeautifulSoup object.
- Extract all movie containers from this BeautifulSoup object.
- Loop through all these containers.
- Extract the data if a container has a Metascore.

In [88]:
# redeclaring the lists to store data in
names = []
years = []
imdb_ratings = []
metascores = []
votes = []

# Preparing the monitoring of the loop
start_time = time()
requests = 0

# for every year in the interval 2000 - 2017
for year_url in years_url:
    
    # for every page in the interval 1-4
    for page in pages:
        
        # make a get request
        response = get('https://www.imdb.com/search/title?release_date='+year_url+\
                       '&sort=num_votes,desc&page='+page,headers = headers)
        # pause the loop
        sleep(randint(8,15))
        
        # minitor the requests
        requests +=1
        elapsed_time = time()-start_time
        print('Request:{}; Frenquency:{} requests/s'.\
             format(requests,requests/elapsed_time))
        clear_output(wait=True)
        
         # Throw a warning for non-200 status codes
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(requests, response.status_code))

        # Break the loop if the number of requests is greater than expected
        if requests > 72:
            warn('Number of requests was greater than expected.')  
            break 

        # Parse the content of the request with BeautifulSoup
        page_html = BeautifulSoup(response.text, 'html.parser')

        # Select all the 50 movie containers from a single page
        mv_containers = page_html.find_all('div', class_ = 'lister-item mode-advanced')

        # For every movie of these 50
        for container in mv_containers:
            # If the movie has a Metascore, then:
            if container.find('div', class_ = 'ratings-metascore') is not None:

                # Scrape the name
                name = container.h3.a.text
                names.append(name)

                # Scrape the year 
                year = container.h3.find('span', class_ = 'lister-item-year').text
                years.append(year)

                # Scrape the IMDB rating
                imdb = float(container.strong.text)
                imdb_ratings.append(imdb)

                # Scrape the Metascore
                m_score = container.find('span', class_ = 'metascore').text
                metascores.append(int(m_score))

                # Scrape the number of votes
                vote = container.find('span', attrs = {'name':'nv'})['data-value']
                votes.append(int(vote))
    

Request:72; Frenquency:0.0018382522520500166 requests/s
