# Get Movie Ratings by Web Scraping (Beautiful Soup)

Read my medium article about it [here](https://medium.com/@shanyitan/automated-web-scraping-using-beautifulsoup-for-dummies-free-python-code-41925125774e).

Next, you can get the EDA of this dataset here (link to be updated)

## Using Requests to download page

To start scraping a web page, first we need to download the page using the Python ``requests library``. The requests library will make a ``GET`` request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one.

#### Install ``requests`` library

Run the code below to install it. Voila!

In [3]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


In [1]:
from requests import get

# request the server the content of the web page by using get()
# store the server’s response in the variable response.
response = get('http://www.imdb.com/search/title?release_date=2019&sort=num_votes,desc&page=1')

# print a small part of response's content by accessing its .text attribute
# (response is now a Response object).
print(response.text[:500])




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle"


As you can see from the first line of ``response.text``, the server sent us an HTML document.

## Using BeautifulSoup to parse the HTML content

To parse our HTML document and extract the HTML, we’ll use a Python module called BeautifulSoup. In the following code cell we will:

- Import the BeautifulSoup from the package bs4.
- Parse ``response.text`` by creating a BeautifulSoup object, and assign this object to ``html_soup``.

#### Install ``BeautifulSoup`` library

Run the code below to install it. Easy peasy.

In [1]:
pip install BeautifulSoup4

Note: you may need to restart the kernel to use updated packages.


In [2]:
from bs4 import BeautifulSoup

# The 'html.parser' argument indicates that we want to do the
# parsing using Python’s built-in HTML parser.
html_soup = BeautifulSoup(response.text, 'html.parser')

type(html_soup)

bs4.BeautifulSoup

## Understanding the HTML structure

Before you get all hyped up for web scraping, you need to understand the HTML of the website which you want to scrape from. Take note that every website has different structure.

 <img src="Images/movie ratings 1.jpg" width ="500" height=500 >

1. Right click on the website
2. Left click on ``Inspect``
3. Turn on the hover cursor button on top left.

Each movie is in a ``div`` tag with class ``lister-item-mode-advanced``. Let’s use the ``find_all()`` method to extract all the ``div containers`` that have a class attribute of ``lister-item mode-advanced``:

In [3]:
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
print(type(movie_containers))
print(len(movie_containers))

<class 'bs4.element.ResultSet'>
50


As shown, there are 50 containers, meaning to say 50 movies listed on each page.

<img src="Images/movie ratings 2.jpg" width ="500" height=500 >

Now we’ll select only the first container, and extract, by turn, each item of interest:

    - The name of the movie.
    - The year of release.
    - The IMDB rating.
    - The Metascore.
    - Directors
    - The number of votes.
    - Gross
    
Let's get started with the ``first_movie``

In [4]:
#stored the content of this container in the first_movie variable
first_movie = movie_containers[0]

In [5]:
first_movie

<div class="lister-item mode-advanced">
<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt4154796"></div>
</div>
<div class="lister-item-image float-left">
<a href="/title/tt4154796/"> <img alt="Avengers: Endgame" class="loadlate" data-tconst="tt4154796" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BMTc5MDE2ODcwNV5BMl5BanBnXkFtZTgwMzI2NzQ2NzM@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB470041630_.png" width="67"/>
</a> </div>
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt4154796/">Avengers: Endgame</a>
<span class="lister-item-year text-muted unbold">(2019)</span>
</h3>
<p class="text-muted">
<span class="certificate">P13</span>
<span class="ghost">|</span>
<span class="runtime">181 min</span>
<span class="ghost">|</span>
<span class="genre">
Action, A

## Scraping Data

From the ``first_movie`` html which we had stored, we are going to use ``find`` and ``find_all`` with ``str slicing`` to work out the magic.

### The name of the movie.

In [6]:
first_movie.h3.a.text

'Avengers: Endgame'

### The year of release.

In [7]:
first_movie.h3.find('span', class_ = 'lister-item-year text-muted unbold').text

'(2019)'

### The IMDB rating.

In [8]:
first_movie.strong.text

'8.5'

### The Metascore.

In [10]:
int(first_movie.find('span', class_ = 'metascore favorable').text)

78

### Directors

This is more complicated as this class contains **Directors** and **Stars**. So I used slicing and splitting to extract only the directors. You may use the same logic to extract Stars as well.

In [11]:
first_movie.find('p', class_ = '')

<p class="">
    Directors:
<a href="/name/nm0751577/">Anthony Russo</a>, 
<a href="/name/nm0751648/">Joe Russo</a>
<span class="ghost">|</span> 
    Stars:
<a href="/name/nm0000375/">Robert Downey Jr.</a>, 
<a href="/name/nm0262635/">Chris Evans</a>, 
<a href="/name/nm0749263/">Mark Ruffalo</a>, 
<a href="/name/nm1165110/">Chris Hemsworth</a>
</p>

In [12]:
# use slicer [2:-2] to select only directors' names
a = first_movie.find('p', class_ = '').text.split('Stars')[0].split('\n')[2:-2]

# join the string together
a = ''.join(a)

a

'Anthony Russo, Joe Russo'

### The number of votes

In [30]:
first_movie.find_all('span', attrs = {'name':'nv'})[0]['data-value']

'636582'

### Gross

In [31]:
first_movie.find_all('span', attrs = {'name':'nv'})[1]['data-value']

'858,373,000'

### Ensure you get the right data in English

As a side note, if you run the code from a country where English is not the main language, it’s very likely that you’ll get some of the movie names translated into the main language of that country.

Most likely, this happens because the server infers your location from your IP address. Even if you are located in a country where English is the main language, you may still get translated content. This may happen if you’re using a VPN while you’re making the GET requests.

If you run into this issue, pass the following values to the headers parameter of the get() function:

In [44]:
headers = {"Accept-Language": "en-US, en;q=0.5"}

This will communicate the server something like “I want the linguistic content in American English (en-US). If en-US is not available, then other types of English (en) would be fine too (but not as much as en-US).”. The q parameter indicates the degree to which we prefer a certain language. If not specified, then the values is set to 1 by default, like in the case of en-US. You can read more about this here.

## Changing the URL’s parameters

The URLs follow a certain logic as the web pages change. As we are making the requests, we’ll only have to vary the values of only two parameters of the URL:
- ``release_date`` : Create a list called **years_url** and populate it with the strings corresponding to the years 2000-2017.
- ``page`` : Create a list called **pages**, and populate it with the strings corresponding to the first 4 pages.


In [45]:
pages = [str(i) for i in range(1,5)]
years_url = [str(i) for i in range(2000,2018)]

## Controlling the crawl-rate

Why do so?
- Less likely to get our IP address banned by avoid hammering the server with tens of requests per second
- Avoid disrupting and overloading the server so that it can respond to other users’ requests.

We need 2 functions:
1. ``sleep()``: Control the loop’s rate. It will pause the execution of the loop for a specified amount of seconds.
2. ``randint()``: To mimic human behavior, we’ll vary the amount of waiting time between requests. It  randomly generates integers within a specified interval.

In [46]:
from time import sleep
from random import randint

## Monitoring the loop as it’s still going

Monitoting is very helpful in the testing and debugging process, especially if you are going to scrape hundreds or thousands of web pages in a single code run. Here are the following parameters that we are gonna monitor:

1. The **frequency (speed) of requests**: make sure our program is not overloading the server.

    ``Frequency value = the number of requests / the time elapsed since the first request.``

2. The **number of requests**: can halt the loop in case the number of expected requests is exceeded.
3. The **status code of our requests**: make sure the server is sending back the proper responses.

Let’s experiment with this monitoring technique at a small scale first.

In [48]:
from time import time

# Set a starting time using the time() function from the time module, and assign the value to start_time.
start_time = time()

# Assign 0 to the variable requests which we’ll use to count the number of requests.
request = 0

#Start a loop, and then with each iteration:
    #- Simulate a request.
    #- Increment the number of requests by 1.
    #- Pause the loop for a time interval between 8 and 15 seconds.
    #- Calculate the elapsed time since the first request, and assign the value to elapsed_time.
    #- Print the number of requests and the frequency.
    
for _ in range(5):
    request += 1
    sleep(randint(1,3))
    elapsed_time = time() - start_time
    print('Request: {}; Frequency: {} requests/s'.format(request, request/elapsed_time))

Request: 1; Frequency: 0.9985465701078157 requests/s
Request: 2; Frequency: 0.4996984632467085 requests/s
Request: 3; Frequency: 0.4282898904634903 requests/s
Request: 4; Frequency: 0.39972687310084826 requests/s
Request: 5; Frequency: 0.454000337412322 requests/s


#### Clear_output()

Since we’re going to more than 5 requests, our work will look a bit untidy as the output accumulates. To avoid that, we’ll clear the output after each iteration, and replace it with information about the most recent request.

How:
1. Use the ``clear_output()`` function from the ``IPython’s core.display module``.
2. Set the ``wait parameter`` of clear_output() to ``True`` to wait with replacing the current output until some new output appears.

In [49]:
from IPython.core.display import clear_output

start_time = time()
request = 0

for _ in range(5):
    request += 1
    sleep(randint(1,3))
    current_time = time()
    elapsed_time = current_time - start_time
    print('Request: {}; Frequency: {} requests/s'.format(request, request/elapsed_time))
    clear_output(wait = True)

Request: 5; Frequency: 0.5537280012707625 requests/s


### Warnings

To monitor the status code we’ll set the program to warn us if there’s something off. A successful request is indicated by a status code of 200. We’ll use the ``warn()`` function from the warnings module to throw a warning if the status code is not 200.

We chose a warning over breaking the loop because there’s a good possibility we’ll scrape enough data, even if some of the requests fail. We will only break the loop if the number of requests is greater than expected.

In [50]:
from warnings import warn
warn("Warning Simulation")

  


## Piece everything together

Phew~ Tough work is done, now let’s piece together everything we’ve done so far.

1. **Import** necessary libraries
2. **Re-declare the lists variables** so they become empty again.
3. Prepare the loop.
4. **Loop through the years_url list** in the interval 2010-2019 and **loop through the pages list** in the interval 1-4.
5. Make the **GET requests** within the pages loop
6. Give the **headers** parameter the right value to make sure we get only English content.
7. Pause the loop for a **time interval** between 8 and 15 seconds.
9. Throw a **warning for non-200 status codes**.
10. Break the loop if the **number of requests is greater than expected**.
11. Convert the response‘s HTML content to a BeautifulSoup object.
12. Extract all **movie containers** from this BeautifulSoup object.
13. **Loop through** all these containers.
14. Extract the data if a container has a **Metascore**.
15. Extract the data if a container has a **Gross**, or else append("-").

In [24]:
from requests import get
from bs4 import BeautifulSoup
from time import time, sleep
from random import randint
from IPython.core.display import clear_output
headers = {"Accept-Language": "en-US, en;q=0.5"}

# Redeclare the lists to store data in
names = []
years = []
imdb_ratings = []
metascores = []
directors = []
votes = []
gross = []

# Prepare the loop
start_time = time()
request = 0

pages = [str(i) for i in range(1,5)]
years_url = [str(i) for i in range(2010,2020)]

# For every year in the interval 2010-2019
for year_url in years_url:

    # For every page in the interval 1-4
    for page in pages:

        # Make a get request
        # Exp: https://www.imdb.com/search/title/?release_date=2019&sort=num_votes,desc&page=1
        response = get('http://www.imdb.com/search/title?release_date='+year_url+'&sort=num_votes,desc&page='+page, headers = headers)

        # Pause the loop
        sleep(randint(8,15))

        # Monitor the requests
        request += 1
        elapsed_time = time() - start_time
        print('Request:{}; Frequency: {} requests/s'.format(request, request/elapsed_time))
        clear_output(wait = True)

        # Throw a warning for non-200 status codes
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(request, response.status_code))

        # Break the loop if the number of requests is greater than expected
        # 4 pages * 10 years = 40 requests
        if request > 40:
            warn('Number of requests was greater than expected.')
            break

        # Parse the content of the request with BeautifulSoup
        page_html = BeautifulSoup(response.text, 'html.parser')

        # Select all the 50 movie containers from a single page
        mv_containers = page_html.find_all('div', class_ = 'lister-item mode-advanced')

        # For every movie of these 50
        for container in mv_containers:
            # If the movie has a Metascore, then:
            if container.find('div', class_ = 'ratings-metascore') is not None:

                # Scrape the name
                name = container.h3.a.text
                names.append(name)

                # Scrape the year
                year = container.h3.find('span', class_ = 'lister-item-year').text
                years.append(year)

                # Scrape the IMDB rating
                imdb = float(container.strong.text)
                imdb_ratings.append(imdb)

                # Scrape the Metascore
                m_score = container.find('span', class_ = 'metascore').text
                metascores.append(int(m_score))
                
                # Scrape the directors
                director = ''.join(container.find('p', class_ = '').text.split('Stars')[0].split('\n')[2:-2])
                directors.append(director)

                # Scrape the number of votes
                vote = container.find_all('span', attrs = {'name':'nv'})[0]['data-value']
                votes.append(int(vote))
                
                # If the movie has a Gross, then:
                if len(container.find_all('span', attrs = {'name':'nv'})) >= 2:
                    
                    # Scrape the gross
                    gross_value = container.find_all('span', attrs = {'name':'nv'})[1]['data-value']
                    gross.append(gross_value)
                    
                else:
                    gross.append("-")        
                

Request:40; Frequency: 0.054821971856423284 requests/s


## Transform the scraped data into CSV

In the next code block we:
- Merge the data into a pandas DataFrame.
- Print some informations about the newly created DataFrame.
- Show the last 10 entries.

In [28]:
import pandas as pd
movie_ratings = pd.DataFrame({'movie': names,
'year': years,
'imdb': imdb_ratings,
'metascore': metascores,
'directors': directors,
'votes': votes,
'gross($)': gross,
})
print(movie_ratings.info())
movie_ratings.tail(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1788 entries, 0 to 1787
Data columns (total 7 columns):
movie        1788 non-null object
year         1788 non-null object
imdb         1788 non-null float64
metascore    1788 non-null int64
directors    1788 non-null object
votes        1788 non-null int64
gross($)     1788 non-null object
dtypes: float64(1), int64(2), object(4)
memory usage: 97.9+ KB
None


Unnamed: 0,movie,year,imdb,metascore,directors,votes,gross($)
1778,Rocketman,(I) (2019),7.4,69,Dexter Fletcher,85926,96368160
1779,How to Train Your Dragon: The Hidden World,(2019),7.5,71,Dean DeBlois,82861,160799505
1780,Men in Black: International,(2019),5.6,38,F. Gary Gray,80819,79800736
1781,Murder Mystery,(2019),6.0,38,Kyle Newacheck,77429,-
1782,Ford v Ferrari,(2019),8.3,81,James Mangold,75836,-
1783,6 Underground,(2019),6.1,41,Michael Bay,74433,-
1784,Yesterday,(III) (2019),6.9,55,Danny Boyle,72179,73286650
1785,Pet Sematary,(2019),5.8,57,"Kevin Kölsch, Dennis Widmyer",64555,54724696
1786,Escape Room,(I) (2019),6.3,48,Adam Robitel,64168,57005601
1787,Polar,(I) (2019),6.3,19,Jonas Åkerlund,63653,-


In [29]:
movie_ratings.to_csv('movie_ratings_raw.csv')