# imdb_movie_scraping

The dataset is a webscraped from IMDB website using BeautifulSoup. This tutorial is an advanced version of my previous notebook and assumes that the user knows the basics of HTML and BeautifulSoup. For a more basic overview please look at the previous notebooks.

In [40]:
from requests import get #the package which fetches the HTML doc from the url for us
import re

Say if you want to scrape a list of 1000 movies, we would have to send in 1000 requests to the website. Assuming each request takes 1 second to execute, it would take a 1000 seconds to execute. When we explore the website a bit we find noval ways which will help us to execute our scraping much more faster. 

In [2]:
url = "http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1"

# Inspecting the link

Exploring the IMDB site for a while we find that while using advanced search feature we can have look at the best movies in the given time frame by 50 movies per page. This reduces our time 50 times as we can extract 50 movies per request.

further let's explore the url to have a better understanding of waht is happeninng on each request. The link has following elements:

- *release_date*: this is takes in the value for the year we are interested in. (2017 in our case)
- *sort*: this takes in the value by which we want to sort our list. (num_votes,desc in our case, desc suggests descending order)
- *page*: this takes the page number we are interested in. (1 in our case)

Further when you click on the next tab, we get an additional element in the link:
"http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=2&ref_=adv_nxt"

- ref: this suggests if we want to go on the next or the previous page.

In [3]:
response = get(url)
print(response.text[:500]) #acccessing the .text attribute of response





<!DOCTYPE html>
<html
xmlns:og="http://ogp.me/ns#"
xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">
            <script type="text/javascript">var ue_t0=window.ue_t0||+new Date();</script>
            <script type="text/javascript">
                var ue_mid = "A1EVAM02EL8SFB"; 
                var


# Understanding the HTML structure of the page

As we can see get() method pulls up the html document for the url. To further pull out the data we want we need to closely inspect the HTML content of the page. We can turn to the developer tools of chrome for this, cntrl+shift+i or simply right clicking on the element you want to study and seelcting the inspect element will pull up that data set for you. (This is performed on chrome bust should work with other browsers as well.)

We observe that there is a div tag for each movie. Thus, we can simply loop through the 50 dev tags assosicated with each movie using BeautifulSoup.

In [4]:
from bs4 import BeautifulSoup as bs # importing BeautifulSoup as bs

html_soup = bs(response.text, "html.parser") #using python's built in HTML parser
type(html_soup)

bs4.BeautifulSoup

Before extracting 50 div containers, we need to figure out what makes them unique. While exploring the HTML content we find that the class attribute has two values "lister-item" and "mode-advanced". This combination is unique to all the div containers. 

In [5]:
movie_containers = html_soup.find_all("div", class_ = "lister-item mode-advanced") #find_all() used to find all the tags
print(type(movie_containers))
len(movie_containers)

<class 'bs4.element.ResultSet'>


50

find_all() returned a ResultSet object with the length of 50 movies we are interested in.

Now, we'll select the conatainers one by one and extract the elements of our interest like:

- The name of movie.
- The year of release.
- The IMDB rating.
- The metascore.
- The number of votes
etc...

In [19]:
#list to store scraped value data in:
movie_names = []
year_release = []
imdb_ratings = []
metascores = []
votes = []
movie_description = []
certificate = []
runtime = []
genre = []
director_name = []
star_cast = []
gross_value = []

#extract data from individual movie container
for container in movie_containers:
    
   
    #if movie has Metascore, then extract:
    if container.find("div", class_ = "ratings-metascore") is not None:
        
                #the movie_name
                name = str(container.h3.a.text)
                #The year of release
                release = container.find("span", class_ = "lister-item-year text-muted unbold").text
                #The ratings for the movie
                ratings = float(container.strong.text)
                #metascore for the movie
                meta = container.find("span", class_ = "metascore").text
                #The number of votes
                vote = container.find("span", attrs = {"name":"nv"})['data-value']
                #the certificate for the movie
                if container.find("span", class_ = "certificate") is not None:
                    certi = container.find("span", class_ = "certificate").text
                else:
                    certi = None
                 #The runtime for the movie   
                if container.find("span", class_ ="runtime") is not None:
                    run = container.find("span", class_ ="runtime").text
                else:
                    run = None
                 #The genre of the movie   
                if container.find("span", class_ ="genre") is not None:
                    gen = container.find("span", class_ ="genre").text
                else:
                    gen = None   
                
                #fetching all <p> tags
                content = container.find_all("p")
                
                #The movie description
                if content[1] is not None:
                    desc = content[1].text
                else:
                    desc = None
                
                #subsetting all the <a> tags in 3rd <p> tag
                content_2 = content[2].find_all("a")
                
                #the director
                if content_2[0] is not None:
                    director = content_2[0].text
                else:
                    director = None
                
                #the gross value
                if len(container.find_all("span", attrs = {"name":"nv"})) >= 2:
                    gross = container.find_all("span", attrs = {"name":"nv"})[1]['data-value']
                else:
                    gross = None
                    
                 #extracting artists names
                if content_2[1] is not None:
                    temp = []
                    for i in range(len(content_2)-1):
                        temp.append(content_2[i].text)
                else:
                    for i in range(len(content_2)-1):
                        temp.append(None)
                        
                #Cleaning the data using regular Expressions:
                
                name = str(name)
                desc = str(re.findall(r"[^\\r\n].+",desc))
                gen = re.findall(r"([^\\r\\n\s,][a-zA-Z]+)",gen)
                
                if (gross != None):
                    gross = int(gross.replace(",",""))
                
                if (run != None):
                    run = int(re.findall(r"[0-9].+[^a-zA-Z-]", run)[0])
                    
                if (release != None):
                    release = str(re.findall(r"[0-9].+[^a-zA-Z-]",release))
                    
                for i in range(len(content_2)-1):
                    for j in range(len(gen)):
                        movie_names.append(str(name))
                        year_release.append(release)
                        imdb_ratings.append(ratings)
                        metascores.append(int(meta))
                        votes.append(int(vote))
                        movie_description.append(desc)
                        gross_value.append(gross)
                        runtime.append(run)
                        certificate.append(str(certi))
                        director_name.append(str(director))
                        star_cast.append(str(temp[i]))
                        genre.append(str(gen[j]))
        
    
        
                

In [20]:
import pandas as pd

test_df = pd.DataFrame({"movie_names":movie_names,
                        "year_release":year_release,
                        "imdb_ratings":imdb_ratings,
                        "metscores":metascores,
                        "votes":votes,
                        "movie_description":movie_description,
                        "certificate":certificate,
                        "runtime":runtime,
                        "genre":genre,
                        "director_name": director_name,
                        "star_cast": star_cast,
                        "gross_value":gross_value
                       })

In [39]:
test_df.head(2)

Unnamed: 0,certificate,director_name,genre,gross_value,imdb_ratings,metscores,movie_description,movie_names,runtime,star_cast,votes,year_release
0,R,James Mangold,Action,226277068.0,8.1,77,"[""In the near future, a weary Logan cares for ...",Logan,137,James Mangold,452557,['2017)']
1,R,James Mangold,Drama,226277068.0,8.1,77,"[""In the near future, a weary Logan cares for ...",Logan,137,James Mangold,452557,['2017)']


## Everything went just as expected!

As a side note if you run the code in a country where english is not the main language, it is very likely that you will get the movie names translated into the main language of that country. To avoid such issues, include the headers = {"Accept-Language": "en-US, en;q=0.5"} as an argument in the get() command.

# Script for multiple pages

Building a script to scrape multiple pages can be a bit more challenging, we will have to build upon our old script by adding three more things:

- Making all the requests we want from within the loop.
- Controlling the loops rate to avoid bombarding the server with requests.
- Monitoring the loop while it is in progress.

We'll scope through the first 4 pages of each year in the range 2000-2018

# Changing URL parameters

As described before, the URl changes certain logic as the web page changes.

As we are making requests, we'll only have to vary the values of two parameters of the URL: "release_date" and "page".

In [23]:
pages = [str(i) for i in range(1,11)] #creating list of strings corresponding to 4 pages
years = [str(i) for i in range(2017,2019)] #creating list corresponding to years 2000-2018

# Controlling the crawl rate

If we avoid flooding the server with tens of request per second, then we are much likely to avoid our Ip being banned permenantly. We also avpid disrupting the activity of the website we scrape by allowing server to respond to other user's requests too.

We'll control the loop's rate by using the sleep() function in the python's "time" module. This will pause the execution of the loop for a specified amount of seconds.

To mimic the human behavious and to render our requests legit we will vary the amount of waiting time between requests by using the randint() function from python's "random" module



In [24]:
from time import sleep
from random import randint

# Monitoring the loop as it's still going

Given that we have so many pages to scan through, it's better to have a way to moniter them while we are looping through them. This process in completely optional but is very helpful while debugging the process.If you are looping through say a 100+ pages, I'd say this is a must have feature.

For our script, we'll make sure to use this feature and measure the following parameters:

- The frequency of requests, just to make sure we are not overloading the server.

- The number of requests, so we can halt the loop incase the number of requests is exceeded.

- The status code of our requests, so we make sure the server is sending back the correct responses.

In [26]:
from IPython.core.display import clear_output
from time import time

#redeclaring the variables

movie_names = []
year_release = []
imdb_ratings = []
metascores = []
votes = []
movie_description = []
certificate = []
runtime = []
genre = []
director_name = []
star_cast = []
gross_value = []

#preparing the moniter of the loop
start_time = time()
requests = 0


    
#for every year in the interval 2000-2018
for year in years:

    #for every page in the onterval 1-4
    for page in pages:

        #make a get request
        response = get("http://www.imdb.com/search/title?release_date=" + year + "&sort=num_votes,desc&page=" + page)

        #pause the loop
        sleep(randint(8,15))

        #monitor the requests
        requests += 1
        sleep(randint(1,3))
        elapsed_time = time() - start_time
        print("Request: {}, Frequency: {} requests/s".format(requests, requests/elapsed_time))
        print(page, year)
        clear_output(wait = True)

        #throw a warning for non-200 status codes
        if response.status_code != 200:
            warn("Request: {}, Status Code: {} ".format(requests, response.status_code))


        #break the loop if the frequency of request is too higih
        if requests > 200:
            warn("Number of requests was greater than expected.")
            break

        #parse the content through the html.parser using BeautifulSoup
        html_page = bs(response.text, "html.parser")

        #select all 50 movie container for a single page
        containers = html_page.find_all("div", class_ = "lister-item mode-advanced")

        #for every movie of the 50 movies
        for container in containers:

            #if movie has Metascore, then extract:
            if container.find("div", class_ = "ratings-metascore") is not None:

                #the movie_name
                name = str(container.h3.a.text)
                #The year of release
                release = container.find("span", class_ = "lister-item-year text-muted unbold").text
                #The ratings for the movie
                ratings = float(container.strong.text)
                #metascore for the movie
                meta = container.find("span", class_ = "metascore").text
                #The number of votes
                vote = container.find("span", attrs = {"name":"nv"})['data-value']
                #the certificate for the movie
                if container.find("span", class_ = "certificate") is not None:
                    certi = container.find("span", class_ = "certificate").text
                else:
                    certi = None
                 #The runtime for the movie   
                if container.find("span", class_ ="runtime") is not None:
                    run = container.find("span", class_ ="runtime").text
                else:
                    run = None
                 #The genre of the movie   
                if container.find("span", class_ ="genre") is not None:
                    gen = container.find("span", class_ ="genre").text
                else:
                    gen = None   
                
                #fetching all <p> tags
                content = container.find_all("p")
                
                #The movie description
                if content[1] is not None:
                    desc = content[1].text
                else:
                    desc = None
                
                #subsetting all the <a> tags in 3rd <p> tag
                content_2 = content[2].find_all("a")
                
                #the director
                if content_2[0] is not None:
                    director = content_2[0].text
                else:
                    director = None
                
                #the gross value
                if len(container.find_all("span", attrs = {"name":"nv"})) >= 2:
                    gross = container.find_all("span", attrs = {"name":"nv"})[1]['data-value']
                else:
                    gross = None
                    
                 #extracting artists names
                if content_2[1] is not None:
                    temp = []
                    for i in range(len(content_2)-1):
                        temp.append(content_2[i].text)
                else:
                    for i in range(len(content_2)-1):
                        temp.append(None)
                        
                #Cleaning the data using regular Expressions:
                
                name = str(name)
                desc = str(re.findall(r"[^\\r\n].+",desc))
                #temp = re.findall(r"'([^']*)'",temp)
                gen = re.findall(r"([^\\r\\n\s,][a-zA-Z]+)",gen)
                
                if (gross != None):
                    gross = int(gross.replace(",",""))
                
                if (run != None):
                    run = int(re.findall(r"[0-9].+[^a-zA-Z-]", run)[0])
                    
                if (release != None):
                    release = str(re.findall(r"[0-9].+[^a-zA-Z-]",release))
                 
                #storing data in the list objects
                for i in range(len(content_2)-1):
                    for j in range(len(gen)):
                        movie_names.append(str(name))
                        year_release.append(release)
                        imdb_ratings.append(ratings)
                        metascores.append(int(meta))
                        votes.append(int(vote))
                        movie_description.append(desc)
                        gross_value.append(gross)
                        runtime.append(run)
                        certificate.append(str(certi))
                        director_name.append(str(director))
                        star_cast.append(str(temp[i]))
                        genre.append(str(gen[j]))

Request: 20, Frequency: 0.06919363152835704 requests/s
10 2018


In [41]:
#storing scraped data into a data frame.

imdb_movie_dataset = pd.DataFrame({"movie_names":movie_names,
                        "year_release":year_release,
                        "imdb_ratings":imdb_ratings,
                        "metscores":metascores,
                        "votes":votes,
                        "movie_description":movie_description,
                        "certificate":certificate,
                        "runtime":runtime,
                        "genre":genre,
                        "director_name": director_name,
                        "star_cast": star_cast,
                        "gross_value":gross_value
                       })

imdb_movie_dataset.head(3)

Unnamed: 0,certificate,director_name,genre,gross_value,imdb_ratings,metscores,movie_description,movie_names,runtime,star_cast,votes,year_release
0,R,James Mangold,Action,226277068.0,8.1,77,"[""In the near future, a weary Logan cares for ...",Logan,137,James Mangold,452598,['2017)']
1,R,James Mangold,Drama,226277068.0,8.1,77,"[""In the near future, a weary Logan cares for ...",Logan,137,James Mangold,452598,['2017)']
2,R,James Mangold,Sci,226277068.0,8.1,77,"[""In the near future, a weary Logan cares for ...",Logan,137,James Mangold,452598,['2017)']


In [30]:
imdb_movie_dataset.to_csv("imdb_movie_dataset.csv", encoding = 'UTF-8') #stores the DataFrame as a csv file

In [33]:
imdb_movie_dataset.shape # checking the shape of the cleaned dataset.

(2887, 12)

In [36]:
#extracting movie names for further scraping
movies = imdb_movie_dataset['movie_names']
movies.values

array(['Logan', 'Logan', 'Logan', ..., 'Zhuo yao ji 2', 'Zhuo yao ji 2',
       'Zhuo yao ji 2'], dtype=object)

In [37]:
movies.to_csv("movies.csv", encoding = 'UTF-8') #saving the names of movies as a csv file

# Success!

We have successfuly completed the scraping and cleaning of the data. In the next workbook we shall use this dataset to store it in a PostgreSQL database.