# Web-Scrapping imdb page for other information

In [52]:
import time
import random
import warnings
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

We will be using the normal distribution as the benchmark for the best movie recommender. Why? Skewed distribitions (either to the left or right) would indicate a bias in the rating behavior. Given enough data, the average movie should be closer to the number 3 from the scale of 1-5. Therefore, it will be something we look after when plotting.

We will be using Requests and Beautifulsoup for webscrapping. Our target is to gather information from multiple websites to evaluate which website is the most reliable to use for movie ratings

In [53]:
#scrolling to the next 50 movies (each page)
link = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=0'
response = requests.get(link)
print(response.text[0:500])




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle"


If we head to the website and do inspection of the page, we'll see that all the movies are in the (div) (/div) tag, specifically, elements of each section of a movie is nested within (div class = "lister-item mode-advance"). There should be about 50 containers of these on each page.

We need to use BeautifulSoup to help parse the html. Documentation can be found on:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautifulsoup

In [54]:
html_soup = BeautifulSoup(response.text,"html.parser")
#type(html_soup)
#The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.
#Ie: looking through the "div" containers with class attribute of descendants being "lister-item mode-advanced"
total_containers = html_soup.find_all("div",class_="lister-item mode-advanced")
print(len(total_containers)) # a list of 50 containers

50


# Structures of the html page

In [55]:
first_movie = total_containers[0]
#first_movie
first_title = first_movie.h3.a.text
first_title

first_year = first_movie.h3.find("span",class_="lister-item-year text-muted unbold").text
first_year = first_year[1:-1] #remove parenthesis
first_year

first_imdb = first_movie.find("div", class_="ratings-bar").strong
first_imdb.text

first_metascore = first_movie.find("span", class_="metascore").text 
#doesn't have to be the whole text "metascore favorable", not all scores are filled with that tag (errors)
first_metascore

first_vote = first_movie.find_all("span", attrs={"name":"nv"})[0]["data-value"]
first_vote

first_gross = first_movie.find_all("span",attrs={"name":"nv"})[1]["data-value"]
first_gross

'226,277,068'

In [56]:
#first page run-through
def web_scrap(total_containers):
    imdb_data = {"name":[],"year":[],"imdb":[],"metascore":[],"vote":[],"gross":[]}
    #since missing metascore and gross gave us errors, we'll only query items in which all rows are available.
    #this will also save us a lot of data clearning time later
    for movie in total_containers:
        if (movie.find('div', class_='ratings-metascore') is None) or ((len(movie.find_all("span", attrs={"name":"nv"})) < 2)):
            continue

        title = movie.h3.a.text
        imdb_data["name"].append(title)

        year = movie.h3.find("span",class_="lister-item-year text-muted unbold").text
        imdb_data["year"].append(year)

        imdb = movie.find("div", class_="ratings-bar").strong.text
        imdb_data["imdb"].append(float(imdb))

        metascore = movie.find("span", class_="metascore").text
        imdb_data["metascore"].append(int(metascore))

        vote = movie.find_all("span", attrs={"name":"nv"})[0]["data-value"]
        vote_ = vote.replace(",","") #parsing string "1,000" to 1000
        imdb_data["vote"].append(int(vote_))

        gross = movie.find_all("span", attrs={"name":"nv"})[1]["data-value"]
        gross_ = gross.replace(",","") #parsing string "1,000" to 1000
        imdb_data["gross"].append(int(gross_))
    return imdb_data

imdb_data = web_scrap(total_containers)
#imdb_data

In [57]:
#helper function to modify the year column
def year_mod(row):
    year = row.split("(")[-1] #will result in "2017)"
    return year[:-1]

#create a dataframe from the dictionary output of server request
def df_transform(imdb_data):
    temp_df = pd.DataFrame({'movie': imdb_data["name"],
                           'year': imdb_data["year"],
                           'imdb': imdb_data["imdb"],
                           'metascore': imdb_data["metascore"],
                           'votes': imdb_data["vote"],
                           'gross':imdb_data["gross"]})
    print(temp_df.info())
    return temp_df


temp_df = df_transform(imdb_data)
temp_year = temp_df["year"].apply(year_mod)
temp_df["year"] = temp_year
temp_df.head(5)
#webscrapping officially worked!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 6 columns):
gross        46 non-null int64
imdb         46 non-null float64
metascore    46 non-null int64
movie        46 non-null object
votes        46 non-null int64
year         46 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 2.2+ KB
None


Unnamed: 0,gross,imdb,metascore,movie,votes,year
0,226277068,8.1,77,Logan,510160,2017
1,412563408,7.5,76,Wonder Woman,437645,2017
2,188373161,8.0,94,Dunkirk,419778,2017
3,620181382,7.2,85,Star Wars: Episode VIII - The Last Jedi,416461,2017
4,389813101,7.7,67,Guardians of the Galaxy Vol. 2,412289,2017


# Putting it all together

In [58]:
def web_scrap_final(year_, pages_):
    years_req = [str(i) for i in range (year_,2019)] #scrapping all the way to the year 2018
    total_pages_req = [str(i) for i in range(0,pages_)] #6x50 = 300 per page if no missing data/errors

    start_time= time.time()
    total_requests = 0
    total_durations = 0
    imdb_total = {"name":[],"year":[],"imdb":[],"metascore":[],"vote":[],"gross":[]}
    for i in years_req:
        for j in total_pages_req:
            link = "http://www.imdb.com/search/title?release_date="+i+"&sort=num_votes,desc&page="+j
            response = requests.get(link)
            html_soup = BeautifulSoup(response.text,"html.parser")
            total_movies = html_soup.find_all('div', class_ = 'lister-item mode-advanced') #should be 50 items per page

            time.sleep(random.randint(1,4)) #crucial
            total_requests += 1
            total_duration = time.time() - start_time
            print("Request: ",total_requests, "Average sec per request: ", total_duration/total_requests)

            #good response = 200, we don't want to raise an exception, as some of the data will still be useful
            if response.status_code != 200:
                warn("Couldn't get a good response: error ",response.status_code)

            #consolidating dict, appending each pages into the end of the previous requests
            for key in imdb_total:
                imdb_total[key] += web_scrap(total_movies)[key]
    return imdb_total

imdb_total = web_scrap_final(2017,5)
df = df_transform(imdb_total)
temp_year = df["year"].apply(year_mod)
df["year"] = temp_year
df.to_csv("scrapped_test.csv")
df.head(10)

Request:  1 Average sec per request:  5.4967029094696045
Request:  2 Average sec per request:  5.474834561347961
Request:  3 Average sec per request:  5.761972983678182
Request:  4 Average sec per request:  5.598272740840912
Request:  5 Average sec per request:  5.124214220046997
Request:  6 Average sec per request:  5.098747690518697
Request:  7 Average sec per request:  5.236070428575788
Request:  8 Average sec per request:  5.215654730796814
Request:  9 Average sec per request:  5.194841437869602
Request:  10 Average sec per request:  5.169923710823059
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287 entries, 0 to 286
Data columns (total 6 columns):
gross        287 non-null int64
imdb         287 non-null float64
metascore    287 non-null int64
movie        287 non-null object
votes        287 non-null int64
year         287 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 13.5+ KB
None


Unnamed: 0,gross,imdb,metascore,movie,votes,year
0,226277068,8.1,77,Logan,510160,2017
1,412563408,7.5,76,Wonder Woman,437645,2017
2,188373161,8.0,94,Dunkirk,419778,2017
3,620181382,7.2,85,Star Wars: Episode VIII - The Last Jedi,416461,2017
4,389813101,7.7,67,Guardians of the Galaxy Vol. 2,412289,2017
5,315058289,7.9,74,Thor: Ragnarok,388630,2017
6,334201140,7.5,73,Spider-Man: Homecoming,359216,2017
7,176040665,7.7,84,Get Out,333607,2017
8,92054159,8.1,81,Blade Runner 2049,332597,2017
9,107825862,7.6,86,Baby Driver,325563,2017


looks like we on average took out about 28 movies per page. Therefore, if we want about 2000 movies, it should be 10x of the current round. Obviously two solutions: increase the page, increase the year. We'll do a little of both

In [59]:
imdb_total = web_scrap_final(2005,15) #from 2005 until now, 15 pages per year.
df = df_transform(imdb_total)
temp_year = df["year"].apply(year_mod)
df["year"] = temp_year
df.to_csv("imdb_scrapped.csv")
df.head(10)

Request:  1 Average sec per request:  5.457341909408569
Request:  2 Average sec per request:  4.446001052856445
Request:  3 Average sec per request:  4.760790665944417
Request:  4 Average sec per request:  5.183119058609009
Request:  5 Average sec per request:  4.920573616027832
Request:  6 Average sec per request:  4.685903827349345
Request:  7 Average sec per request:  4.667958157403128
Request:  8 Average sec per request:  4.864551872014999
Request:  9 Average sec per request:  4.790406571494208
Request:  10 Average sec per request:  4.7264656066894535
Request:  11 Average sec per request:  4.6579622788862745
Request:  12 Average sec per request:  4.7119930783907575
Request:  13 Average sec per request:  4.736654318295992
Request:  14 Average sec per request:  4.841744644301278
Request:  15 Average sec per request:  4.736825593312582
Request:  16 Average sec per request:  4.67471744120121
Request:  17 Average sec per request:  4.777819829828599
Request:  18 Average sec per request: 

Request:  145 Average sec per request:  4.702341538462146
Request:  146 Average sec per request:  4.698559334833328
Request:  147 Average sec per request:  4.699916307618018
Request:  148 Average sec per request:  4.688404879054508
Request:  149 Average sec per request:  4.677290268392371
Request:  150 Average sec per request:  4.680028266906739
Request:  151 Average sec per request:  4.6681649748063245
Request:  152 Average sec per request:  4.66234795513906
Request:  153 Average sec per request:  4.673315805547378
Request:  154 Average sec per request:  4.68511332629563
Request:  155 Average sec per request:  4.698854406418339
Request:  156 Average sec per request:  4.690900781215766
Request:  157 Average sec per request:  4.695922777151606
Request:  158 Average sec per request:  4.698834177813953
Request:  159 Average sec per request:  4.693961666814936
Request:  160 Average sec per request:  4.690630862116814
Request:  161 Average sec per request:  4.698386149376816
Request:  162 A

Unnamed: 0,gross,imdb,metascore,movie,votes,year
0,206852432,8.3,70,Batman Begins,1145123,2005
1,70511035,8.2,62,V for Vendetta,919884,2005
2,74103820,8.0,74,Sin City,704534,2005
3,380262555,7.6,68,Star Wars: Episode III - Revenge of the Sith,614004,2005
4,290013036,7.7,81,Harry Potter and the Goblet of Fire,460844,2005
5,186336279,6.5,55,Mr. & Mrs. Smith,400277,2005
6,234280354,6.5,73,War of the Worlds,375174,2005
7,206459076,6.7,72,Charlie and the Chocolate Factory,372191,2005
8,218080025,7.2,81,King Kong,355427,2005
9,109449237,7.1,73,The 40-Year-Old Virgin,350530,2005


Voila! Now we're transporting these data back to the orginal analysis