## IMDb.com Web Scraper


##### Tom Keith - https://github.com/tomkeith

---


### IMDb URL Structure

IMDb has a great url structure for scraping. Using Star Wars for example: `www.imdb.com/title/tt0076759/`. The IMDb ID - here `tt0076759` - is all that is needed to fetch the page.

IMDb IDs can be sourced from the [IMDb open datasets](https://www.imdb.com/interfaces/ "IMDb open datasets") where these unique IDs are represented in the `tconst` column.

### Why scrape IMDb if there is an open dataset?

IMDb's open dataset was lacking some key features needed for my Movie Genre Prediction project. Most notably it only had 3 genres (the first three alphabetically), where as IMDb.com can have 1-7 genres. Additionally, IMDb's open data does not have any text data (for example plot summary), something I also needed for NLP.

Those reasons are the inspiration behind creating this scraper.

### Movie Posters

The function `imdb_scrape` has an optional second parameter (boolean) to save the movie poster (default location is /posters/ folder).

#### Notes

The main function `imdb_scrape` is meant to be ran in a loop. **This notebook is not meant to be run all at once.** Rather, the main cell (that is not a function) is mean to be manually updated before each running of the cell. See notes before that cell.

---

In [1]:
import pandas as pd
import numpy as np

import csv
import sys

import requests
from bs4 import BeautifulSoup
    
from PIL import Image
from io import BytesIO

import re
import json

import time
import random
from pprint import pprint

In [2]:
def to_numeric(string, num_type='int'):
    '''
    Function to strip all non-numeric characters from string and return int or float
    INPUT - String to convert
          - num_type: either 'int' or 'float'
    OUTPUT - int or float type (returns original string if neither specified)
    '''
    if num_type == 'float':
        x = float( re.sub("[^0-9]", "", string ) )
    elif num_type == 'int':
        x = int( re.sub("[^0-9]", "", string ) )
    else:
        x = string
    return x


def savePoster(imdb_id, img_url):
    '''
    Function that fetches and save the poster image from provided url
    and saves it with the provided id (corresponding with IMDb).
    Won't replace (or even fetch) if file already exists.
    
    INPUT:  id from imdb, url where to find image
    OUTPUT: boolean flag if saved or not.
    '''
    import os.path
    
    # Get file extension
    ext = img_url.split('.')[-1]
    
    # Check to see if I already have it
    if os.path.isfile(f'posters/{imdb_id}.{ext}'):
        return False
    
    # Get image data, and save it as imdb_id
    response = requests.get(img_url)
    img = Image.open(BytesIO(response.content))    
    img.save(f'posters/{imdb_id}.{ext}')
    
    return True

def concatenate_list_data(my_list):
    result = ''
    for element in my_list:
        result += str(element)
    return result

def time_since(start_time):
    '''
    Simple timer calculating time difference between
    start_time input parameter, and now
    
    OUTPUT: string ' 2m45s'
    INPUT: timestamp of starting time
    '''
    end_time = time.time()
    mins = (end_time - start_time)//60
    secs = (end_time - start_time) - (60*mins)
    return f'{mins:2.0f}m{secs:2.0f}s'

In [3]:
def imdb_scrape(imdb_id, save_image=False, debug=False):
    '''
    Function which scrapes IMDb using IMDb ID 'tt0107290'. Second parameter is for 
    the movie poster (saved in /posters/ folder). Third parameter is to print result.
    
    This function is mean to be used in a loop. As such, the print outputs may lack
    meaning if used outside of the cells below.
    
    INPUT:  - ID of movie to scrape from IMDB e.g. "tt0076759"
            - boolean to save the movie poster or not (default True)
            - boolean to print result
           
    OUTPUT: Dictionary of various scrapped information.
    
             {'tconst':imdb_id, 'title':'',     'release_year':'',     'release_date':'',
              'MPAA':'',        'genre':[],     'runtime':'',          'poster_url':'',
              'plot_short':'',  'plot_long':'', 'imdb_rating':'',      'num_imdb_votes':'',
              'metacritic':'',  'num_user_reviews':'',                 'num_critic_reviews':''
             }
    '''
    # Target datapoints to scrape (with provided imdb_id)
    imdb_info_dict = {'tconst':imdb_id,'title':'',    'release_year':'',      'release_date':'',
                      'MPAA':'',       'genre':[],    'runtime':'',           'poster_url':'',
                      'plot_short':'', 'plot_long':'', 'imdb_rating':'',      'num_imdb_votes':'',
                      'metacritic':'', 'num_user_reviews':'',                 'num_critic_reviews':''
                     }
    imdb_info_dict['tconst'] = imdb_id
    
    imdb_base_url = 'https://www.imdb.com/title/'
    print(f'{imdb_id.ljust(10)} ', end='')
    
    # Main content - build URL, and soup content
    imdb_full_url = imdb_base_url + imdb_id
    r = requests.get(imdb_full_url).content
    soup = BeautifulSoup(r, 'html.parser')
    print(f'[x]   ', end='')
    
    # Code from js section has json variables
    json_dict = json.loads( str( soup.findAll('script', {'type':'application/ld+json'})[0].text ))

    # Info - Movie title, year, parental content rating, poster url
    imdb_info_dict['title'] = json_dict['name']
    if 'contentRating' in json_dict:
        imdb_info_dict['MPAA'] = json_dict['contentRating'] 
    imdb_info_dict['poster_url'] = json_dict['image']
    imdb_info_dict['release_year'] = int( soup.find('span', {'id':'titleYear'}).a.text )
    imdb_info_dict['runtime'] = to_numeric( soup.find('time')['datetime'] )

    # Release date (from top header)
    date_string = soup.find('div', {'class':'title_wrapper'}).findAll('a')[-1].text.split(' (')[0]
    imdb_info_dict['release_date'] = date_string
    
    # Genres (up to 7)
    imdb_info_dict['genre'] = json_dict['genre']

    # Ratings - IMDb rating (and vote count), Metacritic
    imdb_info_dict['imdb_rating'] = float( json_dict['aggregateRating']['ratingValue'] )
    imdb_info_dict['num_imdb_votes'] = json_dict['aggregateRating']['ratingCount']

    # Metacritic score, if there is one
    if soup.find('div', {'class':'metacriticScore'}) != None:
        imdb_info_dict['metacritic'] = int( soup.find('div', {'class':'metacriticScore'}).span.text )

    # Reviews - Number of critic and public reviews (different than ratings/votes)
    num_review_list = soup.findAll('div',{'class':'titleReviewBarItem titleReviewbarItemBorder'})
    if num_review_list != []:
        reviews = num_review_list[0].findAll('a')
        if len(reviews) > 1:
            imdb_info_dict['num_critic_reviews'] = to_numeric( reviews[1].text )
        if len(reviews) > 0:
            imdb_info_dict['num_user_reviews'] = to_numeric( reviews[0].text )

    # Plots - long and short versions
    imdb_info_dict['plot_short'] = soup.find('div',{'class':'summary_text'}).text.strip()
    if 'Add a Plot' in imdb_info_dict['plot_short']:
        imdb_info_dict['plot_short'] = ''
    if soup.find('div',{'id':'titleStoryLine'}).div.p != None:
        imdb_info_dict['plot_long'] = soup.find('div',{'id':'titleStoryLine'}).div.p.span.text.strip()
    
    # Plot output
    print(f'[x]   ', end='')

    if save_image == True:
        img_status = savePoster(imdb_id, imdb_info_dict['poster_url'])
        if img_status == True:
            print(f'[x]   ', end='')
        else:
            print(f'[ ]   ', end='')
    else:
        print(f'N/A   ', end='')
    
    print(f"{(imdb_info_dict['title']+' ('+str(imdb_info_dict['release_year'])+')')[:100]:100} ", end='')
    time.sleep(random.randint(1,10) / 100)
    
    print('')
    if(debug):
        pprint(imdb_info_dict)
    return imdb_info_dict

---

### Sample use of scraper (Jurassic Park)

In [4]:
# Scrape Jurassic park, do not save the poster, print result
try:
    imdb_scrape('tt0107290', False, True)
except Exception:
    print('error')

tt0107290  [x]   [x]   N/A   Jurassic Park (1993)                                                                                 
{'MPAA': 'PG-13',
 'genre': ['Action', 'Adventure', 'Sci-Fi', 'Thriller'],
 'imdb_rating': 8.1,
 'metacritic': 68,
 'num_critic_reviews': 355,
 'num_imdb_votes': 820733,
 'num_user_reviews': 1085,
 'plot_long': 'Huge advancements in scientific technology have enabled a mogul '
              'to create an island full of living dinosaurs. John Hammond has '
              'invited four individuals, along with his two grandchildren, to '
              'join him at Jurassic Park. But will everything go according to '
              'plan? A park employee attempts to steal dinosaur embryos, '
              'critical security systems are shut down and it now becomes a '
              'race for survival with dinosaurs roaming freely over the '
              'island.',
 'plot_short': 'A pragmatic paleontologist visiting an almost complete theme '
               'park

The above is just an example of the information scraped for 1 movie. Time to automate this!

---

### Automation with Loops

Below are two cells. The first one scrapes all the movies from the csv (using the IDs) in a *range* of years. While, the second cell scrapes all movies from one provided year.

After each year, the scraped content (list of dictionaries) are converted to a DataFrame and then saved as a .tsv (one for each year - `imdb_scrape_2001.tsv`). We should end up with 100 .tsv files.

**Why by year?** It seemed like a good way to break down my scraping into manageable chunks

---

In [5]:
import csv
import sys

# xNeed to provide a csv of IMDb IDs to scrape
movie_df = pd.read_csv('imdb_movie_list.csv')

failed_list = []

yr = 1920
movies_in_year = movie_df[(movie_df['year'] == yr)]

print('--------------------------------------------------------------------------')    
print(f'Scraping movies: {len(movies_in_year)}   Year: {yr}')
print('')
print('Count   tconst     Get   Parse Img   Title')

start_time = time.time()
annual_movie_list = []
fails=0

for i, tconst in enumerate(movies_in_year['tconst'].values):
    print(f'{i+1:5d}   ', end = '')
    try:
        scraped_movie_info = imdb_scrape(tconst, True) #Change to false to NOT save movie poster
        annual_movie_list.append(scraped_movie_info)
    except Exception:
        print(f'--------- FAILED ---------- FAILED ---------- FAILED ----------')
        failed_list.append(tconst)
        fails+=1

print(f'Movies scraped: {len(annual_movie_list)}   Fails: {fails}   ', end='')

my_df = pd.DataFrame(annual_movie_list)
my_df.to_csv(f'rawdata/imdb_scrape_{yr}.tsv', sep='\t', quoting=csv.QUOTE_ALL, index=False)
print('\n')
print(f'Saved: imdb_scrape_{yr}.tsv     ', end='')
print(f'Time taken: {time_since(start_time)}')
print('')

--------------------------------------------------------------------------
Scraping movies: 11   Year: 1920

Count   tconst     Get   Parse Img   Title
    1   tt0010323  [x]   [x]   [x]   Das Cabinet des Dr. Caligari (1920)                                                                  
    2   tt0011237  [x]   [x]   [x]   Der Golem, wie er in die Welt kam (1920)                                                             
    3   tt0011841  [x]   [x]   [x]   Way Down East (1920)                                                                                 
    4   tt0011130  [x]   [x]   [x]   Dr. Jekyll and Mr. Hyde (1920)                                                                       
    5   tt0011870  [x]   [x]   [x]   Within Our Gates (1920)                                                                              
    6   tt0011439  [x]   [x]   [x]   The Mark of Zorro (1920)                                                                             
    7   tt0011

---

The previous cell only did one year. The below cell can do a range of years.

In [6]:
import csv
import sys

#failed_list = [] # Run this once to initialize, then comment out to append continuously

# Need to provide a csv of IMDb IDs to scrape
movie_df = pd.read_csv('imdb_movie_list.csv')

years_to_scrape_list = range(1920,1925) #[1973, 1972, 1971, 1970]
for yr in years_to_scrape_list:
    movies_in_year = movie_df[(movie_df['year'] == yr)]

    print('--------------------------------------------------------------------------')
    print(f'Scraping movies: {len(movies_in_year)}   Year: {yr}')
    print('')
    print('Count   tconst     Get   Parse Img   Title')
    start_time = time.time()
    annual_movie_list = []

    fails=0
    for i, tconst in enumerate(movies_in_year['tconst'].values):
        print(f'{i+1:5d}   ', end = '')
        try:
            scraped_movie_info = imdb_scrape(tconst, True) #Change to false to NOT save movie poster
            annual_movie_list.append(scraped_movie_info)
        except Exception:
            print(f'--------- FAILED ---------- FAILED ---------- FAILED ----------')
            failed_list.append(tconst)
            fails+=1

    print(f'Movies scraped: {len(annual_movie_list)}   Fails: {fails}   ', end='')

    my_df = pd.DataFrame(annual_movie_list)
    my_df.to_csv(f'rawdata/imdb_scrape_{yr}.tsv', sep='\t', quoting=csv.QUOTE_ALL, index=False)
    print('\n')
    print(f'Saved: imdb_scrape_{yr}.tsv     ', end='')
    print(f'Time taken: {time_since(start_time)}')
    print('')

--------------------------------------------------------------------------
Scraping movies: 11   Year: 1920

Count   tconst     Get   Parse Img   Title
    1   tt0010323  [x]   [x]   [ ]   Das Cabinet des Dr. Caligari (1920)                                                                  
    2   tt0011237  [x]   [x]   [ ]   Der Golem, wie er in die Welt kam (1920)                                                             
    3   tt0011841  [x]   [x]   [ ]   Way Down East (1920)                                                                                 
    4   tt0011130  [x]   [x]   [ ]   Dr. Jekyll and Mr. Hyde (1920)                                                                       
    5   tt0011870  [x]   [x]   [ ]   Within Our Gates (1920)                                                                              
    6   tt0011439  [x]   [x]   [ ]   The Mark of Zorro (1920)                                                                             
    7   tt0011

    7   tt0015174  [x]   [x]   [x]   Die Nibelungen: Kriemhilds Rache (1924)                                                              
    8   tt0014945  [x]   [x]   [x]   Girl Shy (1924)                                                                                      
    9   tt0014972  [x]   [x]   [x]   He Who Gets Slapped (1924)                                                                           
   10   tt0014646  [x]   [x]   [x]   Aelita (1924)                                                                                        
   11   tt0015202  [x]   [x]   [x]   Orlacs Hände (1924)                                                                                  
   12   tt0015016  [x]   [x]   [x]   The Iron Horse (1924)                                                                                
   13   tt0014586  [x]   [x]   [x]   Das Wachsfigurenkabinett (1924)                                                                      
   14   tt0015136  [x]   [x

---

#### Optional

ID's of failed scrapes have been tracked. These are for reference if you need these movies scraped.

In [7]:
len(failed_list)

0

In [8]:
failed_list

[]