In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Introduction

The Internet Movie Database, IMDb.com, is the world's largest source of source of movie data.  It provides data on nearly 4,000,000 films, with dates ranging from 1874 to the present.  This provides a rich opporunity to explore any number of film-related hypotheses or inquiries.  However, it is difficult to extract and use this data for analytical purposes, because IMDB explicitly prohibits scraping its pages, and it does not provide a free API. Third-party alternatives like the Open Movie Database (OMDb API) exist that allow users to query IMDb data, but these are highly constrained in the breadth and depth of queries that can be run. For its part, IMDb only provides "a way access IMDb locally by holding copies of the data directly on your system" by accessing "a subset of the IMDb plain text data files... from our FTP sites."

<img src="IMDB_FTP_description.png" style="width: 600px;"/>

In this tutorial, we will access IMDb data, freely and legally, using the "ftp.fu-berlin.de (Germany)" link on IMDb's Alternative Interfaces page.  Clicking on this link takes us here:

<img src="PDS_tutorial_file_list.png" style="height: 400px;"/>

This page provides a list of all freely available data files from IMDB.  It has files for lists of movies, actors, actresses, directors, cinematographers, and etc.  For our purposes will work with two of these files: directors and cinematographers.  By the end of this tutorial, we will have shown how to:

1. Locate and download specific IMDb.com text files containing the data of interest
2. Process these text files into a format suitable (.csv) for loading into a simple SQL database and/or Pandas DataFrames
3. Load the data from each text file into its own table in our database
4. Develop a script for querying the database that can handle misspellings and disambiguation of names and titles
5. Answer collaborational questions like, "which cinematographer did director Woody Allen work with the most?"

Readers of this tutorial should subsequently be able to select any of the IMDb data files, preprocess and load them in to a database, and generate their own cinematic queries to learn about collaboration networks in film.


## Installing and importing the necessary libraries

In order to complete the tasks in this tutorial, we'll require Pandas, sqlite3, and csv.  If you don't already have any of these libraries installed, you can get them with the following commands in your machine's command terminal:

\$ conda install pandas

\$ conda install sqlite

\$ conda install csv

Then, to begin our code, we will need to import each of these libraries as follows:

In [2]:
import pandas as pd
import sqlite3
import csv

## Downloading the text files from IMDB

Now, it's time to download the text files as discussed in the introduction.  We'll want to navigate to the IMDB Alternative Interfaces page (http://www.imdb.com/interfaces), and click on "ftp.fu-berlin.de."  This takes us to the repository of freely available IMDB files.  We'll want to download "directors.list.gz" and "cinematographers.list.gz".  After unzipping them, we'll want to save their .list files in the same filepath as our notebook.  These .list files can be opened and edited using a standard text editor.  Let's open each of them to see what data they contain and in what format.

In directors.list, we first encounter some summary information at the top of the file, including a usage policy.  We'll want to simply delete this.  Below that, we encounter a key that contains important information about the data in the file:

<img src="PDS_tutorial_directors_key.png" style="width: 600px;"/>

We'll want to save this to another location, as it will be a useful reference tool later.  Then, we'll remove it from the document, along with everything else that precedes the start of the actual data:

#### <center>Top of file before:</center>
<img src="PDS_tutorial_directors_before.png" style="height: 400px;"/>

#### <center>Top of file after:</center>
<img src="PDS_tutorial_directors_after.png" style="height: 400px;"/>

Finally, we must go to the bottom of the file and remove the concluding description text (non-data) that also occurs there:

#### <center>Bottom of file before:</center>
<img src="bottom_before.png" style="height: 400px;"/>

#### <center>Bottom of file after:</center>
<img src="bottom_after.png" style="height: 400px;"/>

Now, upon looking at the data, we see that the three core pieces of information provided are the director's name, the names of the films he/she directed, and their years of release.  This is the information we'll concern ourself with for the purposes of this tutorial.  We can also immediately see some potential problems: some films have unknown release years, shown by "(????)", and some directors' names include punctuation and non-english symbols that could make them difficult to process.  We can also see that, since most directors have made multiple films, there are multiple rows of films associated with a director, whose name is represented by a single row:

<img src="director_sample.png" style="height: 300px;"/>

Fortunately for us, the cinematographers.list file is formatted exactly the same as the directors.list file, except of course that the cinematographer's name is included instead of the director's.  In fact, this is the case for all people files in the IMDB Alternative Interface repository (actors, producers, screenwriters, etc.).  This means that, after (quickly) manually removing the tops and bottoms of any of these files, we can feed them to a single processing script.

## Processing the .list files for formatting into .csv file

In this section, we're going to implement all the processing necessary to convert the raw data in an IMDB .list file to a proper, usable .csv file.

First, let's get some sense of what our raw data looks like if we simply read it as is, using Python's csv library to print the first 20 rows:


In [75]:
index = 1

with open('directors.list', 'rb') as test_file:
    csv_reader = csv.reader(test_file, delimiter='\t', quotechar='|')
    for row in csv_reader:
        print row
        print len(row)
        index += 1
        if index == 20:
            break

["'Abd Al-Hamid, Ja'far", 'A Two Hour Delay (2001)']
2
['', '', '', 'Badgeless sur la Croisette (2012) {{SUSPENDED}}']
4
['', '', '', 'Just Outside the Frame: The Profilmic Event and Beyond (2008)']
4
['', '', '', 'Mesocafe (2009) {{SUSPENDED}}']
4
['', '', '', 'Mesocaf\xe9 (2011)']
4
[]
0
["'D.J'Arlia, Domenic", "She'll Never Know (2012)"]
2
[]
0
["'Dada' Pecori, Diego", 'Adam (????)  (attached)']
2
['', '', '', 'Cantarella (2011)']
4
['', '', '', 'Makhno Beer (2010)']
4
[]
0
["'Kid Niagara' Kallet, Harry", 'Drug Demon Romance (2012)  (co-director)']
2
[]
0
["'Kusare, Mak (I)", 'Baby Beautiful (2013/II)']
2
['', '', '', 'Comrade (2008)']
4
[]
0
["'Kusare, Mak (II)", 'A Play Called a Temple Made of Clay (2014)']
2
[]
0


Several issues jump out immediately:

1. The director's first and last names are included together in the format "first, last".  We will need to split up this name string if we want to separate first and last name.
2. The year is included in the same string as the film title
3. For some films, other special comments are also included at the end of the string containing the film title and year.
4. Some lines have empty elements for the director's name; this occurs due to the formatting of the input file.  Importantly, such rows contain three empty strings, followed by the string containing the title, year, etc. of the film.
5. Some lines are blank, owing to the fact that the input document has empty lines to separate directors
6. When lines that include directors' names are returned as lists (using tab-delimited csv reading), these lists vary in length from 2-4, with some of them including empty elements.

In the code we write to process this data contained in the input file, we'll want to have separate elements for first and last names of directors, for better querying.  We'll also want to separate the film titles from their release dates.

We'll create a function that writes our raw input data to a properly formatted .csv; that function will address these known issues and meet the requirements specified above.

In [5]:
#Maintain a list of valid positions (i.e. positions for which we currently have data files available)
positions_list = ["directors", "cinematographers"]

#Write a function to process the raw .list files after the manual preprocessing described above has been completed.
def text_to_csv(input_filepath, output_filepath, position):
    """
    Inputs:
        input_filepath (str) : the filepath to the .list file that contains the relevant IMDb data
        output_filepath (str) : the filepath to the empty .csv file to which output will be written
        position (str) : category of the position for the current list (e.g. "directors" or "cinematographers")
    Outputs:
        nothing is directly returned from this function; the result is that the output csv file is populated with the data
    """
    
    #First, we check to make sure that the filepaths and position are both valid.
    #If the filepath is not valid, print an error message and exit the function
    try:
        input_file = open(input_filepath, 'rb')
        #Recall that our input data file is (very messily) tab delimited
        csv_reader = csv.reader(input_file, delimiter='\t')
    except IOError:
        print "Input file not found; please provide a valid input filepath"
        return
    
    try:
        output_file = open(output_filepath, 'wb')
        csv_writer = csv.writer(output_file, delimiter='|')
    except IOError:
        print "Output file not found; please provide a valid output filepath"
        return    
    
    #If the position is not valid, print an error message and exit the function
    if position not in positions_list:
        print "Position not valid; please enter one of these valid positions:"
        for pos in positions_list:
            print pos
        return
    
    #Now we begin our main goal for this function: converting the raw data file into a valid .csv file.
    #That .csv file will later be read into a Pandas dataframe.  This raw text processing represents the
    #bulk of the difficult work in this tutorial.  We will need to account for many potential edge cases
    #of text formatting, while capturing all of the valid data in the file.
    #Rather than editing in place on the raw input file, we will write to a new, empty .csv file.
    
    i = 0
    with input_file:
        
        #write headers to first row of output file
        csv_writer.writerow(['last_name','first_name','title','year'])
        
        last_name = None
        first_name = None
        
        for row in csv_reader:
            
            #1. skip over empty lines, and note that this means we're about to see a new director
            if len(row) == 0:
                continue
            
            #2. for rows containing a person's name...
            elif(row[0] != ''):
                #...remove empty elements if there are any. This always results in a list of length 2,
                #of the form:    ["first_name, last_name", "film_title (YYYY) ..."]
                row = [x for x in row if x != '']
                #if the length of the row is still not 2, we've hit an anomoly in the data; just skip the row
                #to avoid corrupting our output file
                if len(row) != 2:
                    continue
                #Otherwise, we parse the row into last_name, first_name, title, and release_year
                else:
                    name = row[0].split(', ')
                    #If we encounter an anomoly where there's not a first and last name, skip the row.
                    #Note that this means we must skip all of the films associated with this person.
                    #For now, we'll create '!' flags as name placeholders.
                    if len(name) != 2:
                        name = ['!','!']
                    #Now, parse the last name, first name, film title, and year
                    last_name = name[0]
                    first_name = name[1]
                    title_and_year = row[1].split(' (')
                    title = title_and_year[0]
                    year = title_and_year[1][0:4]
                    parsed_row = [last_name, first_name, title, year]
                    #We're done parsing the row! Write it to the output file
                    csv_writer.writerow(parsed_row)
                               
            #3. If the line is not empty and doesn't begin with a name, then we know we're on a line within a given
            #person's block of films, so we want to make sure we associate that line with that person
            else:
                #All such lines should be lists of length 4. If we see an anomoly, simply skip the line.
                if len(row) != 4:
                    continue
                #We know that the first three elements of the line will be empty strings, so ignore them.
                #We just want the title and date
                title_and_year = row[3].split(' (')
                title = title_and_year[0]
                year = title_and_year[1][0:4]
                #Year should be numeric; if it's not, we've hit an anomoly. Skip the row and continue
                if year.isdigit() == False:
                    continue
                parsed_row = [last_name, first_name, title, year]
                #We're done parsing the row! Write it to the output file
                csv_writer.writerow(parsed_row)
            i += 1
    
    print "total rows: " + str(i)
    return

In [6]:
#Let's test out the code we've written so far. 
print "Directors"
text_to_csv('directors.list','directors_data.csv','directors')
print "Cinematographers"
text_to_csv('cinematographers.list', 'cinematographers_data.csv','cinematographers')

Directors
total rows: 2600086
Cinematographers
total rows: 1358342


## Reading our csv files into Pandas DataFrames

Now let's load the data from the csv files we generated into separate Pandas DataFrames:

In [7]:
#We'll write a simple function to convert our csv files to Pandas DataFrames,
#and to pickle them so we don't have to reload them again later

def csv_to_df(input_filename, pickle_filename):
    df = pd.read_csv(input_filename, sep='|')
    df.to_pickle(pickle_filename)
    return df

#Load the dataframes
directors_df = csv_to_df("directors_data.csv", "directors_pickle.pkl")
cinematographers_df = csv_to_df("cinematographers_data.csv", "cinematographers_pickle.pkl")
    

In [8]:
#Let's do a quick check of the DataFrames to make sure they match our expectations:
print "Directors: shape and head"
print directors_df.shape
print directors_df.head()
print ""
print "Cinematographers: shape and head"
print cinematographers_df.shape
print cinematographers_df.head()

Directors: shape and head
(2600086, 4)
       last_name first_name  \
0  'Abd Al-Hamid     Ja'far   
1  'Abd Al-Hamid     Ja'far   
2  'Abd Al-Hamid     Ja'far   
3  'Abd Al-Hamid     Ja'far   
4  'Abd Al-Hamid     Ja'far   

                                               title  year  
0                                   A Two Hour Delay  2001  
1                         Badgeless sur la Croisette  2012  
2  Just Outside the Frame: The Profilmic Event an...  2008  
3                                           Mesocafe  2009  
4                                           Mesocaf�  2011  

Cinematographers: shape and head
(1358342, 4)
       last_name first_name                             title  year
0  'Cali' Quiroz      Oscar  Manchild: The Schea Cotton Story  2016
1   'Chito' R�os     Daniel                    About the Dead  2016
2  'Dada' Pecori      Diego                       Makhno Beer  2010
3        'Kusare    Mak (I)                    Baby Beautiful  2013
4          'Lang     

## Loading IMDb data into sqlite3

Using Pandas DataFrames to hold our data is probably ok if we only want to use a couple of the shorter files from IMDb.  However, if we wanted to pull data on a dozen different positions (e.g. actors, actresses, producers, etc.), we'd probably be better off storing it in a simple database, for more efficient joins and queries.  Now, we'll write a function to load data from any IMDb personel file into its own table in sqlite3:

In [9]:
def load_film_data(connection, filepath, table_name):
    """ Load IMDb data in the files as tables into an in-memory SQLite database
    Input:
        connection (sqlite3.Connection) : database connection
        filepath (str) : path to input csv file
        table_name (str) : name of the table in our database
    Output:
        None (result is that the given file is uploaded to a table in the database)
    """
    #Initialize the cursor using the connection given in the function parameters 
    cursor = connection.cursor()
    
    #Write the initial sql command to create the table, given the table_name in the function parameters
    execution_str = "CREATE TABLE " + table_name + " (last_name TEXT, first_name TEXT, title TEXT, year TEXT)"
    cursor.execute(execution_str)
    
    #Now that we've created the table, we're prepared to load the data from our csv file into it
    #We must specify that our delimiter is "|" (the default delimter is a comma)
    with open(filepath,'rb') as film_file:
        dict_reader = csv.DictReader(film_file, delimiter='|')
        to_db = [(i['last_name'], i['first_name'],i['title'],i['year']) for i in dict_reader]

    #Finally, we load our data by executing a sql statment to populate the given table
    cursor.executemany("INSERT INTO " + table_name + " (last_name, first_name, title, year) VALUES (?, ?, ?, ?);", to_db)
    connection.commit()
    
    return

Now, lets load our directors and cinematographers data into sqlite tables, using the function we wrote above:

In [10]:
#First, remember to open a connection to the database
con = sqlite3.connect(":memory:")
con.text_factory = str
#load the directors and cinematographers csv files into the database
load_film_data(con, "directors_data.csv", "directors")
load_film_data(con, "cinematographers_data.csv", "cinematographers")
cursor = con.cursor()
#Finally, let's check to make sure our function calls executed properly by printing the table names in our database:
for row in cursor.execute("SELECT name FROM sqlite_master WHERE type = 'table';"):
    print row

('directors',)
('cinematographers',)


## The fun part: querying our film database for cinematic insights

We're finally ready to start querying!  Now, it's important to remember why we went to all this trouble; if we wanted to ask questions like "Who was director X's cinematographer on film Y," or, "Which films did director X do between years Y and Z?", we could have done that quickly and easily enough on IMDb.com.

But we're interested in *collaboration over time*.  We want to ask questions like, "Which cinematographer did director X collaborate with most often (during years Y through Z)?", or, "Which directors worked with the same cinematographers on more than X% of the films they made?"

So, let's ask them- and answer them.  We'll now write and execute functions that let us answer collaborational questions such as we asked above.

We'll start by creating a function that returns all the films associated with a given person, for use in later functions:

In [11]:
def get_films_and_years(first_name, last_name, position, cursor):
    """ Retrieves all films and dates associated with the person given in the input
    Input:
        first_name (str): first name of person
        last_name (str): last name of person
        position (str): position category of main person (so we can look up in the proper database table)
        cursor (sqlite3.Cursor): cursor object to query the database
    Output:
        titles_and_dates (dict): dictionary of film titles (keys) and dates (values) for the person in the input
    """
    #right now: just get the films and years for a given person (position, first_name, last_name)
    query_string = "SELECT title, year FROM %s where %s=? AND %s=?"
    query_string = "SELECT title, year FROM %s where %s LIKE ? AND %s LIKE ?"
    #To avoid ambiguities with middle names, numerals, etc., we'll use a wildcard with names
    results = cursor.execute(query_string % (position, 'last_name', 'first_name'), (last_name+"%", first_name+"%"))
    
    titles_and_dates = {}
    
    for row in results:
        titles_and_dates[row[0]] = row[1]
    
    return titles_and_dates

Let's test our function for of history's most lauded film directors, Satyajit Ray

In [12]:
sr_films = get_films_and_years('Satyajit', 'Ray', 'directors', cursor)

#Let's sort the results from most recent to earliest
import operator
sorted_sr_films = sorted(sr_films.items(), key=operator.itemgetter(1), reverse=True)
for film in sorted_sr_films:
    print film
print "Total films: " + str(len(sorted_sr_films))

('Agantuk', '1991')
('Shakha Proshakha', '1990')
('Ganashatru', '1989')
('Sukumar Ray', '1987')
('Ghare-Baire', '1984')
('Pikoor Diary', '1981')
('Sadgati', '1981')
('Heerak Rajar Deshe', '1980')
('Joi Baba Felunath', '1979')
('Shatranj Ke Khilari', '1977')
('Bala', '1976')
('Jana Aranya', '1976')
('Sonar Kella', '1974')
('Ashani Sanket', '1973')
('The Inner Eye', '1972')
('Seemabaddha', '1971')
('Sikkim', '1971')
('Aranyer Din Ratri', '1970')
('Pratidwandi', '1970')
('Goopy Gyne Bagha Byne', '1969')
('Chiriyakhana', '1967')
('Nayak', '1966')
('Two', '1965')
('Mahapurush', '1965')
('Kapurush', '1965')
('Charulata', '1964')
('Mahanagar', '1963')
('Kanchenjungha', '1962')
('Abhijaan', '1962')
('Teen Kanya', '1961')
('Rabindranath Tagore', '1961')
('Devi', '1960')
('Apur Sansar', '1959')
('Parash Pathar', '1958')
('Jalsaghar', '1958')
('Aparajito', '1956')
('Pather Panchali', '1955')
Total films: 37


## Looking at collaborations

Now that we can pull films and dates associated with a person, we can also look to see what other people (in what other positions) that person collaborated with.  Let's write a function that does this:

In [13]:
def get_collaborators(first_name, last_name, main, collab, cursor):
    """
    Input:
        first_name (str): first name of person
        last_name (str): last name of person
        main (str): position category of main person
        collab (str): position category of collaborator
        cursor (sqlite3.Cursor): cursor object to query the database
    Output:
        collab_list (list): list of 4-tuples of form (collaborator first name, collaborator last name, film title, year)
    """
    
    inner_join_query = """
    SELECT %s.last_name AS collab_last, %s.first_name AS collab_first, %s.title, %s.year 
    FROM %s INNER JOIN %s 
    ON %s.title = %s.title AND %s.year = %s.year
    WHERE %s.first_name=? AND %s.last_name=?
    """
    
    var_tuple = (collab, collab, collab, collab, main, collab, main, collab, main, collab, main, main)
    val_tuple = (first_name, last_name)
    
    collab_result = cursor.execute(inner_join_query % var_tuple, val_tuple)

    collab_list = []
    
    for row in collab_result:
        collab_list.append(row)
    
    return collab_list

Let's test out or newest function on Satyajit Ray again, to find all the cinematographers he collaborated with, and on what films:

In [14]:
test = get_collaborators('Satyajit', 'Ray', 'directors', 'cinematographers', cursor)
#Let's sort our results from most recent to oldest
test_sorted = sorted(test, key=lambda x: x[3], reverse=True)
for collab in test_sorted:
    print collab

('Raha', 'Barun (I)', 'Agantuk', '1991')
('Raha', 'Barun (I)', 'Shakha Proshakha', '1990')
('Raha', 'Barun (I)', 'Ganashatru', '1989')
('Roy', 'Soumendu', 'Ghare-Baire', '1984')
('Roy', 'Soumendu', 'Sadgati', '1981')
('Roy', 'Soumendu', 'Heerak Rajar Deshe', '1980')
('Roy', 'Soumendu', 'Joi Baba Felunath', '1979')
('Roy', 'Soumendu', 'Shatranj Ke Khilari', '1977')
('Roy', 'Soumendu', 'Jana Aranya', '1976')
('Roy', 'Soumendu', 'Sonar Kella', '1974')
('Roy', 'Soumendu', 'Ashani Sanket', '1973')
('Roy', 'Soumendu', 'The Inner Eye', '1972')
('Roy', 'Soumendu', 'Seemabaddha', '1971')
('Roy', 'Soumendu', 'Sikkim', '1971')
('Roy', 'Soumendu', 'Aranyer Din Ratri', '1970')
('Bose', 'Purnendu', 'Pratidwandi', '1970')
('Roy', 'Soumendu', 'Pratidwandi', '1970')
('Roy', 'Soumendu', 'Goopy Gyne Bagha Byne', '1969')
('Roy', 'Soumendu', 'Chiriyakhana', '1967')
('Mitra', 'Subrata', 'Nayak', '1966')
('Roy', 'Soumendu', 'Kapurush', '1965')
('Roy', 'Soumendu', 'Mahapurush', '1965')
('Mitra', 'Subrata', 'C

## Finding top collaborators and frequency of collaboration

So, just from a glance at the results above, it looks like Satyajit Ray had a few favorite cinematographers that he worked with consistently throughout his career.  It'd be great if we could have a simple function that would show us a list of unique collaborators, along with frequency of collaboration, that a given person worked with.  Let's write a function that can tell us this:

In [15]:
def rank_collaborators(first_name, last_name, main, collab, cursor):
    """ Get collaborators ranked by frequency of collaboration, with ability to see films & dates of collaboration
    Input:
        first_name (str): first name of person
        last_name (str): last name of person
        main (str): position category of main person
        collab (str): position category of collaborator
        cursor (sqlite3.Cursor): cursor object to query the database
    Output:
        (collab_dict, collabs_and_counts) (tuple) : dict and list containing collaboration data
    """
    
    #First, we get the complete list of collaborators, using the function we just wrote
    collabs = get_collaborators(first_name, last_name, main, collab, cursor)
    
    #In order to maintain the first and last names of collaborators without using concatenation
    #or creating a new class, we can use Python's namedtuple library:
    from collections import namedtuple
    Collaborator = namedtuple("Collaborator", ["last_name", "first_name"])
    
    #Now, we'll populate a dictionary with keys as Collaborators, and values as lists of 2-tuples of the form (title, year)
    collab_dict = {}
    
    #We iterate over the collaborations and update the our dictionary accordingly
    for item in collabs:
        c = Collaborator(last_name=item[0], first_name=item[1])
        if c in collab_dict.keys():
            collab_dict[c].append((item[2], item[3]))
        else:
            collab_dict[c] = [(item[2], item[3])]
    
    #Finally, we return a sorted list of collaborators (sorted on frequency of collaboration)
    collabs_and_counts = []
    for k in sorted(collab_dict, key=lambda k: len(collab_dict[k]), reverse=True):
        collabs_and_counts.append((k, len(collab_dict[k])))

    return (collab_dict, collabs_and_counts)

Let's test out our newest function!  First, let's look at rankings, by collaboration frequency, of Satyajit Ray's cinematographers:

In [17]:
test_dict, test_ranked = rank_collaborators('Satyajit', 'Ray', 'directors', 'cinematographers', cursor)
for item in test_ranked:
    print item

(Collaborator(last_name='Roy', first_name='Soumendu'), 20)
(Collaborator(last_name='Mitra', first_name='Subrata'), 10)
(Collaborator(last_name='Raha', first_name='Barun (I)'), 3)
(Collaborator(last_name='Bose', first_name='Purnendu'), 1)


Interesting- it seems Satyajit Ray preferred to work with the cinematographer Soumendu Roy more than any other.  Which films did they work on together?  We can easily use our function to see that as well:

In [18]:
ray_roy_films = test_dict[('Roy','Soumendu')]
for film in sorted(ray_roy_films, key=lambda x: x[1], reverse=True):
    print film

('Ghare-Baire', '1984')
('Sadgati', '1981')
('Heerak Rajar Deshe', '1980')
('Joi Baba Felunath', '1979')
('Shatranj Ke Khilari', '1977')
('Jana Aranya', '1976')
('Sonar Kella', '1974')
('Ashani Sanket', '1973')
('The Inner Eye', '1972')
('Seemabaddha', '1971')
('Sikkim', '1971')
('Aranyer Din Ratri', '1970')
('Pratidwandi', '1970')
('Goopy Gyne Bagha Byne', '1969')
('Chiriyakhana', '1967')
('Kapurush', '1965')
('Mahapurush', '1965')
('Abhijaan', '1962')
('Rabindranath Tagore', '1961')
('Teen Kanya', '1961')


## Where to go from here

Having reached the end of this tutorial, we now know how to extract, format, cleanse and use the freely available data from IMDb.com, in the absence of a nice API.  Better yet, we've seen how to leverage it to gain insights far beyond what we could get on IMDb.com; particularly, we've seen how to explore basic collaboration between people in the film industry.  Yet, this opens up many new questions, like:

  -Which positions tend to collaborate together most frequently?

  -Could we represent collaboration in the film industry using graphs?

  -Can we learn which collaboration patterns tend to yield the best outcomes, in terms of profits and/or awards?

Anyone who has completed this tutorial should be able to move on to tackle those more complex questions.

Let's conclude, for fun, by generating the favorite cinematographers of some of the world's greatest directors:


In [19]:
print "Woody Allen's top cinematographers:"
a_dict, a_ranked = rank_collaborators('Woody', 'Allen', 'directors', 'cinematographers', cursor)
for item in a_ranked[:5]:
    print item[0][1], item[0][0], item[1]

print    
    
print "Ingmar Bergman's top cinematographers:"
b_dict, b_ranked = rank_collaborators('Ingmar', 'Bergman', 'directors', 'cinematographers', cursor)
for item in b_ranked[:5]:
    print item[0][1], item[0][0], item[1]

print    
    
print "Andrei Tarkovsky's top cinematographers:"
c_dict, c_ranked = rank_collaborators('Andrei', 'Tarkovsky', 'directors', 'cinematographers', cursor)
for item in c_ranked[:5]:
    print item[0][1], item[0][0], item[1]

print    
    
print "Akira Kurosawa's top cinematographers:"
d_dict, d_ranked = rank_collaborators('Akira', 'Kurosawa', 'directors', 'cinematographers', cursor)
for item in d_ranked[:5]:
    print item[0][1], item[0][0], item[1]
    
print
    
print "Werner Herzog's top cinematographers:"
e_dict, e_ranked = rank_collaborators('Werner', 'Herzog', 'directors', 'cinematographers', cursor)
for item in e_ranked[:5]:
    print item[0][1], item[0][0], item[1]

Woody Allen's top cinematographers:
Eigil Bryld 36
Carlo Di Palma 12
Gordon (I) Willis 8
Darius Khondji 5
Sven Nykvist 4

Ingmar Bergman's top cinematographers:
Sven Nykvist 58
Gunnar Fischer 12
G�ran Strindberg 4
Raymond Wemmenl�v 3
Hilding Bladh 3

Andrei Tarkovsky's top cinematographers:
Vadim Yusov 4
Georgi (I) Rerberg 2
Leonid Kalashnikov 1
Lev Bunin 1
Ernst Yakovlev 1

Akira Kurosawa's top cinematographers:
Asakazu Nakai 11
Takao (I) Sait� 9
Takeo (I) It� 5
Sh�ji Ueda 5
Kazuo (I) Miyagawa 2

Werner Herzog's top cinematographers:
Peter Zeitlinger 51
Dave (II) Roberson 32
J�rg Schmidt-Reitwein 17
Karl Kofler 16
Thomas Mauch 10
