## Based on the film database created in the last step, I want to know the IMDB ratings of the Netflix original films.

Below are the list of tables and relevant columns in each.

### imdb_basics - Contains the following information for titles:
- tconst (string) - alphanumeric unique identifier of the title.
- titleType (string) – the type/format of the title (e.g. movie, short,
tvseries, tvepisode, video, etc).
- primaryTitle (string) – the more popular title / the title used by
the filmmakers on promotional materials at the point of release.
- originalTitle (string) - original title, in the original language.
- startYear (YYYY) – represents the release year of a title. In the
case of TV Series, it is the series start year.
- runtimeMinutes – primary runtime of the title, in minutes.
- genres (string array) – includes up to three genres associated with
the title.

### imdb_ratings – Contains the IMDb rating and votes information for titles:
- tconst (string) - alphanumeric unique identifier of the title.
- averageRating – weighted average of all the individual user ratings.
- numVotes - number of votes the title has received.

### netflix – Contains the Netflix original film list (webscraped from Wikipedia):
- title (string) - official Netflix title
- releaseDate (string) - release date in '{month} {day}, {year}' format (e.g. 'October 16, 2015') 
- genre - main genre
- runtime - runtime of the flim in '{hour} h {minute} min' format (e.g. '1 h 40 min')
- language - main langauge of the film
- filmType - type of the film (e.g. feature film, documentary, special, etc)
- runtime_min - runtime converted to minutes

Connect to the films database and check each table.

In [1]:
import pandas as pd
import os
import sqlite3 

# Connect to SQLite database 
conn = sqlite3.connect('films.db') 

# Create a cursor object 
cur = conn.cursor() 

In [2]:
# Check if basics dataframe was saved correctly by loading and pulling up first 10 entries
query = """
        SELECT *
        FROM imdb_basics
        LIMIT 10;
        """
# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['tconst', 'titleType', 'primaryTitle', 'originalTitle', 'isAdult', 'startYear', 'endYear', 'runtimeMinutes', 'genres']
('tt0000001', 'short', 'Carmencita', 'Carmencita', 0, 1894, None, 1, 'Documentary,Short')
('tt0000002', 'short', 'Le clown et ses chiens', 'Le clown et ses chiens', 0, 1892, None, 5, 'Animation,Short')
('tt0000003', 'short', 'Poor Pierrot', 'Pauvre Pierrot', 0, 1892, None, 5, 'Animation,Comedy,Romance')
('tt0000004', 'short', 'Un bon bock', 'Un bon bock', 0, 1892, None, 12, 'Animation,Short')
('tt0000005', 'short', 'Blacksmith Scene', 'Blacksmith Scene', 0, 1893, None, 1, 'Short')
('tt0000006', 'short', 'Chinese Opium Den', 'Chinese Opium Den', 0, 1894, None, 1, 'Short')
('tt0000007', 'short', 'Corbett and Courtney Before the Kinetograph', 'Corbett and Courtney Before the Kinetograph', 0, 1894, None, 1, 'Short,Sport')
('tt0000008', 'short', 'Edison Kinetoscopic Record of a Sneeze', 'Edison Kinetoscopic Record of a Sneeze', 0, 1894, None, 1, 'Documentary,Short')
('tt00

In [3]:
# Check if ratings dataframe was saved correctly by loading and pulling up first 10 entries
query = """
        SELECT *
        FROM imdb_ratings
        LIMIT 10;
        """
# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['tconst', 'averageRating', 'numVotes']
('tt0000001', 5.7, 2133)
('tt0000002', 5.5, 289)
('tt0000003', 6.4, 2169)
('tt0000004', 5.3, 184)
('tt0000005', 6.2, 2896)
('tt0000006', 5.0, 208)
('tt0000007', 5.3, 901)
('tt0000008', 5.4, 2280)
('tt0000009', 5.3, 220)
('tt0000010', 6.8, 7871)


In [4]:
# We can join these two tables on 'tconst' which is a unique key for each title.
# Using INNER JOIN as we only want entries that exist in both tables - titles that have ratings.
query = """
        SELECT b.tconst, b.primaryTitle, b.startYear, b.runtimeMinutes, r.averageRating
        FROM imdb_basics AS b
        INNER JOIN imdb_ratings AS r
        ON b.tconst = r.tconst
        LIMIT 10;
        """

# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)

for row in out: 
    print(row) 

['tconst', 'primaryTitle', 'startYear', 'runtimeMinutes', 'averageRating']
('tt0000001', 'Carmencita', 1894, 1, 5.7)
('tt0000002', 'Le clown et ses chiens', 1892, 5, 5.5)
('tt0000003', 'Poor Pierrot', 1892, 5, 6.4)
('tt0000004', 'Un bon bock', 1892, 12, 5.3)
('tt0000005', 'Blacksmith Scene', 1893, 1, 6.2)
('tt0000006', 'Chinese Opium Den', 1894, 1, 5.0)
('tt0000007', 'Corbett and Courtney Before the Kinetograph', 1894, 1, 5.3)
('tt0000008', 'Edison Kinetoscopic Record of a Sneeze', 1894, 1, 5.4)
('tt0000009', 'Miss Jerry', 1894, 45, 5.3)
('tt0000010', 'Leaving the Factory', 1895, 1, 6.8)


In [5]:
# Read in netflix data and list the column names and the first 10 entries
query = """
        SELECT *
        FROM netflix
        LIMIT 10;
        """
# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['title', 'releaseDate', 'genre', 'runtime', 'language', 'filmType', 'runtime_min', 'releaseYear']
('Beasts of No Nation', 'October 16, 2015', 'War drama', '2 h 17 min', 'English', 'Feature films', 137, 2015)
('The Ridiculous 6', 'December 11, 2015', 'Western comedy', '2 h', 'English', 'Feature films', 120, 2015)
("Pee-wee's Big Holiday", 'March 18, 2016', 'Adventure comedy', '1 h 30 min', 'English', 'Feature films', 90, 2016)
('Special Correspondents', 'April 29, 2016', 'Satire', '1 h 41 min', 'English', 'Feature films', 101, 2016)
('The Do-Over', 'May 27, 2016', 'Action comedy', '1 h 48 min', 'English', 'Feature films', 108, 2016)
('The Fundamentals of Caring', 'June 24, 2016', 'Comedy drama', '1 h 37 min', 'English', 'Feature films', 97, 2016)
('Brahman Naman', 'July 7, 2016', 'Sex comedy', '1 h 35 min', 'English', 'Feature films', 95, 2016)
('Rebirth', 'July 15, 2016', 'Thriller', '1 h 40 min', 'English', 'Feature films', 100, 2016)
('Tallulah', 'July 29, 2016', 'Comedy drama', '1 

Let's join netflix and IMDB data by matching the title and release year.

In [6]:
# Match title and year between netflix and IMDB data
# Getting entries from IMDB that is more recent than 2014 since our netflix flims are 2015-2024
query = """
        SELECT i.tconst, n.title, n.releaseDate, i.startYear, n.runtime_min, i.runtimeMinutes
        FROM netflix AS n
        LEFT JOIN (SELECT b.tconst, b.primaryTitle, b.startYear, b.runtimeMinutes, r.averageRating
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2014) 
                    AS i
        ON LOWER(n.Title) = LOWER(i.primaryTitle) AND n.releaseYear = i.startYear
        LIMIT 10;
        """
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['tconst', 'title', 'releaseDate', 'startYear', 'runtime_min', 'runtimeMinutes']
('tt1365050', 'Beasts of No Nation', 'October 16, 2015', 2015, 137, 137)
('tt2479478', 'The Ridiculous 6', 'December 11, 2015', 2015, 120, 119)
('tt0837156', "Pee-wee's Big Holiday", 'March 18, 2016', 2016, 90, 90)
('tt4181052', 'Special Correspondents', 'April 29, 2016', 2016, 101, 101)
('tt4769836', 'The Do-Over', 'May 27, 2016', 2016, 108, 108)
('tt2452386', 'The Fundamentals of Caring', 'June 24, 2016', 2016, 97, 97)
('tt5240748', 'Brahman Naman', 'July 7, 2016', 2016, 95, 95)
('tt4902716', 'Rebirth', 'July 15, 2016', 2016, 100, 100)
('tt5798216', 'Rebirth', 'July 15, 2016', 2016, 100, 23)
('tt1639084', 'Tallulah', 'July 29, 2016', 2016, 111, 111)


In the above result, there are two 'Rebirth' entries with the same release date, as two different IMDB entries joined on the same netflix entry. Looking at the runtimes for both, only one of them is a right match. One way we can solve this is to limit the join so that the runtime has to be an exact match. However, if we look at 'The Ridiculous 6' entry, the runtime in minutes is 120 in netflix table and 119 in IMDB data. Thus, joining on the exact match for runtime might lead to poorer match overall. Another approach can be filtering using WHERE to display only the entries that have *close enough* runtimes between the two tables. In the below cell, let's filter so that we only see entries whose IMDB runtime comes within +-3 min of netflix runtime.

In [7]:
# Join on title and year, filter on IMDB runtime = netflix runtime +-3
# Check the entries 'The Ridiculous 6' and 'Rebirth'
query = """
        SELECT i.tconst, n.title, n.releaseDate, i.startYear, n.runtime_min, i.runtimeMinutes
        FROM netflix AS n
        LEFT JOIN (SELECT b.tconst, b.primaryTitle, b.startYear, b.runtimeMinutes, r.averageRating
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2014) 
                    AS i
        ON LOWER(n.Title) = LOWER(i.primaryTitle) AND n.releaseYear = i.startYear
        WHERE ( i.runtimeMinutes < n.runtime_min +3 AND i.runtimeMinutes > n.runtime_min -3 )
        AND (n.title = 'The Ridiculous 6' OR n.title = 'Rebirth');
        """
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['tconst', 'title', 'releaseDate', 'startYear', 'runtime_min', 'runtimeMinutes']
('tt2479478', 'The Ridiculous 6', 'December 11, 2015', 2015, 120, 119)
('tt4902716', 'Rebirth', 'July 15, 2016', 2016, 100, 100)


We're able to grab only single entries for these two, so great! Now, let's check netflix films that didn't successfully match to the IMDB entry. Let's nest the above query to left join to the original netflix table to list the films that are not included in the above query.

In [11]:
# List non-matches
query = """
        SELECT *
        FROM netflix AS og_n
        LEFT JOIN (SELECT n.title
                    FROM netflix AS n
                    LEFT JOIN (SELECT b.tconst, b.primaryTitle, b.startYear, b.runtimeMinutes, r.averageRating
                                FROM imdb_basics AS b
                                INNER JOIN imdb_ratings AS r
                                ON b.tconst = r.tconst
                                WHERE startYear > 2014) 
                                AS i
                    ON LOWER(n.Title) = LOWER(i.primaryTitle) AND n.releaseYear = i.startYear
                    WHERE (runtimeMinutes < n.runtime_min +3 AND i.runtimeMinutes > n.runtime_min -3)
                    ) 
                    AS matched
        ON og_n.Title = matched.Title
        WHERE matched.Title IS NULL
        LIMIT 10;
        """

out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['title', 'releaseDate', 'genre', 'runtime', 'language', 'filmType', 'runtime_min', 'releaseYear', 'title']
('Mascots', 'October 13, 2016', 'Mockumentary', '1 h 35 min', 'English', 'Feature films', 95, 2016, None)
('7 años', 'October 28, 2016', 'Drama', '1 h 17 min', 'Spanish', 'Feature films', 77, 2016, None)
('Mercy', 'November 22, 2016', 'Thriller', '1 h 27 min', 'English', 'Feature films', 87, 2016, None)
('Imperial Dreams', 'February 3, 2017', 'Drama', '1 h 26 min', 'English', 'Feature films', 86, 2017, None)
("Girlfriend's Day", 'February 14, 2017', 'Comedy drama', '1 h 10 min', 'English', 'Feature films', 70, 2017, None)
("I Don't Feel at Home in This World Anymore", 'February 24, 2017', 'Thriller drama', '1 h 36 min', 'English', 'Feature films', 96, 2017, None)
('Burning Sands', 'March 10, 2017', 'Drama', '1 h 42 min', 'English', 'Feature films', 102, 2017, None)
('Tramps', 'April 21, 2017', 'Romantic comedy', '1 h 23 min', 'English', 'Feature films', 83, 2017, None)
('The Meye

Let's check if some of these exist in the IMDB table.

In [12]:
query = """
        SELECT *
        FROM imdb_basics AS b
        INNER JOIN imdb_ratings AS r
        ON b.tconst = r.tconst
        WHERE b.primaryTitle = 'Mascots';
        """
for row in cur.execute(query): 
    print(row) 

('tt1956565', 'short', 'Mascots', 'Mascots', 0, 2011, None, 21, 'Drama,Family,Short', 'tt1956565', 7.7, 9)
('tt4936176', 'movie', 'Mascots', 'Mascots', 0, 2016, None, 89, 'Comedy', 'tt4936176', 5.8, 8267)


Since we have an entry with 2016 release year, it seems like we need to be more lenient on filtering the runtime.

Let's run the original query again with a more lenient filter for runtime and then check some more entries.

In [13]:
# List non-matches with runtime +-10
query = """
        SELECT *
        FROM netflix AS og_n
        LEFT JOIN (SELECT n.title
                    FROM netflix AS n
                    LEFT JOIN (SELECT b.tconst, b.primaryTitle, b.startYear, b.runtimeMinutes, r.averageRating
                                FROM imdb_basics AS b
                                INNER JOIN imdb_ratings AS r
                                ON b.tconst = r.tconst
                                WHERE startYear > 2014) 
                                AS i
                    ON LOWER(n.Title) = LOWER(i.primaryTitle) AND n.releaseYear = i.startYear
                    WHERE (i.runtimeMinutes <= n.runtime_min +10 AND i.runtimeMinutes >= n.runtime_min -10)
                    ) 
                    AS matched
        ON og_n.Title = matched.Title
        WHERE matched.Title IS NULL
        LIMIT 10;
        """
# 
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['title', 'releaseDate', 'genre', 'runtime', 'language', 'filmType', 'runtime_min', 'releaseYear', 'title']
('7 años', 'October 28, 2016', 'Drama', '1 h 17 min', 'Spanish', 'Feature films', 77, 2016, None)
('Imperial Dreams', 'February 3, 2017', 'Drama', '1 h 26 min', 'English', 'Feature films', 86, 2017, None)
('Tramps', 'April 21, 2017', 'Romantic comedy', '1 h 23 min', 'English', 'Feature films', 83, 2017, None)
('The Meyerowitz Stories (New and Selected)', 'October 13, 2017', 'Comedy drama', '1 h 52 min', 'English', 'Feature films', 112, 2017, None)
('My Own Man', 'March 6, 2015', None, '1 h 21 min', 'English', 'Documentaries', 81, 2015, None)
('The Other One: The Long Strange Trip of Bob Weir', 'May 22, 2015', None, '1 h 23 min', 'English', 'Documentaries', 83, 2015, None)
('What Happened, Miss Simone?', 'June 26, 2015', None, '1 h 24 min', 'English', 'Documentaries', 84, 2015, None)
('Tig', 'July 17, 2015', None, '1 h 20 min', 'English', 'Documentaries', 80, 2015, None)
("Winter 

Check if IMDB has an entry for '7 años' as primaryTitle.

In [14]:
query = """
        SELECT *
        FROM imdb_basics AS b
        INNER JOIN imdb_ratings AS r
        ON b.tconst = r.tconst
        WHERE b.primaryTitle = '7 años';
        """
for row in cur.execute(query): 
    print(row) 

No matches? We might have better luck with originalTitle than primaryTitle since the title isn't in English.

In [15]:
query = """
        SELECT *
        FROM imdb_basics AS b
        INNER JOIN imdb_ratings AS r
        ON b.tconst = r.tconst
        WHERE b.originalTitle = '7 años';
        """
for row in cur.execute(query): 
    print(row) 

('tt5517438', 'movie', '7 Years', '7 años', 0, 2016, None, 77, 'Drama', 'tt5517438', 6.7, 5424)


So, some of the netflix flims might use the title in the original language instead of the more popular title. Seems like we need to try comparing the title to the original title if the primary and original titles are different from each other.

In [16]:
# List non-matches with runtime +-10 and original title matching
# Joining on primary title and original title separately and use UNION to combine the joined tables
# Note that UNION ALL should yield the same results since when we're joining on original title, we're only subsetting the entries where
# the primary title does not match the original title.
query = """
        SELECT *
        FROM netflix AS og_n
        LEFT JOIN (SELECT n.title
                    FROM netflix AS n
                    LEFT JOIN (SELECT b.tconst, b.primaryTitle, b.startYear, b.runtimeMinutes, r.averageRating
                                FROM imdb_basics AS b
                                INNER JOIN imdb_ratings AS r
                                ON b.tconst = r.tconst
                                WHERE startYear > 2014) 
                                AS i
                    ON LOWER(n.Title) = LOWER(i.primaryTitle) AND n.releaseYear = i.startYear
                    WHERE (i.runtimeMinutes <= n.runtime_min +10 AND i.runtimeMinutes >= n.runtime_min -10)
                    UNION
                    SELECT n.title
                    FROM netflix AS n
                    LEFT JOIN (SELECT b.tconst, b.primaryTitle, b.startYear, b.runtimeMinutes, r.averageRating
                                FROM imdb_basics AS b
                                INNER JOIN imdb_ratings AS r
                                ON b.tconst = r.tconst
                                WHERE startYear > 2014 AND LOWER(primaryTitle) != LOWER(originalTitle) )
                                AS i
                    ON LOWER(n.Title) = LOWER(i.originalTitle) AND SUBSTR(n.releaseDate,-4,4) = i.startYear
                    WHERE (i.runtimeMinutes <= n.runtime_min +10 AND i.runtimeMinutes >= n.runtime_min -10)
                    )
                    AS matched
        ON og_n.Title = matched.Title
        WHERE matched.Title IS NULL
        LIMIT 10;
        """
# 
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['title', 'releaseDate', 'genre', 'runtime', 'language', 'filmType', 'runtime_min', 'releaseYear', 'title']
('Imperial Dreams', 'February 3, 2017', 'Drama', '1 h 26 min', 'English', 'Feature films', 86, 2017, None)
('Tramps', 'April 21, 2017', 'Romantic comedy', '1 h 23 min', 'English', 'Feature films', 83, 2017, None)
('My Own Man', 'March 6, 2015', None, '1 h 21 min', 'English', 'Documentaries', 81, 2015, None)
('The Other One: The Long Strange Trip of Bob Weir', 'May 22, 2015', None, '1 h 23 min', 'English', 'Documentaries', 83, 2015, None)
('What Happened, Miss Simone?', 'June 26, 2015', None, '1 h 24 min', 'English', 'Documentaries', 84, 2015, None)
('Tig', 'July 17, 2015', None, '1 h 20 min', 'English', 'Documentaries', 80, 2015, None)
("Winter on Fire: Ukraine's Fight for Freedom", 'October 9, 2015', None, '1 h 31 min', 'Ukrainian', 'Documentaries', 91, 2015, None)
('My Beautiful Broken Brain', 'March 18, 2016', None, '1 h 31 min', 'English', 'Documentaries', 91, 2016, None)
('J

Check for "Tramps" in IMDB

In [17]:
# Print netflix entry for comparison
query = """
        SELECT *
        FROM netflix
        WHERE title = "Tramps";
        """
for row in cur.execute(query): 
    print(row) 

# Look for film 'Tramps'
query = """
        SELECT *
        FROM imdb_basics 
        WHERE primaryTitle = "Tramps" AND startYear > 2000;
        """
for row in cur.execute(query): 
    print(row) 

('Tramps', 'April 21, 2017', 'Romantic comedy', '1 h 23 min', 'English', 'Feature films', 83, 2017)
('tt2184609', 'tvEpisode', 'Tramps', 'Tramps', 0, 2012, None, 12, 'Action,Comedy')
('tt4991512', 'movie', 'Tramps', 'Tramps', 0, 2016, None, 82, 'Adventure,Romance')
('tt9049178', 'movie', 'Tramps', 'Csavargók', 0, 2018, None, 62, 'Documentary')


Seems like it will be better to be more lenient on release year. Let's check some more entries.

In [18]:
# Print netflix entry for comparison
query = """
        SELECT *
        FROM netflix
        WHERE title = "Imperial Dreams";
        """
for row in cur.execute(query): 
    print(row) 

# Look for film 'Tramps'
query = """
        SELECT *
        FROM imdb_basics 
        WHERE primaryTitle = "Imperial Dreams" AND startYear > 2000;
        """
for row in cur.execute(query): 
    print(row) 

('Imperial Dreams', 'February 3, 2017', 'Drama', '1 h 26 min', 'English', 'Feature films', 86, 2017)
('tt3331028', 'movie', 'Imperial Dreams', 'Imperial Dreams', 0, 2014, None, 87, 'Drama')


In [20]:
# Print netflix entry for comparison
query = """
        SELECT *
        FROM netflix
        WHERE title = "My Own Man";
        """
for row in cur.execute(query): 
    print(row) 

query = """
        SELECT *
        FROM imdb_basics 
        WHERE primaryTitle = "My Own Man" AND startYear > 2000;
        """
for row in cur.execute(query): 
    print(row) 

('My Own Man', 'March 6, 2015', None, '1 h 21 min', 'English', 'Documentaries', 81, 2015)
('tt3356434', 'movie', 'My Own Man', 'My Own Man', 0, 2014, None, 82, 'Comedy,Documentary,Drama')
('tt3432784', 'video', 'My Own Man', 'My Own Man', 0, 2014, None, 4, 'Music,Short')


In [21]:
# Print netflix entry for comparison
query = """
        SELECT *
        FROM netflix
        WHERE title = "What Happened, Miss Simone?";
        """
for row in cur.execute(query): 
    print(row) 

query = """
        SELECT *
        FROM imdb_basics 
        WHERE primaryTitle = "What Happened, Miss Simone?";
        """
for row in cur.execute(query): 
    print(row) 

('What Happened, Miss Simone?', 'June 26, 2015', None, '1 h 24 min', 'English', 'Documentaries', 84, 2015)
('tt4284010', 'movie', 'What Happened, Miss Simone?', 'What Happened, Miss Simone?', 0, 2015, None, 101, 'Biography,Documentary,Music')


In [22]:
# Print netflix entry for comparison
query = """
        SELECT *
        FROM netflix
        WHERE title = "Tig";
        """
for row in cur.execute(query): 
    print(row) 

query = """
        SELECT *
        FROM imdb_basics 
        WHERE primaryTitle = "Tig";
        """
for row in cur.execute(query): 
    print(row) 

('Tig', 'July 17, 2015', None, '1 h 20 min', 'English', 'Documentaries', 80, 2015)
('tt0835135', 'tvEpisode', 'Tig', 'Tig', 0, 2004, None, 22, 'Comedy,Documentary')
('tt32029337', 'tvEpisode', 'Tig', 'Tig', 0, 2015, None, None, 'Comedy,Drama')
('tt3986532', 'movie', 'Tig', 'Tig', 0, 2015, None, 95, 'Biography,Documentary')


In [23]:
# Print netflix entry for comparison
query = """
        SELECT *
        FROM netflix
        WHERE title = "Winter on Fire: Ukraine's Fight for Freedom";
        """
for row in cur.execute(query): 
    print(row) 

query = """
        SELECT *
        FROM imdb_basics 
        WHERE primaryTitle = "Winter on Fire: Ukraine's Fight for Freedom";
        """
for row in cur.execute(query): 
    print(row) 

("Winter on Fire: Ukraine's Fight for Freedom", 'October 9, 2015', None, '1 h 31 min', 'Ukrainian', 'Documentaries', 91, 2015)
('tt4908644', 'movie', "Winter on Fire: Ukraine's Fight for Freedom", "Winter on Fire: Ukraine's Fight for Freedom", 0, 2015, None, 102, 'Documentary,History,War')


Some of IMDB entries that seem like a match have a very different runtime compared to Netflix. Let's ignore these for now and stick to the below criteria for matching films between Netflix and IMDB data.
- match on Netflix title to IMDB primaryTitle
- if IMDB primaryTitle is different from originalTitle, match Netflix title to IMDB originalTitle
- limit results to where IMDB runtime is Netflix runtime +- 10 min
- limit results to where IMDB startYear is Netflix releaseYear +- 1 year

In [25]:
# Count non-matches with runtime +-10 and original title matching and release year more lenient +-1
query = """
        SELECT COUNT(*)
        FROM netflix AS og_n
        LEFT JOIN (SELECT n.title
                    FROM netflix AS n
                    LEFT JOIN (SELECT b.tconst, b.primaryTitle, b.startYear, b.runtimeMinutes, r.averageRating
                                FROM imdb_basics AS b
                                INNER JOIN imdb_ratings AS r
                                ON b.tconst = r.tconst
                                WHERE startYear > 2014) 
                                AS i
                    ON LOWER(n.Title) = LOWER(i.primaryTitle)
                    WHERE (i.runtimeMinutes <= n.runtime_min +10 AND i.runtimeMinutes >= n.runtime_min -10)
                    AND i.startYear <= n.releaseYear + 1 AND i.startYear >= n.releaseYear - 1
                    UNION
                    SELECT n.title
                    FROM netflix AS n
                    LEFT JOIN (SELECT b.tconst, b.primaryTitle, b.originalTitle, b.startYear, b.runtimeMinutes, r.averageRating
                                FROM imdb_basics AS b
                                INNER JOIN imdb_ratings AS r
                                ON b.tconst = r.tconst
                                WHERE startYear > 2014 AND LOWER(primaryTitle) != LOWER(originalTitle) )
                                AS i
                    ON LOWER(n.Title) = LOWER(i.originalTitle)
                    WHERE (i.runtimeMinutes <= n.runtime_min +10 AND i.runtimeMinutes >= n.runtime_min -10)
                    AND i.startYear <= n.releaseYear + 1 AND i.startYear >= n.releaseYear - 1
                    )
                    AS matched
        ON og_n.Title = matched.Title
        WHERE matched.Title IS NULL;
        """
# 
out = cur.execute(query)
for row in out: 
    print(row) 

(140,)


In [26]:
# Print the number of total Netflix films
query = """
        SELECT COUNT(*)
        FROM netflix;
        """
# Print output of the query
for row in cur.execute(query): 
    print(row) 

(1413,)


Currently, there are 140 non-matches out of 1413 Netflix films in the data. There's definitely room for improvement to reduce these non-matches, but let's move on for now and save the query results out as a dataframe and then a csv file for further analysis.

In [28]:
# Count non-matches with runtime +-10 and original title matching and release year more lenient +-1
query = """
        SELECT title, releaseDate, genre, language, filmType, runtime_min AS runtime, tconst, averageRating, numVotes
        FROM netflix AS n
        LEFT JOIN (SELECT b.tconst, b.primaryTitle, b.startYear, b.runtimeMinutes, r.averageRating, r.numVotes
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2014) 
                    AS i
        ON LOWER(n.Title) = LOWER(i.primaryTitle)
        WHERE (i.runtimeMinutes <= n.runtime_min +10 AND i.runtimeMinutes >= n.runtime_min -10)
        AND i.startYear <= n.releaseYear + 1 AND i.startYear >= n.releaseYear - 1
        UNION
        SELECT title, releaseDate, genre, language, filmType, runtime_min AS runtime, tconst, averageRating, numVotes
        FROM netflix AS n
        LEFT JOIN (SELECT b.tconst, b.primaryTitle, b.originalTitle, b.startYear, b.runtimeMinutes, r.averageRating, r.numVotes
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2014 AND LOWER(primaryTitle) != LOWER(originalTitle) )
                    AS i
        ON LOWER(n.Title) = LOWER(i.originalTitle)
        WHERE (i.runtimeMinutes <= n.runtime_min +10 AND i.runtimeMinutes >= n.runtime_min -10)
        AND i.startYear <= n.releaseYear + 1 AND i.startYear >= n.releaseYear - 1;
        """
# Load data into Pandas DataFrame
df = pd.read_sql_query(query, conn)
df.head()

Unnamed: 0,title,releaseDate,genre,language,filmType,runtime,tconst,averageRating,numVotes
0,#realityhigh,"September 8, 2017",Teen comedy,English,Feature films,99,tt6119504,5.1,7071
1,(Un)lucky Sisters,"August 30, 2024",Comedy,Spanish,Feature films,83,tt33054529,5.2,825
2,10 Days of a Bad Man,"August 18, 2023",Drama,Turkish,Feature films,124,tt24852002,6.1,3170
3,10 Days of a Curious Man,"November 7, 2024",Drama,Turkish,Feature films,110,tt28713370,5.8,1119
4,10 Days of a Good Man,"March 3, 2023",Drama,Turkish,Feature films,124,tt23334464,6.5,6031


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1324 entries, 0 to 1323
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   title          1324 non-null   object 
 1   releaseDate    1324 non-null   object 
 2   genre          1017 non-null   object 
 3   language       1324 non-null   object 
 4   filmType       1324 non-null   object 
 5   runtime        1324 non-null   int64  
 6   tconst         1324 non-null   object 
 7   averageRating  1324 non-null   float64
 8   numVotes       1324 non-null   int64  
dtypes: float64(1), int64(2), object(6)
memory usage: 93.2+ KB


Save out to a csv file.

In [30]:
df.to_csv('netflix_w_ratings_IMDB.csv',encoding='utf-8-sig',index=False) # using utf-8-sig for special characters in the title

In [31]:
# Close connection to SQLite database 
conn.close() 