## Based on the film database created in the last step, I want to know the IMDB ratings of the Netflix original films.

Below are the list of tables and relevant columns in each.

### imdb_basics - Contains the following information for titles:
- tconst (string) - alphanumeric unique identifier of the title.
- titleType (string) – the type/format of the title (e.g. movie, short,
tvseries, tvepisode, video, etc).
- primaryTitle (string) – the more popular title / the title used by
the filmmakers on promotional materials at the point of release.
- originalTitle (string) - original title, in the original language.
- startYear (YYYY) – represents the release year of a title. In the
case of TV Series, it is the series start year.
- runtimeMinutes – primary runtime of the title, in minutes.
- genres (string array) – includes up to three genres associated with
the title.

### imdb_ratings – Contains the IMDb rating and votes information for titles:
- tconst (string) - alphanumeric unique identifier of the title.
- averageRating – weighted average of all the individual user ratings.
- numVotes - number of votes the title has received.

### netflix – Contains the Netflix original film list (webscraped from Wikipedia):
- title (string) - official Netflix title
- releaseDate (string) - release date in '{month} {day}, {year}' format (e.g. 'October 16, 2015') 
- genre - main genre
- runtime - runtime of the flim in '{hour} h {minute} min' format (e.g. '1 h 40 min')
- language - main langauge of the film
- filmType - type of the film (e.g. feature film, documentary, special, etc)
- runtime_min - runtime converted to minutes

Connect to the films database and check each table.

In [1]:
import pandas as pd
import os
import sqlite3 

# Connect to SQLite database 
conn = sqlite3.connect('films.db') 

# Create a cursor object 
cur = conn.cursor() 

In [2]:
# Check if basics dataframe was saved correctly by loading and pulling up first 10 entries
query = """
        SELECT *
        FROM imdb_basics
        LIMIT 10;
        """
# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['tconst', 'titleType', 'primaryTitle', 'originalTitle', 'isAdult', 'startYear', 'endYear', 'runtimeMinutes', 'genres']
('tt0000001', 'short', 'Carmencita', 'Carmencita', 0, 1894, None, 1, 'Documentary,Short')
('tt0000002', 'short', 'Le clown et ses chiens', 'Le clown et ses chiens', 0, 1892, None, 5, 'Animation,Short')
('tt0000003', 'short', 'Poor Pierrot', 'Pauvre Pierrot', 0, 1892, None, 5, 'Animation,Comedy,Romance')
('tt0000004', 'short', 'Un bon bock', 'Un bon bock', 0, 1892, None, 12, 'Animation,Short')
('tt0000005', 'short', 'Blacksmith Scene', 'Blacksmith Scene', 0, 1893, None, 1, 'Short')
('tt0000006', 'short', 'Chinese Opium Den', 'Chinese Opium Den', 0, 1894, None, 1, 'Short')
('tt0000007', 'short', 'Corbett and Courtney Before the Kinetograph', 'Corbett and Courtney Before the Kinetograph', 0, 1894, None, 1, 'Short,Sport')
('tt0000008', 'short', 'Edison Kinetoscopic Record of a Sneeze', 'Edison Kinetoscopic Record of a Sneeze', 0, 1894, None, 1, 'Documentary,Short')
('tt00

In [3]:
# Check if ratings dataframe was saved correctly by loading and pulling up first 10 entries
query = """
        SELECT *
        FROM imdb_ratings
        LIMIT 10;
        """
# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['tconst', 'averageRating', 'numVotes']
('tt0000001', 5.7, 2133)
('tt0000002', 5.5, 289)
('tt0000003', 6.4, 2169)
('tt0000004', 5.3, 184)
('tt0000005', 6.2, 2896)
('tt0000006', 5.0, 208)
('tt0000007', 5.3, 901)
('tt0000008', 5.4, 2280)
('tt0000009', 5.3, 220)
('tt0000010', 6.8, 7871)


In [4]:
# We can join these two tables on 'tconst' which is a unique key for each title.
# Using INNER JOIN as we only want entries that exist in both tables - titles that have ratings.
query = """
        SELECT b.tconst, b.primaryTitle, b.startYear, b.runtimeMinutes, r.averageRating
        FROM imdb_basics AS b
        INNER JOIN imdb_ratings AS r
        ON b.tconst = r.tconst
        LIMIT 10;
        """

# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)

for row in out: 
    print(row) 

['tconst', 'primaryTitle', 'startYear', 'runtimeMinutes', 'averageRating']
('tt0000001', 'Carmencita', 1894, 1, 5.7)
('tt0000002', 'Le clown et ses chiens', 1892, 5, 5.5)
('tt0000003', 'Poor Pierrot', 1892, 5, 6.4)
('tt0000004', 'Un bon bock', 1892, 12, 5.3)
('tt0000005', 'Blacksmith Scene', 1893, 1, 6.2)
('tt0000006', 'Chinese Opium Den', 1894, 1, 5.0)
('tt0000007', 'Corbett and Courtney Before the Kinetograph', 1894, 1, 5.3)
('tt0000008', 'Edison Kinetoscopic Record of a Sneeze', 1894, 1, 5.4)
('tt0000009', 'Miss Jerry', 1894, 45, 5.3)
('tt0000010', 'Leaving the Factory', 1895, 1, 6.8)


In [5]:
# Read in netflix data and list the column names and the first 10 entries
query = """
        SELECT *
        FROM netflix
        LIMIT 10;
        """
# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['title', 'releaseDate', 'genre', 'runtime', 'language', 'filmType', 'runtime_min', 'releaseYear']
('Beasts of No Nation', 'October 16, 2015', 'War drama', '2 h 17 min', 'English', 'Feature films', 137, 2015)
('The Ridiculous 6', 'December 11, 2015', 'Western comedy', '2 h', 'English', 'Feature films', 120, 2015)
("Pee-wee's Big Holiday", 'March 18, 2016', 'Adventure comedy', '1 h 30 min', 'English', 'Feature films', 90, 2016)
('Special Correspondents', 'April 29, 2016', 'Satire', '1 h 41 min', 'English', 'Feature films', 101, 2016)
('The Do-Over', 'May 27, 2016', 'Action comedy', '1 h 48 min', 'English', 'Feature films', 108, 2016)
('The Fundamentals of Caring', 'June 24, 2016', 'Comedy drama', '1 h 37 min', 'English', 'Feature films', 97, 2016)
('Brahman Naman', 'July 7, 2016', 'Sex comedy', '1 h 35 min', 'English', 'Feature films', 95, 2016)
('Rebirth', 'July 15, 2016', 'Thriller', '1 h 40 min', 'English', 'Feature films', 100, 2016)
('Tallulah', 'July 29, 2016', 'Comedy drama', '1 

Let's join netflix and IMDB data by matching the title and release year.

In [6]:
# Match title and year between netflix and IMDB data
# Getting entries from IMDB that is more recent than 2014 since our netflix flims are 2015-2024
query = """
        SELECT i.tconst, n.title, n.releaseDate, i.startYear, n.runtime_min, i.runtimeMinutes
        FROM netflix AS n
        LEFT JOIN (SELECT b.tconst, b.primaryTitle, b.startYear, b.runtimeMinutes, r.averageRating
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2014) 
                    AS i
        ON LOWER(n.Title) = LOWER(i.primaryTitle) AND n.releaseYear = i.startYear
        LIMIT 10;
        """
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['tconst', 'title', 'releaseDate', 'startYear', 'runtime_min', 'runtimeMinutes']
('tt1365050', 'Beasts of No Nation', 'October 16, 2015', 2015, 137, 137)
('tt2479478', 'The Ridiculous 6', 'December 11, 2015', 2015, 120, 119)
('tt0837156', "Pee-wee's Big Holiday", 'March 18, 2016', 2016, 90, 90)
('tt4181052', 'Special Correspondents', 'April 29, 2016', 2016, 101, 101)
('tt4769836', 'The Do-Over', 'May 27, 2016', 2016, 108, 108)
('tt2452386', 'The Fundamentals of Caring', 'June 24, 2016', 2016, 97, 97)
('tt5240748', 'Brahman Naman', 'July 7, 2016', 2016, 95, 95)
('tt4902716', 'Rebirth', 'July 15, 2016', 2016, 100, 100)
('tt5798216', 'Rebirth', 'July 15, 2016', 2016, 100, 23)
('tt1639084', 'Tallulah', 'July 29, 2016', 2016, 111, 111)


In the above result, there are two 'Rebirth' entries with the same release date, as two different IMDB entries joined on the same netflix entry. Looking at the runtimes for both, only one of them is a right match. One way we can solve this is to limit the join so that the runtime has to be an exact match. However, if we look at 'The Ridiculous 6' entry, the runtime in minutes is 120 in netflix table and 119 in IMDB data. Thus, joining on the exact match for runtime might lead to poorer match overall. Another approach can be filtering using WHERE to display only the entries that have *close enough* runtimes between the two tables. In the below cell, let's filter so that we only see entries whose IMDB runtime comes within +-3 min of netflix runtime.

In [7]:
# Join on title and year, filter on IMDB runtime = netflix runtime +-3
# Check the entries 'The Ridiculous 6' and 'Rebirth'
query = """
        SELECT i.tconst, n.title, n.releaseDate, i.startYear, n.runtime_min, i.runtimeMinutes
        FROM netflix AS n
        LEFT JOIN (SELECT b.tconst, b.primaryTitle, b.startYear, b.runtimeMinutes, r.averageRating
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2014) 
                    AS i
        ON LOWER(n.Title) = LOWER(i.primaryTitle) AND n.releaseYear = i.startYear
        WHERE ( i.runtimeMinutes < n.runtime_min +3 AND i.runtimeMinutes > n.runtime_min -3 )
        AND (n.title = 'The Ridiculous 6' OR n.title = 'Rebirth');
        """
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['tconst', 'title', 'releaseDate', 'startYear', 'runtime_min', 'runtimeMinutes']
('tt2479478', 'The Ridiculous 6', 'December 11, 2015', 2015, 120, 119)
('tt4902716', 'Rebirth', 'July 15, 2016', 2016, 100, 100)


We're able to grab only single entries for these two, so great! Now, let's check netflix films that didn't successfully match to the IMDB entry. Let's nest the above query to left join to the original netflix table to list the films that are not included in the above query.

In [8]:
# List non-matches
query = """
        SELECT *
        FROM netflix AS og_n
        LEFT JOIN (SELECT n.title
                    FROM netflix AS n
                    LEFT JOIN (SELECT b.tconst, b.primaryTitle, b.startYear, b.runtimeMinutes, r.averageRating
                                FROM imdb_basics AS b
                                INNER JOIN imdb_ratings AS r
                                ON b.tconst = r.tconst
                                WHERE startYear > 2014) 
                                AS i
                    ON LOWER(n.Title) = LOWER(i.primaryTitle) AND n.releaseYear = i.startYear
                    WHERE (runtimeMinutes < n.runtime_min +3 AND i.runtimeMinutes > n.runtime_min -3)
                    ) 
                    AS matched
        ON og_n.Title = matched.Title
        WHERE matched.Title IS NULL;
        """

out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['title', 'releaseDate', 'genre', 'runtime', 'language', 'filmType', 'runtime_min', 'releaseYear', 'title']
('Mascots', 'October 13, 2016', 'Mockumentary', '1 h 35 min', 'English', 'Feature films', 95, 2016, None)
('7 años', 'October 28, 2016', 'Drama', '1 h 17 min', 'Spanish', 'Feature films', 77, 2016, None)
('Mercy', 'November 22, 2016', 'Thriller', '1 h 27 min', 'English', 'Feature films', 87, 2016, None)
('Imperial Dreams', 'February 3, 2017', 'Drama', '1 h 26 min', 'English', 'Feature films', 86, 2017, None)
("Girlfriend's Day", 'February 14, 2017', 'Comedy drama', '1 h 10 min', 'English', 'Feature films', 70, 2017, None)
("I Don't Feel at Home in This World Anymore", 'February 24, 2017', 'Thriller drama', '1 h 36 min', 'English', 'Feature films', 96, 2017, None)
('Burning Sands', 'March 10, 2017', 'Drama', '1 h 42 min', 'English', 'Feature films', 102, 2017, None)
('Tramps', 'April 21, 2017', 'Romantic comedy', '1 h 23 min', 'English', 'Feature films', 83, 2017, None)
('The Meye

Let's check if some of these exist in the IMDB table.

In [9]:
query = """
        SELECT *
        FROM imdb_basics AS b
        INNER JOIN imdb_ratings AS r
        ON b.tconst = r.tconst
        WHERE b.primaryTitle = 'Mascots';
        """
for row in cur.execute(query): 
    print(row) 

('tt1956565', 'short', 'Mascots', 'Mascots', 0, 2011, None, 21, 'Drama,Family,Short', 'tt1956565', 7.7, 9)
('tt4936176', 'movie', 'Mascots', 'Mascots', 0, 2016, None, 89, 'Comedy', 'tt4936176', 5.8, 8267)


Since we have an entry with 2016 release year, it seems like we need to be more lenient on filtering the runtime.

Let's run the original query again with a more lenient filter for runtime and then check some more entries.

In [11]:
# List non-matches with runtime +-10
query = """
        SELECT *
        FROM netflix AS og_n
        LEFT JOIN (SELECT n.title
                    FROM netflix AS n
                    LEFT JOIN (SELECT b.tconst, b.primaryTitle, b.startYear, b.runtimeMinutes, r.averageRating
                                FROM imdb_basics AS b
                                INNER JOIN imdb_ratings AS r
                                ON b.tconst = r.tconst
                                WHERE startYear > 2014) 
                                AS i
                    ON LOWER(n.Title) = LOWER(i.primaryTitle) AND n.releaseYear = i.startYear
                    WHERE (i.runtimeMinutes <= n.runtime_min +10 AND i.runtimeMinutes >= n.runtime_min -10)
                    ) 
                    AS matched
        ON og_n.Title = matched.Title
        WHERE matched.Title IS NULL;
        """
# 
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['title', 'releaseDate', 'genre', 'runtime', 'language', 'filmType', 'runtime_min', 'releaseYear', 'title']
('7 años', 'October 28, 2016', 'Drama', '1 h 17 min', 'Spanish', 'Feature films', 77, 2016, None)
('Imperial Dreams', 'February 3, 2017', 'Drama', '1 h 26 min', 'English', 'Feature films', 86, 2017, None)
('Tramps', 'April 21, 2017', 'Romantic comedy', '1 h 23 min', 'English', 'Feature films', 83, 2017, None)
('The Meyerowitz Stories (New and Selected)', 'October 13, 2017', 'Comedy drama', '1 h 52 min', 'English', 'Feature films', 112, 2017, None)
('My Own Man', 'March 6, 2015', None, '1 h 21 min', 'English', 'Documentaries', 81, 2015, None)
('The Other One: The Long Strange Trip of Bob Weir', 'May 22, 2015', None, '1 h 23 min', 'English', 'Documentaries', 83, 2015, None)
('What Happened, Miss Simone?', 'June 26, 2015', None, '1 h 24 min', 'English', 'Documentaries', 84, 2015, None)
('Tig', 'July 17, 2015', None, '1 h 20 min', 'English', 'Documentaries', 80, 2015, None)
("Winter 

In [None]:
query = """
        SELECT *
        FROM imdb_basics AS b
        INNER JOIN imdb_ratings AS r
        ON b.tconst = r.tconst
        WHERE b.primaryTitle = '7 años';
        """
for row in cur.execute(query): 
    print(row) 

No matches? We might have better luck with originalTitle than primaryTitle since the title isn't in English.

In [None]:
query = """
        SELECT *
        FROM imdb_basics AS b
        INNER JOIN imdb_ratings AS r
        ON b.tconst = r.tconst
        WHERE b.originalTitle = '7 años';
        """
for row in cur.execute(query): 
    print(row) 

So, some of the netflix flims might use the title in the original language instead of the more popular title. Seems like we need to try comparing the title to the original title if the primary and original titles are different from each other.

In [None]:
# List non-matches with runtime +-10 and original title matching
# Joining on primary title and original title separately and use UNION to combine the joined tables
# Note that UNION ALL should yield the same results since when we're joining on original title, we're only subsetting the entries where
# the primary title does not match the original title.
query = """
        SELECT *
        FROM netflix AS og_n
        LEFT JOIN (SELECT n.title
                    FROM netflix AS n
                    LEFT JOIN (SELECT *
                                FROM imdb_basics AS b
                                INNER JOIN imdb_ratings AS r
                                ON b.tconst = r.tconst
                                WHERE startYear > 2014) 
                                AS i
                    ON LOWER(n.Title) = LOWER(i.primaryTitle) AND SUBSTR(n.releaseDate,-4,4) = i.startYear
                    WHERE (CAST(i.runtimeMinutes as INTEGER) <= n.runtime_min +10 AND CAST(i.runtimeMinutes as INTEGER) >= n.runtime_min -10)
                    UNION
                    SELECT n.title
                    FROM netflix AS n
                    LEFT JOIN (SELECT *
                                FROM imdb_basics AS b
                                INNER JOIN imdb_ratings AS r
                                ON b.tconst = r.tconst
                                WHERE startYear > 2014 AND LOWER(primaryTitle) != LOWER(originalTitle) )
                                AS i
                    ON LOWER(n.Title) = LOWER(i.originalTitle) AND SUBSTR(n.releaseDate,-4,4) = i.startYear
                    WHERE (CAST(i.runtimeMinutes as INTEGER) <= n.runtime_min +10 AND CAST(i.runtimeMinutes as INTEGER) >= n.runtime_min -10)
                    )
                    AS matched
        ON og_n.Title = matched.Title
        WHERE matched.Title IS NULL;
        """
# 
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

In [None]:
# Print netflix entry for comparison
query = """
        SELECT *
        FROM netflix
        WHERE title = "Tramps";
        """
for row in cur.execute(query): 
    print(row) 

# Look for film 'Mercy' with the year 2016
query = """
        SELECT *
        FROM imdb_basics 
        WHERE primaryTitle = "Tramps" AND startYear > 2000;
        """
for row in cur.execute(query): 
    print(row) 

More lenient on release year (+-1 year?)

In [None]:
# Print netflix entry for comparison
query = """
        SELECT *
        FROM netflix
        WHERE title = "The Other One: The Long Strange Trip of Bob Weir";
        """
for row in cur.execute(query): 
    print(row) 

# Look for film 'Mercy' with the year 2016
query = """
        SELECT *
        FROM imdb_basics 
        WHERE tconst = 'tt3692768';
        """
for row in cur.execute(query): 
    print(row) 

In [None]:
# List non-matches with runtime +-10 and original title matching and release year more lenient +-1
query = """
        SELECT *
        FROM netflix AS og_n
        LEFT JOIN (SELECT n.title
                    FROM netflix AS n
                    LEFT JOIN (SELECT *
                                FROM imdb_basics AS b
                                INNER JOIN imdb_ratings AS r
                                ON b.tconst = r.tconst
                                WHERE startYear > 2014) 
                                AS i
                    ON LOWER(n.Title) = LOWER(i.primaryTitle)
                    WHERE (CAST(i.runtimeMinutes as INTEGER) <= n.runtime_min +10 AND CAST(i.runtimeMinutes as INTEGER) >= n.runtime_min -10)
                    AND i.startYear <= SUBSTR(n.releaseDate,-4,4) + 1 AND i.startYear >= SUBSTR(n.releaseDate,-4,4) - 1
                    UNION
                    SELECT n.title
                    FROM netflix AS n
                    LEFT JOIN (SELECT *
                                FROM imdb_basics AS b
                                INNER JOIN imdb_ratings AS r
                                ON b.tconst = r.tconst
                                WHERE startYear > 2014 AND LOWER(primaryTitle) != LOWER(originalTitle) )
                                AS i
                    ON LOWER(n.Title) = LOWER(i.originalTitle)
                    WHERE (CAST(i.runtimeMinutes as INTEGER) <= n.runtime_min +10 AND CAST(i.runtimeMinutes as INTEGER) >= n.runtime_min -10)
                    AND i.startYear <= SUBSTR(n.releaseDate,-4,4) + 1 AND i.startYear >= SUBSTR(n.releaseDate,-4,4) - 1
                    )
                    AS matched
        ON og_n.Title = matched.Title
        WHERE matched.Title IS NULL;
        """
# 
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

In [None]:
# Print netflix entry for comparison
query = """
        SELECT *
        FROM netflix
        WHERE title = "Tig";
        """
for row in cur.execute(query): 
    print(row) 

# Look for film 'Mercy' with the year 2016
query = """
        SELECT *
        FROM imdb_basics 
        WHERE primaryTitle = 'Tig' AND startYear > 2000;
        """
for row in cur.execute(query): 
    print(row) 

In [None]:
# Runtime match with lenient year filter +- 1 year - CURRENT BEST
query = """
        SELECT COUNT(*)
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2000) 
                    AS i
        ON LOWER(n.Title) = LOWER(i.primaryTitle)
        WHERE (CAST(i.runtimeMinutes as INTEGER) < n.runtime_min +3 AND CAST(i.runtimeMinutes as INTEGER) > n.runtime_min -3)
            AND i.startYear < SUBSTR(n.releaseDate,-4,4) + 1 AND i.startYear > SUBSTR(n.releaseDate,-4,4) - 1;
        """
# 
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

In [None]:
# Runtime match with lenient year filter +- 1 year - CURRENT BEST
# Now join imdb titles separately for primary and original titles
# Matching on original title if the primary title is different from the original title = gives us 25 rows
query = """
        SELECT *
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2000 AND LOWER(primaryTitle) != LOWER(originalTitle) )
                    AS i
        ON LOWER(n.Title) = LOWER(i.originalTitle)
        WHERE (CAST(i.runtimeMinutes as INTEGER) < n.runtime_min +3 AND CAST(i.runtimeMinutes as INTEGER) > n.runtime_min -3)
            AND i.startYear < SUBSTR(n.releaseDate,-4,4) + 1 AND i.startYear > SUBSTR(n.releaseDate,-4,4) - 1;
        """
# 
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

In [None]:
# Runtime match with lenient year filter +- 1 year - CURRENT BEST
# Now join imdb titles separately for primary and original titles
# Use union to combine films matched by the primary and the original titles -- takes forever to run, so let's just do count in the next cell
query = """
        SELECT *
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2000 )
                    AS i
        ON LOWER(n.Title) = LOWER(i.primaryTitle)
        WHERE (CAST(i.runtimeMinutes as INTEGER) < n.runtime_min +3 AND CAST(i.runtimeMinutes as INTEGER) > n.runtime_min -3)
            AND i.startYear < SUBSTR(n.releaseDate,-4,4) + 1 AND i.startYear > SUBSTR(n.releaseDate,-4,4) - 1
        UNION
        SELECT *
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2000 AND LOWER(primaryTitle) != LOWER(originalTitle) )
                    AS i
        ON LOWER(n.Title) = LOWER(i.originalTitle)
        WHERE (CAST(i.runtimeMinutes as INTEGER) < n.runtime_min +3 AND CAST(i.runtimeMinutes as INTEGER) > n.runtime_min -3)
            AND i.startYear < SUBSTR(n.releaseDate,-4,4) + 1 AND i.startYear > SUBSTR(n.releaseDate,-4,4) - 1;
        """
# 
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

In [None]:
# Runtime match with lenient year filter +- 1 year - CURRENT BEST
# Now join imdb titles separately for primary and original titles
# Use union to combine films matched by the primary and the original titles
query = """
        WITH combined AS
        (SELECT *
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2000 )
                    AS i
        ON LOWER(n.Title) = LOWER(i.primaryTitle)
        WHERE (CAST(i.runtimeMinutes as INTEGER) < n.runtime_min +3 AND CAST(i.runtimeMinutes as INTEGER) > n.runtime_min -3)
            AND i.startYear < SUBSTR(n.releaseDate,-4,4) + 1 AND i.startYear > SUBSTR(n.releaseDate,-4,4) - 1
        UNION ALL
        SELECT *
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2000 AND LOWER(primaryTitle) != LOWER(originalTitle) )
                    AS i
        ON LOWER(n.Title) = LOWER(i.originalTitle)
        WHERE (CAST(i.runtimeMinutes as INTEGER) < n.runtime_min +3 AND CAST(i.runtimeMinutes as INTEGER) > n.runtime_min -3)
            AND i.startYear < SUBSTR(n.releaseDate,-4,4) + 1 AND i.startYear > SUBSTR(n.releaseDate,-4,4) - 1)
        SELECT COUNT(*)
        FROM combined;
        """
# 
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

In [None]:
# Runtime match
query = """
        SELECT title, releaseDate, genre, language, filmType, runtime_min AS runtime, tconst, averageRating, numVotes
        FROM netflix AS n
        LEFT JOIN (SELECT b.tconst, titleType, primaryTitle, isAdult, startYear, runtimeMinutes, averageRating, numVotes
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2000) 
                    AS i
        ON LOWER(n.Title) = LOWER(i.primaryTitle) AND substr(n.releaseDate,-4,4) = i.startYear
        WHERE (CAST(i.runtimeMinutes as INTEGER) < n.runtime_min +3 AND CAST(i.runtimeMinutes as INTEGER) > n.runtime_min -3);
        """
# Load data into Pandas DataFrame
df = pd.read_sql_query(query, conn)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# runtime mismatch
query = """
        SELECT i.tconst, n.title, n.releaseDate, i.startYear, n.runtime_min, CAST(i.runtimeMinutes as INTEGER)
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2014) 
                    AS i
        ON n.Title = i.primaryTitle AND substr(n.releaseDate,-4,4) = i.startYear
        WHERE (CAST(i.runtimeMinutes as INTEGER) >= n.runtime_min +10 OR CAST(i.runtimeMinutes as INTEGER) <= n.runtime_min -10);
        """
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

In [None]:
# Match title between netflix and IMDB data
query = """
        SELECT COUNT(*)
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2014) 
                    AS i
        ON n.Title = i.primaryTitle AND substr(n.releaseDate,-4,4) = i.startYear;
        """
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

In [None]:
# non match on title and year
query = """
        SELECT COUNT(*)
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    ) 
                    AS i
        ON n.Title = i.primaryTitle
        WHERE averageRating IS NULL;
        """
# AND substr(n.releaseDate,-4,4) = i.startYear
for row in cur.execute(query): 
    print(row) 

In [None]:
# Match title between netflix and IMDB data
query = """
        SELECT i.tconst, n.title, n.releaseDate, i.startYear, n.genre, i.genres, i.averageRating, i.numVotes
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2014) 
                    AS i
        ON n.Title = i.primaryTitle 
        LIMIT 10;
        """
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

There are multiple items with the same title in the IMDB data that seem to be joining to the same Netflix title, so let's join conjunctively on both title and year.

In [None]:
# TODO: have a duplicate checking code here
# check if the preceding title is the same -- looking for duplicates
# Match title between netflix and IMDB data
query = """
        SELECT title, releaseDate, filmType, tconst, titleType
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2014) 
                    AS i
        ON n.title = i.primaryTitle AND substr(n.releaseDate,-4,4) = i.startYear;
        """
out = cur.execute(query)

# Get column names to display with the output
names = list(map(lambda x: x[0], cur.description))
print(names)

# Print output of the query
for row in out: 
    print(row) 

In [None]:
query = """
        SELECT *
        FROM imdb_basics
        WHERE primaryTitle == 'Rebirth' AND startYear = 2016;
        """
out = cur.execute(query)

# Get column names to display with the output
names = list(map(lambda x: x[0], cur.description))
print(names)

# Print output of the query
for row in out: 
    print(row) 

In [None]:
query = """
        SELECT *
        FROM netflix
        WHERE title == 'Rebirth';
        """
# Print output of the query
for row in cur.execute(query): 
    print(row) 

In [None]:
# Match title between netflix and IMDB data
query = """
        SELECT i.titleType, n.title, n.releaseDate, i.startYear, n.genre, i.genres, i.averageRating, i.numVotes
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE startYear > 2014) 
                    AS i
        ON n.title = i.primaryTitle AND substr(n.releaseDate,-4,4) = i.startYear
        LIMIT 10;
        """
out = cur.execute(query)

# Get column names to display with the output
names = list(map(lambda x: x[0], cur.description))
print(names)

# Print output of the query
for row in out: 
    print(row) 

Even with the title and year match filter, there are still two items with the title 'Rebirth' in the IMDB that are being joined to the same item in Netflix data. One is 'tvEpisode' and the other is 'movie'. Out of curiosity, let's list all the IMDB items that have the title 'Rebirth'.

In [None]:
query = """
        SELECT *
        FROM imdb_basics
        WHERE primaryTitle = 'Rebirth';
        """
for row in cur.execute(query): 
    print(row) 

So.. a LOT. From the query one above, it seems like most matches have titleType as 'movie' in IMDB (which makes sense since we're looking at Netflix original *films*), so let's limit the IMDB data with 'movie' as titleType.

In [None]:
# Match title between netflix and IMDB data
query = """
        SELECT i.titleType, n.title, n.releaseDate, i.startYear, n.genre, i.genres, i.averageRating, i.numVotes
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE b.startYear > 2014 AND b.titleType = 'movie') 
                    AS i
        ON LOWER(n.title) = LOWER(i.primaryTitle) AND SUBSTR(n.releaseDate,-4,4) = i.startYear
        LIMIT 20;
        """
out = cur.execute(query)

# Get column names to display with the output
names = list(map(lambda x: x[0], cur.description))
print(names)

# Print output of the query
for row in out: 
    print(row) 

In [None]:
# Match title between netflix and IMDB data
query = """
        SELECT i.titleType, n.title, n.releaseDate, i.startYear, n.genre, i.genres, i.averageRating, i.numVotes
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE b.startYear > 2014 AND b.titleType = 'movie') 
                    AS i
        ON LOWER(n.title) = LOWER(i.primaryTitle) AND SUBSTR(n.releaseDate,-4,4) = i.startYear
        LIMIT 20;
        """
out = cur.execute(query)

# Get column names to display with the output
names = list(map(lambda x: x[0], cur.description))
print(names)

# Print output of the query
for row in out: 
    print(row) 

Check how many films were not found in IMDB and see if there's any pattern to those.

In [None]:
# first create the above query as CTE, since we'll be referencing a lot
query = """
        SELECT COUNT(*)
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE b.startYear > 2014 AND b.titleType = 'movie') 
                    AS i
        ON LOWER(n.title) = LOWER(i.primaryTitle) AND SUBSTR(n.releaseDate,-4,4) = i.startYear
        WHERE i.titleType IS NULL;
        """
out = cur.execute(query)
# Print output of the query
for row in out: 
    print(row) 

306 out of 1413 netflix films seem like a lot. Let's look at the titles to see what might be going on.

In [None]:
# get film names of netflix films that didn't have a title+name match in imdb
query = """
        SELECT n.title, n.releaseDate
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE b.startYear > 2014 AND b.titleType = 'movie') 
                    AS i
        ON LOWER(n.title) = LOWER(i.primaryTitle) AND SUBSTR(n.releaseDate,-4,4) = i.startYear
        WHERE i.titleType IS NULL;
        """
out = cur.execute(query)
# Print output of the query
for row in out: 
    print(row) 

I notice a lot of these titles contain special characters (e.g. colon, period, ampersand, and, etc.). Some of these titles are from a series (e.g. 'Fear Street Part 1: 1994'), which might have a titleType other than 'movie' or have a different name. Let's try to find these films in the IMDB dataset with a more lenient match.

In [None]:
query = """
        SELECT *
        FROM (SELECT *
                FROM imdb_basics AS b
                INNER JOIN imdb_ratings AS r
                ON b.tconst = r.tconst) 
                AS i
        WHERE i.primaryTitle LIKE 'Rodney King';
        """
out = cur.execute(query)
# Print output of the query
for row in out: 
    print(row) 

In [None]:
# get film names of netflix films that didn't have a title+name match in imdb
query = """
        SELECT n.title
        FROM netflix AS n
        LEFT JOIN (SELECT *
                    FROM imdb_basics AS b
                    INNER JOIN imdb_ratings AS r
                    ON b.tconst = r.tconst
                    WHERE b.startYear > 2014 AND b.titleType != 'tvEpisode') 
                    AS i
        ON LOWER(n.title) = LOWER(i.primaryTitle) AND SUBSTR(n.releaseDate,-4,4) = i.startYear
        WHERE i.titleType IS NULL;
        """
out = cur.execute(query)
# Print output of the query
for row in out: 
    print(row) 

In [None]:
# Close connection to SQLite database 
conn.close() 