### Before combining all data, I want to check how many matches to the Netflix data we can get with the Rotten Tomatoes and Letterboxd data, separately.

Below are the fields of the tables saved in the database.

### netflix – Contains the Netflix original film list (webscraped from Wikipedia):
- title (string) - official Netflix title
- releaseDate (string) - release date in '{month} {day}, {year}' format (e.g. 'October 16, 2015') 
- genre - main genre
- runtime - runtime of the flim in '{hour} h {minute} min' format (e.g. '1 h 40 min')
- language - main langauge of the film
- filmType - type of the film (e.g. feature film, documentary, special, etc)
- runtime_min - runtime converted to minutes

### rotten_tomatoes
- id (string) - Unique identifier for each movie
- title - The title of the movie
- audienceScore - The average score given by regular viewers (0-100)
- tomatoMeter - The percentage of positive reviews from professional critics (0-100)
- rating - The movie's age-based classification (e.g., G, PG, PG-13, R)
- ratingContents - Content leading to the rating classification
- releaseDateTheaters - The date the movie was released in theaters
- releaseDateStreaming - The date the movie became available for streaming
- runtimeMinutes - The duration of the movie in minutes
- genre - The movie's genre(s)
- originalLanguage - The original language of the movie
- director - The movie's director
- writer - The writer(s) responsible for the movie's screenplay
- boxOffice - The movie's total box office revenue
- distributer - The company responsible for distributing the movie
- soundMix - The audio format(s) used in the movie
- releaseYearEarlier - The year the movie was released, either in theaters or on a streaming service, whichever is earlier

### letterboxd
- id - unique identifier for each movie
- name - title of the movie
- date - release year of the movie
- minute - runtime in minutes
- rating - rating on letterboxd

In [10]:
import pandas as pd
import os
import sqlite3 

# Connect to SQLite database 
conn = sqlite3.connect('films.db') 

# Create a cursor object 
cur = conn.cursor() 

In [17]:
# Check if basics dataframe was saved correctly by loading and pulling up first 10 entries
query = """
        SELECT *
        FROM rotten_tomatoes
        LIMIT 10;
        """
# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['id', 'title', 'audienceScore', 'tomatoMeter', 'rating', 'ratingContents', 'releaseDateTheaters', 'releaseDateStreaming', 'runtimeMinutes', 'genre', 'originalLanguage', 'director', 'writer', 'boxOffice', 'distributor', 'soundMix', 'releaseYearEarlier']
('space-zombie-bingo', 'Space Zombie Bingo!', 50, None, None, None, None, '2018-08-25', 75, 'Comedy, Horror, Sci-fi', 'English', 'George Ormrod', 'George Ormrod,John Sabotta', None, None, None, 2018.0)
('love_lies', 'Love, Lies', 43, None, None, None, None, None, 120, 'Drama', 'Korean', 'Park Heung-Sik,Heung-Sik Park', 'Ha Young-Joon,Jeon Yun-su,Song Hye-jin', None, None, None, None)
('the_sore_losers_1997', 'Sore Losers', 60, None, None, None, None, '2020-10-23', 90, 'Action, Mystery & thriller', 'English', 'John Michael McCarthy', 'John Michael McCarthy', None, None, None, 2020.0)
('dinosaur_island_2002', 'Dinosaur Island', 70, None, None, None, None, '2017-03-27', 80, 'Fantasy, Adventure, Animation', 'English', 'Will Meugniot', 'John

Let's try joining netflix and rotten_tomatoes table on the title. Adding runtime constraint proactively.

In [22]:
query = """
        SELECT *
        FROM netflix AS n
        INNER JOIN rotten_tomatoes AS r
        ON LOWER(n.title) = LOWER(r.title)
        WHERE ( r.runtimeMinutes < n.runtime_min +3 AND r.runtimeMinutes > n.runtime_min -3 )
        LIMIT 10;
        """

# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)

for row in out: 
    print(row) 

['title', 'releaseDate', 'genre', 'runtime', 'language', 'filmType', 'runtime_min', 'releaseYear', 'id', 'title', 'audienceScore', 'tomatoMeter', 'rating', 'ratingContents', 'releaseDateTheaters', 'releaseDateStreaming', 'runtimeMinutes', 'genre', 'originalLanguage', 'director', 'writer', 'boxOffice', 'distributor', 'soundMix', 'releaseYearEarlier']
('Beasts of No Nation', 'October 16, 2015', 'War drama', '2 h 17 min', 'English', 'Feature films', 137, 2015, 'beasts_of_no_nation', 'Beasts of No Nation', 92, 91, None, None, '2015-10-16', '2017-02-06', 136, 'War, Drama', 'English', 'Cary Joji Fukunaga', 'Cary Joji Fukunaga', '$83.9K', 'Bleecker Street Media, Netflix', None, 2015.0)
('The Ridiculous 6', 'December 11, 2015', 'Western comedy', '2 h', 'English', 'Feature films', 120, 2015, 'the_ridiculous_6', 'The Ridiculous 6', 35, 0, None, None, None, '2017-04-03', 119, 'Comedy, Western', 'English', 'Frank Coraci', 'Tim Herlihy,Adam Sandler', None, None, None, 2017.0)
("Pee-wee's Big Holida

The first 10 entries look good. Let's check the entire query result for duplicates - multiple entries of rotten_tomatoes table joined to the same entry in the netflix table.

In [24]:
# print the number of rows in the results and the unique number of titles to see how many of the query results contain duplicate titles
query = """
        SELECT COUNT(*), COUNT(DISTINCT n.title)
        FROM netflix AS n
        INNER JOIN rotten_tomatoes AS r
        ON LOWER(n.title) = LOWER(r.title)
        WHERE ( r.runtimeMinutes < n.runtime_min +3 AND r.runtimeMinutes > n.runtime_min -3 );
        """

# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)

for row in out: 
    print(row) 

['COUNT(*)', 'COUNT(DISTINCT n.title)']
(733, 702)


In [25]:
# print the number of rows in the results and the unique number of titles to see how many of the query results contain duplicate titles
query = """
        SELECT COUNT(*), COUNT(DISTINCT n.title)
        FROM netflix AS n
        INNER JOIN rotten_tomatoes AS r
        ON LOWER(n.title) = LOWER(r.title)
        WHERE ( r.runtimeMinutes < n.runtime_min +3 AND r.runtimeMinutes > n.runtime_min -3 )
        AND r.releaseYearEarlier <= n.releaseYear + 1 AND r.releaseYearEarlier >= n.releaseYear - 1;
        """

# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)

for row in out: 
    print(row) 

['COUNT(*)', 'COUNT(DISTINCT n.title)']
(689, 679)


In [21]:
# Check if basics dataframe was saved correctly by loading and pulling up first 10 entries
query = """
        SELECT *
        FROM rotten_tomatoes
        WHERE title = 'Rebirth';
        """
# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['id', 'title', 'audienceScore', 'tomatoMeter', 'rating', 'ratingContents', 'releaseDateTheaters', 'releaseDateStreaming', 'runtimeMinutes', 'genre', 'originalLanguage', 'director', 'writer', 'boxOffice', 'distributor', 'soundMix', 'releaseYearEarlier']
('project_rebirth_2010', 'Rebirth', 79, 92, None, None, '2011-08-26', '2011-09-06', 104, 'Documentary', 'English', 'James Whitaker', None, '$12.3K', 'Oscilloscope Pictures', None, 2011.0)
('yokame_no_semi', 'Rebirth', 69, None, None, None, None, None, 147, 'Drama', 'Japanese', 'Izuru Narushima', 'Satoko Okudera', None, None, 'Dolby Digital', None)
('yokame_no_semi', 'Rebirth', 69, None, None, None, None, None, 147, 'Drama', 'Japanese', 'Izuru Narushima', 'Satoko Okudera', None, None, 'Dolby Digital', None)
('rebirth_2016', 'Rebirth', 27, 43, None, None, None, '2017-05-22', 100, 'Mystery & thriller', 'English', 'Karl Mueller', 'Karl Mueller', None, None, None, 2017.0)
