### Before combining all data, I want to check how many matches to the Netflix data we can get with the Rotten Tomatoes and Letterboxd data, separately.

Below are the fields of the tables saved in the database.

### netflix – Contains the Netflix original film list (webscraped from Wikipedia):
- title (string) - official Netflix title
- releaseDate (string) - release date in '{month} {day}, {year}' format (e.g. 'October 16, 2015') 
- genre - main genre
- runtime - runtime of the flim in '{hour} h {minute} min' format (e.g. '1 h 40 min')
- language - main langauge of the film
- filmType - type of the film (e.g. feature film, documentary, special, etc)
- runtime_min - runtime converted to minutes

### rotten_tomatoes
- id (string) - Unique identifier for each movie
- title - The title of the movie
- audienceScore - The average score given by regular viewers (0-100)
- tomatoMeter - The percentage of positive reviews from professional critics (0-100)
- rating - The movie's age-based classification (e.g., G, PG, PG-13, R)
- ratingContents - Content leading to the rating classification
- releaseDateTheaters - The date the movie was released in theaters
- releaseDateStreaming - The date the movie became available for streaming
- runtimeMinutes - The duration of the movie in minutes
- genre - The movie's genre(s)
- originalLanguage - The original language of the movie
- director - The movie's director
- writer - The writer(s) responsible for the movie's screenplay
- boxOffice - The movie's total box office revenue
- distributer - The company responsible for distributing the movie
- soundMix - The audio format(s) used in the movie
- releaseYearEarlier - The year the movie was released, either in theaters or on a streaming service, whichever is earlier

### letterboxd
- id - unique identifier for each movie
- name - title of the movie
- date - release year of the movie
- minute - runtime in minutes
- rating - rating on letterboxd

In [3]:
import pandas as pd
import os
import sqlite3 

# Connect to SQLite database 
conn = sqlite3.connect('films.db') 

# Create a cursor object 
cur = conn.cursor() 

In [4]:
# Check if basics dataframe was saved correctly by loading and pulling up first 10 entries
query = """
        SELECT *
        FROM rotten_tomatoes
        LIMIT 10;
        """
# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)
for row in out: 
    print(row) 

['id', 'title', 'audienceScore', 'tomatoMeter', 'rating', 'ratingContents', 'releaseDateTheaters', 'releaseDateStreaming', 'runtimeMinutes', 'genre', 'originalLanguage', 'director', 'writer', 'boxOffice', 'distributor', 'soundMix', 'releaseYearEarlier']
('space-zombie-bingo', 'Space Zombie Bingo!', 50, None, None, None, None, '2018-08-25', 75, 'Comedy, Horror, Sci-fi', 'English', 'George Ormrod', 'George Ormrod,John Sabotta', None, None, None, 2018.0)
('love_lies', 'Love, Lies', 43, None, None, None, None, None, 120, 'Drama', 'Korean', 'Park Heung-Sik,Heung-Sik Park', 'Ha Young-Joon,Jeon Yun-su,Song Hye-jin', None, None, None, None)
('the_sore_losers_1997', 'Sore Losers', 60, None, None, None, None, '2020-10-23', 90, 'Action, Mystery & thriller', 'English', 'John Michael McCarthy', 'John Michael McCarthy', None, None, None, 2020.0)
('dinosaur_island_2002', 'Dinosaur Island', 70, None, None, None, None, '2017-03-27', 80, 'Fantasy, Adventure, Animation', 'English', 'Will Meugniot', 'John

Let's try joining netflix and rotten_tomatoes table on the title. Starting with the least constraint to see how many matches just based on title we can get.

TODO: add a unique id to netflix films, since there are 3 pairs of films with the same titles.

In [44]:
query = """
        SELECT *
        FROM netflix AS n
        INNER JOIN rotten_tomatoes AS r
        ON LOWER(n.title) = LOWER(r.title) AND n.releaseYear = r.releaseYearEarlier
        GROUP BY n.id, n.releaseYear
        HAVING COUNT(*) > 1;
        """

# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)

for row in out: 
    print(row) 

['id', 'title', 'releaseDate', 'genre', 'runtime', 'language', 'filmType', 'runtime_min', 'releaseYear', 'id', 'title', 'audienceScore', 'tomatoMeter', 'rating', 'ratingContents', 'releaseDateTheaters', 'releaseDateStreaming', 'runtimeMinutes', 'genre', 'originalLanguage', 'director', 'writer', 'boxOffice', 'distributor', 'soundMix', 'releaseYearEarlier']
(118, 'Paradox', 'March 23, 2018', 'Musical / Western / Fantasy', '1 h 13 min', 'English', 'Feature films', 73, 2018, 'paradox', 'Paradox', 34, 50, None, None, None, '2018-05-08', 101, 'Action', 'Chinese', 'Wilson Yip', 'Nick Cheuk,Lai-yin Leung', None, None, None, 2018.0)
(132, 'Cargo', 'May 18, 2018', 'Drama / Horror', '1 h 44 min', 'English', 'Feature films', 104, 2018, 'cargo_2017', 'Cargo', None, 60, None, None, '2019-05-30', '2018-02-09', 112, 'Drama', 'English', 'Kareem Mortimer', 'Kareem Mortimer', None, 'Artist Rights Distribution', None, 2018.0)
(155, 'The Angel', 'September 14, 2018', 'Spy thriller', '1 h 54 min', 'English'

Let's add the runtime (+-10) and release year (+-2) constraints.

In [24]:
query = """
        SELECT COUNT(DISTINCT n.title) AS n_unique_titles, COUNT(*) AS n_total
        FROM netflix AS n
        INNER JOIN rotten_tomatoes AS r
        ON LOWER(n.title) = LOWER(r.title)
        WHERE ( r.runtimeMinutes <= n.runtime_min +3 AND r.runtimeMinutes >= n.runtime_min -3 )
        AND r.releaseYearEarlier <= n.releaseYear + 1 AND r.releaseYearEarlier >= n.releaseYear - 1;
        """

# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)

for row in out: 
    print(row) 

['n_unique_titles', 'n_total']
(697, 707)


In [27]:
query = """
        SELECT title, COUNT(title)
        FROM netflix
        GROUP BY title
        HAVING COUNT(title) > 1;
        """

# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)

for row in out: 
    print(row) 

['title', 'COUNT(title)']
('Monster', 2)
('Noise', 2)
('The Killer', 2)


In [28]:
query = """
        SELECT *
        FROM netflix
        WHERE title = 'Monster';
        """

# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)

for row in out: 
    print(row) 

['title', 'releaseDate', 'genre', 'runtime', 'language', 'filmType', 'runtime_min', 'releaseYear']
('Monster', 'May 7, 2021', 'Drama', '1 h 39 min', 'English', 'Feature films', 99, 2021)
('Monster', 'May 16, 2024', 'Thriller', '1 h 24 min', 'No dialogue', 'Feature films', 84, 2024)


In [24]:
query = """
        SELECT *
        FROM netflix AS n
        INNER JOIN rotten_tomatoes AS r
        ON LOWER(n.title) = LOWER(r.title)
        WHERE ( r.runtimeMinutes <= n.runtime_min +3 AND r.runtimeMinutes >= n.runtime_min -3 )
        AND r.releaseYearEarlier <= n.releaseYear + 1 AND r.releaseYearEarlier >= n.releaseYear - 1;
        """

# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)

for row in out: 
    print(row) 

['n_unique_titles', 'n_total']
(697, 707)


In [17]:
# print entries in netflix table that doesn't have a match in rotten_tomatoes table
query = """
        SELECT *
        FROM netflix AS n_all
        LEFT JOIN (
            SELECT n.title, r.audienceScore
            FROM netflix AS n
            INNER JOIN rotten_tomatoes AS r
            ON LOWER(n.title) = LOWER(r.title)
            WHERE ( r.runtimeMinutes <= n.runtime_min +3 AND r.runtimeMinutes >= n.runtime_min -3 )
            AND r.releaseYearEarlier <= n.releaseYear + 2 AND r.releaseYearEarlier >= n.releaseYear - 2
            ) AS j
        ON n_all.title = j.title
        WHERE j.audienceScore IS NULL;
        """

# print column names
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)

for row in out: 
    print(row) 

['title', 'releaseDate', 'genre', 'runtime', 'language', 'filmType', 'runtime_min', 'releaseYear', 'title', 'audienceScore']
('Special Correspondents', 'April 29, 2016', 'Satire', '1 h 41 min', 'English', 'Feature films', 101, 2016, None, None)
('The Do-Over', 'May 27, 2016', 'Action comedy', '1 h 48 min', 'English', 'Feature films', 108, 2016, None, None)
('The Fundamentals of Caring', 'June 24, 2016', 'Comedy drama', '1 h 37 min', 'English', 'Feature films', 97, 2016, None, None)
('True Memoirs of an International Assassin', 'November 11, 2016', 'Action comedy', '1 h 38 min', 'English', 'Feature films', 98, 2016, None, None)
('Spectral', 'December 9, 2016', 'Science fiction / Action', '1 h 48 min', 'English', 'Feature films', 108, 2016, None, None)
('Burning Sands', 'March 10, 2017', 'Drama', '1 h 42 min', 'English', 'Feature films', 102, 2017, None, None)
('Deidra & Laney Rob a Train', 'March 17, 2017', 'Comedy crime', '1 h 32 min', 'English', 'Feature films', 92, 2017, None, None)
