### Download and add Rotten Tomatoes dataset to the database

In [1]:
# Install dependencies as needed:
# !pip install --upgrade kagglehub
import kagglehub

# Download latest version
path = kagglehub.dataset_download("andrezaza/clapper-massive-rotten-tomatoes-movies-and-reviews",force_download=True)

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/andrezaza/clapper-massive-rotten-tomatoes-movies-and-reviews?dataset_version_number=4...


100%|████████████████████████████████████████| 152M/152M [00:06<00:00, 24.4MB/s]

Extracting files...





Path to dataset files: /Users/sunyoungpark/.cache/kagglehub/datasets/andrezaza/clapper-massive-rotten-tomatoes-movies-and-reviews/versions/4


In [2]:
# check if the files are successfully downloaded in the said path
# on Mac
!ls -alh $path
# on Windows
!dir $path 

total 860608
drwxr-xr-x@ 4 sunyoungpark  staff   128B Mar  5 13:27 [34m.[m[m
drwxr-xr-x@ 3 sunyoungpark  staff    96B Mar  5 13:27 [34m..[m[m
-rw-r--r--@ 1 sunyoungpark  staff   392M Mar  5 13:27 rotten_tomatoes_movie_reviews.csv
-rw-r--r--@ 1 sunyoungpark  staff    17M Mar  5 13:27 rotten_tomatoes_movies.csv
zsh:1: command not found: dir


## Using the _movies.csv file that contains basic information and ratings for each movie.

### rotten_tomatoes_movies.csv - Each record represents a movie available on Rotten Tomatoes and includes all fields available on the website
*Note that 'rating' field here refers to the age-based rating (e.g. G) and not the rating by viewers.

- id - Unique identifier for each movie
- title - The title of the movie
- audienceScore - The average score given by regular viewers (0-100)
- tomatoMeter - The percentage of positive reviews from professional critics (0-100)
- rating - The movie's age-based classification (e.g., G, PG, PG-13, R)
- ratingContents - Content leading to the rating classification
- releaseDateTheaters - The date the movie was released in theaters
- releaseDateStreaming - The date the movie became available for streaming
- runtimeMinutes - The duration of the movie in minutes
- genre - The movie's genre(s)
- originalLanguage - The original language of the movie
- director - The movie's director
- writer - The writer(s) responsible for the movie's screenplay
- boxOffice - The movie's total box office revenue
- distributer - The company responsible for distributing the movie
- soundMix - The audio format(s) used in the movie

Load the two tsv files to dataframes and then write to a sqlite table.

In [7]:
import pandas as pd
import os
import sqlite3 
import numpy as np

In [4]:
# Load CSV file into a DataFrame
df_rt = pd.read_csv(os.path.join(path,"rotten_tomatoes_movies.csv"))
df_rt = df_rt.convert_dtypes()
df_rt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143258 entries, 0 to 143257
Data columns (total 16 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   id                    143258 non-null  string
 1   title                 142891 non-null  string
 2   audienceScore         73248 non-null   Int64 
 3   tomatoMeter           33877 non-null   Int64 
 4   rating                13991 non-null   string
 5   ratingContents        13991 non-null   string
 6   releaseDateTheaters   30773 non-null   string
 7   releaseDateStreaming  79420 non-null   string
 8   runtimeMinutes        129431 non-null  Int64 
 9   genre                 132175 non-null  string
 10  originalLanguage      129400 non-null  string
 11  director              139041 non-null  string
 12  writer                90116 non-null   string
 13  boxOffice             14743 non-null   string
 14  distributor           23001 non-null   string
 15  soundMix         

In [5]:
df_rt.head()

Unnamed: 0,id,title,audienceScore,tomatoMeter,rating,ratingContents,releaseDateTheaters,releaseDateStreaming,runtimeMinutes,genre,originalLanguage,director,writer,boxOffice,distributor,soundMix
0,space-zombie-bingo,Space Zombie Bingo!,50.0,,,,,2018-08-25,75,"Comedy, Horror, Sci-fi",English,George Ormrod,"George Ormrod,John Sabotta",,,
1,the_green_grass,The Green Grass,,,,,,2020-02-11,114,Drama,English,Tiffany Edwards,Tiffany Edwards,,,
2,love_lies,"Love, Lies",43.0,,,,,,120,Drama,Korean,"Park Heung-Sik,Heung-Sik Park","Ha Young-Joon,Jeon Yun-su,Song Hye-jin",,,
3,the_sore_losers_1997,Sore Losers,60.0,,,,,2020-10-23,90,"Action, Mystery & thriller",English,John Michael McCarthy,John Michael McCarthy,,,
4,dinosaur_island_2002,Dinosaur Island,70.0,,,,,2017-03-27,80,"Fantasy, Adventure, Animation",English,Will Meugniot,John Loy,,,


Cleaning the data and adding some more informative columns that can be used to match entries between datasets.
1. add releaseYear column from releaseDateTheaters and releaseDateStreaming, use the earlier one if both exists and do not match.
2. remove entries that don't have both audienceScore and tomatoMeter

In [8]:
# 1. Convert release date to release year and add as a new column
# loading datetime package to make extracting year easy
import datetime

In [43]:
# First converting releaseDateStreaming to release year
# Set values that don't follow the 'YYYY-MM-DD' format to nan
tmp = np.where(df_rt['releaseDateStreaming'].fillna('nan').str.len() == 10, df_rt['releaseDateStreaming'],np.nan)
# Convert values to datetime data type
tmp = pd.to_datetime(tmp,format='ISO8601')
# Extract year from datetime object
tmp = tmp.year
# Convert year values to integers - swap NA with 10000 - and add as a new column
df_rt['releaseYearStreaming'] = tmp.fillna(10000).astype(int)
# Compare the original column and the new column
df_rt[['releaseDateStreaming','releaseYearStreaming']].head()

Unnamed: 0,releaseDateStreaming,releaseYearStreaming
0,2018-08-25,2018
1,2020-02-11,2020
2,,10000
3,2020-10-23,2020
4,2017-03-27,2017


In [44]:
# Now doing the same for releaseDateTheaters
# Set values that don't follow the 'YYYY-MM-DD' format to nan
tmp = np.where(df_rt['releaseDateTheaters'].fillna('nan').str.len() == 10, df_rt['releaseDateTheaters'],np.nan)
# Convert values to datetime data type
tmp = pd.to_datetime(tmp,format='ISO8601')
# Extract year from datetime object
tmp = tmp.year
# Convert year values to integers - swap NA with 10000 - and add as a new column
df_rt['releaseYearTheaters'] = tmp.fillna(10000).astype(int)
# Compare the original column and the new column
df_rt[['releaseDateTheaters','releaseYearTheaters']]

Unnamed: 0,releaseDateTheaters,releaseYearTheaters
0,,10000
1,,10000
2,,10000
3,,10000
4,,10000
...,...,...
143253,2002-08-27,2002
143254,,10000
143255,,10000
143256,,10000


In [47]:
# Grab the lower value of the two - if both are NA, the value will be 10000
tmp_min = np.where(df_rt['releaseYearStreaming']>df_rt['releaseYearTheaters'],df_rt['releaseYearTheaters'],df_rt['releaseYearStreaming'])
df_rt['releaseYearEarlier'] = np.where(tmp_min==10000,np.nan,tmp_min)
# df_rt['releaseYearEarlier'] = df_rt['releaseYearEarlier'].astype('int64')
df_rt[['releaseYearStreaming','releaseYearTheaters','releaseYearEarlier']]

Unnamed: 0,releaseYearStreaming,releaseYearTheaters,releaseYearEarlier
0,2018,10000,2018.0
1,2020,10000,2020.0
2,10000,10000,
3,2020,10000,2020.0
4,2017,10000,2017.0
...,...,...,...
143253,10000,2002,2002.0
143254,10000,10000,
143255,10000,10000,
143256,2006,10000,2006.0


In [48]:
df_rt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143258 entries, 0 to 143257
Data columns (total 19 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    143258 non-null  string 
 1   title                 142891 non-null  string 
 2   audienceScore         73248 non-null   Int64  
 3   tomatoMeter           33877 non-null   Int64  
 4   rating                13991 non-null   string 
 5   ratingContents        13991 non-null   string 
 6   releaseDateTheaters   30773 non-null   string 
 7   releaseDateStreaming  79420 non-null   string 
 8   runtimeMinutes        129431 non-null  Int64  
 9   genre                 132175 non-null  string 
 10  originalLanguage      129400 non-null  string 
 11  director              139041 non-null  string 
 12  writer                90116 non-null   string 
 13  boxOffice             14743 non-null   string 
 14  distributor           23001 non-null   string 
 15  

In [55]:
# 2. remove entries that don't have both audienceScore and tomatoMeter
df_rt=df_rt.dropna(axis='index',how='all',subset=['audienceScore','tomatoMeter'])
df_rt.head()

Unnamed: 0,id,title,audienceScore,tomatoMeter,rating,ratingContents,releaseDateTheaters,releaseDateStreaming,runtimeMinutes,genre,originalLanguage,director,writer,boxOffice,distributor,soundMix,releaseYearStreaming,releaseYearTheaters,releaseYearEarlier
0,space-zombie-bingo,Space Zombie Bingo!,50,,,,,2018-08-25,75,"Comedy, Horror, Sci-fi",English,George Ormrod,"George Ormrod,John Sabotta",,,,2018,10000,2018.0
2,love_lies,"Love, Lies",43,,,,,,120,Drama,Korean,"Park Heung-Sik,Heung-Sik Park","Ha Young-Joon,Jeon Yun-su,Song Hye-jin",,,,10000,10000,
3,the_sore_losers_1997,Sore Losers,60,,,,,2020-10-23,90,"Action, Mystery & thriller",English,John Michael McCarthy,John Michael McCarthy,,,,2020,10000,2020.0
4,dinosaur_island_2002,Dinosaur Island,70,,,,,2017-03-27,80,"Fantasy, Adventure, Animation",English,Will Meugniot,John Loy,,,,2017,10000,2017.0
5,adrift_2018,Adrift,65,69.0,PG-13,"['Injury Images', 'Brief Drug Use', 'Thematic ...",2018-06-01,2018-08-21,120,"Adventure, Drama, Romance",English,Baltasar Kormákur,"Aaron Kandell,Jordan Kandell,David Branson Smith",$31.4M,STX Films,,2018,2018,2018.0


In [None]:
# Drop the temporary columns created
df_rt = df_rt.drop(columns = ['releaseYearStreaming','releaseYearTheaters'])

In [60]:
# Check dataframe for the last time before saving to the database
df_rt.head()

Unnamed: 0,id,title,audienceScore,tomatoMeter,rating,ratingContents,releaseDateTheaters,releaseDateStreaming,runtimeMinutes,genre,originalLanguage,director,writer,boxOffice,distributor,soundMix,releaseYearEarlier
0,space-zombie-bingo,Space Zombie Bingo!,50,,,,,2018-08-25,75,"Comedy, Horror, Sci-fi",English,George Ormrod,"George Ormrod,John Sabotta",,,,2018.0
2,love_lies,"Love, Lies",43,,,,,,120,Drama,Korean,"Park Heung-Sik,Heung-Sik Park","Ha Young-Joon,Jeon Yun-su,Song Hye-jin",,,,
3,the_sore_losers_1997,Sore Losers,60,,,,,2020-10-23,90,"Action, Mystery & thriller",English,John Michael McCarthy,John Michael McCarthy,,,,2020.0
4,dinosaur_island_2002,Dinosaur Island,70,,,,,2017-03-27,80,"Fantasy, Adventure, Animation",English,Will Meugniot,John Loy,,,,2017.0
5,adrift_2018,Adrift,65,69.0,PG-13,"['Injury Images', 'Brief Drug Use', 'Thematic ...",2018-06-01,2018-08-21,120,"Adventure, Drama, Romance",English,Baltasar Kormákur,"Aaron Kandell,Jordan Kandell,David Branson Smith",$31.4M,STX Films,,2018.0


In [61]:
# Connect to SQLite database 
conn = sqlite3.connect('films.db') 

# Create a cursor object 
cur = conn.cursor() 

# Write the data to a sqlite table 
df_rt.to_sql('rotten_tomatoes', conn, if_exists='replace', index=False) 

76802

In [63]:
# Check if the dataframe was saved correctly by loading and pulling up first 10 entries
query = """
        SELECT *
        FROM rotten_tomatoes
        LIMIT 10;
        """
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)

for row in cur.execute(query): 
    print(row) 

['id', 'title', 'audienceScore', 'tomatoMeter', 'rating', 'ratingContents', 'releaseDateTheaters', 'releaseDateStreaming', 'runtimeMinutes', 'genre', 'originalLanguage', 'director', 'writer', 'boxOffice', 'distributor', 'soundMix', 'releaseYearEarlier']
('space-zombie-bingo', 'Space Zombie Bingo!', 50, None, None, None, None, '2018-08-25', 75, 'Comedy, Horror, Sci-fi', 'English', 'George Ormrod', 'George Ormrod,John Sabotta', None, None, None, 2018.0)
('love_lies', 'Love, Lies', 43, None, None, None, None, None, 120, 'Drama', 'Korean', 'Park Heung-Sik,Heung-Sik Park', 'Ha Young-Joon,Jeon Yun-su,Song Hye-jin', None, None, None, None)
('the_sore_losers_1997', 'Sore Losers', 60, None, None, None, None, '2020-10-23', 90, 'Action, Mystery & thriller', 'English', 'John Michael McCarthy', 'John Michael McCarthy', None, None, None, 2020.0)
('dinosaur_island_2002', 'Dinosaur Island', 70, None, None, None, None, '2017-03-27', 80, 'Fantasy, Adventure, Animation', 'English', 'Will Meugniot', 'John

In [64]:
# Close connection to SQLite database 
conn.close() 

### Download and add Letterboxd dataset to the database

In [65]:
# Download latest version
path = kagglehub.dataset_download("gsimonx37/letterboxd",path='movies.csv')

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/gsimonx37/letterboxd?dataset_version_number=2&file_name=movies.csv...


100%|████████████████████████████████████████| 106M/106M [00:04<00:00, 25.2MB/s]

Extracting zip of movies.csv...





Path to dataset files: /Users/sunyoungpark/.cache/kagglehub/datasets/gsimonx37/letterboxd/versions/2/movies.csv


### movies.csv
- id - movie identifier (primary key)
- name - the name of the film
- date - year of release of the film
- tagline - the slogan of the film
- description - description of the film
- minute - movie duration (in minutes)
- rating - rating (0-5 scale)

In [66]:
# Load CSV file into a DataFrame
df_lb = pd.read_csv(path)
df_lb = df_lb.convert_dtypes()
df_lb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 941597 entries, 0 to 941596
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   id           941597 non-null  Int64  
 1   name         941587 non-null  string 
 2   date         849684 non-null  Int64  
 3   tagline      139387 non-null  string 
 4   description  780785 non-null  string 
 5   minute       760027 non-null  Int64  
 6   rating       90999 non-null   Float64
dtypes: Float64(1), Int64(3), string(3)
memory usage: 53.9 MB


In [67]:
df_lb.head()

Unnamed: 0,id,name,date,tagline,description,minute,rating
0,1000001,Barbie,2023,She's everything. He's just Ken.,Barbie and Ken are having the time of their li...,114,3.86
1,1000002,Parasite,2019,Act like you own the place.,"All unemployed, Ki-taek's family takes peculia...",133,4.56
2,1000003,Everything Everywhere All at Once,2022,The universe is so much bigger than you realize.,An aging Chinese immigrant is swept up in an i...,140,4.3
3,1000004,Fight Club,1999,Mischief. Mayhem. Soap.,A ticking-time-bomb insomniac and a slippery s...,139,4.27
4,1000005,La La Land,2016,Here's to the fools who dream.,"Mia, an aspiring actress, serves lattes to mov...",129,4.09


Not much to adjust except resetting the id to string and dropping rows with NA rating.

In [69]:
df_lb['id']=df_lb['id'].astype(str)
df_lb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 941597 entries, 0 to 941596
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   id           941597 non-null  object 
 1   name         941587 non-null  string 
 2   date         849684 non-null  Int64  
 3   tagline      139387 non-null  string 
 4   description  780785 non-null  string 
 5   minute       760027 non-null  Int64  
 6   rating       90999 non-null   Float64
dtypes: Float64(1), Int64(2), object(1), string(3)
memory usage: 53.0+ MB


In [72]:
df_lb=df_lb.dropna(axis='index',subset='rating')
df_lb.info()

<class 'pandas.core.frame.DataFrame'>
Index: 90999 entries, 0 to 166579
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           90999 non-null  object 
 1   name         90999 non-null  string 
 2   date         90999 non-null  Int64  
 3   tagline      38424 non-null  string 
 4   description  90403 non-null  string 
 5   minute       90475 non-null  Int64  
 6   rating       90999 non-null  Float64
dtypes: Float64(1), Int64(2), object(1), string(3)
memory usage: 5.8+ MB


In [75]:
# Drop the temporary columns created
df_lb = df_lb.drop(columns = ['tagline','description'])

In [73]:
# Connect to SQLite database 
conn = sqlite3.connect('films.db') 

# Create a cursor object 
cur = conn.cursor() 

# Write the data to a sqlite table 
df_lb.to_sql('letterboxd', conn, if_exists='replace', index=False) 

90999

In [77]:
# Check if the dataframe was saved correctly by loading and pulling up first 10 entries
query = """
        SELECT *
        FROM letterboxd
        LIMIT 10;
        """
out = cur.execute(query)
names = list(map(lambda x: x[0], cur.description))
print(names)

for row in cur.execute(query): 
    print(row) 

['id', 'name', 'date', 'minute', 'rating']
('1000001', 'Barbie', 2023, 114, 3.86)
('1000002', 'Parasite', 2019, 133, 4.56)
('1000003', 'Everything Everywhere All at Once', 2022, 140, 4.3)
('1000004', 'Fight Club', 1999, 139, 4.27)
('1000005', 'La La Land', 2016, 129, 4.09)
('1000006', 'Oppenheimer', 2023, 181, 4.23)
('1000007', 'Interstellar', 2014, 169, 4.35)
('1000008', 'Joker', 2019, 122, 3.85)
('1000009', 'Dune', 2021, 155, 3.9)
('1000010', 'Pulp Fiction', 1994, 154, 4.26)


In [78]:
# Close connection to SQLite database 
conn.close()