# __FINAL PROJECT PHASE II__

# __RESEARCH QUESTION:__

What set of criteria is most important to obtain the most viewership on Netflix for movies? Are we able to accurately predict viewership according to various observed criteria, including ratings, runtime, global availability, and genre?

### Importing:

In [31]:
#imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import duckdb
from sklearn.linear_model import LinearRegression, LogisticRegression
import re

In [32]:
pip install pandas openpyxl

Note: you may need to restart the kernel to use updated packages.


### Data Overview & Sources:

Seven data tables were collected. The first five were taken from Netflix regarding semi-annual engagement reports starting from the first half of 2023 to the first half of 2025. Each report contains 2 tabs, Shows and Films, and their respective data (i.e. runtime, viewership, global availability). Meanwhile, the IMDb movie/shows data table displays all the movies and shows, each having their own identification tag. The IMDb rating table references these tags to provide each movie/show with their respective ratings and number of votes. The following two IMDb data tables were combined to give us an extensive IMDb table to cross reference with the Netflix reports.

Source for Netflix Engagement Report First Half 2023: https://about.netflix.com/en/news/what-we-watched-a-netflix-engagement-report

Source for Netflix Engagement Report Second Half 2023: https://about.netflix.com/en/news/what-we-watched-the-second-half-of-2023

Source for Netflix Engagement Report First Half 2024: https://about.netflix.com/en/news/what-we-watched-the-first-half-of-2024

Source for Netflix Engagement Report Second Half 2024: https://about.netflix.com/en/news/what-we-watched-the-second-half-of-2024

Source for Netflix Engagement Report First Half 2025: https://about.netflix.com/en/news/what-we-watched-the-first-half-of-2025

Source of IMDb Ratings for Movies/Shows: https://datasets.imdbws.com/title.ratings.tsv.gz

Source of IMDb Movie/Show Titles: https://datasets.imdbws.com/title.basics.tsv.gz




# __Data Collection & Cleaning:__

In [33]:
#import the dataset with all ratings from IMBb. 
#note that each rating has an ID, not the show/movie title
ratings_df= pd.read_table("title.ratings.tsv")

#import the dataset with show/movie (episode) title given the ID
titles_df= pd.read_table("title.basics.tsv")

#import the dataset with show/movie parent ID (for series titles), given the episode ID
series_title_id_df = pd.read_table("title.episode.tsv")

In [57]:
#perform an SQL join to obtain a dataframe with the rating, number of votes for that rating,
#title of the show/move, and the show/movie genre
merged_ratings_df= duckdb.sql("""SELECT r.tconst, r.averageRating, r.numVotes, t.originalTitle AS Title, t.genres
FROM ratings_df r, titles_df t
WHERE r.tconst=t.tconst""").df()

merged_ratings_df.head(10)

Unnamed: 0,tconst,averageRating,numVotes,Title,genres
0,tt0636799,6.5,61,The Will/Deja Vu/The Prediction,"Comedy,Drama,Romance"
1,tt0636800,7.2,155,Third Wheel/Grandmother's Day/Second String Mom,"Comedy,Drama,Romance"
2,tt0636801,7.0,151,Till Death Do Us Part-Maybe/Locked Away/Chubs,"Comedy,Drama,Romance"
3,tt0636802,6.8,95,"The Tomorrow Lady/Father, Dear Father/Still Life","Comedy,Drama,Romance"
4,tt0636803,7.6,109,Tony and Julie/Separate Beds/America's Sweetheart,"Comedy,Drama,Romance"
5,tt0636804,7.1,146,The Minister and the Stripper/Her Own Two Feet...,"Comedy,Drama,Romance"
6,tt0636805,7.1,110,The Trigamist/Jealousy/From Here to Maternity,"Comedy,Drama,Romance"
7,tt0636806,7.2,77,Trouble in Paradise/No More Mister Nice Guy/Th...,"Comedy,Drama,Romance"
8,tt0636807,7.3,77,Two Grapes on the Vine/Aunt Sylvia/Deductible ...,"Comedy,Drama,Romance"
9,tt0636808,6.8,112,The Duel/Two for Julie/Aunt Hilly,"Comedy,Drama,Romance"


In [35]:
#import 1st half of 2023 netflix data
all_jan_jun_2023= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2023Jan-Jun.xlsx",
                                  sheet_name="Engagement",
                                  skiprows=5)

#import 2nd half of 2023 netflix data
movies_jul_dec_2023= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2023Jul-Dec.xlsx",
                                  sheet_name="Film",
                                  skiprows=5)

#import the 1st half of 2024 netflix data
movies_jan_jun_2024= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2024Jan-Jun.xlsx",
                                   sheet_name= "Film",
                                   skiprows=5)

#import the 2nd half of 2024 netflix data
movies_jul_dec_2024= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2024Jul-Dec.xlsx",
                                   sheet_name= "Film",
                                   skiprows=5)

#import the 1st half of 2025 netflix data
movies_jan_jun_2025= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun.xlsx",
                                   sheet_name= "Movies",
                                   skiprows=5)

In [38]:
#write code to combine all these dataframes into 1 Netflix dataframe.
#be careful that some of the column names are different in different years and depending on whether 
#it's a show or movie. (The 1st half of 2023 doesn't have as many columns)

jul_dec_2023= movies_jul_dec_2023

netflix2023= duckdb.sql("""SELECT 
                    Title, 
                    "Available Globally?" AS Global, 
                    "Release Date" AS Release_Date, 
                    "Hours Viewed" AS Hours_Viewed, 
                    Runtime, Views
                FROM jul_dec_2023 
                UNION 
                SELECT
                    Title, 
                    "Available Globally?" AS Global, 
                    "Release Date" AS Release_Date, 
                    "Hours Viewed" AS Hours_Viewed, 
                    NULL AS Runtime, 
                    NULL AS Views
                FROM all_jan_jun_2023""").df()

print(len(all_jan_jun_2023) + len(shows_jul_dec_2023) + len(movies_jul_dec_2023))
print(netflix2023.shape)


jan_jun_2024= duckdb.sql("""SELECT
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views
                            FROM movies_jan_jun_2024""").df()

jul_dec_2024= duckdb.sql("""SELECT
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views
                            FROM movies_jul_dec_2024""").df()

netflix2024= duckdb.sql("""SELECT * FROM jan_jun_2024 UNION ALL SELECT * FROM jul_dec_2024""").df()
print(len(shows_jan_jun_2024) + len(movies_jan_jun_2024) + len(shows_jul_dec_2024) + len(movies_jul_dec_2024))
print(netflix2024.shape)


netflix2025= duckdb.sql("""SELECT
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views
                            FROM movies_jan_jun_2025""").df()

print(len(shows_jan_jun_2025) + len(movies_jan_jun_2025))
print(netflix2025.shape)

netflix_df= duckdb.sql("""SELECT * FROM netflix2023 UNION SELECT * FROM netflix2024 UNION SELECT * FROM netflix2025""").df()

print(netflix_df.shape)

34208
(27609, 6)
31724
(18040, 6)
16182
(8674, 6)
(48231, 6)


In [39]:
mod_title= netflix_df['Title']
mod_title= mod_title.str.replace(r" \/\/.*", "", regex=True)

netflix_df['Title']= mod_title
netflix_df.head()

Unnamed: 0,Title,Global,Release_Date,Hours_Viewed,Runtime,Views
0,Exterritorial,Yes,2025-04-30,159000000,1:49,87500000
1,Counterattack,Yes,2025-02-28,101000000,1:25,71300000
2,Despicable Me 3,No,,73600000,1:30,49100000
3,Bad Influence,Yes,2025-05-09,78600000,1:47,44100000
4,K.O.,Yes,2025-06-06,63100000,1:27,43500000


In [69]:
#left join Netflix and ratings datasets
#we want to keep all the Netflix observations, or else the dataset would shrink too much
#limitation that only exact titles are matched

netflix_ratings_df= duckdb.sql("""SELECT *
FROM netflix_df
INNER JOIN merged_ratings_df
Title ON netflix_df.Title= merged_ratings_df.Title""").df()

#n.Title, n.Global, n.Release_Date, n.Hours_Viewed, n.Runtime, n.Views, m.averageRating, m.numVotes, m.genres
FROM netflix_df n

print(netflix_ratings_df.shape)
netflix_ratings_df.head(n=1000)
#netflix_ratings_extended.head()

#netflix_ratings_df.to_excel('output_dataframe.xlsx', index=False)

BinderException: Binder Error: Referenced table "merged_ratings_df" not found!
Candidate tables: "netflix_df"

# __Data Description:__


Write Description here 

# __Data Limitations:__


write limitations here

# __Exploratory Data Analysis:__


Start analysis here 

# __Questions for Reviewers:__


Put questions here