# __FINAL PROJECT PHASE II__

# __RESEARCH QUESTION:__

What set of criteria is most important to obtain the most viewership on Netflix for movies? Are we able to accurately predict viewership according to various observed criteria, including ratings, global availability, and genre?

### Importing:

In [31]:
#imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import duckdb
from sklearn.linear_model import LinearRegression, LogisticRegression
import re

In [32]:
pip install pandas openpyxl

Note: you may need to restart the kernel to use updated packages.


### Data Overview & Sources:

Seven data tables were collected. The first five were taken from Netflix regarding semi-annual engagement reports starting from the first half of 2023 to the first half of 2025. Each report contains 2 tabs, Shows and Films, and their respective data (i.e. runtime, viewership, global availability). Meanwhile, the IMDb movie/shows data table displays all the movies and shows, each having their own identification tag. The IMDb rating table references these tags to provide each movie/show with their respective ratings and number of votes. The following two IMDb data tables were combined to give us an extensive IMDb table to cross reference with the Netflix reports.

Source for Netflix Engagement Report First Half 2023: https://about.netflix.com/en/news/what-we-watched-a-netflix-engagement-report

Source for Netflix Engagement Report Second Half 2023: https://about.netflix.com/en/news/what-we-watched-the-second-half-of-2023

Source for Netflix Engagement Report First Half 2024: https://about.netflix.com/en/news/what-we-watched-the-first-half-of-2024

Source for Netflix Engagement Report Second Half 2024: https://about.netflix.com/en/news/what-we-watched-the-second-half-of-2024

Source for Netflix Engagement Report First Half 2025: https://about.netflix.com/en/news/what-we-watched-the-first-half-of-2025

Source of IMDb Ratings for Movies/Shows: https://datasets.imdbws.com/title.ratings.tsv.gz

Source of IMDb Movie/Show Titles: https://datasets.imdbws.com/title.basics.tsv.gz




# __Data Collection & Cleaning:__

In [33]:
#import the dataset with all ratings from IMBb. 
#note that each rating has an ID, not the show/movie title
ratings_df= pd.read_table("title.ratings.tsv")

#import the dataset with show/movie (episode) title given the ID
titles_df= pd.read_table("title.basics.tsv")

#import the dataset with show/movie parent ID (for series titles), given the episode ID
series_title_id_df = pd.read_table("title.episode.tsv")

In [84]:
#perform an SQL join to obtain a dataframe with the rating, number of votes for that rating,
#title of the show/move, and the show/movie genre
merged_ratings_df= duckdb.sql("""SELECT r.tconst, r.averageRating, r.numVotes, t.originalTitle AS Title, t.genres
FROM ratings_df r, titles_df t
WHERE r.tconst=t.tconst""").df()

#drop the duplicate titles to prevent duplicate rows when merging with Netflix dataframe
merged_ratings_df = merged_ratings_df.drop_duplicates(subset=['Title'])

print(merged_ratings_df.head)

<bound method NDFrame.head of              tconst  averageRating  numVotes  \
0         tt0213621            6.5         8   
1         tt0213622            6.2        79   
2         tt0213623            7.2       323   
3         tt0213627            6.7       123   
4         tt0213628            6.8        60   
...             ...            ...       ...   
1626652  tt33297927            7.0         6   
1626653  tt33297972            5.8        39   
1626657   tt6308732            6.3        13   
1626658   tt6308734            6.9        17   
1626659   tt6308738            5.4        12   

                                            Title                   genres  
0                                     Do Ladkiyan                    Drama  
1                            The Domineering Male             Comedy,Short  
2                                   Donovan Quick                    Drama  
3                                    Däumlienchen  Animation,Fantasy,Short  
4       

In [35]:
#import 1st half of 2023 netflix data
all_jan_jun_2023= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2023Jan-Jun.xlsx",
                                  sheet_name="Engagement",
                                  skiprows=5)

#import 2nd half of 2023 netflix data
movies_jul_dec_2023= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2023Jul-Dec.xlsx",
                                  sheet_name="Film",
                                  skiprows=5)

#import the 1st half of 2024 netflix data
movies_jan_jun_2024= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2024Jan-Jun.xlsx",
                                   sheet_name= "Film",
                                   skiprows=5)

#import the 2nd half of 2024 netflix data
movies_jul_dec_2024= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2024Jul-Dec.xlsx",
                                   sheet_name= "Film",
                                   skiprows=5)

#import the 1st half of 2025 netflix data
movies_jan_jun_2025= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun.xlsx",
                                   sheet_name= "Movies",
                                   skiprows=5)

In [72]:
#write code to combine all these dataframes into 1 Netflix dataframe.
#be careful that some of the column names are different in different years and depending on whether 
#it's a show or movie. (The 1st half of 2023 doesn't have as many columns)

jul_dec_2023= movies_jul_dec_2023

netflix2023= duckdb.sql("""SELECT 
                    Title, 
                    "Available Globally?" AS Global, 
                    "Release Date" AS Release_Date, 
                    "Hours Viewed" AS Hours_Viewed, 
                    Runtime, Views
                FROM jul_dec_2023 
                UNION 
                SELECT
                    Title, 
                    "Available Globally?" AS Global, 
                    "Release Date" AS Release_Date, 
                    "Hours Viewed" AS Hours_Viewed, 
                    NULL AS Runtime, 
                    NULL AS Views
                FROM all_jan_jun_2023""").df()

print(len(all_jan_jun_2023) + len(shows_jul_dec_2023) + len(movies_jul_dec_2023))
print(netflix2023.shape)


jan_jun_2024= duckdb.sql("""SELECT
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views
                            FROM movies_jan_jun_2024""").df()

jul_dec_2024= duckdb.sql("""SELECT
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views
                            FROM movies_jul_dec_2024""").df()

netflix2024= duckdb.sql("""SELECT * FROM jan_jun_2024 UNION ALL SELECT * FROM jul_dec_2024""").df()
print(len(shows_jan_jun_2024) + len(movies_jan_jun_2024) + len(shows_jul_dec_2024) + len(movies_jul_dec_2024))
print(netflix2024.shape)


netflix2025= duckdb.sql("""SELECT
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views
                            FROM movies_jan_jun_2025""").df()

print(len(shows_jan_jun_2025) + len(movies_jan_jun_2025))
print(netflix2025.shape)

netflix_df= duckdb.sql("""SELECT * FROM netflix2023 UNION SELECT * FROM netflix2024 UNION SELECT * FROM netflix2025""").df()

print(netflix_df)

34208
(27609, 6)
31724
(18040, 6)
16182
(8674, 6)
                              Title Global Release_Date  Hours_Viewed Runtime  \
0                          Ad Vitam    Yes   2025-01-10     114000000    1:38   
1                   Despicable Me 2     No         None      75000000    1:38   
2      PAW Patrol: The Mighty Movie     No         None      50700000    1:28   
3                       Rebel Ridge    Yes   2024-09-06      68500000    2:12   
4                            Norbit     No         None      33600000    1:43   
...                             ...    ...          ...           ...     ...   
48226                         Zenek     No         None        100000    None   
48227                     Zeroville     No         None        100000    None   
48228                 Ziarno prawdy     No         None        100000    None   
48229                   Çiçek Abbas     No         None        100000    None   
48230            Üç Harfliler: Adak     No         None    

In [73]:
mod_title= netflix_df['Title']
mod_title= mod_title.str.replace(r" \/\/.*", "", regex=True)

netflix_df['Title']= mod_title
netflix_df.head()

Unnamed: 0,Title,Global,Release_Date,Hours_Viewed,Runtime,Views
0,Ad Vitam,Yes,2025-01-10,114000000,1:38,69800000
1,Despicable Me 2,No,,75000000,1:38,45900000
2,PAW Patrol: The Mighty Movie,No,,50700000,1:28,34600000
3,Rebel Ridge,Yes,2024-09-06,68500000,2:12,31100000
4,Norbit,No,,33600000,1:43,19600000


In [85]:
#left join Netflix and ratings datasets
#we want to keep all the Netflix observations, or else the dataset would shrink too much
#limitation that only exact titles are matched

netflix_ratings_df= duckdb.sql("""SELECT *
FROM netflix_df
LEFT JOIN merged_ratings_df
ON netflix_df.Title= merged_ratings_df.Title""").df()

print(netflix_ratings_df.shape) #yes, the size is the same as the netflix only dataframe
print(netflix_ratings_df) 

#save this as an intermediate dataset in our final submission
#used this to troubleshoot the merge being the wrong size
#netflix_ratings_df.to_csv('my_dataframe.csv', index=False) 

(48231, 11)
                                                Title Global Release_Date  \
0                        A.I. Artificial Intelligence     No         None   
1                                            Big Eden     No         None   
2                                                 JFK     No         None   
3                                 The Black Godfather    Yes   2019-06-07   
4                                   Operación Camarón     No         None   
...                                               ...    ...          ...   
48226                    Mapado 2: Back to the Island     No         None   
48227                                 My First Client     No         None   
48228              The Mole Song: Hong Kong Capriccio     No         None   
48229  Why Men Don't Listen and Women Can't Read Maps     No         None   
48230                      You Shine in the Moonlight     No         None   

       Hours_Viewed Runtime   Views      tconst  averageRating 

# __Data Description:__


Write Description here 

# __Data Limitations:__


write limitations here

# __Exploratory Data Analysis:__


### Movie Viewership based on Ratings:

In [None]:
Linear Regression (x - ratings, y - viewership), Avg Views on each rating as bar graph

### Movie Viewership based on Global Availability:

In [None]:
Logistic Regression(x - global availability, y - viewership)

### Movie Viewership over Time:

# __Questions for Reviewers:__


Put questions here