# __FINAL PROJECT PHASE II__

# __RESEARCH QUESTION:__

What set of criteria is most important to obtain the most viewership on Netflix? Which type of screened media on Netflix is more successful in terms of viewership, shows or movies? Are we able to accurately predict viewership and ratings according to various observed criteria, including country of origin, global availability, genre, etc?

### Importing:

In [1]:
#imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import duckdb

In [2]:
pip install pandas openpyxl

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


### Data Overview & Sources:

Seven data tables were collected. The first five were taken from Netflix regarding semi-annual engagement reports starting from the first half of 2023 to the first half of 2025. Each report contains 2 tabs, Shows and Films, and their respective data (i.e. runtime, viewership, global availability). Meanwhile, the IMDb movie/shows data table displays all the movies and shows, each having their own identification tag. The IMDb rating table references these tags to provide each movie/show with their respective ratings and number of votes. The following two IMDb data tables were combined to give us an extensive IMDb table to cross reference with the Netflix reports.

Source for Netflix Engagement Report First Half 2023: https://about.netflix.com/en/news/what-we-watched-a-netflix-engagement-report

Source for Netflix Engagement Report Second Half 2023: https://about.netflix.com/en/news/what-we-watched-the-second-half-of-2023

Source for Netflix Engagement Report First Half 2024: https://about.netflix.com/en/news/what-we-watched-the-first-half-of-2024

Source for Netflix Engagement Report Second Half 2024: https://about.netflix.com/en/news/what-we-watched-the-second-half-of-2024

Source for Netflix Engagement Report First Half 2025: https://about.netflix.com/en/news/what-we-watched-the-first-half-of-2025

Source of IMDb Ratings for Movies/Shows: https://datasets.imdbws.com/title.ratings.tsv.gz

Source of IMDb Movie/Show Titles: https://datasets.imdbws.com/title.basics.tsv.gz




In [3]:
#import the dataset with all ratings from IMBb. 
#note that each rating has an ID, not the show/movie title
ratings_df= pd.read_table("title.ratings.tsv")

#import the dataset with show/movie title given the ID
titles_df= pd.read_table("title.basics.tsv")

In [4]:
#perform an SQL join to obtain a dataframe with the rating, number of votes for that rating,
#title of the show/move, and the show/movie genre
merged_ratings_df= duckdb.sql("""SELECT r.averageRating, r.numVotes, t.originalTitle, t.genres
FROM ratings_df r, titles_df t
WHERE r.tconst=t.tconst""").df()

print(merged_ratings_df)

         averageRating  numVotes                               originalTitle  \
0                  8.3        15                               Episode #3.26   
1                  8.4        13                               Episode #3.27   
2                  6.5        23                          The Bigger Picture   
3                  8.3        14                               Episode #3.28   
4                  6.6        24                                 A Sober Way   
...                ...       ...                                         ...   
1626655            5.5        65  Karsten og Petra - Gullringen fra Atlantis   
1626656            7.6         6              Im Fr√ºhling auf dem Peloponnes   
1626657            8.0         7                     Noah: The New Year 2023   
1626658            9.0         6                               Do Not Answer   
1626659            8.1       421                                    Mithunam   

                        genres  
0    

In [5]:
#import 1st half of 2023 netflix data
all_jan_jun_2023= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2023Jan-Jun.xlsx",
                                  sheet_name="Engagement",
                                  skiprows=5)

#import 2nd half of 2023 netflix data
shows_jul_dec_2023= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2023Jul-Dec.xlsx",
                                  sheet_name="TV",
                                  skiprows=5)
movies_jul_dec_2023= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2023Jul-Dec.xlsx",
                                  sheet_name="Film",
                                  skiprows=5)

#import the 1st half of 2024 netflix data
shows_jan_jun_2024= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2024Jan-Jun.xlsx",
                                   sheet_name= "TV",
                                   skiprows=5)
movies_jan_jun_2024= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2024Jan-Jun.xlsx",
                                   sheet_name= "Film",
                                   skiprows=5)

#import the 2nd half of 2024 netflix data
shows_jul_dec_2024= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2024Jul-Dec.xlsx",
                                   sheet_name= "TV",
                                   skiprows=5)
movies_jul_dec_2024= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2024Jul-Dec.xlsx",
                                   sheet_name= "Film",
                                   skiprows=5)


#import the 1st half of 2025 netflix data
shows_jan_jun_2025= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun.xlsx",
                                   sheet_name= "Shows",
                                   skiprows=5)
movies_jan_jun_2025= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun.xlsx",
                                   sheet_name= "Movies",
                                   skiprows=5)

In [6]:
#create a new column with the media Type (whether show or movie)
#assign 0 to be show and 1 to be movie
all_jan_jun_2023["Type"]= np.nan

shows_jul_dec_2023["Type"]= 0
movies_jul_dec_2023["Type"]= 1

shows_jan_jun_2024["Type"]= 0
movies_jan_jun_2024["Type"]= 1

shows_jul_dec_2024["Type"]= 0
movies_jul_dec_2024["Type"]= 1

shows_jan_jun_2025["Type"]= 0
movies_jan_jun_2025["Type"]= 1

In [18]:
#write code to combine all these dataframes into 1 Netflix dataframe.
#be careful that some of the column names are different in different years and depending on whether 
#it's a show or movie. (The 1st half of 2023 doesn't have as many columns)

jul_dec_2023= duckdb.sql("""SELECT * FROM shows_jul_dec_2023 
                        UNION 
                        SELECT * FROM movies_jul_dec_2023""")

netflix2023= duckdb.sql("""SELECT 
                    Title, 
                    "Available Globally?" AS Global, 
                    "Release Date" AS Release_Date, 
                    "Hours Viewed" AS Hours_Viewed, 
                    Runtime, Views, Type 
                FROM jul_dec_2023 
                UNION 
                SELECT
                    Title, 
                    "Available Globally?" AS Global, 
                    "Release Date" AS Release_Date, 
                    "Hours Viewed" AS Hours_Viewed, 
                    NULL AS Runtime, 
                    NULL AS Views, 
                    Type 
                FROM all_jan_jun_2023""").df()

print(len(all_jan_jun_2023) + len(shows_jul_dec_2023) + len(movies_jul_dec_2023))
print(netflix2023.shape)


jan_jun_2024= duckdb.sql("""SELECT 
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views, Type
                            FROM shows_jan_jun_2024
                            UNION 
                            SELECT  
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views, Type
                            FROM movies_jan_jun_2024""").df()

jul_dec_2024= duckdb.sql("""SELECT 
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views, Type
                            FROM shows_jul_dec_2024
                            UNION 
                            SELECT 
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views, Type
                            FROM movies_jul_dec_2024""").df()

netflix2024= duckdb.sql("""SELECT * FROM jan_jun_2024 UNION ALL SELECT * FROM jul_dec_2024""").df()
print(len(shows_jan_jun_2024) + len(movies_jan_jun_2024) + len(shows_jul_dec_2024) + len(movies_jul_dec_2024))
print(netflix2024.shape)


netflix2025= duckdb.sql("""SELECT 
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views, Type
                            FROM shows_jan_jun_2025
                            UNION 
                            SELECT 
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views, Type
                            FROM movies_jan_jun_2025""").df()

print(len(shows_jan_jun_2025) + len(movies_jan_jun_2025))
print(netflix2025.shape)

netflix_df= duckdb.sql("""SELECT * FROM netflix2023 UNION SELECT * FROM netflix2024 UNION SELECT * FROM netflix2025""").df()

print(netflix_df)

34208
(34208, 7)
31724
(31724, 7)
16182
(16182, 7)
                                                   Title Global Release_Date  \
0                                                 Nonnas    Yes   2025-05-09   
1                                               Carry-On    Yes   2024-12-13   
2                                                   K.O.    Yes   2025-06-06   
3                       A Widow's Game // La viuda negra    Yes   2025-05-30   
4                                              The Union    Yes   2024-08-16   
...                                                  ...    ...          ...   
73994                                 Love Jacked (2018)     No         None   
73995  Peter Bell II: The Hunt for the Czar Crown // ...     No         None   
73996                                       Senario Lagi     No         None   
73997                                        Cube (1997)     No         None   
73998                                   Remember Baghdad     No      