# __FINAL PROJECT PHASE II__

# __RESEARCH QUESTION:__

What set of criteria is most important to obtain the most viewership on Netflix? Which type of screened media on Netflix is more successful in terms of viewership, shows or movies? Are we able to accurately predict viewership and ratings according to various observed criteria, including country of origin, global availability, genre, etc?

### Importing:

In [1]:
#imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import duckdb

In [2]:
pip install pandas openpyxl

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


### Data Overview & Sources:

Seven data tables were collected. The first five were taken from Netflix regarding semi-annual engagement reports starting from the first half of 2023 to the first half of 2025. Each report contains 2 tabs, Shows and Films, and their respective data (i.e. runtime, viewership, global availability). Meanwhile, the IMDb movie/shows data table displays all the movies and shows, each having their own identification tag. The IMDb rating table references these tags to provide each movie/show with their respective ratings and number of votes. The following two IMDb data tables were combined to give us an extensive IMDb table to cross reference with the Netflix reports.

Source for Netflix Engagement Report First Half 2023: https://about.netflix.com/en/news/what-we-watched-a-netflix-engagement-report

Source for Netflix Engagement Report Second Half 2023: https://about.netflix.com/en/news/what-we-watched-the-second-half-of-2023

Source for Netflix Engagement Report First Half 2024: https://about.netflix.com/en/news/what-we-watched-the-first-half-of-2024

Source for Netflix Engagement Report Second Half 2024: https://about.netflix.com/en/news/what-we-watched-the-second-half-of-2024

Source for Netflix Engagement Report First Half 2025: https://about.netflix.com/en/news/what-we-watched-the-first-half-of-2025

Source of IMDb Ratings for Movies/Shows: https://datasets.imdbws.com/title.ratings.tsv.gz

Source of IMDb Movie/Show Titles: https://datasets.imdbws.com/title.basics.tsv.gz




# __Data Collection & Cleaning:__

In [3]:
#import the dataset with all ratings from IMBb. 
#note that each rating has an ID, not the show/movie title
ratings_df= pd.read_table("title.ratings.tsv")

#import the dataset with show/movie (episode) title given the ID
titles_df= pd.read_table("title.basics.tsv")

#import the dataset with show/movie parent ID (for series titles), given the episode ID
series_title_id_df = pd.read_table("title.episode.tsv")

In [33]:
#perform an SQL join to obtain a dataframe with the rating, number of votes for that rating,
#title of the show/move, and the show/movie genre
merged_ratings_df= duckdb.sql("""SELECT r.tconst, r.averageRating, r.numVotes, t.originalTitle, t.primaryTitle, t.genres
FROM ratings_df r, titles_df t
WHERE r.tconst=t.tconst""").df()

merged_ratings_df.head(10)

Unnamed: 0,tconst,averageRating,numVotes,originalTitle,primaryTitle,genres
0,tt11950630,6.8,60,Das Schützenfest,Das Schützenfest,"Biography,Drama,History"
1,tt11950780,6.6,56,Alte Schuld und alte Liebe,Alte Schuld und alte Liebe,"Biography,Drama,History"
2,tt11950782,8.1,9,"Let's Go, PoPiPa","Let's Go, PoPiPa","Animation,Comedy,Drama"
3,tt11950794,6.3,7,Episode #2.7,Episode #2.7,"Comedy,Drama"
4,tt11950836,6.0,929,Lúa vermella,Red Moon Tide,"Drama,Fantasy,Horror"
5,tt11950864,8.2,4638,Astrid et Raphaëlle,Astrid et Raphaëlle,"Crime,Drama,Thriller"
6,tt11950874,8.5,6,Fire and Water,Fire and Water,News
7,tt11950876,8.5,61,Ten Final Pushes,Ten Final Pushes,"Action,Adventure,Animation"
8,tt11950878,8.7,71,100 Spectacular Dances,100 Spectacular Dances,"Action,Adventure,Animation"
9,tt11950884,7.2,86,Blade of the Immortal,Blade of the Immortal,"Action,Adventure,Animation"


In [5]:
#import 1st half of 2023 netflix data
all_jan_jun_2023= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2023Jan-Jun.xlsx",
                                  sheet_name="Engagement",
                                  skiprows=5)

#import 2nd half of 2023 netflix data
shows_jul_dec_2023= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2023Jul-Dec.xlsx",
                                  sheet_name="TV",
                                  skiprows=5)
movies_jul_dec_2023= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2023Jul-Dec.xlsx",
                                  sheet_name="Film",
                                  skiprows=5)

#import the 1st half of 2024 netflix data
shows_jan_jun_2024= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2024Jan-Jun.xlsx",
                                   sheet_name= "TV",
                                   skiprows=5)
movies_jan_jun_2024= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2024Jan-Jun.xlsx",
                                   sheet_name= "Film",
                                   skiprows=5)

#import the 2nd half of 2024 netflix data
shows_jul_dec_2024= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2024Jul-Dec.xlsx",
                                   sheet_name= "TV",
                                   skiprows=5)
movies_jul_dec_2024= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2024Jul-Dec.xlsx",
                                   sheet_name= "Film",
                                   skiprows=5)


#import the 1st half of 2025 netflix data
shows_jan_jun_2025= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun.xlsx",
                                   sheet_name= "Shows",
                                   skiprows=5)
movies_jan_jun_2025= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun.xlsx",
                                   sheet_name= "Movies",
                                   skiprows=5)

In [6]:
#create a new column with the media Type (whether show or movie)
#assign 0 to be show and 1 to be movie
all_jan_jun_2023["Type"]= np.nan

shows_jul_dec_2023["Type"]= 0
movies_jul_dec_2023["Type"]= 1

shows_jan_jun_2024["Type"]= 0
movies_jan_jun_2024["Type"]= 1

shows_jul_dec_2024["Type"]= 0
movies_jul_dec_2024["Type"]= 1

shows_jan_jun_2025["Type"]= 0
movies_jan_jun_2025["Type"]= 1

In [28]:
#write code to combine all these dataframes into 1 Netflix dataframe.
#be careful that some of the column names are different in different years and depending on whether 
#it's a show or movie. (The 1st half of 2023 doesn't have as many columns)

jul_dec_2023= duckdb.sql("""SELECT * FROM shows_jul_dec_2023 
                        UNION 
                        SELECT * FROM movies_jul_dec_2023""")

netflix2023= duckdb.sql("""SELECT 
                    Title, 
                    "Available Globally?" AS Global, 
                    "Release Date" AS Release_Date, 
                    "Hours Viewed" AS Hours_Viewed, 
                    Runtime, Views, Type 
                FROM jul_dec_2023 
                UNION 
                SELECT
                    Title, 
                    "Available Globally?" AS Global, 
                    "Release Date" AS Release_Date, 
                    "Hours Viewed" AS Hours_Viewed, 
                    NULL AS Runtime, 
                    NULL AS Views, 
                    Type 
                FROM all_jan_jun_2023""").df()

print(len(all_jan_jun_2023) + len(shows_jul_dec_2023) + len(movies_jul_dec_2023))
print(netflix2023.shape)


jan_jun_2024= duckdb.sql("""SELECT 
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views, Type
                            FROM shows_jan_jun_2024
                            UNION 
                            SELECT  
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views, Type
                            FROM movies_jan_jun_2024""").df()

jul_dec_2024= duckdb.sql("""SELECT 
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views, Type
                            FROM shows_jul_dec_2024
                            UNION 
                            SELECT 
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views, Type
                            FROM movies_jul_dec_2024""").df()

netflix2024= duckdb.sql("""SELECT * FROM jan_jun_2024 UNION ALL SELECT * FROM jul_dec_2024""").df()
print(len(shows_jan_jun_2024) + len(movies_jan_jun_2024) + len(shows_jul_dec_2024) + len(movies_jul_dec_2024))
print(netflix2024.shape)


netflix2025= duckdb.sql("""SELECT 
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views, Type
                            FROM shows_jan_jun_2025
                            UNION 
                            SELECT 
                                Title, 
                                "Available Globally?" AS Global, 
                                "Release Date" AS Release_Date, 
                                "Hours Viewed" AS Hours_Viewed, 
                                Runtime, Views, Type
                            FROM movies_jan_jun_2025""").df()

print(len(shows_jan_jun_2025) + len(movies_jan_jun_2025))
print(netflix2025.shape)

netflix_df= duckdb.sql("""SELECT * FROM netflix2023 UNION SELECT * FROM netflix2024 UNION SELECT * FROM netflix2025""").df()

netflix_df.head(10)

34208
(34208, 7)
31724
(31724, 7)
16182
(16182, 7)


Unnamed: 0,Title,Global,Release_Date,Hours_Viewed,Runtime,Views,Type
0,Lucca's World // Los dos hemisferios de Lucca,Yes,2025-01-31,46600000,1:37,28800000,1.0
1,Fear Street: Prom Queen,Yes,2025-05-23,34400000,1:30,22900000,1.0
2,Unexplainable // Inexplicável,No,2025-04-16,32900000,1:55,17200000,1.0
3,Pushpa 2: The Rule (Reloaded Version),No,,63700000,3:44,17100000,1.0
4,Sicario (2015),No,,28200000,2:01,14000000,1.0
5,Aquaman and the Lost Kingdom,No,,28100000,2:04,13600000,1.0
6,The Most Beautiful Girl in the World,Yes,2025-02-14,27000000,2:03,13200000,1.0
7,Spider-Man: Into the Spider-Verse,No,,25000000,1:57,12800000,1.0
8,Sniper: Assassin's End,No,,19500000,1:35,12300000,1.0
9,Now You See Me,No,,23300000,1:55,12200000,1.0


# __Data Description:__


Write Description here 

# __Data Limitations:__


write limitations here

# __Exploratory Data Analysis:__


Start analysis here 

# __Questions for Reviewers:__


Put questions here