# __FINAL PROJECT PHASE II__

# __RESEARCH QUESTION:__

What set of criteria is most important to obtain the most viewership on Netflix? Which type of screened media on Netflix is more successful in terms of viewership, shows or movies? Are we able to accurately predict viewership and ratings according to various observed criteria, including country of origin, global availability, etc?

### Importing:

In [6]:
#imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import duckdb

In [13]:
pip install pandas openpyxl

Note: you may need to restart the kernel to use updated packages.


### Data Overview & Sources:

Seven data tables were collected. The first five were taken from Netflix regarding semi-annual engagement reports starting from the first half of 2023 to the first half of 2025. Each report contains 2 tabs, Shows and Films, and their respective data (i.e. runtime, viewership, global availability). Meanwhile, the IMDb movie/shows data table displays all the movies and shows, each having their own identification tag. The IMDb rating table references these tags to provide each movie/show with their respective ratings and number of votes. The following two IMDb data tables were combined to give us an extensive IMDb table to cross reference with the Netflix reports.

Source for Netflix Engagement Report First Half 2023: https://about.netflix.com/en/news/what-we-watched-a-netflix-engagement-report

Source for Netflix Engagement Report Second Half 2023: https://about.netflix.com/en/news/what-we-watched-the-second-half-of-2023

Source for Netflix Engagement Report First Half 2024: https://about.netflix.com/en/news/what-we-watched-the-first-half-of-2024

Source for Netflix Engagement Report Second Half 2024: https://about.netflix.com/en/news/what-we-watched-the-second-half-of-2024

Source for Netflix Engagement Report First Half 2025: https://about.netflix.com/en/news/what-we-watched-the-first-half-of-2025

Source of IMDb Ratings for Movies/Shows: https://datasets.imdbws.com/title.ratings.tsv.gz

Source of IMDb Movie/Show Titles: https://datasets.imdbws.com/title.basics.tsv.gz




In [7]:
#import the dataset with all ratings from IMBb. 
#note that each rating has an ID, not the show/movie title
ratings_df= pd.read_table("title.ratings.tsv")

#import the dataset with show/movie title given the ID
titles_df= pd.read_table("title.basics.tsv")

In [9]:
#perform an SQL join to obtain a dataframe with the rating, number of votes for that rating,
#title of the show/move, and the show/movie genre
merged_ratings_df= duckdb.sql("""SELECT r.averageRating, r.numVotes, t.originalTitle, t.genres
FROM ratings_df r, titles_df t
WHERE r.tconst=t.tconst""").df()

print(merged_ratings_df)

         averageRating  numVotes                             originalTitle  \
0                  8.1        10                               Removal Van   
1                  6.5        19                              Episode #1.1   
2                  6.4         7                             Episode #1.10   
3                  7.9         9                              Episode #4.1   
4                  6.2         7                              Episode #1.3   
...                ...       ...                                       ...   
1626655            8.6       133                       Frensham Great Pond   
1626656            9.4        21                           Nee vente nennu   
1626657            6.3        13         Chris Pratt/Olivia Munn/She & Him   
1626658            6.9        17  Bruno Mars/Jennifer Lawrence/T.J. Miller   
1626659            5.4        12              Katie Holmes/Seth MacFarlane   

                             genres  
0        Action,Adventure

In [19]:
#import 1st half of 2023 netflix data
all_jan_jun_2023= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2023Jan-Jun.xlsx",
                                  sheet_name="Engagement",
                                  skiprows=5)

#import 2nd half of 2023 netflix data
shows_jul_dec_2023= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2023Jul-Dec.xlsx",
                                  sheet_name="TV",
                                  skiprows=5)
movies_jul_dec_2023= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2023Jul-Dec.xlsx",
                                  sheet_name="Film",
                                  skiprows=5)

#import the 1st half of 2024 netflix data
shows_jan_jun_2024= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2024Jan-Jun.xlsx",
                                   sheet_name= "TV",
                                   skiprows=5)
movies_jan_jun_2024= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2024Jan-Jun.xlsx",
                                   sheet_name= "Film",
                                   skiprows=5)

#import the 2nd half of 2024 netflix data
shows_jul_dec_2024= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2024Jul-Dec.xlsx",
                                   sheet_name= "TV",
                                   skiprows=5)
movies_jul_dec_2024= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2024Jul-Dec.xlsx",
                                   sheet_name= "Film",
                                   skiprows=5)


#import the 1st half of 2025 netflix data
shows_jan_jun_2025= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun.xlsx",
                                   sheet_name= "Shows",
                                   skiprows=5)
movies_jan_jun_2025= pd.read_excel("What_We_Watched_A_Netflix_Engagement_Report_2025Jan-Jun.xlsx",
                                   sheet_name= "Movies",
                                   skiprows=5)

      Unnamed: 0                                           Title  \
0            NaN                                          Damsel   
1            NaN                                            Lift   
2            NaN  Society of the Snow // La sociedad de la nieve   
3            NaN                    Under Paris // Sous la Seine   
4            NaN                     The Super Mario Bros. Movie   
...          ...                                             ...   
9355         NaN                               أصحاب ...ولا أعزّ   
9356         NaN                                       두근두근 내 인생   
9357         NaN                                            레드슈즈   
9358         NaN                                        아이 캔 스피크   
9359         NaN                                              표적   

     Available Globally? Release Date  Hours Viewed Runtime      Views  
0                    Yes   2024-03-08     263700000    1:50  143800000  
1                    Yes   2024-01-12

In [26]:
#create a new column with the media Type (whether show or movie)
#assign 0 to be show and 1 to be movie
all_jan_jun_2023["Type"]= np.nan

shows_jul_dec_2023["Type"]= 0
movies_jul_dec_2023["Type"]= 1

shows_jan_jun_2024["Type"]= 0
movies_jan_jun_2024["Type"]= 1

shows_jul_dec_2024["Type"]= 0
movies_jul_dec_2024["Type"]= 1

shows_jan_jun_2025["Type"]= 0
movies_jan_jun_2025["Type"]= 1

Unnamed: 0.1,Unnamed: 0,Title,Available Globally?,Release Date,Hours Viewed,Type
0,,The Night Agent: Season 1,Yes,2023-03-23,812100000,
1,,Ginny & Georgia: Season 2,Yes,2023-01-05,665100000,
2,,The Glory: Season 1 // 더 글로리: 시즌 1,Yes,2022-12-30,622800000,
3,,Wednesday: Season 1,Yes,2022-11-23,507700000,
4,,Queen Charlotte: A Bridgerton Story,Yes,2023-05-04,503000000,


In [None]:
#write code to combine all these dataframes into 1 Netflix dataframe.
#be careful that some of the column names are different in different years and depending on whether 
#it's a show or movie. (The 1st half of 2023 doesn't have as many columns)
#Netflix_df= _____


#remove the first column because it is empty (due to the 1st column of the Excel being blank)