# <u>**Streaming Service Comparison**</u>

### **Objective:**
- Determine which streaming platform hosts the majority of content I enjoy so that I can pare down the services to which I subscribe. 

### **Data Sources:**
- [Netflix via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-netflix-dataset)
- [Hulu via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-hulu-dataset)
- [Prime via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-amazon-prime-dataset/data)
- [AppleTV via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-apple-tv-dataset)

## **Data Collection & Loading**

### **Import Pandas, Numpy, Matplotlib, Wordcloud, and PIL**

In [1]:
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import wordcloud as wc
from PIL import Image

### **Data Load**

In [2]:
# Load in each file separately. 
# Plan to create a function for this 
apple = pd.read_csv("AppleTV.csv")
hulu = pd.read_csv("Hulu.csv")
netflix = pd.read_csv("Netflix.csv")
prime = pd.read_csv("Prime.csv")

In [3]:
display(apple)

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Four Rooms,movie,Comedy,1995.0,tt0113101,6.7,113403.0,
1,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2348885.0,
2,American Beauty,movie,Drama,1999.0,tt0169547,8.3,1238903.0,
3,Citizen Kane,movie,"Drama, Mystery",1941.0,tt0033467,8.3,477481.0,
4,Metropolis,movie,"Drama, Sci-Fi",1927.0,tt0017136,8.3,192285.0,
...,...,...,...,...,...,...,...,...
18203,They Took Our Child: We Got Her Back,tv,"Crime, Documentary",2015.0,tt5056408,8.6,31.0,
18204,,tv,,,,,,
18205,,tv,,2025.0,,,,
18206,,tv,Comedy,2018.0,,,,


In [4]:
# Dispaly information about the dataset (i.e., row counts, column counts, column names, datatypes, # of non-null rows)
apple.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18208 entries, 0 to 18207
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               17643 non-null  object 
 1   type                18208 non-null  object 
 2   genres              17549 non-null  object 
 3   releaseYear         18179 non-null  float64
 4   imdbId              16727 non-null  object 
 5   imdbAverageRating   16314 non-null  float64
 6   imdbNumVotes        16314 non-null  float64
 7   availableCountries  84 non-null     object 
dtypes: float64(3), object(5)
memory usage: 1.1+ MB


In [5]:
# Drop availableCountries Column - no value with such a low non-null number
apple.drop("availableCountries",axis=1,inplace=True)
apple.columns

Index(['title', 'type', 'genres', 'releaseYear', 'imdbId', 'imdbAverageRating',
       'imdbNumVotes'],
      dtype='object')

In [6]:
# Change releaseYear,  imdbNumVotes to integers 
# Display only the data types

apple[["releaseYear","imdbNumVotes"]] = apple[["releaseYear","imdbNumVotes"]].apply(np.int64)
display(apple.dtypes)

  arr = np.asarray(values, dtype=dtype)


title                 object
type                  object
genres                object
releaseYear            int64
imdbId                object
imdbAverageRating    float64
imdbNumVotes           int64
dtype: object

In [7]:
# Rename Columns
apple.rename(columns={"title":"Title","type":"Type","genres":"Genres","releaseYear":"Release Year","imdbId":"IMDb ID","imdbAverageRating":"IMDb Average Rating","imdbNumVotes":"IMDb Num Votes"},inplace = True)
apple.columns

Index(['Title', 'Type', 'Genres', 'Release Year', 'IMDb ID',
       'IMDb Average Rating', 'IMDb Num Votes'],
      dtype='object')

In [None]:
# Determine how many titles contain null values
apple["Title"].isna().sum()

np.int64(565)

In [29]:
# Drop the rows contiaining null values in the Title column
apple = apple.dropna(subset=["Title"])
apple["Title"].isna().sum()


np.int64(0)

In [30]:
#apple.genres.value_counts()
# apple['genres'].str.split(',', expand=True)
apple[["Genre 1","Genre 2","Genre 3","Genre 4","Genre 5","Genre 6"]] = apple["Genres"].str.split(',',expand=True)

display(apple)



Unnamed: 0,Title,Type,Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Genre 1,Genre 2,Genre 3,Genre 4,Genre 5,Genre 6
0,Four Rooms,movie,Comedy,1995,tt0113101,6.7,113403,Comedy,,,,,
1,Forrest Gump,movie,"Drama, Romance",1994,tt0109830,8.8,2348885,Drama,Romance,,,,
2,American Beauty,movie,Drama,1999,tt0169547,8.3,1238903,Drama,,,,,
3,Citizen Kane,movie,"Drama, Mystery",1941,tt0033467,8.3,477481,Drama,Mystery,,,,
4,Metropolis,movie,"Drama, Sci-Fi",1927,tt0017136,8.3,192285,Drama,Sci-Fi,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
18188,Das Boot - Die komplette TV-Serie,tv,"Drama, War",1985,tt30970892,8.8,198,Drama,War,,,,
18194,The Amazon Review Killer,tv,Documentary,2024,tt33342317,6.4,64,Documentary,,,,,
18198,Conspirators,tv,Mystery,2025,tt35075008,5.5,44,Mystery,,,,,
18200,Extraordinary World,tv,Documentary,2025,tt34888944,,-9223372036854775808,Documentary,,,,,


In [9]:
# View top 5 rows in the hulu df
hulu.head()

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Ariel,movie,"Comedy, Crime, Romance",1988.0,tt0094675,7.4,8949.0,
1,Shadows in Paradise,movie,"Comedy, Drama, Music",1986.0,tt0092149,7.5,7727.0,
2,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2348885.0,
3,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997.0,tt0119116,7.6,521815.0,
4,My Life Without Me,movie,"Drama, Romance",2003.0,tt0314412,7.4,26140.0,


In [10]:
# Add a column to hulu DF and entered a default value for each row & display bottom two rows
hulu["serviceName"] = "Hulu"
hulu.tail(2)

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries,serviceName
10201,Cardcaptor Sakura: Clear Card Arc,tv,"Adventure, Animation, Comedy",2018.0,tt6279576,7.7,879.0,,Hulu
10202,Faces of Music,tv,Documentary,2025.0,tt35460841,,,,Hulu


In [11]:
# View top 5 rows in the netflix df
netflix.head()

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,American Beauty,movie,Drama,1999.0,tt0169547,8.3,1238903.0,
1,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997.0,tt0119116,7.6,521815.0,
2,Kill Bill: Vol. 1,movie,"Action, Crime, Thriller",2003.0,tt0266697,8.2,1236086.0,
3,Jarhead,movie,"Biography, Drama, War",2005.0,tt0418763,7.0,213667.0,
4,Unforgiven,movie,"Drama, Western",1992.0,tt0105695,8.2,448833.0,


In [12]:
# Add a column to netflix DF and entered a default value for each row & display bottom two rows
netflix["serviceName"] = "Netflix"
netflix.tail(2)

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries,serviceName
20734,American Manhunt: O.J. Simpson,tv,Documentary,2025.0,tt35456246,7.4,1148.0,,Netflix
20735,Devil's Diner,tv,"Drama, Horror",2025.0,tt35557166,7.2,289.0,,Netflix


In [13]:
# View top 5 rows of prime df
prime.head()

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Ariel,movie,"Comedy, Crime, Romance",1988.0,tt0094675,7.4,8949.0,
1,Four Rooms,movie,Comedy,1995.0,tt0113101,6.7,113403.0,
2,Judgment Night,movie,"Action, Crime, Drama",1993.0,tt0107286,6.6,19627.0,
3,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2348885.0,
4,Citizen Kane,movie,"Drama, Mystery",1941.0,tt0033467,8.3,477481.0,


In [14]:
# Add a column to the prime df and entered a default value for each row & display bottom two rows
prime["serviceName"] = "Prime"
prime.tail(2)

# map function for creating column and populating each cell within each df with the df name

# create a dictionary with the list of the names of df and have it look at the df name to populate the column

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries,serviceName
70053,Ryan Trahan,tv,Reality-TV,2019.0,tt28487606,8.2,42.0,,Prime
70054,Sivarapalli,tv,Comedy,2025.0,tt31914057,7.4,120.0,,Prime


In [15]:
# Merge the dataframes to append them to the end of each other since using the same column names
streaming = pd.concat([apple, hulu, netflix, prime], axis = 0)


In [16]:
streaming.info()

<class 'pandas.core.frame.DataFrame'>
Index: 119202 entries, 0 to 70054
Data columns (total 22 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Title                17643 non-null   object 
 1   Type                 18208 non-null   object 
 2   Genres               17549 non-null   object 
 3   Release Year         18208 non-null   float64
 4   IMDb ID              16727 non-null   object 
 5   IMDb Average Rating  16314 non-null   float64
 6   IMDb Num Votes       18208 non-null   float64
 7   Genre 1              17549 non-null   object 
 8   Genre 2              11705 non-null   object 
 9   Genre 3              6670 non-null    object 
 10  Genre 4              23 non-null      object 
 11  Genre 5              3 non-null       object 
 12  Genre 6              1 non-null       object 
 13  title                98045 non-null   object 
 14  type                 100994 non-null  object 
 15  genres               98

In [17]:
streaming.describe()

Unnamed: 0,Release Year,IMDb Average Rating,IMDb Num Votes,releaseYear,imdbAverageRating,imdbNumVotes
count,18208.0,16314.0,18208.0,100736.0,89195.0,89195.0
mean,-1.469012e+16,6.381574,-9.594171e+17,2009.132951,6.110641,19369.37
std,3.678099e+17,1.162677,2.815851e+18,17.791458,1.308682,96161.32
min,-9.223372e+18,1.3,-9.223372e+18,1902.0,1.0,5.0
25%,2000.0,5.7,90.0,2006.0,5.3,129.0
50%,2014.0,6.5,847.0,2016.0,6.3,616.0
75%,2020.0,7.2,6243.25,2020.0,7.0,3575.0
max,2025.0,9.6,2348885.0,2026.0,10.0,3002274.0


In [18]:
streaming.duplicated().value_counts()

False    116987
True       2215
Name: count, dtype: int64

In [19]:
streaming[streaming.duplicated(keep=False)]

Unnamed: 0,Title,Type,Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Genre 1,Genre 2,Genre 3,...,Genre 6,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries,serviceName
3497,A Personal Journey with Martin Scorsese Throug...,movie,"Biography, Documentary, History",1995.0,tt0112120,8.5,5.196000e+03,Biography,Documentary,History,...,,,,,,,,,,
13091,The Kidnap & Murder of Lynda Spence,movie,"Documentary, Crime",2023.0,,,-9.223372e+18,Documentary,Crime,,...,,,,,,,,,,
13093,The Kidnap & Murder of Lynda Spence,movie,"Documentary, Crime",2023.0,,,-9.223372e+18,Documentary,Crime,,...,,,,,,,,,,
13980,A Personal Journey with Martin Scorsese Throug...,movie,"Biography, Documentary, History",1995.0,tt0112120,8.5,5.196000e+03,Biography,Documentary,History,...,,,,,,,,,,
14499,,tv,,2007.0,,,-9.223372e+18,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70037,,,,,,,,,,,...,,,tv,,2021.0,,,,,Prime
70040,,,,,,,,,,,...,,,tv,,,,,,,Prime
70041,,,,,,,,,,,...,,,tv,Documentary,2025.0,,,,,Prime
70047,,,,,,,,,,,...,,,tv,,2024.0,,,,,Prime


In [20]:
streaming["title"].isnull().value_counts()


title
False    98045
True     21157
Name: count, dtype: int64

In [21]:
streaming["imdbId"].isnull().value_counts()

imdbId
False    92024
True     27178
Name: count, dtype: int64

In [22]:
# # Assign dataset names & combine to read in as separate data frames

# list_of_names = ['Netflix','Hulu','Prime','AppleTV']

# # Created an empty list into which I can place the datasets
# combined_list = []

# # Used a function to append the datasets into the empty list I created above
# for i in range(len(list_of_names)):
#     temp_df = pd.read_csv(list_of_names[i]+".csv")
#     combined_list.append(temp_df)

### **Initial Checks**

In [23]:
# # Looking at file content of list in index 0 - Netflix - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[0].info()
# combined_list[0].head(5)



In [24]:
# # Looking at file content of list in index 1 - Hulu - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[1].info()
# combined_list[1].head(5)

In [25]:
# # Looking at file content of list in index 2 - Prime - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[2].info()
# combined_list[2].head(5)

In [26]:
# # Looking at file content of list in index 3 - AppleTV - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[3].info()
# combined_list[3].head(5)

### Initiall Review (02/10/2025)
- The "availableCountries" column will not provide much data going forward
- No dataset has a column for the source of the data, so this will need to be added

### Secondary Review (02/12/2025)
- Each dataset contains an "imdbId" column with the listing's IMDb ID, which is a string; this may come in handy 
- There are 3,514 missing titles
    - Prime = 2234
    - AppleTV = 651
    - Hulu = 2885
    - Netflix = 629
- There are 2,590 duplicated rows, among which many are from missing titles, though several are duplicate title names 