# <u>**Streaming Service Comparison**</u>

### **Objective:**
- Determine which streaming platform hosts the majority of content I enjoy so that I can pare down the services to which I subscribe. 

### **Data Sources:**
- [Netflix via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-netflix-dataset)
- [Hulu via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-hulu-dataset)
- [Prime via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-amazon-prime-dataset/data)
- [AppleTV via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-apple-tv-dataset)

## **Data Collection & Loading**

### **Import Pandas, Numpy, Matplotlib, Wordcloud, and PIL**

In [1]:
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import wordcloud as wc
from PIL import Image

### **Data Load**

In [2]:
# Load in each file separately. 
# Plan to create a function for this or NOT
apple = pd.read_csv("AppleTV.csv")
hulu = pd.read_csv("Hulu.csv")
netflix = pd.read_csv("Netflix.csv")
prime = pd.read_csv("Prime.csv")

### **Apple**   

In [3]:
#display(apple)
# apple.head(3)
# hulu.head(3)
# netflix.head(3)
# prime.head(3)
#display(apple)
print(apple.head(3))
print(hulu.head(3))
print(netflix.head(3))
print(prime.head(3))

             title   type          genres  releaseYear     imdbId  \
0       Four Rooms  movie          Comedy       1995.0  tt0113101   
1     Forrest Gump  movie  Drama, Romance       1994.0  tt0109830   
2  American Beauty  movie           Drama       1999.0  tt0169547   

   imdbAverageRating  imdbNumVotes availableCountries  
0                6.7      113403.0                NaN  
1                8.8     2348885.0                NaN  
2                8.3     1238903.0                NaN  
                 title   type                  genres  releaseYear     imdbId  \
0                Ariel  movie  Comedy, Crime, Romance       1988.0  tt0094675   
1  Shadows in Paradise  movie    Comedy, Drama, Music       1986.0  tt0092149   
2         Forrest Gump  movie          Drama, Romance       1994.0  tt0109830   

   imdbAverageRating  imdbNumVotes availableCountries  
0                7.4        8949.0                NaN  
1                7.5        7727.0                NaN  
2     

In [4]:
# Dispaly information about the dataset (i.e., row counts, column counts, column names, datatypes, # of non-null rows)
apple.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18208 entries, 0 to 18207
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               17643 non-null  object 
 1   type                18208 non-null  object 
 2   genres              17549 non-null  object 
 3   releaseYear         18179 non-null  float64
 4   imdbId              16727 non-null  object 
 5   imdbAverageRating   16314 non-null  float64
 6   imdbNumVotes        16314 non-null  float64
 7   availableCountries  84 non-null     object 
dtypes: float64(3), object(5)
memory usage: 1.1+ MB


In [5]:
apple.describe()

Unnamed: 0,releaseYear,imdbAverageRating,imdbNumVotes
count,18179.0,16314.0,16314.0
mean,2006.822432,6.381574,25548.55
std,18.41046,1.162677,100828.8
min,1902.0,1.3,5.0
25%,2001.0,5.7,201.0
50%,2014.0,6.5,1234.0
75%,2020.0,7.2,8106.5
max,2025.0,9.6,2348885.0


In [6]:
apple.select_dtypes("object").describe()

Unnamed: 0,title,type,genres,imdbId,availableCountries
count,17643,18208,17549,16727,84
unique,16986,2,798,16724,14
top,Teenage Mutant Ninja Turtles,movie,Drama,tt0112120,US
freq,5,14005,1590,2,42


In [7]:
# Drop availableCountries Column - no value with such a low non-null number
apple.drop("availableCountries",axis=1,inplace=True)
apple.columns

Index(['title', 'type', 'genres', 'releaseYear', 'imdbId', 'imdbAverageRating',
       'imdbNumVotes'],
      dtype='object')

In [8]:
# Change releaseYear,  imdbNumVotes to integers 
# Display only the data types

apple[["releaseYear","imdbNumVotes"]] = apple[["releaseYear","imdbNumVotes"]].apply(np.int64)
display(apple.dtypes)

  arr = np.asarray(values, dtype=dtype)


title                 object
type                  object
genres                object
releaseYear            int64
imdbId                object
imdbAverageRating    float64
imdbNumVotes           int64
dtype: object

In [9]:
# Rename Columns
apple.rename(columns={"title":"Title","type":"Type","genres":"Genres","releaseYear":"Release Year","imdbId":"IMDb ID","imdbAverageRating":"IMDb Average Rating","imdbNumVotes":"IMDb Num Votes"},inplace = True)
# Add a column to hulu DF and entered a default value for each row & display bottom two rows
apple["Service Name"] = "AppleTV"
apple.columns

Index(['Title', 'Type', 'Genres', 'Release Year', 'IMDb ID',
       'IMDb Average Rating', 'IMDb Num Votes', 'Service Name'],
      dtype='object')

In [10]:
# Determine how many titles contain null values
apple["Title"].isna().sum()

np.int64(565)

In [11]:
# Drop the rows contiaining null values in the Title column
apple = apple.dropna(subset=["Title"])
apple["Title"].isna().sum()


np.int64(0)

In [12]:
apple.duplicated().value_counts()

False    17641
True         2
Name: count, dtype: int64

In [13]:
apple[apple.duplicated(keep=False)]

Unnamed: 0,Title,Type,Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
3497,A Personal Journey with Martin Scorsese Throug...,movie,"Biography, Documentary, History",1995,tt0112120,8.5,5196,AppleTV
13091,The Kidnap & Murder of Lynda Spence,movie,"Documentary, Crime",2023,,,-9223372036854775808,AppleTV
13093,The Kidnap & Murder of Lynda Spence,movie,"Documentary, Crime",2023,,,-9223372036854775808,AppleTV
13980,A Personal Journey with Martin Scorsese Throug...,movie,"Biography, Documentary, History",1995,tt0112120,8.5,5196,AppleTV


In [14]:
apple.drop_duplicates()


Unnamed: 0,Title,Type,Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
0,Four Rooms,movie,Comedy,1995,tt0113101,6.7,113403,AppleTV
1,Forrest Gump,movie,"Drama, Romance",1994,tt0109830,8.8,2348885,AppleTV
2,American Beauty,movie,Drama,1999,tt0169547,8.3,1238903,AppleTV
3,Citizen Kane,movie,"Drama, Mystery",1941,tt0033467,8.3,477481,AppleTV
4,Metropolis,movie,"Drama, Sci-Fi",1927,tt0017136,8.3,192285,AppleTV
...,...,...,...,...,...,...,...,...
18188,Das Boot - Die komplette TV-Serie,tv,"Drama, War",1985,tt30970892,8.8,198,AppleTV
18194,The Amazon Review Killer,tv,Documentary,2024,tt33342317,6.4,64,AppleTV
18198,Conspirators,tv,Mystery,2025,tt35075008,5.5,44,AppleTV
18200,Extraordinary World,tv,Documentary,2025,tt34888944,,-9223372036854775808,AppleTV


In [15]:
#apple.genres.value_counts()
# apple['genres'].str.split(',', expand=True)
apple[["Genre 1","Genre 2","Genre 3","Genre 4","Genre 5","Genre 6"]] = apple["Genres"].str.split(',',expand=True)

display(apple)



Unnamed: 0,Title,Type,Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name,Genre 1,Genre 2,Genre 3,Genre 4,Genre 5,Genre 6
0,Four Rooms,movie,Comedy,1995,tt0113101,6.7,113403,AppleTV,Comedy,,,,,
1,Forrest Gump,movie,"Drama, Romance",1994,tt0109830,8.8,2348885,AppleTV,Drama,Romance,,,,
2,American Beauty,movie,Drama,1999,tt0169547,8.3,1238903,AppleTV,Drama,,,,,
3,Citizen Kane,movie,"Drama, Mystery",1941,tt0033467,8.3,477481,AppleTV,Drama,Mystery,,,,
4,Metropolis,movie,"Drama, Sci-Fi",1927,tt0017136,8.3,192285,AppleTV,Drama,Sci-Fi,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18188,Das Boot - Die komplette TV-Serie,tv,"Drama, War",1985,tt30970892,8.8,198,AppleTV,Drama,War,,,,
18194,The Amazon Review Killer,tv,Documentary,2024,tt33342317,6.4,64,AppleTV,Documentary,,,,,
18198,Conspirators,tv,Mystery,2025,tt35075008,5.5,44,AppleTV,Mystery,,,,,
18200,Extraordinary World,tv,Documentary,2025,tt34888944,,-9223372036854775808,AppleTV,Documentary,,,,,


### **Hulu**

In [16]:
# View top 5 rows in the hulu df
hulu.head()

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Ariel,movie,"Comedy, Crime, Romance",1988.0,tt0094675,7.4,8949.0,
1,Shadows in Paradise,movie,"Comedy, Drama, Music",1986.0,tt0092149,7.5,7727.0,
2,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2348885.0,
3,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997.0,tt0119116,7.6,521815.0,
4,My Life Without Me,movie,"Drama, Romance",2003.0,tt0314412,7.4,26140.0,


In [17]:
hulu.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10203 entries, 0 to 10202
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               9552 non-null   object 
 1   type                10203 non-null  object 
 2   genres              9853 non-null   object 
 3   releaseYear         10166 non-null  float64
 4   imdbId              9138 non-null   object 
 5   imdbAverageRating   8830 non-null   float64
 6   imdbNumVotes        8830 non-null   float64
 7   availableCountries  44 non-null     object 
dtypes: float64(3), object(5)
memory usage: 637.8+ KB


In [18]:
hulu.describe()

Unnamed: 0,releaseYear,imdbAverageRating,imdbNumVotes
count,10166.0,8830.0,8830.0
mean,2011.744344,6.569921,35362.48
std,13.952201,1.055329,145658.7
min,1929.0,1.8,5.0
25%,2009.0,6.0,139.0
50%,2016.0,6.7,921.0
75%,2021.0,7.3,7385.75
max,2025.0,9.5,3002274.0


In [19]:
hulu.select_dtypes("object").describe()

Unnamed: 0,title,type,genres,imdbId,availableCountries
count,9552,10203,9853,9138,44
unique,9342,2,664,9134,3
top,Prey,movie,Drama,tt1945851,JP
freq,4,6019,1031,2,28


In [20]:
# Drop availableCountries Column - no value with such a low non-null number
hulu.drop("availableCountries",axis=1,inplace=True)
hulu.columns

Index(['title', 'type', 'genres', 'releaseYear', 'imdbId', 'imdbAverageRating',
       'imdbNumVotes'],
      dtype='object')

In [21]:
# Change releaseYear,  imdbNumVotes to integers 
# Display only the data types

hulu[["releaseYear","imdbNumVotes"]] = hulu[["releaseYear","imdbNumVotes"]].apply(np.int64)
display(hulu.dtypes)

  arr = np.asarray(values, dtype=dtype)


title                 object
type                  object
genres                object
releaseYear            int64
imdbId                object
imdbAverageRating    float64
imdbNumVotes           int64
dtype: object

In [22]:
# Rename Columns
hulu.rename(columns={"title":"Title","type":"Type","genres":"Genres","releaseYear":"Release Year","imdbId":"IMDb ID","imdbAverageRating":"IMDb Average Rating","imdbNumVotes":"IMDb Num Votes"},inplace = True)
# Add column that identifies the streaming service
hulu["Service Name"] = "Hulu"
hulu.columns

Index(['Title', 'Type', 'Genres', 'Release Year', 'IMDb ID',
       'IMDb Average Rating', 'IMDb Num Votes', 'Service Name'],
      dtype='object')

In [23]:
# Determine how many titles contain null values
hulu["Title"].isna().sum()

np.int64(651)

In [24]:
# Drop the rows contiaining null values in the Title column
hulu = hulu.dropna(subset=["Title"])
hulu["Title"].isna().sum()

np.int64(0)

In [25]:
hulu.duplicated().value_counts()

False    9550
True        2
Name: count, dtype: int64

In [26]:
hulu[hulu.duplicated(keep=False)]

Unnamed: 0,Title,Type,Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
6811,The Gutsy Frog,tv,Comedy,2015,tt4658636,7.1,17,Hulu
7193,Scorned: Love Kills,tv,"Crime, Documentary",2012,tt2287041,6.9,304,Hulu
7194,Scorned: Love Kills,tv,"Crime, Documentary",2012,tt2287041,6.9,304,Hulu
8646,The Gutsy Frog,tv,Comedy,2015,tt4658636,7.1,17,Hulu


In [27]:
hulu.drop_duplicates()


Unnamed: 0,Title,Type,Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
0,Ariel,movie,"Comedy, Crime, Romance",1988,tt0094675,7.4,8949,Hulu
1,Shadows in Paradise,movie,"Comedy, Drama, Music",1986,tt0092149,7.5,7727,Hulu
2,Forrest Gump,movie,"Drama, Romance",1994,tt0109830,8.8,2348885,Hulu
3,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997,tt0119116,7.6,521815,Hulu
4,My Life Without Me,movie,"Drama, Romance",2003,tt0314412,7.4,26140,Hulu
...,...,...,...,...,...,...,...,...
10194,Scam Goddess,tv,"Crime, Talk-Show",2025,tt32821425,6.7,58,Hulu
10196,AQUARION Myth of Emotions,tv,"Action, Animation, Fantasy",2024,tt30644518,3.5,17,Hulu
10199,They Took Our Child: We Got Her Back,tv,"Crime, Documentary",2015,tt5056408,8.6,31,Hulu
10201,Cardcaptor Sakura: Clear Card Arc,tv,"Adventure, Animation, Comedy",2018,tt6279576,7.7,879,Hulu


In [None]:
#Obtain max number of elements in the Genres column
hulu["Genres"].str.split(", ",expand=True)


Unnamed: 0,0,1,2,3,4,5,6
0,Comedy,Crime,Romance,,,,
1,Comedy,Drama,Music,,,,
2,Drama,Romance,,,,,
3,Action,Adventure,Sci-Fi,,,,
4,Drama,Romance,,,,,
...,...,...,...,...,...,...,...
10194,Crime,Talk-Show,,,,,
10196,Action,Animation,Fantasy,,,,
10199,Crime,Documentary,,,,,
10201,Adventure,Animation,Comedy,,,,


In [35]:
# Split the elements in the Genres column while creating new columns for the max number 
hulu[["Genre 1","Genre 2","Genre 3","Genre 4","Genre 5","Genre 6","Genre 7"]] = hulu["Genres"].str.split(',',expand=True)

apple.head(2)

Unnamed: 0,Title,Type,Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name,Genre 1,Genre 2,Genre 3,Genre 4,Genre 5,Genre 6
0,Four Rooms,movie,Comedy,1995,tt0113101,6.7,113403,AppleTV,Comedy,,,,,
1,Forrest Gump,movie,"Drama, Romance",1994,tt0109830,8.8,2348885,AppleTV,Drama,Romance,,,,


In [None]:
# View top 5 rows in the netflix df
netflix.head()

In [None]:
# Add a column to netflix DF and entered a default value for each row & display bottom two rows
netflix["Service Name"] = "Netflix"
netflix.tail(2)

In [None]:
# View top 5 rows of prime df
prime.head()

In [None]:
# Add a column to the prime df and entered a default value for each row & display bottom two rows
prime["Service Name"] = "Prime"
prime.tail(2)

# map function for creating column and populating each cell within each df with the df name

# create a dictionary with the list of the names of df and have it look at the df name to populate the column

In [None]:
# Merge the dataframes to append them to the end of each other since using the same column names
streaming = pd.concat([apple, hulu, netflix, prime], axis = 0)


In [None]:
streaming.info()

In [None]:
streaming.describe()

In [None]:
streaming.duplicated().value_counts()

In [None]:
streaming[streaming.duplicated(keep=False)]

In [None]:
streaming["title"].isnull().value_counts()


In [None]:
streaming["imdbId"].isnull().value_counts()

In [None]:
# # Assign dataset names & combine to read in as separate data frames

# list_of_names = ['Netflix','Hulu','Prime','AppleTV']

# # Created an empty list into which I can place the datasets
# combined_list = []

# # Used a function to append the datasets into the empty list I created above
# for i in range(len(list_of_names)):
#     temp_df = pd.read_csv(list_of_names[i]+".csv")
#     combined_list.append(temp_df)

### **Initial Checks**

In [None]:
# # Looking at file content of list in index 0 - Netflix - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[0].info()
# combined_list[0].head(5)



In [None]:
# # Looking at file content of list in index 1 - Hulu - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[1].info()
# combined_list[1].head(5)

In [None]:
# # Looking at file content of list in index 2 - Prime - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[2].info()
# combined_list[2].head(5)

In [None]:
# # Looking at file content of list in index 3 - AppleTV - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[3].info()
# combined_list[3].head(5)

### Initiall Review (02/10/2025)
- The "availableCountries" column will not provide much data going forward
- No dataset has a column for the source of the data, so this will need to be added

### Secondary Review (02/12/2025)
- Each dataset contains an "imdbId" column with the listing's IMDb ID, which is a string; this may come in handy 
- There are 3,514 missing titles
    - Prime = 2234
    - AppleTV = 651
    - Hulu = 2885
    - Netflix = 629
- There are 2,590 duplicated rows, among which many are from missing titles, though several are duplicate title names 
    - Apple = 2
    - Hulu = 651


### Breakdown:
#### Apple
- 565 Missing Titles
- 2 Dupliate Rows
#### Hulu:
- 651 Missing Titles
- 2 Duplicated Rows