# <u>**Streaming Service Comparison**</u>

### **Objective:**
- Determine which streaming platform hosts the majority of content I enjoy so that I can pare down the services to which I subscribe. 

### **Data Sources:**
- [Netflix via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-netflix-dataset)
- [Hulu via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-hulu-dataset)
- [Prime via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-amazon-prime-dataset/data)
- [AppleTV via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-apple-tv-dataset)

## Data Collection & Loading**</U>


### **Import Pandas, Numpy, Matplotlib, Wordcloud, and PIL**

In [36]:
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import wordcloud as wc
from PIL import Image

### **Data Load**

In [37]:
# Load in each file separately. 
# Plan to create a function for this 
apple = pd.read_csv("AppleTV.csv")
hulu = pd.read_csv("Hulu.csv")
netflix = pd.read_csv("Netflix.csv")
prime = pd.read_csv("Prime.csv")

In [38]:
# View top 5 rows in the apple df
apple.head()

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Four Rooms,movie,Comedy,1995.0,tt0113101,6.7,113403.0,
1,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2348885.0,
2,American Beauty,movie,Drama,1999.0,tt0169547,8.3,1238903.0,
3,Citizen Kane,movie,"Drama, Mystery",1941.0,tt0033467,8.3,477481.0,
4,Metropolis,movie,"Drama, Sci-Fi",1927.0,tt0017136,8.3,192285.0,


In [41]:
# Add a column to apple DF and entered a default value for each row & display bottom two rows
apple["serviceName"] = "AppleTV"
apple.tail(2)

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries,serviceName
18206,,tv,Comedy,2018.0,,,,,AppleTV
18207,,tv,Documentary,2021.0,,,,,AppleTV


In [42]:
# View top 5 rows in the hulu df
hulu.head()

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Ariel,movie,"Comedy, Crime, Romance",1988.0,tt0094675,7.4,8949.0,
1,Shadows in Paradise,movie,"Comedy, Drama, Music",1986.0,tt0092149,7.5,7727.0,
2,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2348885.0,
3,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997.0,tt0119116,7.6,521815.0,
4,My Life Without Me,movie,"Drama, Romance",2003.0,tt0314412,7.4,26140.0,


In [None]:
# Add a column to hulu DF and entered a default value for each row & display bottom two rows
hulu["serviceName"] = "Hulu"
hulu.tail(2)

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries,serviceName
10201,Cardcaptor Sakura: Clear Card Arc,tv,"Adventure, Animation, Comedy",2018.0,tt6279576,7.7,879.0,,Hulu
10202,Faces of Music,tv,Documentary,2025.0,tt35460841,,,,Hulu


In [None]:
# View top 5 rows in the netflix df
netflix.head()

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,American Beauty,movie,Drama,1999.0,tt0169547,8.3,1238903.0,
1,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997.0,tt0119116,7.6,521815.0,
2,Kill Bill: Vol. 1,movie,"Action, Crime, Thriller",2003.0,tt0266697,8.2,1236086.0,
3,Jarhead,movie,"Biography, Drama, War",2005.0,tt0418763,7.0,213667.0,
4,Unforgiven,movie,"Drama, Western",1992.0,tt0105695,8.2,448833.0,


In [None]:
# Add a column to netflix DF and entered a default value for each row & display bottom two rows
netflix["serviceName"] = "Netflix"
netflix.tail(2)

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries,serviceName
20734,American Manhunt: O.J. Simpson,tv,Documentary,2025.0,tt35456246,7.4,1148.0,,Netflix
20735,Devil's Diner,tv,"Drama, Horror",2025.0,tt35557166,7.2,289.0,,Netflix


In [None]:
# View top 5 rows of prime df
prime.head()

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Ariel,movie,"Comedy, Crime, Romance",1988.0,tt0094675,7.4,8949.0,
1,Four Rooms,movie,Comedy,1995.0,tt0113101,6.7,113403.0,
2,Judgment Night,movie,"Action, Crime, Drama",1993.0,tt0107286,6.6,19627.0,
3,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2348885.0,
4,Citizen Kane,movie,"Drama, Mystery",1941.0,tt0033467,8.3,477481.0,


In [None]:
# Add a column to the prime df and entered a default value for each row & display bottom two rows
prime["serviceName"] = "Prime"
prime.tail(2)

# map function for creating column and populating each cell within each df with the df name

# create a dictionary with the list of the names of df and have it look at the df name to populate the column

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries,serviceName
70053,Ryan Trahan,tv,Reality-TV,2019.0,tt28487606,8.2,42.0,,Prime
70054,Sivarapalli,tv,Comedy,2025.0,tt31914057,7.4,120.0,,Prime


In [None]:
# Merge the dataframes to append them to the end of each other since using the same column names
streaming = pd.concat([apple, hulu, netflix, prime], axis = 0)


In [43]:
streaming.info()

<class 'pandas.core.frame.DataFrame'>
Index: 119202 entries, 0 to 70054
Data columns (total 9 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   title               115688 non-null  object 
 1   type                119202 non-null  object 
 2   genres              115732 non-null  object 
 3   releaseYear         118915 non-null  float64
 4   imdbId              108751 non-null  object 
 5   imdbAverageRating   105509 non-null  float64
 6   imdbNumVotes        105509 non-null  float64
 7   availableCountries  668 non-null     object 
 8   serviceName         119202 non-null  object 
dtypes: float64(3), object(6)
memory usage: 9.1+ MB


In [21]:
streaming.describe()

Unnamed: 0,releaseYear,imdbAverageRating,imdbNumVotes
count,118915.0,105509.0,105509.0
mean,2008.779733,6.152533,20324.8
std,17.906711,1.290906,96922.97
min,1902.0,1.0,5.0
25%,2005.0,5.4,137.0
50%,2016.0,6.3,682.0
75%,2020.0,7.1,4046.0
max,2026.0,10.0,3002274.0


In [22]:
streaming.duplicated().value_counts()

False    117004
True       2198
Name: count, dtype: int64

In [24]:
streaming[streaming.duplicated(keep=False)]

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries,serviceName
3497,A Personal Journey with Martin Scorsese Throug...,movie,"Biography, Documentary, History",1995.0,tt0112120,8.5,5196.0,GB,AppleTV
13091,The Kidnap & Murder of Lynda Spence,movie,"Documentary, Crime",2023.0,,,,,AppleTV
13093,The Kidnap & Murder of Lynda Spence,movie,"Documentary, Crime",2023.0,,,,,AppleTV
13980,A Personal Journey with Martin Scorsese Throug...,movie,"Biography, Documentary, History",1995.0,tt0112120,8.5,5196.0,GB,AppleTV
14499,,tv,,2007.0,,,,,AppleTV
...,...,...,...,...,...,...,...,...,...
70037,,tv,,2021.0,,,,,Prime
70040,,tv,,,,,,,Prime
70041,,tv,Documentary,2025.0,,,,,Prime
70047,,tv,,2024.0,,,,,Prime


In [34]:
streaming["title"].isnull().value_counts()


title
False    115688
True       3514
Name: count, dtype: int64

In [35]:
streaming["imdbId"].isnull().value_counts()

imdbId
False    108751
True      10451
Name: count, dtype: int64

In [None]:
# # Assign dataset names & combine to read in as separate data frames

# list_of_names = ['Netflix','Hulu','Prime','AppleTV']

# # Created an empty list into which I can place the datasets
# combined_list = []

# # Used a function to append the datasets into the empty list I created above
# for i in range(len(list_of_names)):
#     temp_df = pd.read_csv(list_of_names[i]+".csv")
#     combined_list.append(temp_df)

### **Initial Checks**

In [None]:
# # Looking at file content of list in index 0 - Netflix - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[0].info()
# combined_list[0].head(5)



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20736 entries, 0 to 20735
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               20107 non-null  object 
 1   type                20736 non-null  object 
 2   genres              20391 non-null  object 
 3   releaseYear         20702 non-null  float64
 4   imdbId              19238 non-null  object 
 5   imdbAverageRating   19049 non-null  float64
 6   imdbNumVotes        19049 non-null  float64
 7   availableCountries  167 non-null    object 
dtypes: float64(3), object(5)
memory usage: 1.3+ MB


Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,American Beauty,movie,Drama,1999.0,tt0169547,8.3,1238903.0,
1,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997.0,tt0119116,7.6,521815.0,
2,Kill Bill: Vol. 1,movie,"Action, Crime, Thriller",2003.0,tt0266697,8.2,1236086.0,
3,Jarhead,movie,"Biography, Drama, War",2005.0,tt0418763,7.0,213667.0,
4,Unforgiven,movie,"Drama, Western",1992.0,tt0105695,8.2,448833.0,


In [None]:
# # Looking at file content of list in index 1 - Hulu - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[1].info()
# combined_list[1].head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10203 entries, 0 to 10202
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               9552 non-null   object 
 1   type                10203 non-null  object 
 2   genres              9853 non-null   object 
 3   releaseYear         10166 non-null  float64
 4   imdbId              9138 non-null   object 
 5   imdbAverageRating   8830 non-null   float64
 6   imdbNumVotes        8830 non-null   float64
 7   availableCountries  44 non-null     object 
dtypes: float64(3), object(5)
memory usage: 637.8+ KB


Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Ariel,movie,"Comedy, Crime, Romance",1988.0,tt0094675,7.4,8949.0,
1,Shadows in Paradise,movie,"Comedy, Drama, Music",1986.0,tt0092149,7.5,7727.0,
2,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2348885.0,
3,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997.0,tt0119116,7.6,521815.0,
4,My Life Without Me,movie,"Drama, Romance",2003.0,tt0314412,7.4,26140.0,


In [None]:
# # Looking at file content of list in index 2 - Prime - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[2].info()
# combined_list[2].head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70055 entries, 0 to 70054
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               68386 non-null  object 
 1   type                70055 non-null  object 
 2   genres              67939 non-null  object 
 3   releaseYear         69868 non-null  float64
 4   imdbId              63648 non-null  object 
 5   imdbAverageRating   61316 non-null  float64
 6   imdbNumVotes        61316 non-null  float64
 7   availableCountries  373 non-null    object 
dtypes: float64(3), object(5)
memory usage: 4.3+ MB


Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Ariel,movie,"Comedy, Crime, Romance",1988.0,tt0094675,7.4,8949.0,
1,Four Rooms,movie,Comedy,1995.0,tt0113101,6.7,113403.0,
2,Judgment Night,movie,"Action, Crime, Drama",1993.0,tt0107286,6.6,19627.0,
3,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2348885.0,
4,Citizen Kane,movie,"Drama, Mystery",1941.0,tt0033467,8.3,477481.0,


In [None]:
# # Looking at file content of list in index 3 - AppleTV - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[3].info()
# combined_list[3].head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18208 entries, 0 to 18207
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               17643 non-null  object 
 1   type                18208 non-null  object 
 2   genres              17549 non-null  object 
 3   releaseYear         18179 non-null  float64
 4   imdbId              16727 non-null  object 
 5   imdbAverageRating   16314 non-null  float64
 6   imdbNumVotes        16314 non-null  float64
 7   availableCountries  84 non-null     object 
dtypes: float64(3), object(5)
memory usage: 1.1+ MB


Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Four Rooms,movie,Comedy,1995.0,tt0113101,6.7,113403.0,
1,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2348885.0,
2,American Beauty,movie,Drama,1999.0,tt0169547,8.3,1238903.0,
3,Citizen Kane,movie,"Drama, Mystery",1941.0,tt0033467,8.3,477481.0,
4,Metropolis,movie,"Drama, Sci-Fi",1927.0,tt0017136,8.3,192285.0,


### Initiall Review (02/10/2025)
- The "availableCountries" column will not provide much data going forward
- No dataset has a column for the source of the data, so this will need to be added

### Secondary Review (02/12/2025)
- Each dataset contains an "imdbId" column with the listing's IMDb ID, which is a string; this may come in handy 
- There are 3,514 missing titles
    - Prime = 2234
    - AppleTV = 651
    - Hulu = 2885
    - Netflix = 629
- There are 2,590 duplicated rows, among which many are from missing titles, though several are duplicate title names 