# <u>**Streaming Service Comparison**</u>

### **Objective:**
- Determine which streaming platform hosts the majority of content I enjoy so that I can pare down the services to which I subscribe. 

### **Data Sources:**
- [Netflix via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-netflix-dataset)
- [Hulu via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-hulu-dataset)
- [Prime via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-amazon-prime-dataset/data)
- [AppleTV via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-apple-tv-dataset)

## **Data Collection & Loading**

### **Import Pandas, Numpy, Matplotlib, Wordcloud, and PIL**

In [1]:
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import wordcloud as wc
from PIL import Image

### **Data Load**

In [2]:
# Load in each file separately. 
# Plan to create a function for this or NOT
apple = pd.read_csv("AppleTV.csv")
hulu = pd.read_csv("Hulu.csv")
netflix = pd.read_csv("Netflix.csv")
prime = pd.read_csv("Prime.csv")

### **Apple**   

In [3]:
#display(apple)
# apple.head(3)
# hulu.head(3)
# netflix.head(3)
# prime.head(3)
#display(apple)
print(apple.head(3))
print(hulu.head(3))
print(netflix.head(3))
print(prime.head(3))

             title   type          genres  releaseYear     imdbId  \
0       Four Rooms  movie          Comedy       1995.0  tt0113101   
1     Forrest Gump  movie  Drama, Romance       1994.0  tt0109830   
2  American Beauty  movie           Drama       1999.0  tt0169547   

   imdbAverageRating  imdbNumVotes availableCountries  
0                6.7      113403.0                NaN  
1                8.8     2348885.0                NaN  
2                8.3     1238903.0                NaN  
                 title   type                  genres  releaseYear     imdbId  \
0                Ariel  movie  Comedy, Crime, Romance       1988.0  tt0094675   
1  Shadows in Paradise  movie    Comedy, Drama, Music       1986.0  tt0092149   
2         Forrest Gump  movie          Drama, Romance       1994.0  tt0109830   

   imdbAverageRating  imdbNumVotes availableCountries  
0                7.4        8949.0                NaN  
1                7.5        7727.0                NaN  
2     

In [4]:
# Dispaly information about the dataset (i.e., row counts, column counts, column names, datatypes, # of non-null rows)
apple.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18208 entries, 0 to 18207
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               17643 non-null  object 
 1   type                18208 non-null  object 
 2   genres              17549 non-null  object 
 3   releaseYear         18179 non-null  float64
 4   imdbId              16727 non-null  object 
 5   imdbAverageRating   16314 non-null  float64
 6   imdbNumVotes        16314 non-null  float64
 7   availableCountries  84 non-null     object 
dtypes: float64(3), object(5)
memory usage: 1.1+ MB


In [5]:
apple.describe()

Unnamed: 0,releaseYear,imdbAverageRating,imdbNumVotes
count,18179.0,16314.0,16314.0
mean,2006.822432,6.381574,25548.55
std,18.41046,1.162677,100828.8
min,1902.0,1.3,5.0
25%,2001.0,5.7,201.0
50%,2014.0,6.5,1234.0
75%,2020.0,7.2,8106.5
max,2025.0,9.6,2348885.0


In [6]:
apple.select_dtypes("object").describe()

Unnamed: 0,title,type,genres,imdbId,availableCountries
count,17643,18208,17549,16727,84
unique,16986,2,798,16724,14
top,Teenage Mutant Ninja Turtles,movie,Drama,tt0112120,US
freq,5,14005,1590,2,42


In [7]:
# Drop availableCountries Column - no value with such a low non-null number
apple.drop("availableCountries",axis=1,inplace=True)
apple.columns

Index(['title', 'type', 'genres', 'releaseYear', 'imdbId', 'imdbAverageRating',
       'imdbNumVotes'],
      dtype='object')

In [8]:
# Change releaseYear,  imdbNumVotes to integers 
# Display only the data types

apple[["releaseYear","imdbNumVotes"]] = apple[["releaseYear","imdbNumVotes"]].apply(np.int64)
display(apple.dtypes)

  arr = np.asarray(values, dtype=dtype)


title                 object
type                  object
genres                object
releaseYear            int64
imdbId                object
imdbAverageRating    float64
imdbNumVotes           int64
dtype: object

In [9]:
# Rename Columns
apple.rename(columns={"title":"Title","type":"Type","genres":"Combined Genres","releaseYear":"Release Year","imdbId":"IMDb ID","imdbAverageRating":"IMDb Average Rating","imdbNumVotes":"IMDb Num Votes"},inplace = True)
# Add a column to hulu DF and entered a default value for each row & display bottom two rows
apple["Service Name"] = "AppleTV"
apple.columns

Index(['Title', 'Type', 'Combined Genres', 'Release Year', 'IMDb ID',
       'IMDb Average Rating', 'IMDb Num Votes', 'Service Name'],
      dtype='object')

In [10]:
# Determine how many titles contain null values
apple["Title"].isna().sum()

np.int64(565)

In [11]:
# Drop the rows contiaining null values in the Title column
apple = apple.dropna(subset=["Title"])
apple["Title"].isna().sum()


np.int64(0)

In [12]:
apple.duplicated().value_counts()

False    17641
True         2
Name: count, dtype: int64

In [13]:
apple[apple.duplicated(keep=False)]

Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
3497,A Personal Journey with Martin Scorsese Throug...,movie,"Biography, Documentary, History",1995,tt0112120,8.5,5196,AppleTV
13091,The Kidnap & Murder of Lynda Spence,movie,"Documentary, Crime",2023,,,-9223372036854775808,AppleTV
13093,The Kidnap & Murder of Lynda Spence,movie,"Documentary, Crime",2023,,,-9223372036854775808,AppleTV
13980,A Personal Journey with Martin Scorsese Throug...,movie,"Biography, Documentary, History",1995,tt0112120,8.5,5196,AppleTV


In [14]:
# Drop duplicate rows
apple.drop_duplicates()


Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
0,Four Rooms,movie,Comedy,1995,tt0113101,6.7,113403,AppleTV
1,Forrest Gump,movie,"Drama, Romance",1994,tt0109830,8.8,2348885,AppleTV
2,American Beauty,movie,Drama,1999,tt0169547,8.3,1238903,AppleTV
3,Citizen Kane,movie,"Drama, Mystery",1941,tt0033467,8.3,477481,AppleTV
4,Metropolis,movie,"Drama, Sci-Fi",1927,tt0017136,8.3,192285,AppleTV
...,...,...,...,...,...,...,...,...
18188,Das Boot - Die komplette TV-Serie,tv,"Drama, War",1985,tt30970892,8.8,198,AppleTV
18194,The Amazon Review Killer,tv,Documentary,2024,tt33342317,6.4,64,AppleTV
18198,Conspirators,tv,Mystery,2025,tt35075008,5.5,44,AppleTV
18200,Extraordinary World,tv,Documentary,2025,tt34888944,,-9223372036854775808,AppleTV


In [15]:
apple[apple.duplicated(subset=["IMDb ID","Title","Release Year"], keep=False)]


Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
3497,A Personal Journey with Martin Scorsese Throug...,movie,"Biography, Documentary, History",1995,tt0112120,8.5,5196,AppleTV
10658,Berlin 1945,movie,"Documentary, History, War",2020,tt12264166,8.2,352,AppleTV
12065,Last Call,movie,Documentary,2021,,,-9223372036854775808,AppleTV
12188,Last Call,movie,,2021,,,-9223372036854775808,AppleTV
13091,The Kidnap & Murder of Lynda Spence,movie,"Documentary, Crime",2023,,,-9223372036854775808,AppleTV
13093,The Kidnap & Murder of Lynda Spence,movie,"Documentary, Crime",2023,,,-9223372036854775808,AppleTV
13337,De olhos abertos,movie,Animation,2023,,,-9223372036854775808,AppleTV
13564,De olhos abertos,movie,"Drama, War",2023,,,-9223372036854775808,AppleTV
13748,The Life and Murder of Nicole Brown Simpson,movie,"Biography, Crime, Documentary",2024,tt32267612,7.4,191,AppleTV
13980,A Personal Journey with Martin Scorsese Throug...,movie,"Biography, Documentary, History",1995,tt0112120,8.5,5196,AppleTV


In [16]:
apple = apple.drop_duplicates(subset=["IMDb ID","Title","Release Year"], keep="first")
rows = apple[apple["Title"] == "The Life and Murder of Nicole Brown Simpson"]
display(rows)

Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
13748,The Life and Murder of Nicole Brown Simpson,movie,"Biography, Crime, Documentary",2024,tt32267612,7.4,191,AppleTV


In [17]:
#apple.genres.value_counts()
# apple['genres'].str.split(',', expand=True)
apple[["Genre 1","Genre 2","Genre 3","Genre 4","Genre 5","Genre 6"]] = apple["Combined Genres"].str.split(',',expand=True)

display(apple)



Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name,Genre 1,Genre 2,Genre 3,Genre 4,Genre 5,Genre 6
0,Four Rooms,movie,Comedy,1995,tt0113101,6.7,113403,AppleTV,Comedy,,,,,
1,Forrest Gump,movie,"Drama, Romance",1994,tt0109830,8.8,2348885,AppleTV,Drama,Romance,,,,
2,American Beauty,movie,Drama,1999,tt0169547,8.3,1238903,AppleTV,Drama,,,,,
3,Citizen Kane,movie,"Drama, Mystery",1941,tt0033467,8.3,477481,AppleTV,Drama,Mystery,,,,
4,Metropolis,movie,"Drama, Sci-Fi",1927,tt0017136,8.3,192285,AppleTV,Drama,Sci-Fi,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18188,Das Boot - Die komplette TV-Serie,tv,"Drama, War",1985,tt30970892,8.8,198,AppleTV,Drama,War,,,,
18194,The Amazon Review Killer,tv,Documentary,2024,tt33342317,6.4,64,AppleTV,Documentary,,,,,
18198,Conspirators,tv,Mystery,2025,tt35075008,5.5,44,AppleTV,Mystery,,,,,
18200,Extraordinary World,tv,Documentary,2025,tt34888944,,-9223372036854775808,AppleTV,Documentary,,,,,


### **Hulu**

In [18]:
# View top 5 rows in the hulu df
hulu.head()

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Ariel,movie,"Comedy, Crime, Romance",1988.0,tt0094675,7.4,8949.0,
1,Shadows in Paradise,movie,"Comedy, Drama, Music",1986.0,tt0092149,7.5,7727.0,
2,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2348885.0,
3,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997.0,tt0119116,7.6,521815.0,
4,My Life Without Me,movie,"Drama, Romance",2003.0,tt0314412,7.4,26140.0,


In [19]:
hulu.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10203 entries, 0 to 10202
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               9552 non-null   object 
 1   type                10203 non-null  object 
 2   genres              9853 non-null   object 
 3   releaseYear         10166 non-null  float64
 4   imdbId              9138 non-null   object 
 5   imdbAverageRating   8830 non-null   float64
 6   imdbNumVotes        8830 non-null   float64
 7   availableCountries  44 non-null     object 
dtypes: float64(3), object(5)
memory usage: 637.8+ KB


In [20]:
hulu.describe()

Unnamed: 0,releaseYear,imdbAverageRating,imdbNumVotes
count,10166.0,8830.0,8830.0
mean,2011.744344,6.569921,35362.48
std,13.952201,1.055329,145658.7
min,1929.0,1.8,5.0
25%,2009.0,6.0,139.0
50%,2016.0,6.7,921.0
75%,2021.0,7.3,7385.75
max,2025.0,9.5,3002274.0


In [21]:
hulu.select_dtypes("object").describe()

Unnamed: 0,title,type,genres,imdbId,availableCountries
count,9552,10203,9853,9138,44
unique,9342,2,664,9134,3
top,Prey,movie,Drama,tt1945851,JP
freq,4,6019,1031,2,28


In [22]:
# Drop availableCountries Column - no value with such a low non-null number
hulu.drop("availableCountries",axis=1,inplace=True)
hulu.columns

Index(['title', 'type', 'genres', 'releaseYear', 'imdbId', 'imdbAverageRating',
       'imdbNumVotes'],
      dtype='object')

In [23]:
# Change releaseYear,  imdbNumVotes to integers 
# Display only the data types

hulu[["releaseYear","imdbNumVotes"]] = hulu[["releaseYear","imdbNumVotes"]].apply(np.int64)
display(hulu.dtypes)

  arr = np.asarray(values, dtype=dtype)


title                 object
type                  object
genres                object
releaseYear            int64
imdbId                object
imdbAverageRating    float64
imdbNumVotes           int64
dtype: object

In [24]:
# Rename Columns
hulu.rename(columns={"title":"Title","type":"Type","genres":"Combined Genres","releaseYear":"Release Year","imdbId":"IMDb ID","imdbAverageRating":"IMDb Average Rating","imdbNumVotes":"IMDb Num Votes"},inplace = True)
# Add column that identifies the streaming service
hulu["Service Name"] = "Hulu"
hulu.columns

Index(['Title', 'Type', 'Combined Genres', 'Release Year', 'IMDb ID',
       'IMDb Average Rating', 'IMDb Num Votes', 'Service Name'],
      dtype='object')

In [25]:
# Determine how many titles contain null values
hulu["Title"].isna().sum()

np.int64(651)

In [26]:
# Drop the rows contiaining null values in the Title column
hulu = hulu.dropna(subset=["Title"])
hulu["Title"].isna().sum()

np.int64(0)

In [27]:
hulu.duplicated().value_counts()

False    9550
True        2
Name: count, dtype: int64

In [28]:
hulu[hulu.duplicated(keep=False)]

Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
6811,The Gutsy Frog,tv,Comedy,2015,tt4658636,7.1,17,Hulu
7193,Scorned: Love Kills,tv,"Crime, Documentary",2012,tt2287041,6.9,304,Hulu
7194,Scorned: Love Kills,tv,"Crime, Documentary",2012,tt2287041,6.9,304,Hulu
8646,The Gutsy Frog,tv,Comedy,2015,tt4658636,7.1,17,Hulu


In [29]:
hulu.drop_duplicates()


Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
0,Ariel,movie,"Comedy, Crime, Romance",1988,tt0094675,7.4,8949,Hulu
1,Shadows in Paradise,movie,"Comedy, Drama, Music",1986,tt0092149,7.5,7727,Hulu
2,Forrest Gump,movie,"Drama, Romance",1994,tt0109830,8.8,2348885,Hulu
3,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997,tt0119116,7.6,521815,Hulu
4,My Life Without Me,movie,"Drama, Romance",2003,tt0314412,7.4,26140,Hulu
...,...,...,...,...,...,...,...,...
10194,Scam Goddess,tv,"Crime, Talk-Show",2025,tt32821425,6.7,58,Hulu
10196,AQUARION Myth of Emotions,tv,"Action, Animation, Fantasy",2024,tt30644518,3.5,17,Hulu
10199,They Took Our Child: We Got Her Back,tv,"Crime, Documentary",2015,tt5056408,8.6,31,Hulu
10201,Cardcaptor Sakura: Clear Card Arc,tv,"Adventure, Animation, Comedy",2018,tt6279576,7.7,879,Hulu


In [30]:
hulu[hulu.duplicated(subset=["IMDb ID","Title","Release Year"], keep=False)]


Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
2608,Panzer World Galient,movie,"Action, Animation, Fantasy",1984,tt1945851,7.4,28,Hulu
5480,Shylock's Children,movie,"Crime, Drama",2023,tt21944604,6.4,127,Hulu
6811,The Gutsy Frog,tv,Comedy,2015,tt4658636,7.1,17,Hulu
7193,Scorned: Love Kills,tv,"Crime, Documentary",2012,tt2287041,6.9,304,Hulu
7194,Scorned: Love Kills,tv,"Crime, Documentary",2012,tt2287041,6.9,304,Hulu
8110,Panzer World Galient,tv,"Action, Animation, Fantasy",1984,tt1945851,7.4,28,Hulu
8646,The Gutsy Frog,tv,Comedy,2015,tt4658636,7.1,17,Hulu
9339,Shylock's Children,tv,"Crime, Drama",2023,tt21944604,6.4,127,Hulu


In [31]:
hulu = hulu.drop_duplicates(subset=["IMDb ID","Title","Release Year"], keep="first")
rows = hulu[hulu["Title"] == "Panzer World Galient"]
display(rows)

Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
2608,Panzer World Galient,movie,"Action, Animation, Fantasy",1984,tt1945851,7.4,28,Hulu


In [32]:
#Obtain max number of elements in the Genres column
hulu["Combined Genres"].str.split(", ",expand=True)


Unnamed: 0,0,1,2,3,4,5,6
0,Comedy,Crime,Romance,,,,
1,Comedy,Drama,Music,,,,
2,Drama,Romance,,,,,
3,Action,Adventure,Sci-Fi,,,,
4,Drama,Romance,,,,,
...,...,...,...,...,...,...,...
10194,Crime,Talk-Show,,,,,
10196,Action,Animation,Fantasy,,,,
10199,Crime,Documentary,,,,,
10201,Adventure,Animation,Comedy,,,,


In [33]:
# Split the elements in the Genres column while creating new columns for the max number 
hulu[["Genre 1","Genre 2","Genre 3","Genre 4","Genre 5","Genre 6","Genre 7"]] = hulu["Combined Genres"].str.split(',',expand=True)

hulu.head(2)

Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name,Genre 1,Genre 2,Genre 3,Genre 4,Genre 5,Genre 6,Genre 7
0,Ariel,movie,"Comedy, Crime, Romance",1988,tt0094675,7.4,8949,Hulu,Comedy,Crime,Romance,,,,
1,Shadows in Paradise,movie,"Comedy, Drama, Music",1986,tt0092149,7.5,7727,Hulu,Comedy,Drama,Music,,,,


### **Netflix**

In [34]:
# View top 5 rows in the netflix df
netflix.head()

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,American Beauty,movie,Drama,1999.0,tt0169547,8.3,1238903.0,
1,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997.0,tt0119116,7.6,521815.0,
2,Kill Bill: Vol. 1,movie,"Action, Crime, Thriller",2003.0,tt0266697,8.2,1236086.0,
3,Jarhead,movie,"Biography, Drama, War",2005.0,tt0418763,7.0,213667.0,
4,Unforgiven,movie,"Drama, Western",1992.0,tt0105695,8.2,448833.0,


In [35]:
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20736 entries, 0 to 20735
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               20107 non-null  object 
 1   type                20736 non-null  object 
 2   genres              20391 non-null  object 
 3   releaseYear         20702 non-null  float64
 4   imdbId              19238 non-null  object 
 5   imdbAverageRating   19049 non-null  float64
 6   imdbNumVotes        19049 non-null  float64
 7   availableCountries  167 non-null    object 
dtypes: float64(3), object(5)
memory usage: 1.3+ MB


In [36]:
netflix.describe()

Unnamed: 0,releaseYear,imdbAverageRating,imdbNumVotes
count,20702.0,19049.0,19049.0
mean,2013.218288,6.397438,31661.72
std,14.337227,1.096381,121274.1
min,1913.0,1.2,5.0
25%,2012.0,5.7,325.0
50%,2018.0,6.5,1572.0
75%,2022.0,7.2,10006.0
max,2025.0,9.8,3002274.0


In [37]:
netflix.select_dtypes("object").describe()

Unnamed: 0,title,type,genres,imdbId,availableCountries
count,20107,20736,20391,19238,167
unique,19334,2,855,19235,103
top,Perfect Strangers,movie,Comedy,tt6446792,"AD, AE, AG, AL, AO, AR, AT, AU, AZ, BA, BB, BE..."
freq,5,15762,1754,2,11


In [38]:
# Drop availableCountries Column - no value with such a low non-null number
netflix.drop("availableCountries",axis=1,inplace=True)
netflix.columns


Index(['title', 'type', 'genres', 'releaseYear', 'imdbId', 'imdbAverageRating',
       'imdbNumVotes'],
      dtype='object')

In [39]:
# Change releaseYear,  imdbNumVotes to integers 
# Display only the data types

netflix[["releaseYear","imdbNumVotes"]] = netflix[["releaseYear","imdbNumVotes"]].apply(np.int64)
display(netflix.dtypes)


  arr = np.asarray(values, dtype=dtype)


title                 object
type                  object
genres                object
releaseYear            int64
imdbId                object
imdbAverageRating    float64
imdbNumVotes           int64
dtype: object

In [40]:
# Rename Columns
netflix.rename(columns={"title":"Title","type":"Type","genres":"Combined Genres","releaseYear":"Release Year","imdbId":"IMDb ID","imdbAverageRating":"IMDb Average Rating","imdbNumVotes":"IMDb Num Votes"},inplace = True)
# Add column that identifies the streaming service
netflix["Service Name"] = "Netflix"
netflix.columns


Index(['Title', 'Type', 'Combined Genres', 'Release Year', 'IMDb ID',
       'IMDb Average Rating', 'IMDb Num Votes', 'Service Name'],
      dtype='object')

In [41]:
# Determine how many titles contain null values
netflix["Title"].isna().sum()


np.int64(629)

In [42]:
# Drop the rows contiaining null values in the Title column
netflix = netflix.dropna(subset=["Title"])
netflix["Title"].isna().sum()


np.int64(0)

In [43]:
netflix.duplicated().value_counts()

False    20107
Name: count, dtype: int64

In [44]:
netflix[netflix.duplicated(keep=False)]

Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name


In [45]:
netflix.drop_duplicates()

Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
0,American Beauty,movie,Drama,1999,tt0169547,8.3,1238903,Netflix
1,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997,tt0119116,7.6,521815,Netflix
2,Kill Bill: Vol. 1,movie,"Action, Crime, Thriller",2003,tt0266697,8.2,1236086,Netflix
3,Jarhead,movie,"Biography, Drama, War",2005,tt0418763,7.0,213667,Netflix
4,Unforgiven,movie,"Drama, Western",1992,tt0105695,8.2,448833,Netflix
...,...,...,...,...,...,...,...,...
20724,Grisaia Phantom Trigger,tv,"Action, Animation, Drama",2025,tt33503859,4.7,32,Netflix
20725,Legend of Princess Chang-Ge,tv,"Action, Adventure, Animation",2024,tt32561196,,-9223372036854775808,Netflix
20733,Ms. Rachel,tv,Family,2025,tt35406274,8.0,23,Netflix
20734,American Manhunt: O.J. Simpson,tv,Documentary,2025,tt35456246,7.4,1148,Netflix


In [46]:
netflix[netflix.duplicated(subset=["IMDb ID","Title","Release Year"], keep=False)]

Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
6318,Gekijouban Yowamushi pedaru,movie,"Animation, Comedy, Romance",2015,tt4982154,6.7,160,Netflix
7219,Miniforce - New Heroes Rise,movie,"Action, Adventure, Animation",2016,tt6446792,6.1,55,Netflix
9350,The Best of the Wiggles,movie,"Music, Family",2018,,,-9223372036854775808,Netflix
10361,Lucid,movie,,2019,,,-9223372036854775808,Netflix
10441,True and the Rainbow Kingdom,movie,"Adventure, Animation, Comedy",2017,tt5607658,7.0,574,Netflix
10874,Amr's in Trouble,movie,"Romance, Comedy",2019,,,-9223372036854775808,Netflix
11217,Lucid,movie,Drama,2019,,,-9223372036854775808,Netflix
13522,The Best of the Wiggles,movie,,2018,,,-9223372036854775808,Netflix
14327,Amr's in Trouble,movie,,2019,,,-9223372036854775808,Netflix
14945,De olhos abertos,movie,Animation,2023,,,-9223372036854775808,Netflix


In [47]:
netflix = netflix.drop_duplicates(subset=["IMDb ID","Title","Release Year"], keep="first")
rows = netflix[netflix["Title"] == "Gekijouban Yowamushi pedaru"]
display(rows)

Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
6318,Gekijouban Yowamushi pedaru,movie,"Animation, Comedy, Romance",2015,tt4982154,6.7,160,Netflix


In [48]:
#Obtain max number of elements in the Genres column
netflix["Combined Genres"].str.split(", ",expand=True)


Unnamed: 0,0,1,2,3,4
0,Drama,,,,
1,Action,Adventure,Sci-Fi,,
2,Action,Crime,Thriller,,
3,Biography,Drama,War,,
4,Drama,Western,,,
...,...,...,...,...,...
20724,Action,Animation,Drama,,
20725,Action,Adventure,Animation,,
20733,Family,,,,
20734,Documentary,,,,


In [49]:
# Split the elements in the Genres column while creating new columns for the max number 
netflix[["Genre 1","Genre 2","Genre 3","Genre 4","Genre 5"]] = netflix["Combined Genres"].str.split(',',expand=True)

netflix.head(2)


Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name,Genre 1,Genre 2,Genre 3,Genre 4,Genre 5
0,American Beauty,movie,Drama,1999,tt0169547,8.3,1238903,Netflix,Drama,,,,
1,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997,tt0119116,7.6,521815,Netflix,Action,Adventure,Sci-Fi,,


### **Prime**

In [50]:
# View top 5 rows of prime df
prime.head()

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Ariel,movie,"Comedy, Crime, Romance",1988.0,tt0094675,7.4,8949.0,
1,Four Rooms,movie,Comedy,1995.0,tt0113101,6.7,113403.0,
2,Judgment Night,movie,"Action, Crime, Drama",1993.0,tt0107286,6.6,19627.0,
3,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2348885.0,
4,Citizen Kane,movie,"Drama, Mystery",1941.0,tt0033467,8.3,477481.0,


In [51]:
prime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70055 entries, 0 to 70054
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               68386 non-null  object 
 1   type                70055 non-null  object 
 2   genres              67939 non-null  object 
 3   releaseYear         69868 non-null  float64
 4   imdbId              63648 non-null  object 
 5   imdbAverageRating   61316 non-null  float64
 6   imdbNumVotes        61316 non-null  float64
 7   availableCountries  373 non-null    object 
dtypes: float64(3), object(5)
memory usage: 4.3+ MB


In [52]:
prime.describe()

Unnamed: 0,releaseYear,imdbAverageRating,imdbNumVotes
count,69868.0,61316.0,61316.0
mean,2007.542494,5.955402,13247.37
std,18.939153,1.370233,75539.37
min,1902.0,1.0,5.0
25%,2004.0,5.1,101.0
50%,2015.0,6.1,446.0
75%,2019.0,7.0,2331.0
max,2026.0,10.0,3002274.0


In [53]:
prime.select_dtypes("object").describe()

Unnamed: 0,title,type,genres,imdbId,availableCountries
count,68386,70055,67939,63648,373
unique,63384,2,1465,63625,97
top,Monster,movie,Drama,tt0468988,US
freq,9,60901,6743,2,110


In [54]:
# Drop availableCountries Column - no value with such a low non-null number
prime.drop("availableCountries",axis=1,inplace=True)
prime.columns


Index(['title', 'type', 'genres', 'releaseYear', 'imdbId', 'imdbAverageRating',
       'imdbNumVotes'],
      dtype='object')

In [55]:
# Change releaseYear,  imdbNumVotes to integers 
# Display only the data types
prime[["releaseYear","imdbNumVotes"]] = prime[["releaseYear","imdbNumVotes"]].apply(np.int64)
display(prime.dtypes)


  arr = np.asarray(values, dtype=dtype)


title                 object
type                  object
genres                object
releaseYear            int64
imdbId                object
imdbAverageRating    float64
imdbNumVotes           int64
dtype: object

In [56]:
# Rename Columns
prime.rename(columns={"title":"Title","type":"Type","genres":"Combined Genres","releaseYear":"Release Year","imdbId":"IMDb ID","imdbAverageRating":"IMDb Average Rating","imdbNumVotes":"IMDb Num Votes"},inplace = True)
# Add column that identifies the streaming service
prime["Service Name"] = "Prime"
prime.columns


Index(['Title', 'Type', 'Combined Genres', 'Release Year', 'IMDb ID',
       'IMDb Average Rating', 'IMDb Num Votes', 'Service Name'],
      dtype='object')

In [57]:
# # Add a column to the prime df and entered a default value for each row & display bottom two rows
# prime["Service Name"] = "Prime" - done above, but keeping notes below for future function building
# prime.tail(2) 

# map function for creating column and populating each cell within each df with the df name

# create a dictionary with the list of the names of df and have it look at the df name to populate the column

In [58]:
# Determine how many titles contain null values
prime["Title"].isna().sum()


np.int64(1669)

In [59]:
# Drop the rows contiaining null values in the Title column
prime = prime.dropna(subset=["Title"])
prime["Title"].isna().sum()


np.int64(0)

In [60]:
prime.duplicated().value_counts()

False    68382
True         4
Name: count, dtype: int64

In [61]:
prime[prime.duplicated(keep=False)]

Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
25185,Burn,movie,,2014,,,-9223372036854775808,Prime
35993,Belle,movie,"Fantasy, Horror",2023,tt12373754,5.4,168,Prime
42922,Burn,movie,,2014,,,-9223372036854775808,Prime
45350,Belle,movie,"Fantasy, Horror",2023,tt12373754,5.4,168,Prime
50542,Flower Boy,movie,,2021,,,-9223372036854775808,Prime
51337,Flower Boy,movie,,2021,,,-9223372036854775808,Prime
63744,Scorned: Love Kills,tv,"Crime, Documentary",2012,tt2287041,6.9,304,Prime
63745,Scorned: Love Kills,tv,"Crime, Documentary",2012,tt2287041,6.9,304,Prime


In [62]:
prime.drop_duplicates()

Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
0,Ariel,movie,"Comedy, Crime, Romance",1988,tt0094675,7.4,8949,Prime
1,Four Rooms,movie,Comedy,1995,tt0113101,6.7,113403,Prime
2,Judgment Night,movie,"Action, Crime, Drama",1993,tt0107286,6.6,19627,Prime
3,Forrest Gump,movie,"Drama, Romance",1994,tt0109830,8.8,2348885,Prime
4,Citizen Kane,movie,"Drama, Mystery",1941,tt0033467,8.3,477481,Prime
...,...,...,...,...,...,...,...,...
70046,Breaking Pickleball,tv,Documentary,2024,tt32306198,9.3,76,Prime
70050,Abracadavers,tv,"Action, Comedy, Sci-Fi",2019,tt8386640,8.2,40,Prime
70051,Tom Green Country,tv,Reality,2025,tt32832543,8.1,72,Prime
70053,Ryan Trahan,tv,Reality-TV,2019,tt28487606,8.2,42,Prime


In [63]:
prime[prime.duplicated(subset=["IMDb ID","Title","Release Year"], keep=False)]

Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
830,Moses the Lawgiver,movie,Drama,1974,tt0072547,6.1,591,Prime
3339,Ferrari,movie,Drama,2003,tt0346946,7.0,925,Prime
6850,Icon,movie,"Action, Crime, Thriller",2005,tt0424176,5.6,1085,Prime
7782,Gone But Not Forgotten,movie,"Crime, Drama, Mystery",2005,tt0384736,6.1,732,Prime
7809,Category 7: The End of the World,movie,"Action, Adventure, Drama",2005,tt0468988,4.5,3044,Prime
...,...,...,...,...,...,...,...,...
68195,Gone But Not Forgotten,tv,"Crime, Drama, Mystery",2005,tt0384736,6.1,732,Prime
68300,Shylock's Children,tv,"Crime, Drama",2023,tt21944604,6.4,127,Prime
68510,Love Kills,tv,Horror,2023,tt16431352,6.3,11,Prime
68609,Living with the Dead,tv,"Crime, Drama, Mystery",2002,tt0289652,7.0,1496,Prime


In [64]:
prime[prime.duplicated(subset=["IMDb ID","Title","Release Year"], keep=False)]


Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name
830,Moses the Lawgiver,movie,Drama,1974,tt0072547,6.1,591,Prime
3339,Ferrari,movie,Drama,2003,tt0346946,7.0,925,Prime
6850,Icon,movie,"Action, Crime, Thriller",2005,tt0424176,5.6,1085,Prime
7782,Gone But Not Forgotten,movie,"Crime, Drama, Mystery",2005,tt0384736,6.1,732,Prime
7809,Category 7: The End of the World,movie,"Action, Adventure, Drama",2005,tt0468988,4.5,3044,Prime
...,...,...,...,...,...,...,...,...
68195,Gone But Not Forgotten,tv,"Crime, Drama, Mystery",2005,tt0384736,6.1,732,Prime
68300,Shylock's Children,tv,"Crime, Drama",2023,tt21944604,6.4,127,Prime
68510,Love Kills,tv,Horror,2023,tt16431352,6.3,11,Prime
68609,Living with the Dead,tv,"Crime, Drama, Mystery",2002,tt0289652,7.0,1496,Prime


In [65]:
# prime = prime.drop_duplicates(subset=["IMDb ID","Title","Release Year"], keep="first")
# rows = prime[prime["Title"] == "Panzer World Galient"]
# display(rows)

In [66]:
#Obtain max number of elements in the Genres column
prime["Combined Genres"].str.split(", ",expand=True)


Unnamed: 0,0,1,2,3,4,5
0,Comedy,Crime,Romance,,,
1,Comedy,,,,,
2,Action,Crime,Drama,,,
3,Drama,Romance,,,,
4,Drama,Mystery,,,,
...,...,...,...,...,...,...
70046,Documentary,,,,,
70050,Action,Comedy,Sci-Fi,,,
70051,Reality,,,,,
70053,Reality-TV,,,,,


In [67]:
# Split the elements in the Genres column while creating new columns for the max number 
prime[["Genre 1","Genre 2","Genre 3","Genre 4","Genre 5","Genre 6"]] = prime["Combined Genres"].str.split(',',expand=True)
prime.head(2)

Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name,Genre 1,Genre 2,Genre 3,Genre 4,Genre 5,Genre 6
0,Ariel,movie,"Comedy, Crime, Romance",1988,tt0094675,7.4,8949,Prime,Comedy,Crime,Romance,,,
1,Four Rooms,movie,Comedy,1995,tt0113101,6.7,113403,Prime,Comedy,,,,,


In [68]:
# Merge the dataframes to append them to the end of each other since using the same column names
streaming = pd.concat([apple, hulu, netflix, prime], axis = 0)


In [69]:
streaming.info()

<class 'pandas.core.frame.DataFrame'>
Index: 115671 entries, 0 to 70054
Data columns (total 15 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Title                115671 non-null  object 
 1   Type                 115671 non-null  object 
 2   Combined Genres      113456 non-null  object 
 3   Release Year         115671 non-null  int64  
 4   IMDb ID              108741 non-null  object 
 5   IMDb Average Rating  105499 non-null  float64
 6   IMDb Num Votes       115671 non-null  int64  
 7   Service Name         115671 non-null  object 
 8   Genre 1              113456 non-null  object 
 9   Genre 2              75644 non-null   object 
 10  Genre 3              44152 non-null   object 
 11  Genre 4              117 non-null     object 
 12  Genre 5              26 non-null      object 
 13  Genre 6              4 non-null       object 
 14  Genre 7              1 non-null       object 
dtypes: float64(1), int64(2)

In [70]:
streaming.describe()

Unnamed: 0,Release Year,IMDb Average Rating,IMDb Num Votes
count,115671.0,105499.0,115671.0
mean,-7335894000000000.0,6.152437,-8.110947e+17
std,2.600163e+17,1.290911,2.612128e+18
min,-9.223372e+18,1.0,-9.223372e+18
25%,2005.0,5.4,78.0
50%,2015.0,6.3,504.0
75%,2020.0,7.1,3266.0
max,2026.0,10.0,3002274.0


In [71]:
streaming.duplicated().value_counts()

False    115667
True          4
Name: count, dtype: int64

In [72]:
streaming[streaming.duplicated(keep=False)]

Unnamed: 0,Title,Type,Combined Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes,Service Name,Genre 1,Genre 2,Genre 3,Genre 4,Genre 5,Genre 6,Genre 7
25185,Burn,movie,,2014,,,-9223372036854775808,Prime,,,,,,,
35993,Belle,movie,"Fantasy, Horror",2023,tt12373754,5.4,168,Prime,Fantasy,Horror,,,,,
42922,Burn,movie,,2014,,,-9223372036854775808,Prime,,,,,,,
45350,Belle,movie,"Fantasy, Horror",2023,tt12373754,5.4,168,Prime,Fantasy,Horror,,,,,
50542,Flower Boy,movie,,2021,,,-9223372036854775808,Prime,,,,,,,
51337,Flower Boy,movie,,2021,,,-9223372036854775808,Prime,,,,,,,
63744,Scorned: Love Kills,tv,"Crime, Documentary",2012,tt2287041,6.9,304,Prime,Crime,Documentary,,,,,
63745,Scorned: Love Kills,tv,"Crime, Documentary",2012,tt2287041,6.9,304,Prime,Crime,Documentary,,,,,


In [74]:
streaming["Title"].isnull().value_counts()


Title
False    115671
Name: count, dtype: int64

In [75]:
streaming["IMDb ID"].isnull().value_counts()

IMDb ID
False    108741
True       6930
Name: count, dtype: int64

In [None]:
# # Assign dataset names & combine to read in as separate data frames

# list_of_names = ['Netflix','Hulu','Prime','AppleTV']

# # Created an empty list into which I can place the datasets
# combined_list = []

# # Used a function to append the datasets into the empty list I created above
# for i in range(len(list_of_names)):
#     temp_df = pd.read_csv(list_of_names[i]+".csv")
#     combined_list.append(temp_df)

### **Initial Checks**

In [None]:
# # Looking at file content of list in index 0 - Netflix - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[0].info()
# combined_list[0].head(5)



In [None]:
# # Looking at file content of list in index 1 - Hulu - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[1].info()
# combined_list[1].head(5)

In [None]:
# # Looking at file content of list in index 2 - Prime - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[2].info()
# combined_list[2].head(5)

In [None]:
# # Looking at file content of list in index 3 - AppleTV - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[3].info()
# combined_list[3].head(5)

### Initiall Review (02/10/2025)
- The "availableCountries" column will not provide much data going forward
- No dataset has a column for the source of the data, so this will need to be added

### Secondary Review (02/12/2025)
- Each dataset contains an "imdbId" column with the listing's IMDb ID, which is a string; this may come in handy 
- There are 3,514 missing titles
    - Prime = 2234
    - AppleTV = 651
    - Hulu = 2885
    - Netflix = 629
- There are 2,590 duplicated rows, among which many are from missing titles, though several are duplicate title names 
    - Apple = 2
    - Hulu = 651


### Breakdown:
#### Apple
- 565 Missing Titles
- 2 Dupliate Rows
#### Hulu:
- 651 Missing Titles
- 2 Duplicated Rows