# <u>**Streaming Service Comparison**</u>

### **Objective:**
- Determine which streaming platform hosts the majority of content I enjoy so that I can pare down the services to which I subscribe. 

### **Data Sources:**
- [Netflix via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-netflix-dataset)
- [Hulu via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-hulu-dataset)
- [Prime via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-amazon-prime-dataset/data)
- [AppleTV via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-apple-tv-dataset)

## **Data Collection & Loading**

### **Import Pandas, Numpy, Matplotlib, Wordcloud, and PIL**

In [1]:
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import wordcloud as wc
from PIL import Image

### **Data Load**

In [2]:
# Load in each file separately. 
# Plan to create a function for this or NOT
apple = pd.read_csv("AppleTV.csv")
hulu = pd.read_csv("Hulu.csv")
netflix = pd.read_csv("Netflix.csv")
prime = pd.read_csv("Prime.csv")

In [3]:
#display(apple)
apple.head(3)

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Four Rooms,movie,Comedy,1995.0,tt0113101,6.7,113403.0,
1,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2348885.0,
2,American Beauty,movie,Drama,1999.0,tt0169547,8.3,1238903.0,


In [4]:
# Dispaly information about the dataset (i.e., row counts, column counts, column names, datatypes, # of non-null rows)
apple.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18208 entries, 0 to 18207
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               17643 non-null  object 
 1   type                18208 non-null  object 
 2   genres              17549 non-null  object 
 3   releaseYear         18179 non-null  float64
 4   imdbId              16727 non-null  object 
 5   imdbAverageRating   16314 non-null  float64
 6   imdbNumVotes        16314 non-null  float64
 7   availableCountries  84 non-null     object 
dtypes: float64(3), object(5)
memory usage: 1.1+ MB


In [5]:
apple.describe()

Unnamed: 0,releaseYear,imdbAverageRating,imdbNumVotes
count,18179.0,16314.0,16314.0
mean,2006.822432,6.381574,25548.55
std,18.41046,1.162677,100828.8
min,1902.0,1.3,5.0
25%,2001.0,5.7,201.0
50%,2014.0,6.5,1234.0
75%,2020.0,7.2,8106.5
max,2025.0,9.6,2348885.0


In [6]:
apple.select_dtypes("object").describe()

Unnamed: 0,title,type,genres,imdbId,availableCountries
count,17643,18208,17549,16727,84
unique,16986,2,798,16724,14
top,Teenage Mutant Ninja Turtles,movie,Drama,tt0112120,US
freq,5,14005,1590,2,42


In [7]:
# Drop availableCountries Column - no value with such a low non-null number
apple.drop("availableCountries",axis=1,inplace=True)
apple.columns

Index(['title', 'type', 'genres', 'releaseYear', 'imdbId', 'imdbAverageRating',
       'imdbNumVotes'],
      dtype='object')

In [8]:
# Change releaseYear,  imdbNumVotes to integers 
# Display only the data types

apple[["releaseYear","imdbNumVotes"]] = apple[["releaseYear","imdbNumVotes"]].apply(np.int64)
display(apple.dtypes)

  arr = np.asarray(values, dtype=dtype)


title                 object
type                  object
genres                object
releaseYear            int64
imdbId                object
imdbAverageRating    float64
imdbNumVotes           int64
dtype: object

In [9]:
# Rename Columns
apple.rename(columns={"title":"Title","type":"Type","genres":"Genres","releaseYear":"Release Year","imdbId":"IMDb ID","imdbAverageRating":"IMDb Average Rating","imdbNumVotes":"IMDb Num Votes"},inplace = True)
apple.columns

Index(['Title', 'Type', 'Genres', 'Release Year', 'IMDb ID',
       'IMDb Average Rating', 'IMDb Num Votes'],
      dtype='object')

In [10]:
# Determine how many titles contain null values
apple["Title"].isna().sum()

np.int64(565)

In [11]:
# Drop the rows contiaining null values in the Title column
apple = apple.dropna(subset=["Title"])
apple["Title"].isna().sum()


np.int64(0)

In [12]:
apple.duplicated().value_counts()

False    17641
True         2
Name: count, dtype: int64

In [13]:
apple[apple.duplicated(keep=False)]

Unnamed: 0,Title,Type,Genres,Release Year,IMDb ID,IMDb Average Rating,IMDb Num Votes
3497,A Personal Journey with Martin Scorsese Throug...,movie,"Biography, Documentary, History",1995,tt0112120,8.5,5196
13091,The Kidnap & Murder of Lynda Spence,movie,"Documentary, Crime",2023,,,-9223372036854775808
13093,The Kidnap & Murder of Lynda Spence,movie,"Documentary, Crime",2023,,,-9223372036854775808
13980,A Personal Journey with Martin Scorsese Throug...,movie,"Biography, Documentary, History",1995,tt0112120,8.5,5196


In [14]:
apple_deduped = apple.drop.duplicates()


AttributeError: 'function' object has no attribute 'duplicates'

In [None]:
#apple.genres.value_counts()
# apple['genres'].str.split(',', expand=True)
apple[["Genre 1","Genre 2","Genre 3","Genre 4","Genre 5","Genre 6"]] = apple["Genres"].str.split(',',expand=True)

display(apple)



In [None]:
# View top 5 rows in the hulu df
hulu.head()

In [None]:
# Add a column to hulu DF and entered a default value for each row & display bottom two rows
hulu["serviceName"] = "Hulu"
hulu.tail(2)

In [None]:
# View top 5 rows in the netflix df
netflix.head()

In [None]:
# Add a column to netflix DF and entered a default value for each row & display bottom two rows
netflix["serviceName"] = "Netflix"
netflix.tail(2)

In [None]:
# View top 5 rows of prime df
prime.head()

In [None]:
# Add a column to the prime df and entered a default value for each row & display bottom two rows
prime["serviceName"] = "Prime"
prime.tail(2)

# map function for creating column and populating each cell within each df with the df name

# create a dictionary with the list of the names of df and have it look at the df name to populate the column

In [None]:
# Merge the dataframes to append them to the end of each other since using the same column names
streaming = pd.concat([apple, hulu, netflix, prime], axis = 0)


In [None]:
streaming.info()

In [None]:
streaming.describe()

In [None]:
streaming.duplicated().value_counts()

In [None]:
streaming[streaming.duplicated(keep=False)]

In [None]:
streaming["title"].isnull().value_counts()


In [None]:
streaming["imdbId"].isnull().value_counts()

In [None]:
# # Assign dataset names & combine to read in as separate data frames

# list_of_names = ['Netflix','Hulu','Prime','AppleTV']

# # Created an empty list into which I can place the datasets
# combined_list = []

# # Used a function to append the datasets into the empty list I created above
# for i in range(len(list_of_names)):
#     temp_df = pd.read_csv(list_of_names[i]+".csv")
#     combined_list.append(temp_df)

### **Initial Checks**

In [None]:
# # Looking at file content of list in index 0 - Netflix - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[0].info()
# combined_list[0].head(5)



In [None]:
# # Looking at file content of list in index 1 - Hulu - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[1].info()
# combined_list[1].head(5)

In [None]:
# # Looking at file content of list in index 2 - Prime - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[2].info()
# combined_list[2].head(5)

In [None]:
# # Looking at file content of list in index 3 - AppleTV - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[3].info()
# combined_list[3].head(5)

### Initiall Review (02/10/2025)
- The "availableCountries" column will not provide much data going forward
- No dataset has a column for the source of the data, so this will need to be added

### Secondary Review (02/12/2025)
- Each dataset contains an "imdbId" column with the listing's IMDb ID, which is a string; this may come in handy 
- There are 3,514 missing titles
    - Prime = 2234
    - AppleTV = 651
    - Hulu = 2885
    - Netflix = 629
- There are 2,590 duplicated rows, among which many are from missing titles, though several are duplicate title names 