# <u>**Streaming Service Comparison**</u>

### **Objective:**
- Determine which streaming platform hosts the majority of content I enjoy so that I can pare down the services to which I subscribe. 

### **Data Sources:**
- [Netflix via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-netflix-dataset)
- [Hulu via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-hulu-dataset)
- [Prime via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-amazon-prime-dataset/data)
- [AppleTV via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-apple-tv-dataset)

## **Data Collection & Loading**

### **Import Pandas, Numpy, Matplotlib, Wordcloud, and PIL**

In [None]:
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import wordcloud as wc
from PIL import Image

### **Data Load**

In [None]:
# Load in each file separately. 
# Plan to create a function for this or NOT
apple = pd.read_csv("AppleTV.csv")
hulu = pd.read_csv("Hulu.csv")
netflix = pd.read_csv("Netflix.csv")
prime = pd.read_csv("Prime.csv")

### **Apple**   

In [None]:
#display(apple)
# apple.head(3)
# hulu.head(3)
# netflix.head(3)
# prime.head(3)
#display(apple)
print(apple.head(3))
print(hulu.head(3))
print(netflix.head(3))
print(prime.head(3))

In [None]:
# Dispaly information about the dataset (i.e., row counts, column counts, column names, datatypes, # of non-null rows)
apple.info()

In [None]:
apple.describe()

In [None]:
apple.select_dtypes("object").describe()

In [None]:
# Drop availableCountries Column - no value with such a low non-null number
apple.drop("availableCountries",axis=1,inplace=True)
apple.columns

In [None]:
# Change releaseYear,  imdbNumVotes to integers 
# Display only the data types

apple[["releaseYear","imdbNumVotes"]] = apple[["releaseYear","imdbNumVotes"]].apply(np.int64)
display(apple.dtypes)

In [None]:
# Rename Columns
apple.rename(columns={"title":"Title","type":"Type","genres":"Genres","releaseYear":"Release Year","imdbId":"IMDb ID","imdbAverageRating":"IMDb Average Rating","imdbNumVotes":"IMDb Num Votes"},inplace = True)
# Add a column to hulu DF and entered a default value for each row & display bottom two rows
apple["Service Name"] = "AppleTV"
apple.columns

In [None]:
# Determine how many titles contain null values
apple["Title"].isna().sum()

In [None]:
# Drop the rows contiaining null values in the Title column
apple = apple.dropna(subset=["Title"])
apple["Title"].isna().sum()


In [None]:
apple.duplicated().value_counts()

In [None]:
apple[apple.duplicated(keep=False)]

In [None]:
apple.drop_duplicates()


In [None]:
#apple.genres.value_counts()
# apple['genres'].str.split(',', expand=True)
apple[["Genre 1","Genre 2","Genre 3","Genre 4","Genre 5","Genre 6"]] = apple["Genres"].str.split(',',expand=True)

display(apple)



### **Hulu**

In [None]:
# View top 5 rows in the hulu df
hulu.head()

In [None]:
hulu.info()

In [None]:
hulu.describe()

In [None]:
hulu.select_dtypes("object").describe()

In [None]:
# Drop availableCountries Column - no value with such a low non-null number
hulu.drop("availableCountries",axis=1,inplace=True)
hulu.columns

In [None]:
# Change releaseYear,  imdbNumVotes to integers 
# Display only the data types

hulu[["releaseYear","imdbNumVotes"]] = hulu[["releaseYear","imdbNumVotes"]].apply(np.int64)
display(hulu.dtypes)

In [None]:
# Rename Columns
hulu.rename(columns={"title":"Title","type":"Type","genres":"Genres","releaseYear":"Release Year","imdbId":"IMDb ID","imdbAverageRating":"IMDb Average Rating","imdbNumVotes":"IMDb Num Votes"},inplace = True)
# Add column that identifies the streaming service
hulu["Service Name"] = "Hulu"
hulu.columns

In [None]:
# Determine how many titles contain null values
hulu["Title"].isna().sum()

In [None]:
# Drop the rows contiaining null values in the Title column
hulu = hulu.dropna(subset=["Title"])
hulu["Title"].isna().sum()

In [None]:
hulu.duplicated().value_counts()

In [None]:
hulu[hulu.duplicated(keep=False)]

In [None]:
hulu.drop_duplicates()


In [None]:
# hulu[["Genre 1","Genre 2","Genre 3","Genre 4","Genre 5","Genre 6"]] = hulu["Genres"].str.split(',',expand=True)

# display(hulu)

# count = hulu.count()
# print(count)
# column_Counts = hulu["Genres"].value_counts()
# print(column_Counts)
hulu["Genres"].max()


In [None]:
# View top 5 rows in the netflix df
netflix.head()

In [None]:
# Add a column to netflix DF and entered a default value for each row & display bottom two rows
netflix["Service Name"] = "Netflix"
netflix.tail(2)

In [None]:
# View top 5 rows of prime df
prime.head()

In [None]:
# Add a column to the prime df and entered a default value for each row & display bottom two rows
prime["Service Name"] = "Prime"
prime.tail(2)

# map function for creating column and populating each cell within each df with the df name

# create a dictionary with the list of the names of df and have it look at the df name to populate the column

In [None]:
# Merge the dataframes to append them to the end of each other since using the same column names
streaming = pd.concat([apple, hulu, netflix, prime], axis = 0)


In [None]:
streaming.info()

In [None]:
streaming.describe()

In [None]:
streaming.duplicated().value_counts()

In [None]:
streaming[streaming.duplicated(keep=False)]

In [None]:
streaming["title"].isnull().value_counts()


In [None]:
streaming["imdbId"].isnull().value_counts()

In [None]:
# # Assign dataset names & combine to read in as separate data frames

# list_of_names = ['Netflix','Hulu','Prime','AppleTV']

# # Created an empty list into which I can place the datasets
# combined_list = []

# # Used a function to append the datasets into the empty list I created above
# for i in range(len(list_of_names)):
#     temp_df = pd.read_csv(list_of_names[i]+".csv")
#     combined_list.append(temp_df)

### **Initial Checks**

In [None]:
# # Looking at file content of list in index 0 - Netflix - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[0].info()
# combined_list[0].head(5)



In [None]:
# # Looking at file content of list in index 1 - Hulu - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[1].info()
# combined_list[1].head(5)

In [None]:
# # Looking at file content of list in index 2 - Prime - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[2].info()
# combined_list[2].head(5)

In [None]:
# # Looking at file content of list in index 3 - AppleTV - to determine column names, non-null counts, datatypes, and what the top five rows look like
# combined_list[3].info()
# combined_list[3].head(5)

### Initiall Review (02/10/2025)
- The "availableCountries" column will not provide much data going forward
- No dataset has a column for the source of the data, so this will need to be added

### Secondary Review (02/12/2025)
- Each dataset contains an "imdbId" column with the listing's IMDb ID, which is a string; this may come in handy 
- There are 3,514 missing titles
    - Prime = 2234
    - AppleTV = 651
    - Hulu = 2885
    - Netflix = 629
- There are 2,590 duplicated rows, among which many are from missing titles, though several are duplicate title names 
    - Apple = 2
    - Hulu = 651


### Breakdown:
#### Apple
- 565 Missing Titles
- 2 Dupliate Rows
#### Hulu:
- 651 Missing Titles
- 2 Duplicated Rows