# <u>**Streaming Service Comparison**</u>

### **Objective:**
- Determine which streaming platform hosts the majority of content I enjoy so that I can pare down the services to which I subscribe. 

### **Data Sources:**
- [Netflix via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-netflix-dataset)
- [Hulu via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-hulu-dataset)
- [Prime via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-amazon-prime-dataset/data)
- [AppleTV via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-apple-tv-dataset)

## Data Collection & Loading**</U>


### **Import Pandas, Numpy, Matplotlib, Wordcloud, and PIL**

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import wordcloud as wc
from PIL import Image

### **Data Load**

In [None]:
# Assign dataset names & combine to read in as separate data frames

list_of_names = ['Netflix','Hulu','Prime','AppleTV']

# Created an empty list into which I can place the datasets
combined_list = []

# Used a function to append the datasets into the empty list I created above
for i in range(len(list_of_names)):
    temp_df = pd.read_csv(list_of_names[i]+".csv")
    combined_list.append(temp_df)

### **Initial Checks**

In [None]:
# Looking at file content of list in index 0 - Netflix - to determine column names, non-null counts, datatypes, and what the top five rows look like
combined_list[0].info()
combined_list[0].head(5)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20736 entries, 0 to 20735
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               20107 non-null  object 
 1   type                20736 non-null  object 
 2   genres              20391 non-null  object 
 3   releaseYear         20702 non-null  float64
 4   imdbId              19238 non-null  object 
 5   imdbAverageRating   19049 non-null  float64
 6   imdbNumVotes        19049 non-null  float64
 7   availableCountries  167 non-null    object 
dtypes: float64(3), object(5)
memory usage: 1.3+ MB


Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,American Beauty,movie,Drama,1999.0,tt0169547,8.3,1238903.0,
1,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997.0,tt0119116,7.6,521815.0,
2,Kill Bill: Vol. 1,movie,"Action, Crime, Thriller",2003.0,tt0266697,8.2,1236086.0,
3,Jarhead,movie,"Biography, Drama, War",2005.0,tt0418763,7.0,213667.0,
4,Unforgiven,movie,"Drama, Western",1992.0,tt0105695,8.2,448833.0,


In [None]:
# Looking at file content of list in index 1 - Hulu - to determine column names, non-null counts, datatypes, and what the top five rows look like
combined_list[1].info()
combined_list[1].head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10203 entries, 0 to 10202
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               9552 non-null   object 
 1   type                10203 non-null  object 
 2   genres              9853 non-null   object 
 3   releaseYear         10166 non-null  float64
 4   imdbId              9138 non-null   object 
 5   imdbAverageRating   8830 non-null   float64
 6   imdbNumVotes        8830 non-null   float64
 7   availableCountries  44 non-null     object 
dtypes: float64(3), object(5)
memory usage: 637.8+ KB


Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Ariel,movie,"Comedy, Crime, Romance",1988.0,tt0094675,7.4,8949.0,
1,Shadows in Paradise,movie,"Comedy, Drama, Music",1986.0,tt0092149,7.5,7727.0,
2,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2348885.0,
3,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997.0,tt0119116,7.6,521815.0,
4,My Life Without Me,movie,"Drama, Romance",2003.0,tt0314412,7.4,26140.0,


In [16]:
# Looking at file content of list in index 2 - Prime - to determine column names, non-null counts, datatypes, and what the top five rows look like
combined_list[2].info()
combined_list[2].head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70055 entries, 0 to 70054
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               68386 non-null  object 
 1   type                70055 non-null  object 
 2   genres              67939 non-null  object 
 3   releaseYear         69868 non-null  float64
 4   imdbId              63648 non-null  object 
 5   imdbAverageRating   61316 non-null  float64
 6   imdbNumVotes        61316 non-null  float64
 7   availableCountries  373 non-null    object 
dtypes: float64(3), object(5)
memory usage: 4.3+ MB


Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Ariel,movie,"Comedy, Crime, Romance",1988.0,tt0094675,7.4,8949.0,
1,Four Rooms,movie,Comedy,1995.0,tt0113101,6.7,113403.0,
2,Judgment Night,movie,"Action, Crime, Drama",1993.0,tt0107286,6.6,19627.0,
3,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2348885.0,
4,Citizen Kane,movie,"Drama, Mystery",1941.0,tt0033467,8.3,477481.0,


In [17]:
# Looking at file content of list in index 3 - AppleTV - to determine column names, non-null counts, datatypes, and what the top five rows look like
combined_list[3].info()
combined_list[3].head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18208 entries, 0 to 18207
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               17643 non-null  object 
 1   type                18208 non-null  object 
 2   genres              17549 non-null  object 
 3   releaseYear         18179 non-null  float64
 4   imdbId              16727 non-null  object 
 5   imdbAverageRating   16314 non-null  float64
 6   imdbNumVotes        16314 non-null  float64
 7   availableCountries  84 non-null     object 
dtypes: float64(3), object(5)
memory usage: 1.1+ MB


Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Four Rooms,movie,Comedy,1995.0,tt0113101,6.7,113403.0,
1,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2348885.0,
2,American Beauty,movie,Drama,1999.0,tt0169547,8.3,1238903.0,
3,Citizen Kane,movie,"Drama, Mystery",1941.0,tt0033467,8.3,477481.0,
4,Metropolis,movie,"Drama, Sci-Fi",1927.0,tt0017136,8.3,192285.0,


### Initiall Review
- The "availableCountries" column will not provide much data going forward
- No dataset has a column for the source of the data, so this will need to be added