# <u>**Streaming Service Comparison**</u>

### **Objective:**
- Determine which streaming platform hosts the majority of content I enjoy so that I can pare down the services to which I subscribe. 

### **Data Sources:**
- [Netflix via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-netflix-dataset)
- [Hulu via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-hulu-dataset)
- [Prime via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-amazon-prime-dataset/data)
- [AppleTV via Kaggle.com](https://www.kaggle.com/datasets/octopusteam/full-apple-tv-dataset)

## **Data Collection & Loading**

### **Import Pandas, Numpy, Matplotlib, Wordcloud, and PIL**

In [1]:
import glob
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import wordcloud as wc
from PIL import Image

### **Data Load**

In [2]:
# Load in each file separately. 
apple = pd.read_csv("AppleTV.csv")
hulu = pd.read_csv("Hulu.csv")
netflix = pd.read_csv("Netflix.csv")
prime = pd.read_csv("Prime.csv")

### **Preliminary View**

In [3]:
# View the head of each dataframe output 

# create a variable to include a list of all dataframes recently imported
dataframes = [apple, hulu, netflix, prime]
# Display only the head of each dataframe separately
for df in dataframes:
    display(df.head())

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Four Rooms,movie,Comedy,1995.0,tt0113101,6.7,113546.0,
1,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2354158.0,
2,American Beauty,movie,Drama,1999.0,tt0169547,8.3,1241156.0,
3,Citizen Kane,movie,"Drama, Mystery",1941.0,tt0033467,8.3,478085.0,
4,Metropolis,movie,"Drama, Sci-Fi",1927.0,tt0017136,8.3,192628.0,


Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Ariel,movie,"Comedy, Crime, Romance",1988.0,tt0094675,7.4,8991.0,
1,Shadows in Paradise,movie,"Comedy, Drama, Music",1986.0,tt0092149,7.5,7792.0,
2,Finding Nemo,movie,"Adventure, Animation, Comedy",2003.0,tt0266543,8.2,1149529.0,
3,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2354158.0,
4,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997.0,tt0119116,7.6,522699.0,


Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,American Beauty,movie,Drama,1999.0,tt0169547,8.3,1241156.0,
1,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997.0,tt0119116,7.6,522699.0,
2,Kill Bill: Vol. 1,movie,"Action, Crime, Thriller",2003.0,tt0266697,8.2,1238778.0,
3,Jarhead,movie,"Biography, Drama, War",2005.0,tt0418763,7.0,214024.0,
4,Unforgiven,movie,"Drama, Western",1992.0,tt0105695,8.2,449594.0,


Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries
0,Ariel,movie,"Comedy, Crime, Romance",1988.0,tt0094675,7.4,8991.0,
1,Four Rooms,movie,Comedy,1995.0,tt0113101,6.7,113546.0,
2,Judgment Night,movie,"Action, Crime, Drama",1993.0,tt0107286,6.6,19686.0,
3,Forrest Gump,movie,"Drama, Romance",1994.0,tt0109830,8.8,2354158.0,
4,Citizen Kane,movie,"Drama, Mystery",1941.0,tt0033467,8.3,478085.0,


### **Initial Insights**
- Each dataframe contains the same column headings
- None of the dataframes contain a column for the source of the dataframe
- All numerical columns appear to be floats and don't need to be
    - Change ***releaseYear*** and ***imdbNumVotes*** to integers
- The majority of the titles displayed contain multiple genres in the ***genres*** column
- The ***imdbId*** columns appear to match 
    - *American Beauty* in the 1st and 3rd dataframe
    - *Forest Gump* in the 1st, 2nd and 4th dataframes
- Of all the rows displayed, none include ***availableCountries*** data


In [4]:
# Using the same variable from above, dispaly the info for each dataframe to confirm column names and data types
for dframe in dataframes:
    display(dframe.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18281 entries, 0 to 18280
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               17700 non-null  object 
 1   type                18281 non-null  object 
 2   genres              17607 non-null  object 
 3   releaseYear         18249 non-null  float64
 4   imdbId              16775 non-null  object 
 5   imdbAverageRating   16363 non-null  float64
 6   imdbNumVotes        16363 non-null  float64
 7   availableCountries  82 non-null     object 
dtypes: float64(3), object(5)
memory usage: 1.1+ MB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10259 entries, 0 to 10258
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               9610 non-null   object 
 1   type                10259 non-null  object 
 2   genres              9911 non-null   object 
 3   releaseYear         10221 non-null  float64
 4   imdbId              9193 non-null   object 
 5   imdbAverageRating   8885 non-null   float64
 6   imdbNumVotes        8885 non-null   float64
 7   availableCountries  43 non-null     object 
dtypes: float64(3), object(5)
memory usage: 641.3+ KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20877 entries, 0 to 20876
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               20225 non-null  object 
 1   type                20877 non-null  object 
 2   genres              20528 non-null  object 
 3   releaseYear         20842 non-null  float64
 4   imdbId              19356 non-null  object 
 5   imdbAverageRating   19166 non-null  float64
 6   imdbNumVotes        19166 non-null  float64
 7   availableCountries  166 non-null    object 
dtypes: float64(3), object(5)
memory usage: 1.3+ MB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70294 entries, 0 to 70293
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   title               68616 non-null  object 
 1   type                70294 non-null  object 
 2   genres              68180 non-null  object 
 3   releaseYear         70107 non-null  float64
 4   imdbId              63868 non-null  object 
 5   imdbAverageRating   61528 non-null  float64
 6   imdbNumVotes        61528 non-null  float64
 7   availableCountries  360 non-null    object 
dtypes: float64(3), object(5)
memory usage: 4.3+ MB


None

In [5]:
# Change releaseYear, imdbNumVotes to integers
for dframe in dataframes:
    dframe[["releaseYear","imdbNumVotes"]] = dframe[["releaseYear","imdbNumVotes"]].apply(np.int64)
    # Display only the data types of each dataframe to ensure the data types were changed
    display(df.dtypes)

  arr = np.asarray(values, dtype=dtype)


title                  object
type                   object
genres                 object
releaseYear           float64
imdbId                 object
imdbAverageRating     float64
imdbNumVotes          float64
availableCountries     object
dtype: object

  arr = np.asarray(values, dtype=dtype)


title                  object
type                   object
genres                 object
releaseYear           float64
imdbId                 object
imdbAverageRating     float64
imdbNumVotes          float64
availableCountries     object
dtype: object

  arr = np.asarray(values, dtype=dtype)


title                  object
type                   object
genres                 object
releaseYear           float64
imdbId                 object
imdbAverageRating     float64
imdbNumVotes          float64
availableCountries     object
dtype: object

  arr = np.asarray(values, dtype=dtype)


title                  object
type                   object
genres                 object
releaseYear             int64
imdbId                 object
imdbAverageRating     float64
imdbNumVotes            int64
availableCountries     object
dtype: object

In [6]:
# Rename the columns of each dataframe
for dframe in dataframes:
    dframe.rename(columns={"title":"Title","type":"Type","genres":"Combined Genres","releaseYear":"Release Year","imdbId":"IMDb ID","imdbAverageRating":"IMDb Average Rating","imdbNumVotes":"IMDb Num Votes"},inplace = True)
dframe.columns



Index(['Title', 'Type', 'Combined Genres', 'Release Year', 'IMDb ID',
       'IMDb Average Rating', 'IMDb Num Votes', 'availableCountries'],
      dtype='object')

In [None]:
dataframes = [apple, hulu, netflix, prime]

result = []
for value in dataframes:
    if value == apple:
        result.append("AppleTV")
    elif value == hulu:
        result.append("Hulu")
    elif value == netflix:
        result.append("Netflix")
    else:
        result.append("Prime")

dataframes["Service Name"] = result
display(dataframes)

# https://www.geeksforgeeks.org/python-creating-a-pandas-dataframe-column-based-on-a-given-condition/?ref=lbp

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [None]:
# # Add a column to each dataframe to identify the data source
# apple["Service Name"] = "AppleTV"
# hulu["Service Name"] = "Hulu"
# netflix["Service Name"] = "Netflix"
# prime["Service Name"] = "Prime"

# df.columns