# Comparison of content between Netflix and Amazon Prime
## 1. Introduction
The online streaming domain has been heating up with the entry of Disney+, Apple, HBO Max and NBC Peacock. However, Netflix remains the biggest player in the market with Amazon Prime Video trailing behind it. 

In this project I have compared the content between Amazon and Netflix to gather insights into this online streaming war.

In [1]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import seaborn as sns
import json
import time

## 2. Data Gathering
The catalog for both Amazon Prime Video and Netflix have been scraped from reelgood.com which is an online streaming aggregator and helps one watch all the online content in one place. <br>
The genre of the TV shows and movies have been scraped from Finder.com which is a service used for comparing products such as cred cards, martgages.

In [62]:
# Extracting the content from reelgood.com using Beautiful Soup

def reel_good_scraping(total_content,url_base):
    """ Extracting the content from reelgood.com using Beautiful Soup
    Input: total number of pages that has the content (int) and base URL
    Output: Pandas Dataframe"""
    
    start= time.time()

    for page in range(0,total_content,50): # List of all the pages in the website

        print(page, end=',') # for telling the status of the current iteration

        time.sleep(np.random.randint(5,25))

        try:

            # URL for the reelgood website
            url = url_base +str(page)

            # Extracting the HTML elements with Beautiful soup
            response= requests.get(url)
            soup = BeautifulSoup(response.content, 'html.parser' )

            # Finding the number of titles in the extracted page
            page_length= len(soup.find_all('td', class_="css-1u7zfla e126mwsw1"))

            # Initiating empty lists to make the dataframe
            title= []
            year = []
            age_group= []
            imdb= []
            rt=[]

            for i in range(page_length):
                # extracting the title from the soup element
                title.append(soup.find_all('td', class_="css-1u7zfla e126mwsw1")[i].find('a').contents[0])

                # extracting the year information from the soup element
                year.append(soup.find_all('td', class_="css-1u11l3y")[4*i].contents[0])

                # extracting the age group detail from the soup element
                age_group.append(soup.find_all('td', class_="css-1u11l3y")[4*i+1].contents[0])

                # extracting the imdb rating from the soup element
                imdb.append(soup.find_all('td', class_="css-1u11l3y")[4*i+2].contents[0])

                # extracting the rotten tomatoes rating from the soup element
                rt.append(soup.find_all('td', class_="css-1u11l3y")[4*i+3].contents[0])

                # forming a dataframe for each iteration
                df_temp = pd.DataFrame({'title':title,'year':year,'age_group':age_group, 'imdb':imdb,'rotten_tomato':rt} )

            if page==0:
                df = df_temp

            else:
                df = pd.concat([df, df_temp]) # appending the dataframe for each iteration

        except:
            print('Error on page:',page)
            continue

    end= time.time()
     
    print(round(end-start,0),'s')
    
    return df

In [23]:
# exporting the netflix content
df= reel_good_scraping(5801,'https://reelgood.com/source/netflix?offset=')

#Exporting the data to local hard drive
df.to_csv(r'C:\Users\srini\Projects\Online Streaming\netflix_shows.csv', index=False)

In [None]:
# exporting the Prime Video content
df= reel_good_scraping(15651,'https://reelgood.com/source/amazon?offset=')

#Exporting the data to local hard drive
df.to_csv(r'C:\Users\srini\Projects\Online Streaming\amazon.csv', index=False)

In [63]:
# exporting the Dinsey Plus content
df= reel_good_scraping(801,'https://reelgood.com/source/disney_plus?offset=')

#Exporting the data to local hard drive
df.to_csv(r'C:\Users\srini\Projects\Online Streaming\disney_plus.csv', index=False)

0,50,100,150,200,250,300,350,400,450,500,550,600,650,700,750,800,429.0 s


In [28]:
#Returns the movie/tv show genre and other details from finder.com

def genre_extract(url):
    """ Returns the movie/tv show genre and other details from finder.com
    args- url of finder.com
    output: dataframe with the movie/tv show information"""
    response= requests.get(url)
    return pd.read_html(response.content)[0]

In [29]:
# Extracting information for Netflix TV shows
df_netflix_tv= genre_extract('https://www.finder.com/netflix-tv-shows')

# Extracting information for Netflix movies
df_netflix_movie = genre_extract('https://www.finder.com/netflix-movies')

# Extracting information for Amazon Movies
df_amazon_movie = genre_extract('https://www.finder.com/amazon-prime-movies')

# Extracting information for Amazon TV shows
df_amazon_tv = genre_extract('https://www.finder.com/amazon-prime-tv-shows')

In [81]:
# Extracting information for Disney+ shows
df_disney_shows = genre_extract('https://www.finder.com/complete-list-disney-plus-movies-tv-shows-exclusives')

## 3. Data Wrangling
### 3.1 Merging dataframes
Combining Netflix and Amazon into a common dataframe to help with analysis

In [273]:
# Retriving the data from local hard drive
df_netflix= pd.read_csv(r'C:\Users\srini\Projects\Online Streaming\netflix_shows.csv')
df_amazon= pd.read_csv(r'C:\Users\srini\Projects\Online Streaming\amazon.csv')
df_disney = pd.read_csv(r'C:\Users\srini\Projects\Online Streaming\disney_plus.csv')

In [274]:
df_netflix.head()

Unnamed: 0,title,year,age_group,imdb,rotten_tomato
0,Breaking Bad,2008,18+,9.5,96%
1,Inception,2010,13+,8.8,87%
2,Back to the Future,1985,7+,8.5,96%
3,The Matrix,1999,18+,8.7,88%
4,The Silence of the Lambs,1991,18+,8.6,96%


In [275]:
# Adding a column to indicate the streaming platform
df_netflix['streaming']= 'Netflix'
df_amazon['streaming']= 'Amazon'
df_disney['streaming']= 'Disney+'

In [276]:
df_amazon.head()

Unnamed: 0,title,year,age_group,imdb,rotten_tomato,streaming
0,The Silence of the Lambs,1991,18+,8.6,96%,Amazon
1,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,Amazon
2,The Pianist,2002,18+,8.5,95%,Amazon
3,The Avengers,2012,13+,8.0,92%,Amazon
4,Knives Out,2019,13+,7.9,97%,Amazon


In [277]:
# combining the dataframes
df= pd.concat([df_netflix, df_amazon, df_disney])
df.head()

Unnamed: 0,title,year,age_group,imdb,rotten_tomato,streaming
0,Breaking Bad,2008,18+,9.5,96%,Netflix
1,Inception,2010,13+,8.8,87%,Netflix
2,Back to the Future,1985,7+,8.5,96%,Netflix
3,The Matrix,1999,18+,8.7,88%,Netflix
4,The Silence of the Lambs,1991,18+,8.6,96%,Netflix


In [278]:
df.shape

(22313, 6)

### 3.2 Finding duplicates


In [279]:
# finding duplicate values
df.duplicated().sum()

2

In [280]:
# finding duplicate values
df[df.duplicated()]

Unnamed: 0,title,year,age_group,imdb,rotten_tomato,streaming
3938,El día menos pensado,2020,,7.3,,Netflix
4612,Lucid Dream,2017,,6.1,,Netflix


In [281]:
# removing the duplicate values
df.drop_duplicates(inplace= True)

In [282]:
# Checking
df.duplicated().sum()

0

### 3.3 Resetting the index
Since we concatenated 3 dataframes we need to remove the duplicate indices.

In [283]:
df.reset_index(inplace= True)
df.head(1)

Unnamed: 0,index,title,year,age_group,imdb,rotten_tomato,streaming
0,0,Breaking Bad,2008,18+,9.5,96%,Netflix


In [284]:
df.drop(columns='index', inplace= True)
df.head(1)

Unnamed: 0,title,year,age_group,imdb,rotten_tomato,streaming
0,Breaking Bad,2008,18+,9.5,96%,Netflix


### 3.4 Changing data type for  rotten tomatoes columns
Changing the rotten tomatoes columns to float values.

In [285]:
def rt_float_extract(x):
    """Function to extract the digits from the Rotten Tomatoes column
    Input: single rotten tomatoe rating value
    Output: float value"""
    try:
        temp= float(x[:2])
    except:
        temp= np.nan
    return temp

In [286]:
# extracting the digits from the Rotten Tomatoes column
df.rotten_tomato= df.rotten_tomato.apply(lambda x: rt_float_extract(x) )

In [287]:
df.head(2)

Unnamed: 0,title,year,age_group,imdb,rotten_tomato,streaming
0,Breaking Bad,2008,18+,9.5,96.0,Netflix
1,Inception,2010,13+,8.8,87.0,Netflix


### 3.5 Cleaning Netflix data from finder.com
- We need to extract the original content information from the Title. <br>
- Dropping the 'Watch it' column from TV shows and movies dataframes. <br>
- Clean the title information so that it doesnt have Season and Original info from TV show dataframe. <br>

In [82]:
df_netflix_movie.head()

Unnamed: 0,Title,Year of release,Runtime (mins),Genres,Watch it
0,#Rucker50,2016,56,Basketball Movies,Watch now
1,#Selfie,2014,125,Comedies,Watch now
2,#Selfie 69,2016,119,Comedies,Watch now
3,#cats_the_mewvie,2020,90,Canadian Movies,Watch now
4,#realityhighOriginal,2017,99,Comedies,Watch now


In [92]:
# Finding which title are Netflix original movies
df_netflix_movie['original']= df_netflix_movie.Title.apply(lambda x: 'Original' in x)

In [95]:
# Finding which title are Netflix original TV shows
df_netflix_tv['original']= df_netflix_tv.Title.apply(lambda x: 'Original' in x)

In [98]:
# dropping the Watch it column
df_netflix_tv.drop(columns='Watch it', inplace= True)
df_netflix_tv.head(1)

Unnamed: 0,Title,Year of release,Genres,original
0,100 HumansOriginalSeason 1 (8 episodes),2020,Science & Nature Docs Social & Cultural Docs D...,True


In [100]:
# dropping the Watch it column
df_netflix_movie.drop(columns='Watch it', inplace= True)
df_netflix_movie.head(1)

Unnamed: 0,Title,Year of release,Runtime (mins),Genres,original
0,#Rucker50,2016,56,Basketball Movies,False


In [106]:
# Function to return the title of the TV show/ movie cleaning the Original and Season information
def title_cleaning(x):
    for i in ['Original','Season','Collection']:
        if i in x:
            return x.split(i)[0]

In [190]:
# making a new column with the cleaned title
df_netflix_tv['Title']= df_netflix_tv.Title.apply(lambda x: title_cleaning(x))
df_netflix_tv.head(1)

Unnamed: 0,Title,Year of release,Genres,original
0,100 Humans,2020,Science & Nature Docs Social & Cultural Docs D...,True


### 3.6 Adding TV vs Movie column in the finder.com dataframes
- We need to add if the content is a TV or a movie

In [194]:
df_netflix_tv['type']='TV'
df_netflix_movie['type']='Movie'
df_amazon_movie['type']='Movie'
df_amazon_tv['type']='TV'

In [203]:
# Since Disney+ does not have seperate column for movie/tv type we can extract that info by seeing if the Title as 'Season'
# keyword in it

df_disney_shows['type'] =df_disney_shows.Title.apply(lambda x: 'TV' if 'Season' in x else 'Movie')
df_disney_shows.head(3)

Unnamed: 0,Title,Year of release,Genres,type
0,Marvel Studios' Avengers: Endgame,2019,Superhero Fantasy Action-Adventure Science Fic...,Movie
1,Marvel Studios' Captain Marvel,2019,Superhero Action-Adventure Science Fiction,Movie
2,Marvel Studios' Iron Man 3,2013,Superhero Action-Adventure Science Fiction,Movie


### 3.7 Concatenating the dataframes
- We need to add a column for streaming service before merging the dataframes.

In [204]:
# adding a column for the streaming service provider before merging the dataframes

df_netflix_tv['streaming']='Netflix'
df_netflix_movie['streaming']='Netflix'
df_amazon_movie['streaming']='Amazon'
df_amazon_tv['streaming']= 'Amazon'
df_disney_shows['streaming']='Disney'

In [206]:
# Merging the dataframes together

df_genre =pd.concat([df_netflix_tv,df_netflix_movie,df_amazon_movie,df_amazon_tv,df_disney_shows], sort= False)
df_genre.shape

(15271, 7)

In [207]:
df_genre.head(2)

Unnamed: 0,Title,Year of release,Genres,original,type,streaming,Runtime (mins)
0,100 Humans,2020.0,Science & Nature Docs Social & Cultural Docs D...,True,TV,Netflix,
1,100% Hotter,2017.0,Reality TV Shows Makeover Reality TV British T...,False,TV,Netflix,


In [226]:
# resetting the index
df_genre.reset_index(inplace= True)
df_genre.drop(columns='index', inplace= True)

### 3.8 Modifying the genre column
The genre column across Netflix, Disney+ and Amazon are different and need to be standardized.

In [148]:
# Creating a master list that has all the different genres in all the dataframes
genre_master= set()

def genre_cleaning(x):
    """function to create a masterlist of genre categories"""
    # set is used to prevent duplicates
    
    [genre_master.add(x.lower()) for x in str(x).split(' ')]

In [208]:
# Creating a master list that has all the different genres in all the dataframes

df_genre.Genres.apply(lambda x: genre_cleaning(x));

In [209]:
len(genre_master)

221

In [167]:
# Exporting the genre to csv file for easier analysis
file= open(r'C:\Users\srini\Projects\Online Streaming\genre_master.csv', 'w')
file.write(str(genre_master) )
file.close()

In the csv file I cleaned the different genre categories and grouped them into 20 different categories for easier analysis. Example action, adventure-action, adventure were grouped into Action genre.

In [210]:
# Creating a dictionary to group all similar types of genre categories together

genre_dic= {'action':['adventures','moviesaction','action-adventure','survival','adventure','action'],\
'sports': ['baseball','fitness','martial','wrestling','sports','boxing','basketball'],\
'thrillers_horror': ['thrillers','b-horror','thriller','dark','horror'],\
'comedy':['comedies','comic','moviescomedy','comics','sitcoms','comedy','stand-up'],\
'romantic':['romance','romantic','moviesromance'],'drama':[ 'crime','psychological','drama','tvdrama','police/cop',\
'k-dramas','tales','soap','musical','courtroom','wedding','melodrama','medical','teen','social','cult','survival',\
'moviesdrama','dark','family','dramas','fiction','reality','silent'], \
'others': ['novels','buddy','irreverent','independent','adult','lgbtq','parody','period','book','campy'],\
'misc':['incorrect','&','talk','age','movies','real','moviesprime','variety','books','films','tv','channels','/','country',\
'film','on','nan','release','together','new','procedural', 'for','show','competition','video','and','middle','light','of',\
 'issue','watch','prime','pieces','shows','series','a','coming','true','features','mecha','based','noir'],
'travel_life':[ 'travel','lifestyle','food','makeover','life','home','arts','nature','world','cultural','garden'],
'spiritual':[ 'spirituality','faith','spiritual'],\
'documentary_edu':['docuseries','docs','documentaries','documentary','historical','anthology','tvscience','biographical',\
 'animals','tvdocumentary','moviesdocumentary','mockumentaries','disaster','science','ecology'],\
'music':['hip-hop','music','opera','musical','concert','concerts','dance'],\
'military_political':[ 'military','spy/espionage','political','police/cop','politically'],\
'kids':['animal','webtoon','education',"kids'",'creature','disney','cartoons','children','animation','kids'],\
'anime':['shounen','anime','manga','seinen','animated','animation'],\
'mystery':['tvmystery','mysteries','moviesmystery','mystery'],\
'popular':['popular','favorites','moviestop-rated','top-rated','tvpopular','classic'],\
'game':['gamers','game'], 'fantasy':['fantasy','superhero','alien','cyborg','cyberpunk','sci-fi'],\
'regional':['israeli','mexican','british','african','zealand','dutch','polish','japanese','filipino','irish','thai',\
'romanian','k-dramas','austrian','international','spanish','latin','malaysian','swedish','australian','danish',\
 'hindi-language','russian','belgian','asian','colombian','korean','western','taiwanese','chinese','indian',\
'bengali-language','german','american','bollywood','westerns','french','western/folk','canadian','singaporean','italian',\
 'eastern','finnish','scandinavian','argentinian','telugu-language','brazilian','chilean']         
}

In [216]:
# Total number of genre categories
genre_dic.keys()

dict_keys(['action', 'sports', 'thrillers_horror', 'comedy', 'romantic', 'drama', 'others', 'misc', 'travel_life', 'spiritual', 'documentary_edu', 'music', 'military_political', 'kids', 'anime', 'mystery', 'popular', 'game', 'fantasy', 'regional'])

In [229]:
# Formatting the Genre column  to a lowercase and splitting it into a list of spaces
df_genre['Genres']= df_genre['Genres'].apply(lambda x: str(x).lower().split(' ') )

In [265]:
# Adding new columns for each of the genre categories
for i in genre_dic.keys():
    df_genre[i]= None

# Function that changes the coresponding genre column value to 1 for each genre found in the Genres column

def genre_col_func(genre_data):
    """Function that changes the coresponding genre column value to 1 for each genre found in the
    Genres column 
    Output: None"""
    global index
    
    for gd in genre_data: # Iterating through the Genres list for a row
        for gd_key in genre_dic.keys():  # Checking which key has its value matching with Genre word
            if gd in genre_dic[gd_key]:
                df_genre.loc[index, gd_key]=1    # updating the respective Genre column value to 1 if a match is found
    index +=1

In [266]:
index=0
df_genre.Genres.apply(lambda x: genre_col_func(x) );
df_genre.head()

Unnamed: 0,Title,Year of release,Genres,original,type,streaming,Runtime (mins),action,sports,thrillers_horror,...,documentary_edu,music,military_political,kids,anime,mystery,popular,game,fantasy,regional
0,100 Humans,2020.0,"[science, &, nature, docs, social, &, cultural...",True,TV,Netflix,,,,,...,1.0,,,,,,,,,
1,100% Hotter,2017.0,"[reality, tv, shows, makeover, reality, tv, br...",False,TV,Netflix,,,,,...,,,,,,,,,,1.0
2,12 Years Promise,2014.0,"[tv, comedies, tv, dramas, romantic, tv, comed...",False,TV,Netflix,,,,,...,,,,,,,,,,1.0
3,13 Reasons Why,2019.0,"[tv, mysteries, tv, dramas, crime, tv, dramas,...",True,TV,Netflix,,,,,...,,,,,,1.0,,,,
4,13 Reasons Why: Beyond the Reasons,2019.0,[docuseries],True,TV,Netflix,,,,,...,1.0,,,,,,,,,


In [288]:
# Creating a backup
df_copy= df.copy()
df_genre_copy= df_genre.copy()

In [347]:
df_genre= df_genre_copy

### 3.9 Minor cleaning
- Renaming few columns in df_genre <br>
- dropping year from df_genre <br>

In [348]:
# Renaming few columns in df_genre
df_genre.rename(columns={'Runtime (mins)':'runtime','Title':'title', 'Year of release':'year'}, inplace= True)
df_genre.head()

Unnamed: 0,title,year,Genres,original,type,streaming,runtime,action,sports,thrillers_horror,...,documentary_edu,music,military_political,kids,anime,mystery,popular,game,fantasy,regional
0,100 Humans,2020.0,"[science, &, nature, docs, social, &, cultural...",True,TV,Netflix,,,,,...,1.0,,,,,,,,,
1,100% Hotter,2017.0,"[reality, tv, shows, makeover, reality, tv, br...",False,TV,Netflix,,,,,...,,,,,,,,,,1.0
2,12 Years Promise,2014.0,"[tv, comedies, tv, dramas, romantic, tv, comed...",False,TV,Netflix,,,,,...,,,,,,,,,,1.0
3,13 Reasons Why,2019.0,"[tv, mysteries, tv, dramas, crime, tv, dramas,...",True,TV,Netflix,,,,,...,,,,,,1.0,,,,
4,13 Reasons Why: Beyond the Reasons,2019.0,[docuseries],True,TV,Netflix,,,,,...,1.0,,,,,,,,,


In [349]:
# dropping year column from df_genre as df as that info
df_genre.drop(columns='year', inplace= True)

### 3.10 Missing values

In [350]:
# Rows with missing titles
df_genre[df_genre.title.isna()]

Unnamed: 0,title,Genres,original,type,streaming,runtime,action,sports,thrillers_horror,comedy,...,documentary_edu,music,military_political,kids,anime,mystery,popular,game,fantasy,regional
38,,"[tv, variety, &, talk, shows, korean, tv, show...",False,TV,Netflix,,,,,,...,,,,,,,,,,1
72,,"[tv, horror]",False,TV,Netflix,,,,1,,...,,,,,,,,,,
108,,"[tv, dramas, romantic, tv, dramas, chinese, tv...",False,TV,Netflix,,,,,,...,,,,,,,,,,1
178,,"[anime, series, anime, fantasy, anime, japanes...",False,TV,Netflix,,,,,,...,,,,,1,,,,1,1
188,,"[action, anime, anime, series, anime, fantasy,...",False,TV,Netflix,,1,,,,...,,,,,1,,,,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1731,,"[hip-hop, tv, dramas, crime, tv, dramas, tv, s...",False,TV,Netflix,,,,,,...,,1,,,,,,,,
1751,,"[tv, dramas, latin, american, tv, shows, argen...",False,TV,Netflix,,,,,,...,,,,,,,,,,1
1753,,"[drama, anime, anime, series, anime, fantasy, ...",False,TV,Netflix,,,,,,...,,,,,1,,,,1,1
1755,,"[tv, dramas, crime, tv, dramas, korean, tv, sh...",False,TV,Netflix,,,,1,,...,,,,,,,,,,1


In [351]:
df_genre.dropna(subset=['title'], inplace= True)
df_genre.isna().sum()

title                     0
Genres                    0
original               9714
type                      0
streaming                 0
runtime               11486
action                12470
sports                15093
thrillers_horror      12853
comedy                13403
romantic              14045
drama                  7978
others                14903
misc                   2588
travel_life           14910
spiritual             15190
documentary_edu       12713
music                 15011
military_political    15078
kids                  13011
anime                 14548
mystery               13258
popular               14121
game                  15176
fantasy               14651
regional              13376
dtype: int64

### 3.11 Cleaning Title for Disney TV shows

In [352]:
# cleaning disney title column
df_genre.loc[df_genre.query('streaming=="Disney" and type=="TV"').index, 'title']= \
df_genre.query('streaming=="Disney" and type=="TV"').title.apply(lambda x: x.split('Season')[0])

In [353]:
df_genre.query('streaming=="Disney" and type=="TV"').title

14258                                Marvel's Hero Project
14260                                    Marvel's Runaways
14262                                Marvel's Agent Carter
14263                                    Marvel's Inhumans
14264                     Marvel's Rocket & Groot (Shorts)
                               ...                        
15255                                  Disney Prop Culture
15256                                           Shop Class
15257                           Disney Stuck In The Middle
15258    Walt Disney Animation Studios: Short Circuit E...
15259                                           Zenimation
Name: title, Length: 247, dtype: object

### 3.12 Adding Regional Language section
- Adding a column for language if there is a region lanugage movie

In [372]:
df_genre['country']= df_genre['Genres'].apply(lambda x: str([y for y in genre_dic['regional'] if (y in x)])[1:-1])

## 4. Exploratory Data Analysis
We will be visualizing the data to find any trends in it.