# Anitej Isaac Sharma

## Research question/interests

#### On a broader perspective, I'm trying to look into how the presence of certain directors and actors on each streaming platform impact user engagement and satisfaction, what the most common director and actor pairings are across platforms, and how these pairings vary by genre. Additionally, I'll tie it all up with a recommendation drawn from the insights that can be used to inform content acquisition and marketing strategies for each platform. Some of the questions that can help me achieve these goals are:

- Which director(s) appears to be directing the most number of the movies within the four platforms? Who's it for the TV shows?
- Which directors have directed movies and TV shows that are available on all the four streaming platforms? Which directors appear only on one of the streaming platforms?
- Are there any TV shows and movies across all the platforms which have the same cast as other TV show(s) or movie(s)?
- Which is the most frequent name in all of the casts within each platform? Which is the least frequent?
- For each director, what are the different genres they have directed movies and TV shows of? Which is the prominent genre for each director?
- For each cast member, what are the different genres they have appeared in? Which is the prominent genre for each cast member?

To answer these questions, I plan to filter the datasets and perform a comprehensive analysis of each dataset, using tools such as data visualization and regression analysis to identify patterns and trends in user engagement and satisfaction based on the presence of specific directors and actors. I will also examine the most common director and actor pairings across platforms, and use this information to generate insights into how each platform can tailor their content offerings to better meet the needs and interests of its users.


My exploration would be accompanied by various charts comparing different variables within the datasets, specific to directors and casts, which would aid me in conducting a detailed analysis.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
hulu_df = pd.read_csv("../data/raw/hulu_titles.csv")
amazon_df = pd.read_csv("../data/raw/amazon_prime_titles.csv")
netflix_df = pd.read_csv("../data/raw/netflix_titles.csv")
disney_df = pd.read_csv("../data/raw/disney_plus_titles.csv")

In [None]:
hulu_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Ricky Velez: Here's Everything,,,,"October 24, 2021",2021,TV-MA,,"Comedy, Stand Up",​Comedian Ricky Velez bares it all with his ho...
1,s2,Movie,Silent Night,,,,"October 23, 2021",2020,,94 min,"Crime, Drama, Thriller","Mark, a low end South London hitman recently r..."
2,s3,Movie,The Marksman,,,,"October 23, 2021",2021,PG-13,108 min,"Action, Thriller",A hardened Arizona rancher tries to protect an...
3,s4,Movie,Gaia,,,,"October 22, 2021",2021,R,97 min,Horror,A forest ranger and two survivalists with a cu...
4,s5,Movie,Settlers,,,,"October 22, 2021",2021,,104 min,"Science Fiction, Thriller",Mankind's earliest settlers on the Martian fro...


In [None]:
amazon_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,"March 30, 2021",2014,,113 min,"Comedy, Drama",A small fishing village must procure a local d...
1,s2,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,"March 30, 2021",2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...
2,s3,Movie,Secrets of Deception,Josh Webber,"Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R...",United States,"March 30, 2021",2017,,74 min,"Action, Drama, Suspense",After a man discovers his wife is cheating on ...
3,s4,Movie,Pink: Staying True,Sonia Anderson,"Interviews with: Pink, Adele, Beyoncé, Britney...",United States,"March 30, 2021",2014,,69 min,Documentary,"Pink breaks the mold once again, bringing her ..."
4,s5,Movie,Monster Maker,Giles Foster,"Harry Dean Stanton, Kieran O'Brien, George Cos...",United Kingdom,"March 30, 2021",1989,,45 min,"Drama, Fantasy",Teenage Matt Banting wants to work with a famo...


In [None]:
netflix_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [None]:
disney_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,"Alonso Ramirez Ramos, Dave Wasson","Chris Diamantopoulos, Tony Anselmo, Tress MacN...",,"November 26, 2021",2016,TV-G,23 min,"Animation, Family",Join Mickey and the gang as they duck the halls!
1,s2,Movie,Ernest Saves Christmas,John Cherry,"Jim Varney, Noelle Parker, Douglas Seale",,"November 26, 2021",1988,PG,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...
2,s3,Movie,Ice Age: A Mammoth Christmas,Karen Disher,"Raymond Albert Romano, John Leguizamo, Denis L...",United States,"November 26, 2021",2011,TV-G,23 min,"Animation, Comedy, Family",Sid the Sloth is on Santa's naughty list.
3,s4,Movie,The Queen Family Singalong,Hamish Hamilton,"Darren Criss, Adam Lambert, Derek Hough, Alexa...",,"November 26, 2021",2021,TV-PG,41 min,Musical,"This is real life, not just fantasy!"
4,s5,TV Show,The Beatles: Get Back,,"John Lennon, Paul McCartney, George Harrison, ...",,"November 25, 2021",2021,,1 Season,"Docuseries, Historical, Music",A three-part documentary from Peter Jackson ca...


# Dropping the Duplicates

In [None]:
# Remove any duplicates from the datasets
disney_df.drop_duplicates(inplace=True)
hulu_df.drop_duplicates(inplace=True)
netflix_df.drop_duplicates(inplace=True)
amazon_df.drop_duplicates(inplace=True)

# Shape of the Data Sets

In [None]:
# Check the shape of each dataset
print("Disney Plus dataset shape:", disney_df.shape)
print("Hulu dataset shape:", hulu_df.shape)
print("Netflix dataset shape:", netflix_df.shape)
print("Amazon Prime dataset shape:", amazon_df.shape)

Disney Plus dataset shape: (1450, 12)
Hulu dataset shape: (3073, 12)
Netflix dataset shape: (8807, 12)
Amazon Prime dataset shape: (9668, 12)


# Counting the number of missing values in each column of the Four Data Sets

In [None]:
missing_values_df = pd.concat([disney_df.isnull().sum(), hulu_df.isnull().sum(),
                               netflix_df.isnull().sum(), amazon_df.isnull().sum()], axis=1)
missing_values_df.columns = ["Disney+", "Hulu", "Netflix", "Amazon Prime"]
missing_values_df

Unnamed: 0,Disney+,Hulu,Netflix,Amazon Prime
show_id,0,0,0,0
type,0,0,0,0
title,0,0,0,0
director,473,3070,2634,2082
cast,190,3073,825,1233
country,219,1453,831,8996
date_added,3,28,10,9513
release_year,0,0,0,0
rating,3,520,4,337
duration,0,479,3,0


In [None]:
missing_values_df.to_csv("../data/processed/missing_values_df.csv", index=True)

### Which director(s) appears to be directing the most number of the movies within the four platforms? Who's it for the TV shows?

## Netflix

In [None]:
netflix_directors = []
for i in netflix_df['director']:
    if str(type(i)) != "<class 'float'>":
        x = i.split(",")
        for j in x:
            netflix_directors.append(j)

In [None]:
appearances = []
for i in netflix_directors:
    counter = 0
    for j in netflix_directors:
        if i == j:
            counter += 1
    appearances.append(counter)

In [None]:
data = {"Director": netflix_directors, "Appearances": appearances}
netflix_director_appearances = pd.DataFrame(data)
netflix_director_appearances = netflix_director_appearances.drop_duplicates().reset_index()
netflix_director_appearances = netflix_director_appearances.drop('index', axis=1)

In [None]:
maxnum = netflix_director_appearances['Appearances'].max()
row_indices = []
counter = 0
for i in netflix_director_appearances['Appearances']:
    if i == maxnum:
        row_indices.append(counter)
    counter += 1

In [None]:
netflix_director_appearances.loc[row_indices]

Unnamed: 0,Director,Appearances
282,Rajiv Chilaka,22


In [None]:
netflix_director_appearances.to_csv("../data/processed/netflix_director_appearances.csv", index=False)

## Amazon Prime

In [None]:
amazon_directors = []
for i in amazon_df['director']:
    if str(type(i)) != "<class 'float'>":
        x = i.split(",")
        for j in x:
            amazon_directors.append(j)

In [None]:
appearances = []
for i in amazon_directors:
    counter = 0
    for j in amazon_directors:
        if i == j:
            counter += 1
    appearances.append(counter)

In [None]:
data = {"Director": amazon_directors, "Appearances": appearances}
amazon_director_appearances = pd.DataFrame(data)
amazon_director_appearances = amazon_director_appearances.drop_duplicates().reset_index()
amazon_director_appearances = amazon_director_appearances.drop('index', axis=1)

In [None]:
maxnum = amazon_director_appearances['Appearances'].max()
row_indices = []
counter = 0
for i in amazon_director_appearances['Appearances']:
    if i == maxnum:
        row_indices.append(counter)
    counter += 1

In [None]:
amazon_director_appearances.loc[row_indices]

Unnamed: 0,Director,Appearances
24,Mark Knight,113


In [None]:
amazon_director_appearances.to_csv("../data/processed/amazon_director_appearances.csv", index=False)

## Disney Plus

In [None]:
disney_directors = []
for i in disney_df['director']:
    if str(type(i)) != "<class 'float'>":
        x = i.split(",")
        for j in x:
            disney_directors.append(j)

In [None]:
appearances = []
for i in disney_directors:
    counter = 0
    for j in disney_directors:
        if i == j:
            counter += 1
    appearances.append(counter)

In [None]:
data = {"Director": disney_directors, "Appearances": appearances}
disney_director_appearances = pd.DataFrame(data)
disney_director_appearances = disney_director_appearances.drop_duplicates().reset_index()
disney_director_appearances = disney_director_appearances.drop('index', axis=1)

In [None]:
maxnum = disney_director_appearances['Appearances'].max()
row_indices = []
counter = 0
for i in disney_director_appearances['Appearances']:
    if i == maxnum:
        row_indices.append(counter)
    counter += 1

In [None]:
disney_director_appearances.loc[row_indices]

Unnamed: 0,Director,Appearances
197,Jack Hannah,17


In [None]:
disney_director_appearances.to_csv("../data/processed/disney_director_appearances.csv", index=False)

## Hulu

In [None]:
hulu_directors = []
for i in hulu_df['director']:
    if str(type(i)) != "<class 'float'>":
        hulu_directors.append(i)

Since there's more information than just the directors' names, we have to manually alter the data.

In [None]:
hulu_directors[0] = "Jennifer Kent"
hulu_directors[1] = "Gigi Saul Guerrero"
hulu_directors[2] = "Alex Winter"

In [None]:
appearances = []
for i in hulu_directors:
    counter = 0
    for j in hulu_directors:
        if i == j:
            counter += 1
    appearances.append(counter)

In [None]:
data = {"Director": hulu_directors, "Appearances": appearances}
hulu_director_appearances = pd.DataFrame(data)
hulu_director_appearances = hulu_director_appearances.drop_duplicates().reset_index()
hulu_director_appearances = hulu_director_appearances.drop('index', axis=1)

In [None]:
maxnum = hulu_director_appearances['Appearances'].max()
row_indices = []
counter = 0
for i in hulu_director_appearances['Appearances']:
    if i == maxnum:
        row_indices.append(counter)
    counter += 1

In [None]:
hulu_director_appearances.loc[row_indices]

Unnamed: 0,Director,Appearances
0,Jennifer Kent,1
1,Gigi Saul Guerrero,1
2,Alex Winter,1


In [None]:
hulu_director_appearances.to_csv("../data/processed/hulu_director_appearances.csv", index=False)

# Which directors have directed movies and TV shows that are available on all the four streaming platforms? Which directors appear only on one of the streaming platforms?

## Directors who've appeared on only ONE platform

### Netflix

In [None]:
netflix_one = []
for i in netflix_directors:
    if i not in amazon_directors and i not in disney_directors and i not in hulu_directors:
        netflix_one.append(i)

In [None]:
netflix_one = {"Directors": netflix_one}
netflix_one = pd.DataFrame(netflix_one).drop_duplicates().reset_index()
netflix_one = pd.DataFrame(netflix_one).drop('index', axis=1)
netflix_one

Unnamed: 0,Directors
0,Kirsten Johnson
1,Julien Leclercq
2,Mike Flanagan
3,Robert Cullen
4,José Luis Ucha
...,...
4332,Ivona Juka
4333,Mu Chu
4334,Chandra Prakash Dwivedi
4335,Majid Al Ansari


In [None]:
netflix_one.to_csv("../data/processed/netflix_one.csv", index=False)

### Amazon Prime

In [None]:
amazon_one = []
for i in amazon_directors:
    if i not in netflix_directors and i not in disney_directors and i not in hulu_directors:
        amazon_one.append(i)

In [None]:
amazon_one = {"Directors": amazon_one}
amazon_one = pd.DataFrame(amazon_one).drop_duplicates().reset_index()
amazon_one = pd.DataFrame(amazon_one).drop('index', axis=1)
amazon_one

Unnamed: 0,Directors
0,Don McKellar
1,Sonia Anderson
2,Giles Foster
3,Paul Weiland
4,Fran Strine
...,...
5542,Kristi Jacobson
5543,Lori Silverbush
5544,John-Paul Davidson
5545,Stephen Warbeck


In [None]:
amazon_one.to_csv("../data/processed/amazon_one.csv", index=False)

### Disney Plus

In [None]:
disney_one = []
for i in disney_directors:
    if i not in netflix_directors and i not in amazon_directors and i not in hulu_directors:
        disney_one.append(i)

In [None]:
disney_one = {"Directors": disney_one}
disney_one = pd.DataFrame(disney_one).drop_duplicates().reset_index()
disney_one = pd.DataFrame(disney_one).drop('index', axis=1)
disney_one

Unnamed: 0,Directors
0,Alonso Ramirez Ramos
1,Dave Wasson
2,John Cherry
3,Karen Disher
4,Hamish Hamilton
...,...
454,Hollingsworth Morse
455,Dave Michener
456,Gavin Hood
457,Dexter Fletcher


In [None]:
disney_one.to_csv("../data/processed/disney_one.csv", index=False)

### Hulu

In [None]:
hulu_one = []
for i in hulu_directors:
    if i not in netflix_directors and i not in amazon_directors and i not in disney_directors:
        hulu_one.append(i)

In [None]:
hulu_one = {"Directors": hulu_one}
hulu_one = pd.DataFrame(hulu_one).drop_duplicates().reset_index()
hulu_one = pd.DataFrame(hulu_one).drop('index', axis=1)
hulu_one

Unnamed: 0,Directors
0,Jennifer Kent


In [None]:
hulu_one.to_csv("../data/processed/hulu_one.csv", index=False)

## Directors who've appeared on ALL platforms

In [None]:
allInOne = []
for i in netflix_directors:
    if i in amazon_directors and i in disney_directors and i in hulu_directors:
        allInOne.append(i)

In [None]:
allInOne

[]

So, it appears that there are no directors that have directed movies that are on all of the four platforms.

# Are there any TV shows and movies across all the platforms which have the same cast as other TV show(s) or movie(s)?

### First, we're going to look at overlapping directors and cast within individual platforms.

### Netflix