# Exploring Actor Graph + EDA
Here I explore the Netflix dataset, followed by exploring a graph for actors in both TV Shows and Movies.

![Graph image](https://sailab.diism.unisi.it/gnn/_images/intro.gif)

# Table of Contents
* [Getting ready](#Getting-ready)
* [A. Start of analysis](#A)
    - [A.1 High level view of TV and Movies data](#A1)
    - [A.2 Breakdown comparison of TV Shows and Movies](#A2)
    - [A.3 Breakdown of durations for TV Shows and Movies](#A3)
    - [A.4 View as Violin plots](#A4)
    - [A.5 Breakdown of contents and Actors](#A5)
* [B. Graph Network analysis](#B)
    - [B.1 Generating graphs and calculating Degrees](#B1)
    - [B.2 Viewing connection between number of Movies and degree of nodes](#B2)
    - [B.3 Viewing the connection between number of TV Shows and degree of nodes](#B3)
    - [B.4 Correlation within graph](#B4)
    - [B.5 Shortest path between actors](#B5)
    - [B.6 Longest path](#B6)
    - [B.7 Shortest path for TV Shows](#B7)
    - [B.8 Longest path for TV Shows](#B8)
    - [B.9 Viewing subgraphs for Movies](#B9)
    - [B.10 Example of interesting subgraph](#B10)
    - [B.11 Viewing subgraphs for TV Shows](#B11)
    - [B.12 Viewing an interesting subgraph](#B12)
    - [B.13 Egographs for Movies](#B13)
    - [B.14 Viewing egograph for midway actor in Movies](#B14)
    - [B.15 Egograph for TV Shows](#B15)
    - [B.16 Viewing egograph for midway actor in TV shows](#B16)

<a id="Getting-ready"></a>
# Getting ready...

# Importing required packages

In [None]:
import numpy as np
import pandas as pd
import os
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx 
import itertools
from itertools import combinations
import random

# Load Dataset

In [None]:
df = pd.read_csv("../input/netflix-shows/netflix_titles.csv")

# Basic cleansing
The show ID is not required here, and I only consider TV shows and movies released after 1980 (which is the majority). I then split the dataset up into Movies and TV shows separately for processing.

In [None]:
df.drop(columns = ['show_id'], inplace = True)
df = df[df['release_year'] > 1980]
df_TV = df[df['type'] == 'TV Show']
df_Movies = df[df['type'] == 'Movie']

# Processing
This function cleanses both the Movies and TV datasets concurrently. I find the time since a Movie/TV show was released relative to the dates they were added on Netflix. I do the map mentioned above, and I also calculate the number of countries a movie was showed in. These may all be useful fields.

Rather than considering individual ratings, I map them to audience types - there is overlap between certain ratings.

In [None]:
def process_df(df, rep_pattern, variable):
    new_df = df.copy()
    new_df['duration'] = df['duration'].replace(rep_pattern, regex = True)
    new_df['duration'] = new_df['duration'].astype('int64')
    new_df['date_added'] = new_df['date_added'].astype('datetime64').dt.year
    new_df['years_from_release_to_add'] = new_df['date_added'] - new_df['release_year']
    new_df['years_from_release_to_add'].fillna(new_df['years_from_release_to_add'].mode()[0], inplace=True)
    new_df['audience'] = df['rating'].replace(rating_to_audience_map)
    new_df.fillna('Unknown', inplace = True)
    new_df.drop(columns = ['date_added', 'type'], inplace = True)
    new_df.rename(columns = {'duration':f'duration_{variable}'}, inplace = True)
    new_df['number_of_countries'] = new_df.country.str.count(',') + 1
    
    return new_df

rating_to_audience_map = {'G': 'Child','NC-17': 'Adult','NR': 'Unrated',
                      'PG': 'Child','PG-13': 'Older Child','R': 'Adult',
                      'TV-14': 'Older Child','TV-G': 'Child','TV-MA': 'Adult',
                      'TV-PG': 'Child','TV-Y': 'Child','TV-Y7': 'Child',
                      'TV-Y7-FV': 'Child','UR': 'Unrated'}

df_TV_new = process_df(df_TV, {' Season': '', 's': ''}, 'TV_seasons')
df_Movies_new = process_df(df_Movies, {' min': ''}, 'Movies_mins')

# Rejoining datasets for overall view
Once cleansed and processed, I join the two datasets again for some graphs later.

In [None]:
df_TV_overall = df_TV_new.copy()
df_TV_overall['type'] = 'TV Show'

df_Movies_overall = df_Movies_new.copy()
df_Movies_overall['type'] = 'Movie'

df_Overall = pd.concat([df_TV_overall, df_Movies_overall])

<a id="A"></a>
# A. Start of Analysis!

<a id="A1"></a>
# A.1 TV Shows vs Movies
There are significantly more Movies on Netflix than TV shows for all time.

In [None]:
g = sns.set_theme(style="darkgrid")
g = sns.countplot(x = 'type', data = df_Overall, palette = ['b', 'g'], alpha = 0.5)

<a id="A2"></a>
# A.2 TV Shows and Movies broken down
There are a few findings here:
* Of the movies released, most are aimed at Adults. Although, the total of Older Children (Pre-teens + Teens) and Children are higher than adults.
* Most Movies and TV Shows are released in one single country.
* There has been an exponential increase in the number of TV Shows and Movies released in the past 40 years.
* There seems to be a slowing of Movies released, whilst TV shows continue to excel.
* Many shows are released directly into Netflix, or within the first year of production.

In [None]:
def histplots(x, i, j):
    g = sns.histplot(x = x, data = df_Overall, hue = 'type', palette = ['b','g'], ax = ax[i,j], alpha=0.5)
    
    return g

cols = ['audience', 'number_of_countries', 'release_year', 'years_from_release_to_add']

_, ax = plt.subplots(2,2, figsize = (12,8))

for index, cols_ in enumerate(cols):
    histplots(cols_, int(index/2), 0) if index % 2 == 0 else histplots(cols_, int((index-1)/2), 1)

In [None]:
def general_plot(plot_type, i):
    x = 'number_of_countries' if i == 0 else 'audience'
    if plot_type == 'violin':
        g = sns.violinplot(y = cols[index], x = x, data = df, ax = ax[index, i], color = color[index])
        for j, _ in enumerate(g.collections[::2]): g.collections[::2][j].set_alpha(0.6)
    elif plot_type == 'bar':
        g = sns.barplot(y = cols[index], x = x, data = df, color = color[index], ax = ax[index, i], alpha=0.5)
    
    ax[index, 1].set_ylabel('')
    ax[index, 1].set(yticklabels=[])
    ax[0, i].set_xlabel('')
    
    return g

df_list = [df_Movies_new, df_TV_new]
cols = ['duration_Movies_mins','duration_TV_seasons']
color = ['g', 'b']

<a id="A3"></a>
# A.3 Breakdown of durations
There are a few findings:
* The duration of movies is independent of the number of countries it is released in.
* Interestingly, TV shows tend to have longer seasons on average if released in 4 countries, but tail off later. However there is great variability in 4 countries also.
* Movies tend to be longest for Older children.
* Unrated seasons are longest on average, albeit with huge variations. There are also very few of these as seen above, so not very representative.

In [None]:
fig, ax = plt.subplots(2,2, figsize = (10,8))

for index, df in enumerate(df_list):
    for i in [0,1]: general_plot('bar', i)

<a id="A4"></a>
# A.4 Representing as Violin plots
This reinforces the above message, i.e. most movies are roughly 100 minutes in duration, with varying ranges.

In [None]:
fig, ax = plt.subplots(2, 2, figsize = (10,8))

for index, df in enumerate(df_list):
    for i in [0,1]: general_plot('violin', i)

In [None]:
def frequency_plot(df, col, ax, color, ylabel):
    general_list = list(df[col])
    general_list = [element for item in general_list for element in item.split(', ')]
    
    general_df = pd.DataFrame(general_list, columns = [col]).value_counts().reset_index(drop = False).rename(columns = {0:ylabel})

    general_df = general_df[general_df[col] != 'Unknown']

    general_df_top = general_df.head(20)

    g = sns.barplot(x = col, y = ylabel, data = general_df_top, ax = ax, color = color, alpha=0.5)
    g = g.set_xticklabels(g.get_xticklabels(), rotation=90)
    
    return g, general_df

<a id="A5"></a>
# A.5 Breaking down content and Actors data
Here are few findings:
* For both movies and TV shows, the US is significantly higher in content.
* Second is India for Movies, whilst UK is second for TV Shows.
* Takahiro Sakurai has the most TV shows.
* Anupam Kher has the most movies.

In [None]:
fig, ax = plt.subplots(2, 2, figsize = (15,25))

returnplot = frequency_plot(df_TV_new, 'country', ax = ax[0,0], color = 'b', ylabel = 'Frequency (TV shows)')
returnplot, cast_TV_df = frequency_plot(df_TV_new, 'cast', ax = ax[0,1], color = 'b', ylabel = 'Frequency (TV shows)')
returnplot = frequency_plot(df_Movies_new, 'country', ax = ax[1,0], color = 'g', ylabel = 'Frequency (Movies)')
returnplot, cast_Movies_df = frequency_plot(df_Movies_new, 'cast', ax = ax[1,1], color = 'g', ylabel = 'Frequency (Movies)')
    
_ = ax[0, 0].set_xlabel('')
_ = ax[0, 1].set_xlabel('')

<a id="B"></a>
# B. Starting Graph Network analysis

In [None]:
def calc_degs(df):

    combinations_list = []
    for i in df['cast'].iteritems():
        combinations_list.append(list(itertools.combinations(i[1].split(", "), 2)))

    flat_list = [item for sublist in combinations_list for item in sublist]
    
    G=nx.Graph()
    G.add_edges_from(flat_list)
  
    degrees = pd.DataFrame.from_dict(G.degree).rename(columns = {0:'cast', 1:'degree'})
    
    return G, degrees

<a id="B1"></a>
# B.1 Generating graphs and calculated Degrees
Here I generate the graphs for the movies and TV shows.

The graphs are undirected since it is looking at actors who directly worked in the same movies/TV shows. The information encapsulated in the graphs is related to who has worked with who in the shows/movies.

I also calculate the degree of nodes/vertices (in the context of an undirected graph). The degree is the number of edges connected to a single node.

In [None]:
df_list = [df_Movies_new, df_TV_new]
cast_dfs = [cast_Movies_df, cast_TV_df]
df_results = []
graph = []
for index, df in enumerate(df_list):
    G, degrees = calc_degs(df)
    results_ = cast_dfs[index].merge(degrees, left_on = 'cast', right_on = 'cast')
    results_ = results_.sort_values('degree', ascending = False).reset_index(drop = 'True')
    
    df_results.append(results_)
    graph.append(G)

In [None]:
def pair_plot(df, color):
    g = sns.pairplot(df, kind='reg', diag_kind = 'kde', corner = 'True',
                 diag_kws = {'color':color[0]},
                 plot_kws={'line_kws':{'color':color},
                           'scatter_kws': {'color':color,'alpha': 0.5}})
    
    return g

<a id="B2"></a>
# B.2 Viewing connection between number of Movies and degree of nodes
There is a clear correlation between the number of movies and the degree for an actor. 

In real terms, this means that the more movies an actor is present in, the more connections they form.

In [None]:
_ = pair_plot(df_results[0], color[0])

<a id="B3"></a>
# B.3 Viewing the connection between number of TV Shows and degree of nodes
Once again, the same correlation exists here as with Movies.

This also shows that you can end up getting more contracts for acting if you forge relationships between actors. So the more actors in a movie/TV show, the better for individual actors.

In [None]:
_ = pair_plot(df_results[1], color[1])

<a id="B4"></a>
# B.4 Correlation within graph
This correlation measures the similarity of connections within graphs relative to the node degree.

This shows that for TV shows, similarity in connectivity is more related to the degree for an actor than for Movies. A strange but interesting difference in behaviour between the two modes of content.

In [None]:
graph_vals = ['Movie','TV Show']
corr = []
for index, g in enumerate(graph):
    r = nx.degree_pearson_correlation_coefficient(g)
    corr.append(r)
    
corr_df = pd.DataFrame({'Dataset':graph_vals, 'Pearson Correlation Coefficient':corr})

g = sns.barplot(x = 'Dataset', y = 'Pearson Correlation Coefficient', data = corr_df, palette = color, alpha=0.5)

In [None]:
def get_actor_tuple(df, particular_actor_index):
    particular_actor = df.iloc[particular_actor_index]['cast']
    actor_list = list(df[df.cast != particular_actor]['cast'])

    tuple_list = []
    for actor in actor_list:
        tuple_list.append((particular_actor, actor))

    paths = {}
    for nodes in tuple_list:
        try:
            paths[nodes] = nx.shortest_path_length(graph[0], *nodes)
        except:
            pass
        
    return paths

In [None]:
def plot_graph_path(paths, path_type, color):
    if path_type == 'min':
        path = min(paths.items(), key=lambda x: x[1])[0]
    else:
        path = max(paths.items(), key=lambda x: x[1])[0]
        
    induced_path = nx.shortest_path(graph[0], *path)

    sG = nx.subgraph(graph[0], induced_path)

    pos = nx.spring_layout(sG, scale=20, k=3/np.sqrt(graph[0].order()))
    nx.draw(sG, pos, node_color=color, 
            with_labels=True, 
            node_size=1500,
            arrowsize=20)

<a id="B5"></a>
# B.5 Shortest path between actors
Here I look at the "top" actor, which is Anupam Kher. Changing the last index would get the connection for a different actor.

The shortest path for the set of all actors with Anupam Kher is Shah Rukh Khan.

In [None]:
paths_movie = get_actor_tuple(df_results[0], 0)
plot_graph_path(paths_movie, 'min', 'lightgreen')

<a id="B6"></a>
# B.6 Longest path
The longest path between Anupam Kher and Shah Rukh Khan is much longer!

In [None]:
plot_graph_path(paths_movie, 'max', 'lightgreen')

<a id="B7"></a>
# B.7 Shortest path for TV Shows
Again I choose the top actor and compare the graph for all other actors.

The closest connection is Yuchi Nakamura.

In [None]:
paths_tv = get_actor_tuple(df_results[1], 0)
plot_graph_path(paths_tv, 'min', 'lightblue')

<a id="B8"></a>
# B.8 Longest path for TV Shows
The longest path is again much longer between the same actors!

In [None]:
plot_graph_path(paths_tv, 'max', 'lightblue')

In [None]:
def plot_subgraph(graph, graph_index, color, dorandom):
    subgraphs = [graph.subgraph(c) for c in nx.connected_components(graph)]

    subgraphslist = []
    for subgraph in subgraphs:
        if subgraph.number_of_nodes() > 10 and subgraph.number_of_nodes() <= 50:
            subgraphslist.append(subgraph)
    
    if dorandom: graph_index = random.sample(range(len(subgraphslist)),1)[0]
    
    nx.draw(subgraphslist[graph_index], node_color=color, 
            with_labels=True, 
            node_size=1500,
            arrowsize=20)

<a id="B9"></a>
# B.9 Viewing subgraphs for Movies
I break down the graph into subgraphs, and choose only those with more than 10 nodes and less than or equal to 50 nodes.

This is a randomised viewer of the subgraphs.

In [None]:
plot_subgraph(graph[0],_,'lightgreen',True)

<a id="B10"></a>
# B.10 Example of interesting subgraph
This is a densely connected subgraph - there are others and functions can be used to find these.

In [None]:
plot_subgraph(graph[0],3,'lightgreen', False)

<a id="B11"></a>
# B.11 Viewing subgraphs for TV Shows
This is again a randomised viewer of the subgraphs for TV shows.

In [None]:
plot_subgraph(graph[1],_,'lightblue',True)

<a id="B12"></a>
# B.12 Viewing an interesting subgraph
I chose this subgraph as an interesting example of a more dense subgraph.

In [None]:
plot_subgraph(graph[1],23,'lightblue',False)

In [None]:
def plot_egograph(type_, graph_index):
    if type_ == 'Movies':
        actor = df_results[0].iloc[graph_index]['cast']
        egograph = nx.ego_graph(graph[0], n = actor)
        color = 'lightgreen'
    else:
        actor = df_results[1].iloc[graph_index]['cast']
        egograph = nx.ego_graph(graph[1], n = actor)
        color = 'lightblue'        
    
    r = nx.degree_pearson_correlation_coefficient(egograph)

    graph_ = nx.draw(egograph, node_color=color, 
                with_labels=True, 
                node_size=1500,
                arrowsize=20)
    
    return graph_, r, actor

<a id="B13"></a>
# B.13 Egographs for Movies
The egograph looks at the subgraph centred on an actor within a particular radius. I use the default of radius = 1.

This is for actor Sameer Dharkmadhikari. There is low similarity in connectivity here.

In [None]:
egograph, r, actor = plot_egograph('Movies', 5000)

print(r)
print(actor)

<a id="B14"></a>
# B.14 Viewing egograph for midway actor in Movies
The subgraph is as below for the actor middling in the list of all actors (ordered by degree of nodes).

In [None]:
index = int(np.round(df_results[1].shape[0]/2,2))

egograph, r, actor = plot_egograph('Movies', index)

print(r)
print(actor)

<a id="B15"></a>
# B.15 Egograph for TV Shows
I choose an interesting egograph, in this case it is Cedric the Entertainer (what a name!).

In [None]:
egograph, r, actor = plot_egograph('TV Show', 2000)

print(r)
print(actor)

<a id="B16"></a>
# B.16 Viewing egograph for midway actor in TV shows
The subgraph is as below for the actor middling in the list of all actors (ordered by degree of nodes).

In [None]:
index = int(np.round(df_results[1].shape[0]/2,2))

egograph, r, actor = plot_egograph('TV Show', index)

print(r)
print(actor)