<h1>EDA of Netflix Data</h1>

<p>
Hello everyone, this is my first notebook on Kaggle and here I attempt to answer the following questions and visualize the relevant data. <br><br>
    1. What content is available in various countries and what are the popular categories in Netflix's top 20 countries? <br>
    2. Has Netflix of late been focusing on movies instead of TV shows? <br>

In addition, the notebook contains a recommender based on the text description of the shows and network analysis of top actors and directors.</p>



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv',parse_dates=['date_added'])

In [None]:
df.head()

In [None]:
df.info()

<h2>Country-wise content availability and popular categories</h2>

In order to know which categories are available in various countries, we create a dataframe with countries as indices and categories as columns. But before we can do that, we need unique lists of categories and countries.

In [None]:
#Removing nulls from country column
df['country'].fillna('',inplace=True)

#Creating country and category lists
countries_list = set()
for country in df['country'].unique():
    for substr in country.strip().split(','):
        countries_list.add(substr.strip())
if '' in countries_list:
    countries_list.remove('')

categories_list = set()
for category in df['listed_in'].unique():
    for substr in category.strip().split(','):
        categories_list.add(substr.strip())
if '' in categories_list:
    categories_list.remove('')

Now, we create the dataframe and populate it as below,

In [None]:
country_category_df = pd.DataFrame(index=sorted(countries_list),columns=sorted(categories_list))
for country in countries_list:
    for category in categories_list:
        country_category_df.loc[country, category] = \
        int(len(df[df['country'].str.contains(country) & df['listed_in'].str.contains(category)]))

With the dataframe populated, we can see which categories and how many of them are available in each of the countries.

In [None]:
country_category_df

In [None]:
country_category_df.loc['Denmark'].sort_values(ascending=False)

Let's add a 'Total' column to this dataframe, this will allow us to get the top 20 countries. (Note: Total does not represent total shows in the country but is just the total occurances of categories in the 'listed_in' column. It only serves the purpose of getting the top countries)

In [None]:
country_category_df['Total'] = country_category_df.sum(axis=1)
top_20 = country_category_df['Total'].sort_values(ascending=False).head(20)

In [None]:
#Creating a dataframe for top 20 countries to show the 3 most popular categories
popular_categories = pd.DataFrame(index=top_20.index,columns=['Most popular','2nd Most popular','3rd Most popular'])
for country in popular_categories.index:
    popular_categories.loc[country,'Most popular'] = country_category_df.loc[country].sort_values(ascending=False).index[1]
    popular_categories.loc[country,'2nd Most popular'] = country_category_df.loc[country].sort_values(ascending=False).index[2]
    popular_categories.loc[country,'3rd Most popular'] = country_category_df.loc[country].sort_values(ascending=False).index[3]

In [None]:
popular_categories

Let's visualize the above data using nested pie charts,

In [None]:
labels = sorted(list(set(popular_categories['Most popular'].values)))
sizes = [list(popular_categories['Most popular'].values).count(label) for label in labels]
group_names = labels
group_size = sizes
subgroup_names = popular_categories.sort_values(by='Most popular').index
subgroup_size = np.ones(20)
txt = 'In most of the countries, Movies are the most popular category. Only Japan, South Korea and Taiwan are exceptions with TV Shows taking the top spot' 
#Set colors
a, b, c = [plt.cm.Blues, plt.cm.Reds, plt.cm.Greens]

#Outer ring 
fig, ax = plt.subplots(figsize=(6,6))
fig.suptitle('Most popular category',fontsize=20)
fig.text(.5, .05, txt, ha='center')
ax.axis('equal')
mypie, _ = ax.pie(group_size, radius=1.3, labels=group_names, colors=[a(0.6), b(0.6), c(0.6)] )
plt.setp(mypie, width=0.3, edgecolor='white')

sub_colors = [tuple(a(0.4))]+[tuple(b(0.4)) for i in range(16)]+[tuple(c(0.4)) for i in range(3)]

#Inner ring
mypie2, _ = ax.pie(subgroup_size, radius=1.3-0.3,rotatelabels=True, labels=subgroup_names, labeldistance=0.7, textprops = dict(rotation_mode = 'anchor', va='center', ha='center'), colors=sub_colors)
plt.setp(mypie2, width=0.4, edgecolor='white')
plt.margins(0,0)
plt.show()

In [None]:
labels = sorted(list(set(popular_categories['2nd Most popular'].values)))
sizes = [list(popular_categories['2nd Most popular'].values).count(label) for label in labels]
group_names = labels
group_size = sizes
subgroup_names = popular_categories.sort_values(by='2nd Most popular').index
subgroup_size = np.ones(20)
txt = 'The field for the 2nd most popular category is more diversified with the inclusion of Dramas and Comedies. 2nd spot is also taken by TV Shows in Japan, South Korea and Taiwan'
#Set colors
a, b, c, d, e, f = [plt.cm.Blues, plt.cm.Reds, plt.cm.Greens, plt.cm.Purples, plt.cm.Oranges, plt.cm.Greys]

#Outer ring 
fig, ax = plt.subplots(figsize=(6,6))
fig.suptitle('2nd Most popular category',fontsize=20)
fig.text(.5, .05, txt, ha='center')
ax.axis('equal')
mypie, _ = ax.pie(group_size, radius=1.3, labels=group_names, colors=[a(0.6), b(0.6), c(0.6), d(0.6), e(0.6), f(0.6)] )
plt.setp(mypie, width=0.3, edgecolor='white')

sub_colors = [tuple(a(0.4))]+[tuple(b(0.4)) for i in range(3)]+[tuple(c(0.4)) for i in range(11)]+ \
            [tuple(d(0.4)) for i in range(3)]+[tuple(e(0.4))]+[tuple(f(0.4))]

#Inner ring
mypie2, _ = ax.pie(subgroup_size, radius=1.3-0.3,rotatelabels=True, labels=subgroup_names, labeldistance=0.7, textprops = dict(rotation_mode = 'anchor', va='center', ha='center'), colors=sub_colors)
plt.setp(mypie2, width=0.4, edgecolor='white')
plt.margins(0,0)
plt.show()

In [None]:
labels = sorted(list(set(popular_categories['3rd Most popular'].values)))
sizes = [list(popular_categories['3rd Most popular'].values).count(label) for label in labels]
group_names = labels
group_size = sizes
subgroup_names = popular_categories.sort_values(by='3rd Most popular').index
subgroup_size = np.ones(20)
txt = 'More diversification can be seen with Action & Adventure, Romantic TV Shows, Anime Series, British and Korean TV Shows'
#Set colors
a, b, c, d, e, f, g, h, i = [plt.cm.Blues, plt.cm.Reds, plt.cm.Greens, plt.cm.Purples, plt.cm.Oranges, plt.cm.Greys, plt.cm.copper, plt.cm.cool, plt.cm.Wistia]

#Outer ring 
fig, ax = plt.subplots(figsize=(6,6))
fig.suptitle('3rd Most popular category',fontsize=20)
fig.text(.5, .05, txt, ha='center')
ax.axis('equal')
mypie, _ = ax.pie(group_size, radius=1.3, labels=group_names, colors=[a(0.6), b(0.6), c(0.6), d(0.6), e(0.6), f(0.6),g(0.6), h(0.6), i(0.6),] )
plt.setp(mypie, width=0.3, edgecolor='white')

sub_colors = [tuple(a(0.4)) for i in range(2)]+[tuple(b(0.4))]+[tuple(c(0.4))]+ \
            [tuple(d(0.4)) for i in range(3)]+[tuple(e(0.4)) for i in range(9)]+[tuple(f(0.4))]+ \
            [tuple(g(0.8))]+[tuple(h(0.5))]+[tuple(i(0.4))]

#Inner ring
mypie2, _ = ax.pie(subgroup_size, radius=1.3-0.3,rotatelabels=True, labels=subgroup_names, labeldistance=0.7, textprops = dict(rotation_mode = 'anchor', va='center', ha='center'), colors=sub_colors)
plt.setp(mypie2, width=0.4, edgecolor='white')
plt.margins(0,0)
plt.show()

<h2>Movies vs TV Shows over the years</h2>

In [None]:
years = list(sorted(set(df['date_added'].dropna().dt.year)))

In [None]:
yearly_movie_count = [len(df[(df['date_added'].dt.year==year) & (df['type']=='Movie')]) for year in years]

In [None]:
yearly_tvshow_count = [len(df[(df['date_added'].dt.year==year) & (df['type']=='TV Show')]) for year in years]

Let's plot a line graph to compare the number of movies and TV shows added since 2008,

In [None]:
fig, ax = plt.subplots(figsize=(8,6))
fig.suptitle('Movies vs TV Shows', fontsize=20)
line1, = ax.plot(years, yearly_movie_count)
line2, = ax.plot(years, yearly_tvshow_count)
ax.set_xlabel('Year', fontsize=15)
ax.set_ylabel('No. of Movies/Shows', fontsize=15)
ax.legend((line1, line2), ('Movies', 'TV Shows'), loc='upper left')
plt.show()

The above graph clearly shows that Netflix has been focusing heavily on movies in the last few years.

<h2>Recommender based on text features using NLTK and KMeans clustering</h2>

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import names
from nltk.stem import WordNetLemmatizer
from sklearn.cluster import KMeans

We will use the text features in description column to create the recommender

In [None]:
docs = df['description']

First of all, we need to clean the data by eliminating numbers, names, inflectional and derivational forms of words. Then, we use the Tfidf Vectorizer before feeding the data to the KMeans clustering algorithm. We do this below,

In [None]:
all_names = set(names.words())
lemmatizer = WordNetLemmatizer()
data_cleaned = []
for doc in docs:
    doc = doc.lower()
    doc_cleaned = ' '.join(lemmatizer.lemmatize(word) for word in doc.split() if word.isalpha() and word not in all_names)
    data_cleaned.append(doc_cleaned)

In [None]:
tfidf_vector = TfidfVectorizer(stop_words='english',max_features=None,max_df=0.5,min_df=2)
data_tfidf = tfidf_vector.fit_transform(data_cleaned)
terms = tfidf_vector.get_feature_names()

In [None]:
k=100 #No. of clusters
kmeans = KMeans(n_clusters=k, random_state=42)

In [None]:
kmeans.fit(data_tfidf)

In [None]:
clusters = kmeans.labels_
cluster_label = {i: df['description'].iloc[np.where(clusters==i)].index for i in range(k)}
centroids = kmeans.cluster_centers_

The below function accepts a show_id and prints keywords and the list of similar content.

In [None]:
def get_similar_content(show_id):
    idx = df[df['show_id']==show_id].index
    for key, value in cluster_label.items():
        if idx in list(value):
            print('Keywords: ')
            for ind in centroids[key].argsort()[-10:]:
                print(' %s' % terms[ind].title(), end="")
            print()
            print(df[['title','type','release_year','rating']].iloc[value])

In [None]:
get_similar_content('s741') # Here, it appears to be doing a decent job of clustering kids programs

In [None]:
get_similar_content('s16') # The keywords suggest content could be for mature audiences and this can be verified
                           # in the ratings columns where most are TV-MA

In [None]:
get_similar_content('s6412') # These appear to be documentaries

Overall, the recommender seems to be doing ok but with a lot of room for improvement.

<h2>Network Analysis of top actors and directors</h2>

I will be doing network analysis at the country level because doing this on the whole data will make the graph messy and not reveal any useful patterns. I will create a function that accepts the country name and plots the network graph of top 50 actors/directors. It will also print the actor/director with most connections.

In [None]:
import networkx as nx

In [None]:
# Removing nulls in the relevant columns
df['director'].fillna(' ', inplace=True)
df['cast'].fillna(' ', inplace=True)
df['country'].fillna(' ', inplace=True)

We will select top actors and directors based on their appearances in the data. For top actors/directors, I will be considering names which have atleast 2 words as single word names likely have inaccurate counts (e.g. An actor called 'Ram' will get counted when the names are 'Ramlal Singh' and 'Ram Charan Tej'). 

I will be using the networkx module to add the nodes for actors and directors. Then, whenever an actor and a director work together, we add an edge.

In [None]:
def draw_network_graph(country_name):
    # Creating actor and director lists
    actors_list = set()
    for actor, country in zip(df.cast,df.country):
        if country_name in country:
            for substr in actor.strip().split(','):
                actors_list.add(substr.strip())
    if '' in actors_list:
        actors_list.remove('')

    directors_list = set()
    for director, country in zip(df.director,df.country):
        if country_name in country:
            for substr in director.strip().split(','):
                directors_list.add(substr.strip())
    if '' in directors_list:
        directors_list.remove('')
    
    # Creating dataframes for actor count and director count and populating them 
    actor_count = pd.DataFrame(columns=['Name','Count'])
    for actor in actors_list:
        new_row = {'Name':actor,'Count':len(df[df['cast'].str.contains(actor) & df['country'].str.contains(country_name)])}
        actor_count = actor_count.append(new_row,ignore_index=True)
    actor_count.sort_values(by='Count',inplace=True,ascending=False)

    director_count = pd.DataFrame(columns=['Name','Count'])
    for director in directors_list:
        new_row = {'Name':director,'Count':len(df[df['director'].str.contains(director) & df['country'].str.contains(country_name)])}
        director_count = director_count.append(new_row,ignore_index=True)
    director_count.sort_values(by='Count',inplace=True,ascending=False)

    top_50_actors = actor_count[actor_count['Name'].str.contains(' ')].head(50)
    top_50_directors = director_count[director_count['Name'].str.contains(' ')].head(50)
    
    G = nx.DiGraph()
    
    for actor in top_50_actors['Name']:
        G.add_node(actor)
    for director in top_50_directors['Name']:
        G.add_node(director)
    for actor in top_50_actors['Name']:
        for director in top_50_directors['Name']:
            if len(df[df['director'].str.contains(director) & df['cast'].str.contains(actor) & df['country'].str.contains(country_name)]) > 0:
                G.add_edge(actor, director)
    
    #Blue nodes for actors and red for directors
    color_map = []
    for node in G:
        if node in top_50_actors['Name'].values:
            color_map.append('blue')
        else:
            color_map.append('red')
    
    plt.figure(1,figsize=(30,30))
    nx.draw(G,node_color=color_map, with_labels=True,font_color='green',font_size=15)
    print('Max connections: '+ str(max(dict(G.degree()).items(), key = lambda x : x[1])))
    plt.show()

In [None]:
draw_network_graph('India')

I hope the notebook was useful.