# EDA of "Netflix Movies and TV Shows" dataset

The purpose of this notebook is to train / learn how to create visually appealing EDA.

From what I understand, sometimes it may be usefull, because you can save created graphs like PNG images and put them, for example, in MS Power Point or show this type of notebook to your management as it is.

However, as I am writing these words after I did all the things below, I don't think that I would do that too often, because creating something cool looking in .ipynb probably takes more time then creating something similar in Tableau or Power BI.

You can find detailed description of Netflix dataset [here](https://www.kaggle.com/shivamb/netflix-shows).

Besides original, I also added some external data (revenue / cost of content / number of memberships) from [Netflix's 2010-2020 Financial Statements](https://ir.netflix.net/financials/quarterly-earnings/default.aspx) to see the dynamics of these indicators.

Some of the visualizations below I managed to create / recreate after studying these notebooks (Really cool stuff, highly recommend!):
* [Netfix Data Visualization by Joshua Swords](https://www.kaggle.com/joshuaswords/netflix-data-visualization)
* [Storytelling with Data - Netflix ver. by Subin An](https://www.kaggle.com/subinium/storytelling-with-data-netflix-ver#Relation-Between-Month?)

So let's start!

In [None]:
# libraries for working with data
import pandas as pd
import numpy as np

# libraries for visualizations
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objs as go
from matplotlib import gridspec

# library for visualizing missing values
import missingno as msno

# library for counting values in the lists
from collections import Counter

# setting to ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
# reading csv and taking a first look at the data
df = pd.read_csv('../input/netflix-shows/netflix_titles.csv', parse_dates = ['date_added'])
df.head(3)

# Dealing with the missing values
Let's look at the missing values.

In [None]:
msno.matrix(df, fontsize = 14)
plt.show()

In [None]:
null_rate = df.isna().sum() / len(df) * 100
null_rate = pd.DataFrame(null_rate)\
              .rename(columns={0: 'share_of_nulls'})\
              .query('share_of_nulls>0')\
              .round(2)
null_rate

Features with a lot of missing values are `director`, `cast` and `country`.

We are keeping them and making a simple filling of these features cause we will need them in the future.

NaN's in `date_added` and `rating` are insignificant so we just drop these rows.

In [None]:
# fill director and cast NaN with something
df['director'].replace(np.nan, 'No Data',inplace  = True)
df['cast'].replace(np.nan, 'No Data',inplace  = True)

# fill country's NaN with the value that appears most often
df['country'] = df['country'].fillna(df['country'].mode()[0])

# drop NaN in other columns because they are insignificant
df.dropna(inplace=True)

# check that we filled or dropped all NaN's
df.isna().sum()

In addition, let's check dataframe for duplicates.

In [None]:
df.duplicated().sum()

Let's take a look at descriptive statistics. Things I'm particularly interested in are **unique counts** and **first/last date**.

In [None]:
df.describe(include='all').head(6)

# Data processing
Some of the features we will be formatting later too, but `date_added` is going to be used a lot, so it would be wise to do it here in advance.
What we want to do is extract date characteristics form `date_added` column.

Let's also separate movies and tv-shows in two different dataframes, making it easier for us to analyse later.

In [None]:
# Extracting months, month's name and year of adding
df['month_added']=df['date_added'].dt.month
df['month_name_added']=df['date_added'].dt.month_name()
df['year_added'] = df['date_added'].dt.year

# Creating separate dataframes for Movies and TV Shows
movie = df[df['type'] == 'Movie']
tv_show = df[df['type'] == 'TV Show']

# Exploring the data through visualization

I think we should start EDA by looking at the countries. In order to do that, we need to format column `country`, because some movies / tv-shows were produced internationally, which means they have several countries that we have to split.

Let's also look at some general interesting facts:

In [None]:
country_df = df['country']
country_count = pd.Series(dict(Counter(','.join(country_df)\
                                          .replace(' ,',',')\
                                          .replace(', ',',')\
                                          .split(','))))\
                                          .sort_values(ascending=False)
total = sum(country_count)
top_10 = sum(country_count[:10])

print(f'''
{round(country_df.str.contains(",").sum() / len(df) * 100)} percent of the movies/tv-shows were produced internationally.

Total number of times any country was involed in production - {total} (including international production).

Top 10 countries were involved in {top_10} cases. Thus, producing {round(top_10/total * 100, 2)} percent of the content.
''')

In [None]:
# Choosing data
data = country_count[:10]

# Setting main color
color_map = ['#999999' for _ in range(10)]
# Highlighting top 3 countries with Netflix's brand color - Symbol Dark Red
color_map[0] = color_map[1] = color_map[2] =  '#b20710'

# Vizualizing bar chart
fig, ax = plt.subplots(1,1, figsize=(12, 6))
ax.bar(data.index, data, width=0.5, 
       edgecolor='darkgray',
       linewidth=0.6,color=color_map)

# Setting annotations
for i in data.index:
    ax.annotate(f"{data[i]}", 
                   xy=(i, data[i] + 100), # number sets how high above bar we want our annotation to be 
                   va = 'center', ha='center',fontweight='light', fontfamily='serif')

# Removing frames from a figure
for s in ['top', 'left', 'right']:
    ax.spines[s].set_visible(False)
    
# Making a horizontal grid and putting it behind bars
ax.grid(axis='y', linestyle='-', alpha=0.4)
ax.set_axisbelow(True)
#grid_y_ticks = np.arange(0, 4000, 500) # y ticks, min, max, then step
#ax.set_yticks(grid_y_ticks)

# Removing ticks (= reducing their lengh to zero)
ax.tick_params(axis=u'both', which=u'both',length=0)
    
# Setting title and sub-title 
fig.text(0.09, 1, 'Top 10 content producing countries on Netflix', fontsize=15, fontweight='bold', fontfamily='serif')
fig.text(0.09, 0.95, 'Top three countries have been highlighted.', fontsize=12, fontweight='light', fontfamily='serif')

# Removes unnecessary information
plt.show()

US producing a lot of content isn't really a surprise, but India and UK had me intrigued.
Don't know about UK, but in India it's most certainly connected to the amount of its population and popularity of Bollywood.

In [None]:
# Building grid for 2 graphs
fig = plt.figure(figsize=(20, 6))
gs = gridspec.GridSpec(nrows=1, ncols=2,
                       height_ratios=[6], 
                       width_ratios=[10, 10])

    ### Pie Chart vizualization ###

# Setting parameters
sizes = [round(len(movie) / len(df) * 100, 2), round(len(tv_show) / len(df) * 100, 2)]
explode = (0, 0.1)
color_map = ['#221f1f' for _ in range(2)]
color_map[0] = '#b20710'

# Vizualizing pie chart
ax = plt.subplot(gs[0])
patches, texts, autotexts = ax.pie(sizes, explode=explode, autopct='%1.f%%',
                                   startangle=90, colors=color_map,
                                   textprops={'color':"w",'fontweight':'bold','fontsize':'25','fontfamily':'serif'}) # settings for values / text inside pie chart
# Setting a title    
fig.text(0.15, 0.93, 'Movies', fontsize=15, fontweight='bold', fontfamily='serif', color='#b20710')
fig.text(0.195, 0.93, '| TV Shows', fontsize=15, fontweight='bold', fontfamily='serif', color='#221f1f')
fig.text(0.262, 0.93, 'general distribution', fontsize=15, fontweight='bold', fontfamily='serif', color='black')

# Equal aspect ratio ensures that pie is drawn as a circle.
ax.axis('equal')

    ### Ratio visualization of content in top 10 countries ###

# Picking top 10 ciuntries by amount of movies/tv-shows
country_order = df['country'].value_counts()[:10].index

# Building a table (filtering top 10) with countries and counting number of movies/tv-shows
df_mtv_cnt = df[['type', 'country']].groupby('country')['type']\
                                    .value_counts()\
                                    .unstack()\
                                    .loc[country_order]
# Calculating total amount of content 
df_mtv_cnt['sum'] = df_mtv_cnt.sum(axis=1)

# Calculating ratio and sorting
df_mtv_cnt_ratio = (df_mtv_cnt.T / df_mtv_cnt['sum']).T[['Movie', 'TV Show']]\
                                                     .sort_values(by='Movie',ascending=False)[::-1]

# Vizualizing ratio
ax2 = plt.subplot(gs[1])
ax2.barh(df_mtv_cnt_ratio.index, df_mtv_cnt_ratio['Movie'], color='#b20710', label='Movie')
ax2.barh(df_mtv_cnt_ratio.index, df_mtv_cnt_ratio['TV Show'], left=df_mtv_cnt_ratio['Movie'], color='#221f1f', label='TV Show')

# Deleting x-ticks
ax2.set_xticks([])
ax2.set_yticklabels(df_mtv_cnt_ratio.index, fontfamily='serif', fontsize=15)

# Percentage in bars
for i in df_mtv_cnt_ratio.index:
    ax2.annotate(f"{round(df_mtv_cnt_ratio['Movie'][i]*100)}%", 
                   xy=(df_mtv_cnt_ratio['Movie'][i]/2, i),
                   va = 'center', ha='center',fontsize=12, fontweight='bold', fontfamily='serif', color='white')

for i in df_mtv_cnt_ratio.index:
    ax2.annotate(f"{round(df_mtv_cnt_ratio['TV Show'][i]*100)}%", 
                   xy=(df_mtv_cnt_ratio['Movie'][i]+df_mtv_cnt_ratio['TV Show'][i]/2, i),
                   va = 'center', ha='center',fontsize=12, fontweight='bold', fontfamily='serif', color='white')
    

# Settings for text
fig.text(0.545, 0.93, 'Movies', fontsize=15, fontweight='bold', fontfamily='serif', color='#b20710')
fig.text(0.59, 0.93, '| TV Shows', fontsize=15, fontweight='bold', fontfamily='serif', color='#221f1f')  
fig.text(0.658, 0.93, 'split of top 10 countries*', fontsize=15, fontweight='bold', fontfamily='serif')   
fig.text(0.547, 0.89, '* by the amount of produced content', fontsize=12,fontfamily='serif')   

# Deleting frames of the graph
for s in ['top', 'left', 'right', 'bottom']:
    ax2.spines[s].set_visible(False)
    
# Deleting y-axis ticks
ax2.tick_params(axis=u'both', which=u'both',length=0)

plt.show()

In general, 2/3 of the content represented by movies. However, if we look at the ratio in the context of countries, we can see that India's and Egypt's content represented almost entirery by movies, while Japan and South Korea are represented more by their TV Shows.

I think this is a great example of how Netflix adds more recognisable / popular content of every country to it's library:
* From India - Bollywood's movies
* From Japan - Anime
* From Korea - K-dramas

Graphs below (top 5 genres) somewhat confirm this. 

Don't know much about Egyptian movies, so can't say whether they are popular in the world or not.

In [None]:
# Formating data to build graphs below
m_india = movie[movie.country == 'India']
m_india_df = m_india['listed_in']
m_india_count = pd.Series(dict(Counter(','.join(m_india_df)\
                                          .replace(' ,',',')\
                                          .replace(', ',',')\
                                          .split(','))))\
                                          .sort_values(ascending=False)


tv_japan = tv_show[tv_show.country == 'Japan']
tv_japan_df = tv_japan['listed_in']
tv_japan_count = pd.Series(dict(Counter(','.join(tv_japan_df)\
                                          .replace(' ,',',')\
                                          .replace(', ',',')\
                                          .split(','))))\
                                          .sort_values(ascending=False)

tv_korea = tv_show[tv_show.country == 'South Korea']
tv_korea_df = tv_korea['listed_in']
tv_korea_count = pd.Series(dict(Counter(','.join(tv_korea_df)\
                                          .replace(' ,',',')\
                                          .replace(', ',',')\
                                          .split(','))))\
                                          .sort_values(ascending=False)

In [None]:
fig = plt.figure(figsize=(20, 4))
gs = gridspec.GridSpec(nrows=1, ncols=3,
                       height_ratios=[4], 
                       width_ratios=[12, 12, 12],
                       wspace=0.5)

# Choosing top 10 genres
data_1 = m_india_count[0:5]
data_2 = tv_japan_count[0:5]
data_3 = tv_korea_count[0:5]

# Setting colors an highlighting top 3
color_map_1 = ['#999999' for _ in range(11)]
color_map_1[0] = color_map_1[1] = color_map_1[2] = '#b20710'

color_map_2 = ['#999999' for _ in range(11)]
color_map_2[0] = color_map_2[1] = color_map_2[2] = '#221f1f'

# Vizualizing bar charts
ax = plt.subplot(gs[0])
ax.barh(data_1.index, data_1, alpha=0.8, edgecolor='darkgray',color=color_map_1)

ax2 = plt.subplot(gs[1])
ax2.barh(data_2.index, data_2, alpha=0.8, edgecolor='darkgray',color=color_map_2)

ax3 = plt.subplot(gs[2])
ax3.barh(data_3.index, data_3, alpha=0.8, edgecolor='darkgray',color=color_map_2)

# Setting annotations for India graph
for i in data_1.index:
    ax.annotate(f"{data_1[i]}", 
                   xy=(data_1[i]/2, i),
                   va = 'center', ha='center',fontweight='bold', fontfamily='serif', fontsize=12, color="w")
# Setting annotations for Japan graph
for i in data_2.index:
    ax2.annotate(f"{data_2[i]}", 
                   xy=(data_2[i]/2, i),
                   va = 'center', ha='center',fontweight='bold', fontfamily='serif', fontsize=12, color="w")
# Setting annotations for Korea graph
for i in data_3.index:
    ax3.annotate(f"{data_3[i]}", 
                   xy=(data_3[i]/2, i),
                   va = 'center', ha='center',fontweight='bold', fontfamily='serif', fontsize=12, color="w")

# Removing frames from a figure
for s in ['top', 'bottom', 'left', 'right']:
    ax.spines[s].set_visible(False)
    ax2.spines[s].set_visible(False)
    ax3.spines[s].set_visible(False)  

# Removing ticks (= reducing their lengh to zero) and bottom labels
ax.tick_params(axis=u'both', which=u'both',length=0)
ax.axes.get_xaxis().set_visible(False)
ax2.tick_params(axis=u'both', which=u'both',length=0)
ax2.axes.get_xaxis().set_visible(False)
ax3.tick_params(axis=u'both', which=u'both',length=0)
ax3.axes.get_xaxis().set_visible(False)
    
# Setting titles
ax.set_title('India', fontsize=16, fontweight='bold', fontfamily='serif', color="black")
ax2.set_title('Japan', fontsize=16, fontweight='bold', fontfamily='serif', color="black")
ax3.set_title('South Korea', fontsize=16, fontweight='bold', fontfamily='serif', color="black")

# Removes unnecessary information
plt.show()

# Adding content over the years

From what I understand, reading [timeline of Netflix on wiki](https://en.wikipedia.org/wiki/Timeline_of_Netflix), events in 2015 and 2016 played a major role in rapid growth of the platfrom's content and membership base.

**Most noticeable events of 2015 and 2016**
* **02/09/2015** - Netflix launched streaming service in Japan.
* **06/01/2016** - Netflix announced a major international expansion into 150 additional countries with a notable major exclusions - China.
* **11/02/2016** - Netflix finishes its massive migration of its data servers to Amazon Web Services.

In [None]:
# Setting size and colors
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
color = ["#b20710", "#221f1f"]

# Defining types and counting # of values
for i, mtv in enumerate(df['type'].value_counts().index):
    mtv_rel = df[df['type']==mtv]['year_added'].value_counts().sort_index()
    
    # Building charts
    ax.plot(mtv_rel.index, mtv_rel, color=color[i], label=mtv)
    
    # Filling space under lines
    ax.fill_between(mtv_rel.index, 0, mtv_rel, color=color[i], alpha=0.85)
    
# Moving y-axis values to right
ax.yaxis.tick_right()
    
# Settings for bottom line
#ax.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .7)

# Removing frames from a figure
for s in ['top', 'right','bottom','left']:
    ax.spines[s].set_visible(False)

# Deleting grid
ax.grid(False)

# Limiting x range (in our case - "years")
ax.set_xlim(2008,2020)

# Adjust x-axis steps  
plt.xticks(np.arange(2008, 2021, 1))

# Adding and formatting title and legend "Movie | TV Show"
fig.text(0.13, 0.85, 'Movies', fontsize=18, fontweight='bold', fontfamily='serif', color='#b20710')
fig.text(0.22, 0.85, '| TV Show', fontsize=18, fontweight='bold', fontfamily='serif', color='#221f1f')
fig.text(0.345, 0.85, 'added over time', fontsize=18, fontweight='bold', fontfamily='serif', color='black')

# Removing ticks (= reducing their lengh to zero)
ax.tick_params(axis=u'both', which=u'both',length=0)

# Adding white strap with stars
ax.axvspan(xmin=2015, xmax=2016, ymin=0.03, ymax=0.95,  color='white', alpha=0.6, linestyle='-', linewidth=0, hatch='*')

# Removes unnecessary information
plt.show()

Let's upload some information that we manually scraped from [Netflix's 2010-2020 Financial Statements](https://ir.netflix.net/financials/quarterly-earnings/default.aspx).

In [None]:
rev_content_costs = pd.read_csv('../input/fs-and-subs/FS 2010-2020.csv')
subs_after_2017 = pd.read_csv('../input/fs-and-subs/Subscribers after 2017.csv')
subs_before_2017 = pd.read_csv('../input/fs-and-subs/Subscribers before 2017.csv')

Before plotting the graph, we need to merge and format the data first.

In [None]:
# Merging datasets to see dynamic of total number of memeberships from 2010 to 2020
subs_before_2017['total'] = subs_before_2017.subs_US + subs_before_2017.subs_international
subs_after_2017['total'] = subs_after_2017.subs_US_Canada + subs_after_2017.subs_EMEA + subs_after_2017.subs_LATAM + subs_after_2017.subs_Asia_Pacific
total_subs = subs_before_2017[['date','total']].append(subs_after_2017[['date','total']])

# Extracting year from the date
total_subs["date"] = pd.to_datetime(total_subs['date'])
total_subs['year'] = total_subs['date'].dt.year

# Selecting only number of memberships at the end of the year
total_subs_years = total_subs.iloc[3::4]
total_subs_years['total'] = total_subs_years['total'] / 1000

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 6))

# Creating barplot
ax.bar(total_subs_years.year, total_subs_years.total,
       width=0.6, edgecolor='darkgray',
       linewidth=0.6, color="#b20710")

# Removing first and last years for aesthetic reasons
plt.xticks(np.arange(2011, 2021, 1))

# Removing ticks (= reducing their lengh to zero)
ax.tick_params(axis=u'both', which=u'both',length=0)

# Making a horizontal grid and putting it behind bars
ax.grid(axis='y', linestyle='-', alpha=0.4)
ax.set_axisbelow(True)

# Setting step of the grid 
grid_y_ticks = np.arange(0, 225, 25) # y ticks, min, max, then step
ax.set_yticks(grid_y_ticks)

# Removing frames from a figure
for s in ['top', 'right','left']:
    ax.spines[s].set_visible(False)
    
# Limiting x range (in our case - "years")
ax.set_xlim(2010,2021)    
    
# Setting title and sub-title 
fig.text(0.09, 0.92, 'Netflix paying streaming subscribers, millions', fontsize=15, fontweight='bold', fontfamily='serif')

# Removes unnecessary information
plt.show()

The number of paying users has been steadily increasing since 2011, with about 20 million new members annually.
According to Q4 2020 Financial Statement, at the end of 2020 number of Netflix paying subscribers was **203.66 million**.

In [None]:
# Removing unnecessary columns and rows
rev_content_costs.drop(rev_content_costs.iloc[:, 4:14],axis=1,inplace=True)
rev_content_costs = rev_content_costs.iloc[:44]

rev_content_costs["date"] = pd.to_datetime(rev_content_costs['date'])
rev_content_costs['year'] = rev_content_costs['date'].dt.year

rcn_1 = rev_content_costs.groupby('year', as_index=False)\
                       .agg({'revenue':'sum', 'cost_of_content':'sum', 'net income':'sum'})\
                       .rename(columns={'net income':'net_income'})

rcn = round(rcn_1.iloc[:,1:4] / 1000)

rcn['year'] = rcn_1['year']
rcn['net_profit_margin'] = round(rcn.net_income / rcn.revenue * 100)

In [None]:
fig = plt.figure(figsize=(20, 6))
gs = gridspec.GridSpec(nrows=1, ncols=2,
                       height_ratios=[6], 
                       width_ratios=[15, 15],
                       wspace=0.2)

# Building plots
ax = plt.subplot(gs[0])
ax.plot(rcn.year, rcn.revenue, color="#b20710", alpha=0.85)
ax.plot(rcn.year, rcn.cost_of_content, color='#999999', alpha=1)
ax2 = plt.subplot(gs[1])
ax2.plot(rcn.year, rcn.net_income, color="#b20710", alpha=0.85)
    
# Filling space under lines
ax.fill_between(rcn.year, 0,  rcn.revenue, color="#b20710", alpha=0.85)
ax.fill_between(rcn.year, 0,  rcn.cost_of_content, color='#999999')
ax2.fill_between(rcn.year, 0,  rcn.net_income, color="#b20710", alpha=0.85)
    
# Moving y-axis values to right
ax.yaxis.tick_right()
ax2.yaxis.tick_right()
    
# Removing frames from a figure
for s in ['top', 'right','bottom','left']:
    ax.spines[s].set_visible(False)
    ax2.spines[s].set_visible(False)

# Limiting x range (in our case - "years")
ax.set_xlim(2010,2020)
ax2.set_xlim(2010,2020)

# Adjust x-axis steps  
#plt.xticks(np.arange(2010, 2021, 1))

# Adding and formatting title and legend "Movie | TV Show"
fig.text(0.125, 0.9, 'Revenue', fontsize=15, fontweight='bold', fontfamily='serif', color="#b20710", alpha=0.85)
fig.text(0.178, 0.9, 'and', fontsize=15, fontweight='bold', fontfamily='serif')
fig.text(0.203, 0.9, 'Content Add-on Costs', fontsize=15, fontweight='bold', fontfamily='serif', color='#999999')
fig.text(0.332, 0.9, ', USD millions', fontsize=15, fontweight='bold', fontfamily='serif')
fig.text(0.545, 0.9, 'Net Income, USD millions', fontsize=15, fontweight='bold', fontfamily='serif')

# Removing ticks (= reducing their lengh to zero)
ax.tick_params(axis=u'both', which=u'both',length=0)
ax2.tick_params(axis=u'both', which=u'both',length=0)

# Removes unnecessary information
plt.show()

I thought it would be interesing to look at Netflix's revenue and content add-on costs side by side.


As you can see, up until 2018 Netflix spend significant amount of its revenue on licensing and original content. Althouth the company was always profitable, only starting 2019, it began to receive significant benefits, with noticable increase in revenue and net income, which is probaly connected to COVID-19 and people staying at home more often. Another theory is that Neflix finally created library big enough that it started to bring more people with the same level of costs on adding the content.

There is also noticibale dip in content add-on expenses in 2020, which is probably conneted to the specific of licening. It would be logical to think, that some content licenced for several years and not renewed annually.

# Actors (cast)
Let's look at the most represented actors on Netflix.

In [None]:
# Formatting data for the actors graphs
castm_df = movie['cast']
castm_count = pd.Series(dict(Counter(','.join(castm_df)\
                                          .replace(' ,',',')\
                                          .replace(', ',',')\
                                          .split(','))))\
                                          .sort_values(ascending=False)

casttv_df = tv_show['cast']
casttv_count = pd.Series(dict(Counter(','.join(casttv_df)\
                                          .replace(' ,',',')\
                                          .replace(', ',',')\
                                          .split(','))))\
                                          .sort_values(ascending=False)

In [None]:
fig = plt.figure(figsize=(20, 6))
gs = gridspec.GridSpec(nrows=1, ncols=2,
                       height_ratios=[6], 
                       width_ratios=[10, 10])

# Choosing top 10 actors.
# As a reminder, 9 percent of the cast data was missing and we filled it with 'No data'.
# Because of that we need to exclude first row that contains 'No data'
data_1 = castm_count[1:11]
data_2 = casttv_count[1:11]

# Setting colors an highlighting top 3
color_map_1 = ['#999999' for _ in range(11)]
color_map_1[0] = color_map_1[1] = color_map_1[2] = '#b20710'

color_map_2 = ['#999999' for _ in range(11)]
color_map_2[0] = color_map_2[1] = color_map_2[2] = '#221f1f'

# Vizualizing bar chart
ax = plt.subplot(gs[0])
ax.barh(data_1.index, data_1, alpha=0.8, edgecolor='darkgray',color=color_map_1)


ax2 = plt.subplot(gs[1])
ax2.barh(data_2.index, data_2, alpha=0.8, edgecolor='darkgray',color=color_map_2)

# Setting annotations for movies graph
for i in data_1.index:
    ax.annotate(f"{data_1[i]}", 
                   xy=(data_1[i]/2, i), # number sets how high above bar we want our annotation to be 
                   va = 'center', ha='center',fontweight='bold', fontfamily='serif', fontsize=12, color="w")
    
# Setting annotations tv shows graph
for i in data_2.index:
    ax2.annotate(f"{data_2[i]}", 
                   xy=(data_2[i]/2, i), # number sets how high above bar we want our annotation to be 
                   va = 'center', ha='center',fontweight='bold', fontfamily='serif', fontsize=12, color="w")
    
    
# Removing frames from a figure
for s in ['top', 'bottom', 'left', 'right']:
    ax.spines[s].set_visible(False)
    ax2.spines[s].set_visible(False)
    
# Removing ticks (= reducing their lengh to zero) and bottom labels
ax.tick_params(axis=u'both', which=u'both',length=0)
ax.axes.get_xaxis().set_visible(False)
ax2.tick_params(axis=u'both', which=u'both',length=0)
ax2.axes.get_xaxis().set_visible(False)
    
# Setting title and sub-title 
fig.text(0.09, 1, 'Actors with the most presence on NETFLIX', fontsize=15, fontweight='bold', fontfamily='serif')
fig.text(0.09, 0.96, 'Numbers inside bars represent number of', fontsize=12, fontweight='light', fontfamily='serif')
fig.text(0.272, 0.96, 'Movies', fontsize=13, fontweight='bold', fontfamily='serif', color="#b20710")
fig.text(0.31, 0.96, '/ TV Shows', fontsize=13, fontweight='bold', fontfamily='serif', color="#221f1f")

# Removes unnecessary information
plt.show()

With a little bit of googling, these charts makes sense:
* **Anupam Kher** is a famous Indian actor in Bollywood Movies
* **Takahiro Sakurai** is a famous Japanese voice actor who participated in many Anime Series

Tables below also confirms that.

In [None]:
search_list = set(['Takahiro Sakurai'])
df['actors_j'] = df['cast'].apply(lambda x: set.intersection(set(x.replace(' ,',',').replace(', ',',').split(',')), search_list))
df[df['actors_j'].astype(bool)].head()

In [None]:
search_list = set(['Anupam Kher'])
df['actors_i'] = df['cast'].apply(lambda x: set.intersection(set(x.replace(' ,',',').replace(', ',',').split(',')), search_list))
df[df['actors_i'].astype(bool)].head()

# Directors
Let's look at the most represented directors on Netflix.

In [None]:
# Formatting data for the directors graphs
directorm_df = movie['director']
directorm_count = pd.Series(dict(Counter(','.join(directorm_df)\
                                          .replace(' ,',',')\
                                          .replace(', ',',')\
                                          .split(','))))\
                                          .sort_values(ascending=False)

directortv_df = tv_show['director']
directortv_count = pd.Series(dict(Counter(','.join(directortv_df)\
                                          .replace(' ,',',')\
                                          .replace(', ',',')\
                                          .split(','))))\
                                          .sort_values(ascending=False)

In [None]:
fig = plt.figure(figsize=(20, 6))
gs = gridspec.GridSpec(nrows=1, ncols=2,
                       height_ratios=[6], 
                       width_ratios=[10, 10])

# Choosing top 10 directors.
# As a reminder, 30 percent of the cast data was missing and we filled it with 'No data'.
# Because of that we need to exclude first row that contains 'No data'
data_1 = directorm_count[1:11]
data_2 = directortv_count[1:11]

# Setting colors an highlighting top 3
color_map_1 = ['#999999' for _ in range(11)]
color_map_1[0] = color_map_1[1] = color_map_1[2] = '#b20710'

color_map_2 = ['#999999' for _ in range(11)]
color_map_2[0] = color_map_2[1] = color_map_2[2] = '#221f1f'

# Vizualizing bar chart
ax = plt.subplot(gs[0])
ax.barh(data_1.index, data_1, alpha=0.8, edgecolor='darkgray',color=color_map_1)


ax2 = plt.subplot(gs[1])
ax2.barh(data_2.index, data_2, alpha=0.8, edgecolor='darkgray',color=color_map_2)

# Setting annotations for movies graph
for i in data_1.index:
    ax.annotate(f"{data_1[i]}", 
                   xy=(data_1[i]/2, i), # number sets how high above bar we want our annotation to be 
                   va = 'center', ha='center',fontweight='bold', fontfamily='serif', fontsize=12, color="w")
    
# Setting annotations tv shows graph
for i in data_2.index:
    ax2.annotate(f"{data_2[i]}", 
                   xy=(data_2[i]/2, i), # number sets how high above bar we want our annotation to be 
                   va = 'center', ha='center',fontweight='bold', fontfamily='serif', fontsize=12, color="w")
    
    
# Removing frames from a figure
for s in ['top', 'bottom', 'left', 'right']:
    ax.spines[s].set_visible(False)
    ax2.spines[s].set_visible(False)
    
# Removing ticks (= reducing their lengh to zero) and bottom labels
ax.tick_params(axis=u'both', which=u'both',length=0)
ax.axes.get_xaxis().set_visible(False)
ax2.tick_params(axis=u'both', which=u'both',length=0)
ax2.axes.get_xaxis().set_visible(False)
    
# Setting title and sub-title 
fig.text(0.09, 1, 'Directors with the most presence on NETFLIX', fontsize=15, fontweight='bold', fontfamily='serif')
fig.text(0.09, 0.96, 'Numbers inside bars represent number of', fontsize=12, fontweight='light', fontfamily='serif')
fig.text(0.272, 0.96, 'Movies', fontsize=13, fontweight='bold', fontfamily='serif', color="#b20710")
fig.text(0.31, 0.96, '/ TV Shows', fontsize=13, fontweight='bold', fontfamily='serif', color="#221f1f")

# Removes unnecessary information
plt.show()

Interesting that movie directors have a lot more produced content. My theory is that tv-show directors are more involved time-wise, so they can't participate in that many filming projects.

Expamples of Jan Suter's (Mexican film director) movies:

In [None]:
search_list = set(['Jan Suter'])
df['directors'] = df['director'].apply(lambda x: set.intersection(set(x.replace(' ,',',').replace(', ',',').split(',')), search_list))
df[df['directors'].astype(bool)].head()

# Genres

Let's look what genres are presented the most in the Netflix library.

In [None]:
# Formatting data for the genre graphs
genrem_df = movie['listed_in']
genrem_count = pd.Series(dict(Counter(','.join(genrem_df)\
                                          .replace(' ,',',')\
                                          .replace(', ',',')\
                                          .split(','))))\
                                          .sort_values(ascending=False)

genretv_df = tv_show['listed_in']
genretv_count = pd.Series(dict(Counter(','.join(genretv_df)\
                                          .replace(' ,',',')\
                                          .replace(', ',',')\
                                          .split(','))))\
                                          .sort_values(ascending=False)

In [None]:
fig = plt.figure(figsize=(20, 6))
gs = gridspec.GridSpec(nrows=1, ncols=2,
                       height_ratios=[6], 
                       width_ratios=[10, 10])

# Choosing top 10 genres
data_1 = genrem_count[0:10]
data_2 = genretv_count[0:10]

# Setting colors an highlighting top 3
color_map_1 = ['#999999' for _ in range(11)]
color_map_1[0] = color_map_1[1] = color_map_1[2] = '#b20710'

color_map_2 = ['#999999' for _ in range(11)]
color_map_2[0] = color_map_2[1] = color_map_2[2] = '#221f1f'

# Vizualizing bar chart
ax = plt.subplot(gs[0])
ax.barh(data_1.index, data_1, alpha=0.8, edgecolor='darkgray',color=color_map_1)


ax2 = plt.subplot(gs[1])
ax2.barh(data_2.index, data_2, alpha=0.8, edgecolor='darkgray',color=color_map_2)

# Setting annotations for movies graph
for i in data_1.index:
    ax.annotate(f"{data_1[i]}", 
                   xy=(data_1[i]/2, i), # number sets how high above bar we want our annotation to be 
                   va = 'center', ha='center',fontweight='bold', fontfamily='serif', fontsize=12, color="w")
    
# Setting annotations tv shows graph
for i in data_2.index:
    ax2.annotate(f"{data_2[i]}", 
                   xy=(data_2[i]/2, i), # number sets how high above bar we want our annotation to be 
                   va = 'center', ha='center',fontweight='bold', fontfamily='serif', fontsize=12, color="w")
    
    
# Removing frames from a figure
for s in ['top', 'bottom', 'left', 'right']:
    ax.spines[s].set_visible(False)
    ax2.spines[s].set_visible(False)
    
# Removing ticks (= reducing their lengh to zero) and bottom labels
ax.tick_params(axis=u'both', which=u'both',length=0)
ax.axes.get_xaxis().set_visible(False)
ax2.tick_params(axis=u'both', which=u'both',length=0)
ax2.axes.get_xaxis().set_visible(False)
    
# Setting title and sub-title 
fig.text(0.09, 1, 'Genres with the most presence on NETFLIX', fontsize=15, fontweight='bold', fontfamily='serif')
fig.text(0.09, 0.96, 'Numbers inside bars represent quantity of', fontsize=12, fontweight='light', fontfamily='serif')
fig.text(0.274, 0.96, 'Movies', fontsize=13, fontweight='bold', fontfamily='serif', color="#b20710")
fig.text(0.313, 0.96, '/ TV Shows', fontsize=13, fontweight='bold', fontfamily='serif', color="#221f1f")

# Removes unnecessary information
plt.show()

International, Dramas and Comedies are equally dominant in both Movies and TV Shows, which is not really a surprise.

# Correlation of genres
In order to see correlation between genres we need to create **a function for processing and visualizing correlation.**


I took the code from [this notebook](https://www.kaggle.com/subinium/storytelling-with-data-netflix-ver#Comparison-by-country-for-time), studied it and marked for myself for the future.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

def relation_heatmap(df, title):
    
    # Splitting column listed_in by comma into a list where each genre is a list item
    df['genre'] = df['listed_in'].apply(lambda x : x.replace(' ,',',')\
                                                    .replace(', ',',')\
                                                    .split(','))
    
    # Adding '+=' all the list items from 'genre' into a single list
    Types = []
    for i in df['genre']:
        Types += i
    
    # Creating a set (= only unique values are left)
    Types = set(Types)
    print(f"There are {len(Types)} genres in the Netflix {title} Dataset")    
    
    # Taking column with lists and transforming it into a matrix where genres are added as columns. The number of rows remained the same.
    # If genre was in a particular row it is marked as 1, and if not, it is 0.
    test = df['genre']
    mlb = MultiLabelBinarizer()
    res = pd.DataFrame(mlb.fit_transform(test), columns=mlb.classes_, index=test.index)
    
    # Computing correlation of columns
    corr = res.corr()
    
    # Creating a mask for vizualizationg purposes. Data will not be shown in cells where mask is 'True'.
    
    # First we creat array with the same shape as corr matrix, but all values are 'False'
    mask = np.zeros_like(corr, dtype=np.bool)
    # Turning the upper-triangle of array into 'True'
    mask[np.triu_indices_from(mask)] = True
    
    # Creating a graph
    fig, ax = plt.subplots(figsize=(15, 14))
    pl = sns.heatmap(corr, mask=mask, cmap= "RdGy", vmax=.5, vmin=-.5, center=0, square=True, linewidths=.7, cbar_kws={"shrink": 0.6})
    
    plt.show()

In [None]:
relation_heatmap(movie, 'Movie')

Quite an interesing results.

Looks like **Documentaries / Dramas** and **Children & Family movies / International Movies** have pretty strong negative correlation.
My theory is that **Documentaries** are more focused on facts and objective representation of events than on human emotions. In case of **Children & Family movies**,  I think that they are being created more for local markets (probably something connected with national culture / language).

On the contrary, **Independant** and **International movies** have good correlation with **Dramas**. Thinking back on some independent and international movies that I watched, I think that's quite true.

In [None]:
relation_heatmap(tv_show, 'TV Show')

Similar to movie genre correlation results, **Kid's TV** tend to be local and not international.

**Science & Nature TV** have strong correlation with **Docuseries** which is quite logical.

# Popular word in titles

It is interesting to see whether movies and tv-shows share the same keywords in their titles.

I took and studied the code for visualizing this from [here](https://www.kaggle.com/dmitryuarov/netflix-eda-with-plotly).

In [None]:
from wordcloud import WordCloud
import random
from PIL import Image
import matplotlib

# Making custom colour map based on Netflix palette
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ['#b20710', '#221f1f'])

# Defining list of words and mask
text = str(list(df['title'])).replace(',', '').replace('[', '').replace("'", '').replace(']', '').replace('.', '')
mask = np.array(Image.open('../input/circle-mask/circle-mask-png.png'))

wordcloud = WordCloud(background_color = 'white', colormap=cmap, max_words = 20000, mask = mask).generate(text)

plt.figure( figsize=(6,6))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

Most popular words are Love, Christmas, World, Man, Girl, Story, Time and Life.

# Best month(s) for adding content
In which months content added the most?

In [None]:
cnt_months = df.groupby(['month_added', 'month_name_added'], as_index=False)\
               .agg({'type':'count'})

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 6))

# Creating barplot
ax.bar(cnt_months.month_name_added, cnt_months.type,
       width=0.6, edgecolor='#b20710', color="#b20710", alpha=0.9)

# Removing ticks (= reducing their lengh to zero)
ax.tick_params(axis=u'both', which=u'both',length=0)

# Making a horizontal grid and putting it behind bars
ax.grid(axis='y', linestyle='-', alpha=0.4)
ax.set_axisbelow(True)

# Setting step of the grid 
grid_y_ticks = np.arange(0, 900, 100) # y ticks, min, max, then step
ax.set_yticks(grid_y_ticks)

# Removing frames from a figure
for s in ['top', 'right','left']:
    ax.spines[s].set_visible(False)
    
# Setting title   
fig.text(0.09, 0.92, 'Number of titles (Movies and TV Shows) added by months', fontsize=15, fontweight='bold', fontfamily='serif')

# Removes unnecessary information
plt.show()

Looks like in the last three months of the year and in January content is added the most, which is quite logical, considering that, in the most countires, people will have more free time during Winter Holidays / Christmas. Let's also not forget that Chistmas is one of the most popular words in movie/tv show titles!

# Maturity rating

Before visualizing the distribution of content let's see what types of ratings we have in general.

In [None]:
cnt_rating = df['rating'].value_counts()
cnt_rating             

Meaning and age groups of the most types can be found in [Netflix's Help Center (Information for US):](https://help.netflix.com/en/node/2064/us)

**Kids** - TV-Y, TV-Y7, G, TV-G, PG, TV-PG


**Teens** - PG-13, TV-14


**Adults** - R, TV-MA, NC-17

After a lit bit of googling, I found meaning for the rest of the types too:

**Not Rated** - NR, UR


**Kids** - TV-Y7-FV (basically a TV-Y7, but FV indicates that a program contains “fantasy violence”)


In [None]:
# Setting order of ratings based on the age group and rearranging the array
rating_order =  ['NR', 'UR', 'TV-Y', 'TV-Y7', 'TV-Y7-FV', 'G', 'TV-G', 'PG', 'TV-PG', 'PG-13', 'TV-14', 'R', 'TV-MA', 'NC-17']   
df_rating = df['rating'].value_counts()[rating_order]

I managed to understand settings for annotations and visual split of age groups thanks to [this notebook](https://www.kaggle.com/subinium/storytelling-with-data-netflix-ver#Relation-Between-Month?).

In [None]:
fig, ax = plt.subplots(1,1, figsize=(16, 6))

# Creating a graph
ax.bar(df_rating .index, df_rating ,  color="#b20710", width=0.6, edgecolor='#b20710')

# Removing frames from a figure
for s in ['top', 'right','left']:
    ax.spines[s].set_visible(False)

# Removing ticks (= reducing their lengh to zero)
ax.tick_params(axis=u'both', which=u'both',length=0)

# Making a horizontal grid and putting it behind bars
ax.grid(axis='y', linestyle='-', alpha=0.4)
ax.set_axisbelow(True)    

# Setting colors for age groups
color =  ['grey',  'green',  'orange',  'red']
span_range = [[0, 1], [2, 8], [9,  10], [11, 13]]

# Settings for annotations
for idx, sub_title in enumerate(['Not Rated', 'Kids', 'Teens', 'Adults']):
    ax.annotate(sub_title,
                xy=(sum(span_range[idx])/2 ,3050),
                xytext=(0,0), textcoords='offset points',
                va="center", ha="center",
                color="w", fontsize=16, fontweight='bold',
                bbox=dict(boxstyle='round4', pad=0.4, color=color[idx], alpha=0.6))
    ax.axvspan(span_range[idx][0]-0.4,span_range[idx][1]+0.4,  color=color[idx], alpha=0.05)

# Y-axis height setting
ax.set_ylim(0, 3300) 

# Setting title
ax.set_title('Distribution of maturity ratings', fontsize=18, fontweight='bold', fontfamily='serif')

# Removes unnecessary information
plt.show()

Judging by the distribution, most of the content is directed at adults and teens.

# Conclusion

So what I basically did was:
* Cleaned (dealt with missing values) and formatted the original data
* Scraped, formated and visualised some external data (revenue / cost of content / number of memberships)
* Looked at top countries, actors, directors and genres by visualizing them
* Checked correlation of genres and popular words in titles
* Visualized distribution of content by types, months and maturity ratings

At the same time, I was making some generic assumptions based on what I saw.

For now I will stop here and maybe come back to this notebook to do some more challenging staff like test of hypotheses / recommendation / prediction once I get more experienced.

Didn't really do anything with `release_year` and` duration`, so, for now, let's leave exploration of these feature also for the future.

# Useful code
This piece of code I didn't use for any of the visualizations but kept anyway, because I might need to process the data in a similar way in some future projects.

What this code does - it 'explode' and assign related rows to every item in the list (assuming we have a column with lists).


Initially I wanted to explore when some genres were first added to Netflix library, but didn't get anything interesting form it.

In [None]:
# Splitting column 'listed_in' by comma into a list where each genre is a list item
df['genre'] = df['listed_in'].apply(lambda x : x.replace(' ,',',')\
                                                .replace(', ',',')\
                                                .split(','))
# Leaving only the columns we need
gr_df = df[['genre','year_added']]
# Creating a variable because we need it several places
list_col = 'genre'

# Creating df with 'exploded' year to each genre in the lists
r = pd.DataFrame({
    col: np.repeat(gr_df[col].values, # Choosing values from columns that we want to explode. By using 'col' we generalized code for all columns with scalar values. 
                   gr_df[list_col].str.len()) # Getting the length (N) of the corresponding list and setting N as the number of repetitions.
    for col in gr_df.columns.drop(list_col)} # Getting names of the columns and dropping the one with the lists (in our case 'genre').
    # By using np.concatenate() we flatten all thr values in the list column (genre) and get a 1D vector
    ).assign(**{list_col:np.concatenate(gr_df[list_col].values)})[gr_df.columns]

r.sort_values('year_added')\
 .drop_duplicates(subset=['genre'], keep='first')\
 .tail(10)