<img src="https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fjaymcgregor%2Ffiles%2F2014%2F07%2F11243566003_e50ae60bad_h.jpg" width="500">

# Introduction
<font size=3>Netflix has become one of the most popular in media industry.<br />
    I also watch netflix during my free time :) It contains lots of Movie and TV Show from different countries. In this notebook, I will focus on data visualizing using matplotlib and seaborn. These are the things that I am going to visualize it.<br />

1. Which actor has been mostly showed up in US Netflix Movie or TV Show?<br />
2. Which Country has made most Movie or TV Show?<br />
3. Get the percentage of genre<br />
4. Visualize how many Movie and TV Show has showed in each decade<br />
5. Understanding what content is available in different countries<br />
6. Is Netflix has increasingly focusing on TV rather than movies in recent years? <br />    <font size/>

In [None]:
# Basic module
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

# Helps to visualize missing values in each columns
import missingno as msno

# Simple & Easy way to overview dataset
from pandas_profiling import ProfileReport

# Used for choosing most frequently shown country.
from collections import Counter

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
sns.set_theme(style= 'darkgrid', palette = 'pastel')

# 1. Load Data & Check Information

In [None]:
df_net = pd.read_csv('../input/netflix-shows/netflix_titles.csv')

<font size=3> 
In this data set, there are 12 features. <br />
* show_id = Unique ID for every Movie / TV Show<br />
* type = Identifier - A Movie / TV Show<br />
* title = Title of Movie / TV Show<br />
* director = Director of the Movie<br />
* cast = Actors involved in the Movie / TV Show <br />
* country = Country where the Movie / TV Show was produced<br />
* data_added = Date it was added on Netflix<br />
* release_year = Actual Release year of the Movie / TV Show<br />
* rating = TV Rating of the Movie / TV Show<br />
* duration = Total Duration - in minutes or number of seasons<br />

Let's overview each feature and figure out how to use them to visualization.
<font size />

In [None]:
df_net.head()

<font size =3>
As you can see, only 'release_year' data type is interger and rest of the features' data types are object. Because of that, describe method only shows 'release_year'. <br /><br />
'director' includes 2389 null values. <br />
'cast' includes 718 null values. <br />
'country' includes 507 null values. <br />
'date_added' includes 10 null values. <br />
'rating' includes 7 null values. <br />
    <font size/>

In [None]:
df_net.info()

In [None]:
df_net.describe()

In [None]:
df_net.isna().sum()

In [None]:
# Showing NaN values by using heatmap
msno.heatmap(df_net)

<font size=3>
All the works from above is kind of classic(?) way. There is really easy and simple library which explains all the detail information of dataset. By using ProfileReport, it will show dataset statistics, variable types, variables, interactions, correlations, missing values, and samples.
    <font size/>

In [None]:
ProfileReport(df_net)

# 2. Data Cleaning

<font size=4>Before doing visualization, lets clean our data to see more clean view!<br/>
<font size/>

<font size=3>Drop useless features.<font size/>

In [None]:
df_net.drop(['director', 'date_added', 'description'], axis=1, inplace=True)

<font size=4> **Rating** <br/> <font size/>
<font size=3> There is only seven NaN values in rating, so I'm just gonna put values.<font size/>

In [None]:
df_net[df_net['rating'].isna()]

In [None]:
 changing_nan = {
    67: 'TV-PG',
    2359: 'TV-14',
    3660: 'TV-MA',
    3736: 'TV-MA',
    3737: 'NR',
    3738: 'TV-MA',
    4323: 'TV-MA '
}

for id, rate in changing_nan.items():
    df_net.iloc[id, 6] = rate
    
df_net['rating'].isna().sum()

<font size=4> **Cast** <br/> <font size/>
<font size=3> Drop NaN values in cast<font size/>

In [None]:
df_net = df_net[df_net['cast'].notna()]
df_net['cast'].isna().sum()

<font size=4> **Country** <br/> <font size/>
<font size=3> In the country feature, Some of them are NaN values and some of them include multiple countries.<br/>For NaN values, I am going to put the most common country.<br/>For values which contain more than two countries, the main country in the list will be selected.<font size/>

In [None]:
Counter(df_net['country']).most_common(1)

In [None]:
df_net['country'] = df_net['country'].fillna('United States')
df_net['country'].isna().sum()

In [None]:
#After finishing cleaning NaN values, lets do our second task!
df_net['main_country']=df_net['country'].apply(lambda x:x.split(',')[0])
df_net.drop('country', axis=1, inplace=True)

<font size=3> As you can see, there is no NaN values left! Also, I changed the value, contains multiple countries, into one country <font size/>

In [None]:
df_net.head()

In [None]:
df_net.isna().sum()

# 3. Data Visualization

<font size=3>
Which actor has been mostly showed up in US Netflix Movie or TV Show? <br/><br/>

* **Andrea Libman** = 20 movies and TV shows <br/>
* **Samuel L. Jackson** = 19 movies and TV shows <br/>
* **Adam Sandler** = 19 movies and TV shows <br/>
* **Fred Tatasciore** = 18 movies and TV shows <br/>
* **Tara Strong** = 17 movies and TV shows <br/>

<font size/>

In [None]:
#Making new DataFrame for visualizing
df_us = df_net[df_net['main_country'] == 'United States']
actor_list = [(lambda x: x)(x.strip()) for x in ','.join(df_us['cast']).split(',')]
counter_list = Counter(actor_list).most_common(5)
actor_name = [(lambda x : x)(x[0]) for x in counter_list]
actor_frequency = [(lambda x : x)(x[1]) for x in counter_list]
us_actor = pd.DataFrame({'actor_name': actor_name, 'actor_frequency' : actor_frequency},
                       columns=['actor_name', 'actor_frequency'])

#Total 13772 actors has shown up in US movie and TV show
my_set = set(actor_list)
print("Total actors in US netflix is " + str(len(my_set)))

#Visualizing using seaborn
plt.figure(figsize=(15,6))
sns.set_context('paper', font_scale=1.2)
sns.barplot(x='actor_name', y='actor_frequency', data=us_actor)
plt.xlabel('Actor Name')
plt.ylabel('Actor Frequency')
plt.show()

<font size=3> Which Country has made most Movie or TV Show?<br/><br/>
* United States = 2961 movies and TV shows <br/>
* India = 927 movies and TV shows <br/>
* Untied Kingdom = 500 movies and TV shows <br/>
* Canada = 234 movies and TV shows <br/>
* Japan = 232 movies and TV shows <br/>
* South Korea = 191 movies and TV shows <br/>
* France = 176 movies and TV shows <br/>
* Spain = 149 movies and TV shows <br/>
* Mexico = 114 movies and TV shows <br/>
* Turkey = 106 movies and TV shows <br/>
<font size/>

In [None]:
#Making new DataFrame by using country and type features
count_type = df_net.groupby(['main_country']).count()
most_country = count_type['type'].to_frame().reset_index().sort_values(by='type', ascending=False)[:10]

#Visualizing using seaborn
plt.figure(figsize=(15,5))
sns.set_context('paper', font_scale=1.2)
sns.barplot(x='main_country', y='type', data=most_country)
plt.ylabel('# of TV & Movie')
plt.xlabel('Country')
plt.show()

<font size=3>Get the percentage of genre<br/><br/>
* International Movies = 14.4%<br/>
* Dramas = 13.3%<br/>
* Comedies = 9.3%<br/>
* International TV Shows = 7.0%<br/>
* Action & Adventure = 4.5%<br/>
* TV Dramas = 4.4%<br/>
* Independent Movies = 4.3%<br/>
* Romantic Movies = 3.3%<br/>
* Children & Family Movies = 3.2%<br/>
* Others = 36.2%

<font size/>    

In [None]:
#Extract most 10 genre from netflix to visualize the percentage of genre
top9_genre = [(lambda x: x)(x.strip()) for x in ','.join(df_net['listed_in']).split(',')]
top9_list = list(Counter(top9_genre).most_common(9))
total_genre=len(top9_genre)
labels = [(lambda x:x)(x[0]) for x in top9_list]
labels.append('Others')
sizes = [(lambda x:x)(x[1]) for x in top9_list]
sizes.append(total_genre - sum(sizes))

#Visualizing using matplotlib
plt.figure(figsize=(12,15))
plt.title('Percentage of Genre', fontsize=15)
plt.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True)
plt.show()

<font size=3>Visualize how many the number of Movie and TV Show has increased in each decade<font size=3>

In [None]:
#Visualizing using matplotlib
plt.figure(figsize=(12,8))
plt.title('# of TV & Movie in each decade', fontsize=18)
sns.set_context("poster", font_scale = 0.8)
sns.histplot(df_net['release_year'], bins=8, kde=True)
plt.show()

<font size=3>Understanding what content is available in different countries <br/><br/>
Since there are too many countries, I am going to use top 3 which is United States, India, and United Kingdom. 
<font size=3>

In [None]:
#Make new dataframe for each country
df_US = df_net[df_net['main_country'] == 'United States']
df_In = df_net[df_net['main_country'] == 'India']
df_UK = df_net[df_net['main_country'] == 'United Kingdom']

#Function which return sizes and labels for pie chart
def show_pie(df):
    genre = [(lambda x: x)(x.strip()) for x in ','.join(df['listed_in']).split(',')]
    df_list = list(Counter(genre).most_common(9))
    total_genre=len(genre)
    labels = [(lambda x:x)(x[0]) for x in df_list]
    labels.append('Others')
    sizes = [(lambda x:x)(x[1]) for x in df_list]
    sizes.append(total_genre - sum(sizes))
    return sizes, labels

#Get the sizes and labels
US_sizes, US_labels = show_pie(df_US)
In_sizes, In_labels = show_pie(df_In)
UK_sizes, UK_labels = show_pie(df_UK)

#Visualizing using matplotlib
fig, ([ax1, ax2], [ax3, ax4]) = plt.subplots(2,2, figsize=(40,40))
ax1.pie(US_sizes, labels=US_labels, autopct='%1.1f%%', shadow=True)
ax1.set_title('US Content', size=20, fontweight='bold')
ax2.pie(In_sizes, labels=In_labels, autopct='%1.1f%%', shadow=True)
ax2.set_title('India Content', size=20, fontweight='bold')
ax3.pie(UK_sizes, labels=UK_labels, autopct='%1.1f%%', shadow=True)
ax3.set_title('UK Content', size=20, fontweight='bold')
#Dummy
ax4.pie(UK_sizes, labels=UK_labels, autopct='%1.1f%%', shadow=True)
ax4.set_visible(False)
plt.show()

<font size=3>Is Netflix has increasingly focusing on TV rather than movies in recent years? <br/><br/>
    
<font size=3>

In [None]:
#Making new dataframe to seperate TV & Movie
df_TV = df_net[df_net['type']=='TV Show'].groupby('release_year').count()
df_Movie = df_net[df_net['type']=='Movie'].groupby('release_year').count()

#Visualizing using matplotlib
plt.figure(figsize=(12,8))
sns.set_context("poster", font_scale = 0.8)
sns.lineplot(data=df_TV['show_id'], sizes=10)
sns.lineplot(data=df_Movie['show_id'])
plt.ylabel('Count')
plt.xlabel('Release Year')
plt.legend(['TV', 'Movie'], fontsize='large')
plt.title('TV vs Movie')
plt.show()

# Reference

* Pandas Profiling GitHub - https://github.com/pandas-profiling/pandas-profiling
* Heatmap for missing value Github - https://github.com/ResidentMario/missingno
* Handling NaN values in rating - https://www.kaggle.com/bhartiprasad17/netflix-movies-and-tv-shows-eda