<center><h1> Exploratory data analysis of Netflix content </h1></center>
<img src="https://www.enigma-mag.com/wp-content/uploads/2019/05/netflix-logo-and-screen-1-1.jpg" width="600px">

# 1. Introduction

This notebook is intended to analyze and visualize Netflix content. 

The data have been preprocessed for better visualization, which has been done with Plotly.

Please don't forget to UpVote this notebook if you like it.

# 2. Importing required libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import plotly.express as px
import plotly.graph_objects as go

import datetime

# 3. Reading CSV Data

In [None]:
path = '../input/netflix-shows/netflix_titles.csv'
netflix_titles = pd.read_csv(path)
netflix_titles.head()

In [None]:
netflix_titles.info()

There are some missing values. Some of them are consider as *Unknown* data.

# 4. Data Wrangling

- **Release year and date_added**

Convert the date columns in datetime type.


In [None]:
netflix_titles['date_added'] = pd.to_datetime(netflix_titles['date_added'])

In [None]:
netflix_titles['release_year'] = netflix_titles['release_year'].astype(str)
netflix_titles['release_year'] = pd.to_datetime(netflix_titles['release_year'])
netflix_titles['release_year'].dt.year

- **Country**

There are films or TV shows with different countries, only the first country is selected. The Nan values are filled with 'Unknown'.

In [None]:
netflix_titles['country'] = netflix_titles['country'].fillna('Unknown')
netflix_titles['country'] = [countries[0] for countries in netflix_titles['country'].str.split(',')]
netflix_titles['country'].unique()

**- Rating**

The values 'UR' and 'NR' have the same meaning: Unrated and No rated. So, 'NR' values are replaced by 'UR' in order to unify the variable.

In addition, the nan values are consider as Unrated too.

In [None]:
netflix_titles['rating'] = netflix_titles['rating'].replace('NR','UR')
netflix_titles['rating'] = netflix_titles['rating'].fillna('UR')
netflix_titles['rating'].unique()

**- Duration**

The duration for TV Shows are in Seasons, but it is in min for Movies. Only the number of each row will selected:

In [None]:
netflix_titles['duration'] = [int(duration[0]) for duration in netflix_titles['duration'].str.split(' ')]
netflix_titles['duration']

# 5. Visualisations


**- Movies and TV-shows**

In [None]:
fig = px.pie(values=netflix_titles['type'].value_counts(), 
             names=netflix_titles['type'].value_counts().index, 
             title='Number total of TV-Shows and Movies in Netflix')
fig.show()

In [None]:
x = netflix_titles['type'].value_counts().index, 
y = netflix_titles['type'].value_counts()
print(x)
print(y)

It can be concluded that there are more Movies than TV-shows. Now, we want to represent the Movies ans TV-shows by country.

`visualise_country` is a function that represent the number of TV-shows and movies for the selected country.

In [None]:
def visualise_country(country):
    if (country == ALL):
        netflix_titles_vis = netflix_titles
    
    else:
        netflix_titles_vis = netflix_titles[netflix_titles.country == country]
        
    fig = px.pie(values=netflix_titles_vis['type'].value_counts(), 
             names=netflix_titles_vis['type'].value_counts().index, 
             title=f'Number total of TV-Shows and Movies from {country}.')
    fig.show()

Dropdown widgets are used to select the country, we want to plot.

In [None]:
import ipywidgets as widgets
from ipywidgets.widgets.interaction import show_inline_matplotlib_plots

ALL = 'ALL'
def unique_sorted_values_plus_ALL(array):
    unique = array.unique().tolist()
    unique.sort()
    unique.insert(0, ALL)
    return unique

dropdown_contry = widgets.Dropdown(options = unique_sorted_values_plus_ALL(netflix_titles.country))
output_country = widgets.Output()

def dropdown_country_eventhandler(change):
    output_country.clear_output()
    with output_country:
        display(visualise_country(change.new))
        
dropdown_contry.observe(dropdown_country_eventhandler, names='value')
display(dropdown_contry)

In [None]:
display(output_country)

**- Rating**

In [None]:
netflix_titles['rating'].unique()

In [None]:
order_rating = ['TV-Y', 'TV-Y7', 'TV-Y7-FV', 'G', 'TV-G', 'PG', 'TV-PG', 'PG-13', 'TV-14', 'R', 'NC-17', 
               'TV-MA', 'NR']


fig = px.bar(y = netflix_titles['rating'].value_counts(), 
             x = netflix_titles['rating'].value_counts().index,
             labels = dict(x="Rating", y="Total Number"),
             title = 'Rating of TV-Shows and Movies in Netflix'
            )

fig.update_xaxes(categoryorder = 'array', categoryarray= order_rating)

fig.show()


Most of the content on Netflix is for Mature Audiences (TV-MA), followed by TV-14 (Parents strongly cautioned). 

But by grouping the classification into Kids ('TV-Y', 'TV-Y7', 'TV-Y7-FV', 'G', 'TV-G', 'PG', 'TV-PG'), Teens ('PG-13', 'TV-14'), Adults ('R', 'NC-17', 'TV-MA') and Unclassified ('UR'), it is possible to have a clearer and more understandable plot.

In [None]:
def group_by_rating(rating):
    if rating in ['TV-Y', 'TV-Y7', 'TV-Y7-FV', 'G', 'TV-G', 'PG', 'TV-PG']:
        new_ratin_group = 'Kids'
    elif rating in ['PG-13', 'TV-14']:
        new_ratin_group = 'Teens'
    elif rating in ['R', 'NC-17', 'TV-MA']:
        new_ratin_group = 'Adults'
    else:
        new_ratin_group = 'Unrated'
    return new_ratin_group 
        

netflix_titles['rating_group'] = netflix_titles.apply(lambda x: group_by_rating(x['rating']), axis=1)

order_rating = ['Kids', 'Teens', 'Adults', 'Unrated']


fig = px.bar(y = netflix_titles['rating_group'].value_counts(), 
             x = netflix_titles['rating_group'].value_counts().index,
             labels = dict(x="Rating", y="Total Number"),
             title = 'Rating of TV-Shows and Movies in Netflix'
            )

fig.update_xaxes(categoryorder = 'array', categoryarray= order_rating)

fig.show()
    

**- Categoies / Genre**

This columns contains between 1 to 3 genres and categories for the movie or TV-show. These categories are separately in three columns: 'category1','category2' and 'category3':

In [None]:
netflix_titles['listed_in'] = netflix_titles['listed_in'].str.split(', ')
netflix_titles['listed_in']

In [None]:
netflix_titles[['category1','category2', 'category3']] = pd.DataFrame(netflix_titles.listed_in.tolist(), 
                                                                      index= netflix_titles.index)
netflix_titles.head(5)

Now it is posible to group the netflix content according catergories.

In [None]:
netflix_categories_content = netflix_titles[['type', 'category1','category2', 'category3']]
netflix_categories_content

It is possible to group by type and categories

In [None]:
netflix_categories_group= pd.get_dummies(netflix_categories_content.set_index('type'), prefix='',prefix_sep='').stack().sum(level=[0,1])
netflix_categories_group

In [None]:
fig = px.bar(y = netflix_categories_group['TV Show'].sort_values(ascending=False).head(10), 
             x = netflix_categories_group['TV Show'].sort_values(ascending=False).head(10).index,
             labels = dict(x="Category", y="Total Number"),
             title = 'The most common categories in TV Shows'
            )


fig.show()

In [None]:
netflix_categories_group['Movie'].sort_values(ascending=False).head(10)

In [None]:
fig = px.bar(y = netflix_categories_group['Movie'].sort_values(ascending=False).head(10), 
             x = netflix_categories_group['Movie'].sort_values(ascending=False).head(10).index,
             labels = dict(x="Category", y="Total Number"),
             title = 'The most common categories in Movies'
            )

fig.show()

**- Duration**

In [None]:
fig = px.bar(x = netflix_titles[netflix_titles['type']=='TV Show']['duration'].value_counts().index, 
             y = netflix_titles[netflix_titles['type']=='TV Show']['duration'].value_counts())

fig.update_layout(
    title='Duration of TV Shows',
    xaxis_title="Duration (seasons)",
    yaxis_title="Total number",
    xaxis = dict(
        tickmode = 'linear',
        tick0 = 1,
        dtick = 1)
)

fig.show()

In [None]:
fig = px.histogram(netflix_titles[netflix_titles['type']=='Movie'], x="duration")

fig.update_layout(
    title='Duration of movies',
    xaxis_title="Duration (min)",
    yaxis_title="Count",
    xaxis = dict(
        tickmode = 'linear',
        tick0 = 0,
        dtick = 15),
)

fig.update_traces(
    xbins = dict( # bins used for histogram
        start = 0,
        end = 315,
        size = 15)
    )
fig.show()


Most of the movies are between 90 and 105 minutes long.

**- Director**


In [None]:
netflix_titles['director'] = netflix_titles['director'].fillna('Unknown')
netflix_titles['director'] = netflix_titles['director'].str.split(', ')

In [None]:
netflix_director = pd.DataFrame(netflix_titles.director.tolist())
netflix_director['type'] = netflix_titles['type']

In [None]:
netflix_director_group= pd.get_dummies(netflix_director.set_index('type'), prefix='',prefix_sep='').stack().sum(level= [0, 1])
netflix_director_group

In [None]:
fig = px.bar(y = netflix_director_group['TV Show'].sort_values(ascending=False).head(11)[1:], 
             x = netflix_director_group['TV Show'].sort_values(ascending=False).head(11)[1:].index,
             labels = dict(x="Director", y="Total Number"),
             title = 'The most common directors in TV Shows'
            )


fig.show()

In [None]:
fig = px.bar(y = netflix_director_group['Movie'].sort_values(ascending=False).head(11)[1:], 
             x = netflix_director_group['Movie'].sort_values(ascending=False).head(11)[1:].index,
             labels = dict(x="Director", y="Total Number"),
             title = 'The most common directors in Movies'
            )


fig.show()

- Year added

In [None]:
fig = go.Figure()

# Add traces
fig.add_trace(go.Scatter(x=netflix_titles.date_added.dt.year.value_counts().sort_index().index, 
                    y=netflix_titles.date_added.dt.year.value_counts().sort_index(),
                    mode='lines+markers',
                    name='Total'))
fig.add_trace(go.Scatter(x=netflix_titles[netflix_titles['type']=='Movie'].date_added.dt.year.value_counts().sort_index().index,
                    y=netflix_titles[netflix_titles['type']=='Movie'].date_added.dt.year.value_counts().sort_index(),
                    mode='lines+markers',
                    name='Movies'))
fig.add_trace(go.Scatter(x=netflix_titles[netflix_titles['type']=='TV Show'].date_added.dt.year.value_counts().sort_index().index,
                    y=netflix_titles[netflix_titles['type']=='TV Show'].date_added.dt.year.value_counts().sort_index(),
                    mode='lines+markers',
                    name='TV Shows'))

fig.update_layout(
    title='Content added in Netflix',
    xaxis_title="Year",
    yaxis_title="Total number",
    xaxis = dict(
        tickmode = 'linear',
        tick0 = 2008,
        dtick = 1),
)


fig.show()

**- Release year**

In [None]:
fig = go.Figure()

# Add traces
fig.add_trace(go.Scatter(x=netflix_titles.release_year.dt.year.value_counts().sort_index().index, 
                    y=netflix_titles.release_year.dt.year.value_counts().sort_index(),
                    mode='lines+markers',
                    name='Total'))
fig.add_trace(go.Scatter(x=netflix_titles[netflix_titles['type']=='Movie'].release_year.dt.year.value_counts().sort_index().index,
                    y=netflix_titles[netflix_titles['type']=='Movie'].release_year.dt.year.value_counts().sort_index(),
                    mode='lines+markers',
                    name='Movies'))
fig.add_trace(go.Scatter(x=netflix_titles[netflix_titles['type']=='TV Show'].release_year.dt.year.value_counts().sort_index().index,
                    y=netflix_titles[netflix_titles['type']=='TV Show'].release_year.dt.year.value_counts().sort_index(),
                    mode='lines+markers',
                    name='TV Shows'))

fig.update_layout(
    title='Release Year of the content in Netflix',
    xaxis_title="Year",
    yaxis_title="Total number",
    xaxis = dict(
        tickmode = 'linear',
        tick0 = 1925,
        dtick = 5),
)


fig.show()