# Exploratory Analysis of Netflix Data
This exploratory analysis has a study purpose, made by a Data Scientist student and enthusiast.
You can check the dataset source here: https://www.kaggle.com/shivamb/netflix-shows

## Index
### Packages
### Missing Values
### Exploratory Analysis

#### i. Show Type
##### i.i. Overview
#####  i.ii. Evolution Over Time
##### i.iii. Duration

#### ii. Countries

#### iii. Conclusion and Next Steps


# Packages

In [None]:
#packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import pycountry as pc

### Data Import

In [None]:
#Data Import
kaggle_path = '../input/netflix-shows/netflix_titles.csv'
df = pd.read_csv('../input/netflix-shows/netflix_titles.csv')
df.head()

In [None]:
df.info()

In [None]:
dup = df.duplicated()
dup.sum()

# Missing Values

In [None]:
#heatmap
fig = px.imshow(df.isnull().T,template='ggplot2')
fig.update_layout(title='Missing values in data set')
fig.show()

In [None]:
#barplot
missing_value = 100 * df.isnull().sum()/len(df)
missing_value = missing_value.reset_index()
missing_value.columns = ['variables','missing']
missing_value = missing_value[missing_value.missing != 0]

fig = px.bar(missing_value, y='missing',x='variables',title='Missing Values in data set (%)')
fig.show()

Director and cast are the column with more missing data. Thankfully those columns are not important to the focus of this study, once data will be analysed by type, duration, time and genres (listed_in)
Country, which is an important category of analysis of this study, has 6.5% of missing data, which is relevant. Those missing data lines will be droped of the dataset when analysing data by country.

# Exploratory Analysis

In [None]:
#Exploratory Analysis
df = df.drop(columns = ['director','description','cast','rating'], axis=1)
df.head()

## i. Show Type

### i.i. Overview

In [None]:
plot_type = df['type'].value_counts().reset_index()
plot_type.columns = ['type','count']
plot_type

In [None]:
px.pie(plot_type,values='count',names='type',template='ggplot2',title='Type')

### i.ii. Evolution over time 

In [None]:
df['year_added'] = pd.DatetimeIndex(df['date_added']).year
df['month_added'] = pd.DatetimeIndex(df['date_added']).month
df['quarter_added'] = pd.DatetimeIndex(df['date_added']).quarter

In [None]:
plot_type_year = df.groupby(['type','year_added']).size().reset_index()
plot_type_year.columns = ['type','year_added','count']
plot_type_year.year_added = plot_type_year.year_added.astype('category')
chart = sns.catplot(
    data = plot_type_year, kind = 'bar',x='year_added',y='count',hue = 'type',
    ci='sd',palette = 'dark',height=20)
chart.set_axis_labels('Year','Count')
#Netflix has always realeased more movies the tv shows
#The number of movies released in 2020 decreased, while we saw an increase on movies at the same time

In [None]:
plot_type_quarter = df.groupby(['type','quarter_added']).size().reset_index()
plot_type_quarter.columns = ['type','quarter_added','count']
plot_type_quarter.month_added = plot_type_quarter.quarter_added.astype('category')
fig = sns.catplot(
    data = plot_type_quarter, kind = 'bar',x='quarter_added',y='count',hue = 'type',
    ci='sd',palette = 'dark',alpha = .6,height=20)
fig.set_axis_labels('Quarter','Count')
#Netflix usually puts more Movies on its catalog in the first and last quartes of the year
#TV Shows are added to catalog in the last quarter of the year

In [None]:
#plot difference year by year of movies and tv shows
yoy_movies = plot_type_year.query("type == 'Movie'")
yoy_shows = plot_type_year.query("type == 'TV Show'")
#Movies:
yoy_movies['difference'] = 100*((yoy_movies['count'] - yoy_movies['count'].shift(1))/yoy_movies['count'])
yoy_movies = yoy_movies.query("year_added > 2008")
yoy_movies = yoy_movies.query("year_added < 2021")
yoy_movies

In [None]:
px.bar(data_frame = yoy_movies,x='year_added',y='difference',title = '% Difference of Movies Added Year over Year')

In [None]:
#TV Shows
yoy_shows['difference'] = 100*((yoy_shows['count'] - yoy_shows['count'].shift(1))/yoy_shows['count'])
yoy_shows = yoy_shows.query("year_added > 2008")
yoy_shows = yoy_shows.query("year_added < 2021")
yoy_shows

In [None]:
px.bar(data_frame = yoy_shows,x='year_added',y='difference',title = '% Difference of TV Shows Added Year over Year')

It's clear that Netflix started as an streaming platform based on Movies, since that TV Shows were only included on its catalog in 2012. TV Shows are becaming more and more relevant once the number of new titles have never decreased comparing with the year before.
In 2012, when TV Shows were introduced, we saw a major decrease on the number of new movies added on its catalog.

### i.iii. Duration

In [None]:
#getting the duration of each title
df[['duration_time','unit']] = df['duration'].str.split(' ',1,expand=True)
df['duration_time'] = df['duration_time'].astype(int)
df.head()

In [None]:
duration_movies = df[['type','duration_time','unit']].query("type == 'Movie'")
duration_shows = df[['type','duration_time','unit']].query("type == 'TV Show'")

#Describe
#movies
duration_movies.describe()

In [None]:
#shows
duration_shows.describe()

In [None]:
#plotting histogram
#movies
#hist
ax1 = duration_movies.plot(kind = 'hist', density = True,bins=25,figsize = (10,8),
                          xlim = (duration_movies['duration_time'].min(),duration_movies['duration_time'].max()))
#kde
duration_movies.plot(kind = 'kde', ax = ax1, secondary_y = True,
                     figsize = (10,8),
                     title = 'Histogram of Duration time (min) with KDE Distribution')

#shows
#hist
ax2 = duration_shows.plot(kind = 'hist', density = True,bins=10,figsize = (10,8),
                          xlim = (duration_shows['duration_time'].min(),duration_shows['duration_time'].max()))
#kde
duration_shows.plot(kind = 'kde', ax = ax2, secondary_y = True,
                     figsize = (10,8),
                     title = 'Histogram of Duration time (Seasons) with KDE Distribution')

Kernel Smoother was chosen because it can see the data points in a more smooth way.
The average duration of a Movie on Netflix is 99 minutes, and most of them have between 86 to 119 minutes. 
The average duration of TV Shows is 1.7 seasons, and most of them have only 2 seasons.

## ii. Countries

### ii.i Overview

In [None]:
plot_country = df['country'].value_counts().reset_index()
plot_country.columns = ['country','count']
plot_country.country = plot_country.country.str.split(',',expand = True)
plot_country = plot_country['country'].value_counts().reset_index()
plot_country.columns = ['country','count']
#total of countries
plot_country.country.count()

In [None]:
plot_country['% of total'] = 100*plot_country['count']/plot_country['count'].sum()
plot_country['acum'] = plot_country['% of total'].cumsum()
plot_country

In [None]:
total = plot_country['country'].count()
total

In [None]:
#bar plot
fig = px.line(x=plot_country['country'],y=plot_country['acum'],title = 'Pareto Chart')
fig.add_bar(x=plot_country['country'],y=plot_country['count'])
fig.show()
#plot_country.plot(kind = 'line',ax = ax1,secondary_y = True, figsize = (100,80),x='country',y='acum')

In [None]:
limit = (100*2/3)
top_60p = plot_country[plot_country['acum'] < limit]
fig = px.line(x=top_60p['country'],y=top_60p['acum'],title = 'Pareto Chart - top 2/3 only')
fig.add_bar(x=top_60p['country'],y=top_60p['count'])
fig.show()

In [None]:
#number of countries that have 66.6% of the total titles
percent = 100*(top_60p['country'].count()/total)
percent

With the Pareto Chart it is possible to analyse how the number of titles are distribuited among the countries of producing 
Two thirds of all titles were produced in 14 coutries. It shows that althought Netflix has a diverse catalog, with title from more than 81 countries, most of them are concentrated on 17.2% of those nations.  

### iii.Conclusions
This study shows that Netflix has started as a Platform of Movies and fastly over teh years turn its business to Media distribuition, including more and more TV Shows on the catalog. 

One nice thing to observe is that most of its TV Shows have between 1 and 2 seasons. Probably it happens due to the fact that most of them are produced by Netflix. Once the company started investing on it later than the big players on the market, those titles present less seasons yet. It is another argument showing that Netflix is investing to be a great Media Player, not only just a movie distribuitor or tech/streaming company.

A great conclusing after exploring the countries where Movies and TV Shows were made is that Netflix still having most of its titles from USA and UK, once those are the greatest producers in the globe (in financial and awards numbers). However, it presented a diverse catalog, with more than 81 nations, even though 2/3 of titles are concentrated in less than 20% of these countries.

The next steps of Netflix Exploratory analysis would be include financial factors and also analyse it in a more granular way, including producers and other stakeholders on the analysis. 