![netflix-imdb-header.jpg](https://i.postimg.cc/rwj6wc7j/netflix-imdb-header.jpg)


Photo by <a href="https://unsplash.com/@surface?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Surface</a> on <a href="https://unsplash.com/s/photos/netflix?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  

# <font color='#E50914'> Netflix </font> Data Exploration

---
This notebook is a part of a Chapter assignment in the course. 
[Data Analysis with Python: Zero to Pandas](zerotopandas.com) <br>
The objective of this assignment is to perform an exploratory data analysis
**using seaborn / matplotlib** as visualizations.


Table of Contents
* [Downloading the Dataset](#first-bullet)
* [Data Preparation and Cleaning](#second-bullet)
* [Exploratory Analysis and Visualization](#third-bullet)
* [Asking and Answering Questions](#fourth-bullet)
* [Conlusion and Summary](#summary)

In [None]:
# !pip install squarify
!pip install pingouin

In [None]:
import pandas as pd
import pingouin as pg
import numpy as np
import seaborn as sns
import seaborn as sns
import matplotlib
from matplotlib import style
import matplotlib.font_manager
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.io as pio
import squarify
import warnings
import os
# import opendatasets as od
%matplotlib inline

# set seaborn style
sns.set_style('darkgrid')
style.use('ggplot')

# plotly settings
pio.renderers = 'kaggle'

# set warning settings
warnings.filterwarnings(action='ignore')

# change default params in matplotlib
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['font.family'] = "roboto"
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

## Downloading the Dataset <a class="anchor" id="first-bullet"></a>


Let's begin by downloading the data, and listing the files within the dataset.

In [None]:
## dataset link from kaggle
# dataset_url = 'https://www.kaggle.com/luiscorter/netflix-original-films-imdb-scores' 

# # download data from url
# od.download(dataset_url)

# # data directory
# data_dir = 'netflix-original-films-imdb-scores'

# # list data
# os.listdir(data_dir)

## Data Preparation and Cleaning <a class="anchor" id="second-bullet"></a>

- Load our data using Pandas
- Check heads, tails, and also view by sampling
- Know our dataset shape and Column names
- Check missing Values


In [None]:
# read dataset
netflix_df = pd.read_csv('../input/netflix-original-films-imdb-scores/NetflixOriginals.csv')

In [None]:
# view data
netflix_df.head()

In [None]:
# check tail
netflix_df.tail()

In [None]:
# get samples
netflix_df.sample(5)

In [None]:
# check dataset shape
netflix_rows, netflix_cols = netflix_df.shape[0], netflix_df.shape[1]
netflix_column_names = netflix_df.columns
print(f'The Netflix Dataset contains {netflix_rows} rows and {netflix_cols} columns')

# print column names
print('All of the columns are the ff:')
for col_names in enumerate(netflix_column_names):
    print(col_names)

In [None]:
# change column names
netflix_df.columns = netflix_df.columns.str.lower().\
    str.replace(' ', '_')

In [None]:
# check dataset info
netflix_df.info()

In [None]:
# convert premiere feature to datetime
netflix_df['premiere'] = pd.to_datetime(netflix_df.premiere)

In [None]:
# check for missing values
netflix_df.isnull().sum()

**Tables for Exploratory Data Analysis**

---

Titles and Genre

In [None]:
netflix_df.head()

In [None]:
# count available genre
netflix_genre_counts = netflix_df.genre.value_counts()

print(f'There are {len(netflix_genre_counts)} distinct genre.')
netflix_genre_counts

In [None]:
# total titles by year
netflix_df['premiere_year'] = netflix_df.premiere.dt.year
netflix_titles_premiere_per_year = netflix_df.groupby('premiere_year').size()

netflix_titles_premiere_per_year

In [None]:
# total titles by year with genre
crosstab_year_genre = pd.crosstab(index=netflix_df.premiere_year,
    columns=netflix_df.genre)

crosstab_year_genre

Runtimes and IMDB scores

In [None]:
# which titles has the longest runtimes
netflix_df.runtime.describe()

In [None]:
sns.boxplot(x=netflix_df.runtime);

In [None]:
sns.histplot(netflix_df.runtime);

In [None]:
# get median runtime
runtime_median = netflix_df.runtime.median()
print(f'The median of runtime is {runtime_median}')

# check normality
pg.normality(netflix_df.runtime, method='normaltest')

In [None]:
# get outlier runtimes
netflix_longest_runtime = netflix_df[netflix_df.runtime >= 120]

# sort netflix values 
netflix_longest_runtime = netflix_longest_runtime.\
    sort_values(by='runtime', ascending=False)

In [None]:
# count long runtime titles
netflix_longest_runtime_count = netflix_longest_runtime.shape[0]
print(f'There are {netflix_longest_runtime_count} tiltes that is above 2 hours long.')

# view sample from longest runtimes var
netflix_longest_runtime.sample(5)

In [None]:
# get shortest runtime titles
netflix_shortest_runtime = netflix_df[netflix_df.runtime <= 50]
netflix_shortest_runtime = netflix_shortest_runtime.\
    sort_values(by='runtime', ascending=False).head(15)

In [None]:
# count long runtime titles
netflix_shortest_runtime_count = netflix_shortest_runtime.shape[0]
print(f'There are {netflix_shortest_runtime_count} titles that is less than an hour long.')

# view sample from shortest runtimes var
netflix_shortest_runtime.sample(5)

In [None]:
# netflix multi language films
netflix_multilanguage_count = netflix_df[netflix_df.language.str.contains('/')]
multilanguage_size = len(netflix_multilanguage_count.language.value_counts())
print(f'There are {multilanguage_size} multilanguage languages in this dataset. All are combined with english.')

# netflix unique languages
netflix_total_languages = netflix_df.language.value_counts()
unique_language_size = len(netflix_total_languages)

print(f'There are {unique_language_size - multilanguage_size} monolingual languages in this dataset.')

In [None]:
# netflix use language of titles per year
netflix_language_in_years = pd.crosstab(index=netflix_df.premiere_year,
    columns=netflix_df.language)

## Exploratory Analysis and Visualization  <a class="anchor" id="third-bullet"></a>

#### What is the distribution of Netflixs' IMDB Score

In [None]:
# imdb_score feature distribution
netflix_df['imdb_score'].describe()

In [None]:
# imdb_score KDE (Kernel Density Estimate)
sns.kdeplot(data=netflix_df, x='imdb_score');

Our `imdb_score` is slight skewed to the left, we can check its normality with pingouins normality test function.

In [None]:
# is our imdb_score feature distributed normally?
pg.normality(netflix_df['imdb_score'])

Here in our boxplot we can see on beyond our lower fence the extreme outliers in our feature.

In [None]:
# Check outliers using seaborn boxplot
sns.boxplot(netflix_df.imdb_score);

#### How are the features correlated

Using seaborn's `pairplot()` function we can visualize pairwise(X,Y) features in a multiple scatter subplots.

In [None]:
sns.pairplot(netflix_df);

The `imdb_score` and `runtime` feature seem show some kind of relationship. We can check
the correlation coefficient of our features using `pairwise_corr` in pingouin library.

In [None]:
# netflix pairwise correlation
netflix_pairwise = pg.pairwise_corr(netflix_df, method='pearson')
netflix_pairwise

#### What titles have the **<font color='green'> highest</font>** IMDB score

---

In [None]:
# sort values by imdb score with in descending order
netflix_df.sort_values(by='imdb_score', ascending=False).head(10).\
    style.background_gradient(subset=['imdb_score', 'premiere_year'], cmap='Greens')

![lifeontheplanet.jpg](https://i.postimg.cc/DZ6gGsSD/lifeontheplanet.jpg)

#### What titles have the **<font color='red'> lowest </font>** IMDB score

---

In [None]:
# sort values by imdb score, but this time ascending
netflix_df.sort_values(by='imdb_score', ascending=True).head(10).\
    style.background_gradient(subset=['imdb_score', 'premiere_year'], cmap='Reds_r')

![image.png](attachment:b6b504e3-ab9e-4263-af20-e75d2da9d0b6.png)

#### What is the median IMDB score in terms of genre?
The imdb score is not a normal distribution. So we'll be getting the median imdb_score.

In [None]:
# calculates the median for every genre
median_genre_imdb = netflix_df.groupby('genre')['imdb_score'].median()

# filter the genre with median less than or equal to 5
median_low_genre_imdb = median_genre_imdb[median_genre_imdb <= 5].\
    sort_values()

# filter the genre with median greather than or equal to 6
median_high_genre_imdb = median_genre_imdb[median_genre_imdb >= 6].\
    sort_values(ascending=False)

In [None]:
# view the top 15 lowest (median) imdb genre score
pd.DataFrame(median_low_genre_imdb[:15]).\
    style.background_gradient(subset=['imdb_score'], cmap='Reds_r')

In [None]:
# view the top 15 highest (median) imdb genre score
pd.DataFrame(median_high_genre_imdb[:10]).\
    style.background_gradient(subset=['imdb_score'], cmap='Greens')

## Asking and Answering Questions <a class='anchor' id='fourth-bullet'></a>



#### **What genre do they frequently release in their titles?**

---

In [None]:
total_genre = netflix_genre_counts.value_counts().sum()
print(f'Netflix has a total of {total_genre} genre in its artillery of original films.')

In [None]:
# arrange values
netflix_genre_counts = pd.DataFrame(netflix_genre_counts)

# get index as array
netflix_genre_names = np.array(netflix_genre_counts.index)

# flatten values
netflix_genre_values = netflix_genre_counts.values.flatten()

# figure size
f, ax = plt.subplots(figsize=(30, 10))

# create plot
genre_plot = sns.barplot(y=netflix_genre_values, x=netflix_genre_names);

# rotate x-axis text vertically
plt.setp(genre_plot.get_xticklabels(), rotation=90)
plt.xticks(fontsize=10)

# tweak labels
genre_plot.set_xlabel(xlabel='Genres', fontsize=20);
genre_plot.set_ylabel(ylabel='Count', fontsize=20);
genre_plot.set_title('Netflix Genres', fontsize=20);

In [None]:
# filter top 5
netflix_top_five_genre = netflix_df[(netflix_df.genre == 'Documentary')\
    | (netflix_df.genre == 'Drama')\
    | (netflix_df.genre == 'Comedy')\
    | (netflix_df.genre == 'Romantic comedy')\
    | (netflix_df.genre == 'Thriller')]

# use crosstab to count genre in years
crosstab_top_five_genre = pd.crosstab(
    netflix_top_five_genre.premiere_year, 
    netflix_top_five_genre.genre)

# marker colors
sub_colors = ['lightgray'] * 5
sub_colors[1] = '#E50914'

# create plotly figure
fig = go.Figure()

# for loop to add genres
color_list = 0
for genres in crosstab_top_five_genre.columns:
    fig.add_trace(go.Scatter(
        x=crosstab_top_five_genre.index,
        y=crosstab_top_five_genre[genres].values,
        mode='lines+markers',
        line_color=sub_colors[color_list],
        name=genres))
    
    color_list += 1

# tweak settings and names
fig.update_layout(template='plotly_white', width=780,
    title='<b>Top 5 Genre</b><br> Counts in Years')
fig.update_yaxes(title='Count')
fig.update_xaxes(title='Years', tickmode='linear')

# view plot
fig.show()

#### **How many films did they release through the years (2014-2021)**

---

In [None]:
# convert to dataframe
netflix_year_title_counts = pd.DataFrame(netflix_titles_premiere_per_year)

# get index as array
netflix_years_names = np.array(netflix_year_title_counts.index)

# get values and flatten for viz
netflix_years_values = netflix_year_title_counts.values.flatten()

# figure size
f, ax = plt.subplots(figsize=(15,5))

# seaborn bar plot
netflix_year_plot = sns.barplot(x=netflix_years_names, y=netflix_years_values);

# set barplot title
netflix_year_plot.set_title('Film Counts per Year', fontsize=15);

# set ylabel title
netflix_year_plot.set_ylabel('Count');

#### How many films are released by year in Quarter1?

---

In [None]:
# filter first quarter
netflix_first_quarter = netflix_df[netflix_df['premiere'].dt.month <= 3]

# use datetime module to get month name
netflix_first_quarter['month'] =  netflix_df.premiere.dt.month_name()

# use crosstab to get counts in months and their designated years
netflix_first_quarter = pd.crosstab(netflix_first_quarter.month, netflix_first_quarter.premiere_year)

# flip data
netflix_first_quarter = netflix_first_quarter.T

# tweak columns in order
netflix_first_quarter = netflix_first_quarter[['January', 'February', 'March']]

In [None]:
# colors
colors = ['lightgray'] * 3
colors[0] = '#E50914'

fig = go.Figure()
line_color = 0
for columns in netflix_first_quarter.columns:
    fig.add_trace(go.Scatter(
        x=netflix_first_quarter.index,
        y=netflix_first_quarter[columns].values,
        name=columns,
        line_color=colors[line_color]
    ))
    line_color += 1
    
fig.update_layout(
    template='plotly_white', 
    title='<b>Quater1:</b> 2016-2021 <br> No. of Films Release',
    width=780)

fig.show()

#### **What titles had the longest rune time and which genre do they belong to?**

---

In [None]:
# top 15 longest runtime
netflix_longest_runtime = netflix_longest_runtime.head(15)

# figure size
f, ax = plt.subplots(figsize=(10,10))

# create barplot
netflix_runtime_plot = sns.barplot(
    y='title', x='runtime',
    hue='genre',
    data=netflix_longest_runtime,
    dodge=False);

# tweak title and labels
netflix_runtime_plot.set_title('Longest Runtime', fontsize=15)
netflix_runtime_plot.set_xlabel('Runtime in minutes', fontsize=15)
netflix_runtime_plot.set_ylabel('Title', fontsize=15)

# move legend to side
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);

#### **What titles have the shortest runtime and what genre do they belong to?**

---

In [None]:
# final sort
netflix_shortest_runtime = netflix_shortest_runtime.sort_values(by='runtime')

# figure size
f, ax = plt.subplots(figsize=(10,10))

# create barplot
netflix_runtime_plot = sns.barplot(
    y='title', x='runtime',
    hue='genre',
    data=netflix_shortest_runtime,
    dodge=False);

# tweak title and labels 
netflix_runtime_plot.set_title('Shortest Runtime', fontsize=15)
netflix_runtime_plot.set_xlabel('Runtime in minutes', fontsize=15)
netflix_runtime_plot.set_ylabel('Title', fontsize=15)

# move legend to side
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);

#### **What language are they in? (Except for english)**

---

In [None]:
# check top languages
netflix_total_languages.head()

In [None]:
# convert to dataframe
netflix_languages = pd.DataFrame(netflix_total_languages)

# drop english language from counts
netflix_languages = netflix_languages.drop(index='English')

# flatten array
netflix_langauges_values = netflix_languages.values.flatten()

# make array of label / names
netflix_langauges_labels = np.array(netflix_languages.index)

In [None]:
# set figure size
f, ax = plt.subplots(figsize=(25,15))

# add figure title
f.suptitle('Netflix Language', fontsize=30)

# tweak matplotlib cmap
cmap = matplotlib.cm.Dark2

# create min and max values for cmap
mini, maxi = netflix_langauges_values.min(), netflix_langauges_values.max()
norm = matplotlib.colors.Normalize(vmin=mini, vmax=maxi)

# use list comprehension to get color list
colors = [cmap(norm(value)) for value in netflix_langauges_values]

# create tree map using squarify library
netflix_language_treemap = squarify.plot(sizes=netflix_langauges_values,
    color=colors,
    label=netflix_langauges_labels, alpha=0.8,
    bar_kwargs={'alpha':.7},
    text_kwargs={'fontsize':10, 'color': 'white'});

# set axes to none
ax.set_xticks([]);
ax.set_yticks([]);

#### **What genre were prevalent through the years(2014-2021)**

---

In [None]:
# set figure size
f, ax = plt.subplots(figsize=(30, 20))

# create seaborn heatmap reference crosstab year and genre
heatmap_genre_year = sns.heatmap(crosstab_year_genre, cbar=False, cmap="magma");

# set title and tweak fontsize
heatmap_genre_year.set_title('Netflix Genre', fontsize=30);

#### **What languages were prevalent through the years(2014-2021)**

---

In [None]:
# set figure size
f, ax = plt.subplots(figsize=(30, 20))

# create a crosstab for year and language
crosstab_year_language = pd.crosstab(index=netflix_df.premiere_year,
    columns=netflix_df.language)

# drop english counts
crosstab_year_language = crosstab_year_language.drop(columns='English')

# plot heatmap using seaborn referencing crosstab
sns.heatmap(crosstab_year_language.T, cbar=False, cmap="magma");

## Inferences and Conclusion <a class='anchor' id='summary'></a>
Data Summary

---
- The data consist of `584 films` and 6 features.
- The data does not contain `NaN` values
- The 6 features are 
	- title, `object`-string
	- genre, `object`-string
	- premiere `object`-string
	- runtime `int`
	- imdb_score `float`
	- language `object`-string
- The dataset contains Netflix oringal titles `premiered_year` 2014-2021. Roughly 7 years of data.
- There are `115` unique genre in the dataset.

Feature `runtime`

---

- The median runtime (in minutes) is `97` with a standard deviation of `27.76`. However the feature is negatively 
skewed. Many of the titles/observations lie in very short runtimes.

- There are `74` titles that are above 2 hours and `15` titles that are less than an hour.
- The longest runtime title is `irish man`. The Top 15 of the longest runtimes has varying genre. While
The top 15 shortest runtime are more populated with documentary originals.

Feature `language`

---
- `English Language is the main language` used in these films. Followed by Spanish & Hindi.
- There are `23` multilanguage languages in this dataset. All are combined with english.
- There are `22` monolingual languages in this dataset.


Feature `genre`


---

- There's a total of `115` distinct genre is this dataset.
- `Documentary is the dominating genre` followed by Drama, Comedy, and Romantic Comedy
- There's an huge `uptrend of romantic comedy in 2018`. From 1 to 12.
- The number of `documentary films` almost `doubled` in years 2018 and 2019. From 25 to 40.
- The number of `thriller films` almost `quadrupled` in years 2019 and 2020 from 5 to 18.


Feature `imdb_score`

- imdb_score is slighly skewed to the left. With some outliers in the lower fence
	of the distribution
- imdb_score has a `min` of `2.5` and `max` of `9`

We can infer that:
- There seems to have a negative correlation between features `runtime` and `imdb_score`
- There seems to have a negative correlation between features `imdb_score` and `premiere_year`

Further statistical test should be considered.


Feature `imdb_score`

---
- imdb_score is slighly skewed to the left. With some outliers in the lower fence
	of the distribution
- imdb_score has a `min` of `2.5` and `max` of `9`

We can infer that:
- There seems to have a negative correlation between features `runtime` and `imdb_score`
- There seems to have a negative correlation between features `imdb_score` and `premiere_year`

Further statistical test should be considered.



## References and Future Work
   
   - Add other observations that did not make the summary
