# Netflix Originals Visualization

# Introduction

**In this notebook we will analyze and visualize Netflix Originals and get ideas and insights.**

## Menu

- [Imports](#Imports)
- [Data Cleaning](#Data-Cleaning)
- [Visualization](#Visualization)
    - [Top Ratings](#Top-Ratings)
    - [Genre Analysis](#Genre-Analysis)
    - [Date Analysis](#Date-Analysis)
    - [Runtime Analysis](#Runtime-Analysis)

# Imports

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns

from fuzzywuzzy import process, fuzz

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import warnings
warnings.filterwarnings('ignore')

# Data Cleaning

In [None]:
df = pd.read_csv('../input/netflix-original-films-imdb-scores/NetflixOriginals.csv')

df.head()

To datetime

In [None]:
df['Premiere'] = pd.to_datetime(df['Premiere'])

# columns year, month and weekday
df['year']    = df['Premiere'].dt.year
df['month']   = df['Premiere'].dt.month_name()
df['weekday'] = df['Premiere'].dt.day_name()

#rename score
df.rename(columns= {'IMDB Score' : 'Score'}, inplace= True)

df.head()

In [None]:
#Creating a dataframe ordered by genre
df_genre = pd.DataFrame(columns= ['Genre', 'Title', 'Premiere', 'Runtime', 'Score', 'Language', 'year', 'month', 'weekday'])

i = 0
genres_list = []
for row in df['Genre']:
    genres = row.replace('/ ','/').replace(' /', '/').replace(' / ', '/').replace('-', '/').replace(' ', '/').split('/')
    
    for genre in genres:
        new_row ={
            'Genre'    : genre.title(),
            'Title'    : df['Title'][i],
            'Premiere' : df['Premiere'][i],
            'Runtime'  : df['Runtime'][i],
            'Score'    : df['Score'][i],
            'Language' : df['Language'][i],
            'year'     : df['year'][i],
            'month'    : df['month'][i],
            'weekday'  : df['weekday'][i]
        }
        df_genre = df_genre.append(new_row, ignore_index= True)
    i += 1

# Visualization

## Top Ratings

In [None]:
data = df.sort_values('Score', ascending= False)[:10]


plt.figure(figsize= (6, 15))
sns.set_theme()

plt.subplot(2, 1, 1)
ax = sns.barplot(data= data, 
                 x= 'Score', 
                 y= 'Title', 
                 palette= 'BuGn_r', )

ax.set_xlim(0, 10)
plt.title('Top 10 Most Rated Netflix Originals', size= 22)
plt.xlabel('Score', size= 20)
plt.ylabel('Title', size= 20)
plt.xticks(size= 16)
plt.yticks(size= 20)

for patch in ax.patches:
    width = patch.get_width()
    height = patch.get_height()
    x = patch.get_height()
    y = patch.get_y()
    
    plt.text(x + width - 0.7, y + 0.5, '{}'.format(width), size= 16)

    
data = df.sort_values('Score')[:10]

plt.subplot(2, 1, 2)
ax = sns.barplot(data= data, 
                 x= 'Score', 
                 y= 'Title', 
                 palette= 'Reds_r', )

ax.set_xlim(0, 10)
plt.title('Less Rated Netflix Originals', size= 22)
plt.xlabel('Score', size= 20)
plt.ylabel('Title', size= 20)
plt.xticks(size= 16)
plt.yticks(size= 20)

for patch in ax.patches:
    width = patch.get_width()
    height = patch.get_height()
    x = patch.get_height()
    y = patch.get_y()
    
    plt.text(x + width - 0.7, y + 0.5, '{}'.format(width), size= 16)

## Genre Analysis

In [None]:
plt.figure(figsize= (11,8), )


top_genres = df_genre.loc[df_genre['Genre'].isin(df_genre.groupby('Genre').sum().sort_values('Score', ascending= False).reset_index()['Genre'][:10])].groupby('Genre').mean().sort_values('Score', ascending= False).reset_index()['Genre']

plt.subplot(2, 2, 1)
sns.boxplot(data= df_genre, 
            x= 'Genre', 
            y= 'Score', 
            order= top_genres)
plt.title('Best Rating per Genre', size= 22)
plt.xlabel(None)
plt.xticks([])
plt.yticks(size= 14)
plt.ylim((0, 10))

plt.subplot(2, 2, 3)
for genre in top_genres:
    sns.scatterplot(data= df_genre.loc[df_genre['Genre'] == genre],
                   x= 'Genre',
                   y= 'Score',
                   hue= 'Score',
                   size= 'Score',
                   palette= 'RdYlGn',
                   legend= False)
plt.xlabel(None)
plt.xticks(rotation= 90,size= 16)
plt.yticks(size= 14)
plt.ylim((0, 10))


plt.subplot(1, 2, 2)
sns.countplot(df_genre.loc[df_genre['Genre'].isin(top_genres)]['Genre'], order= top_genres)
plt.title('Released Originals per Genre', size= 22)
plt.xticks(rotation= 90, size= 16)
plt.yticks(size= 14)
plt.xlabel(None)

plt.show()

- **Insights**
    - The documentary genre is the one with the highest number of releases and ratings.
    - Netflix could invest more in the animation genre, as even though it has the smallest amount, it is one of the highest rated.

In [None]:
plt.figure(figsize= (10, 6))

sns.countplot(df_genre.loc[df_genre['Genre'].isin(top_genres)]['year'], 
              hue= df_genre.loc[df_genre['Genre'].isin(top_genres)]['Genre'])

plt.title('Released Genre per Year', size= 25)
plt.xlabel(None)
plt.xticks(size= 16)

plt.show()

- **Insights**
    - Over the years, Netflix has invested much more in documentaries, dramas and comedies.

## Date Analysis

In [None]:

plt.figure(figsize= (11, 8))

sns.set_theme()
plt.subplot(2, 2, 1)
ax = sns.boxplot(data= df, 
                 x= 'year', 
                 y= 'Score')


ax.set_ylim((0, 10))
plt.title('Distribuition of Rating per Year', size= 20)
plt.ylabel('Score', size= 18)
plt.xlabel(None)
plt.xticks([])

plt.subplot(2, 2, 3)
ax = sns.scatterplot(data= df, 
                     x= 'year', 
                     y= 'Score', 
                     size= 'Score', 
                     hue= 'Score', 
                     palette= 'RdYlGn', 
                     legend= False)

ax.set_ylim((0, 10))
plt.xlabel('Year', size= 18)
plt.ylabel('Score', size= 18)
plt.xticks(size= 13, rotation= 90)

plt.subplot(1, 2, 2)
sns.countplot(df['year'])
plt.title('Released per Year', size= 22)
plt.xlabel('Year', size= 18)
plt.xticks(size= 13, rotation= 90)

plt.show()

- **Insights**
    - Each year Netflix releases more originals.
    - At first it looks like the rating has dropped over the years, but actually, when we look at the charts of Movies released per year, we realize that what happened is that the number of movies released has increased, causing the average rating to be pulled down .
    

In [None]:
plt.figure(figsize= (10, 5))

plt.subplot(1, 2, 1)
ax = sns.boxplot(data= df, 
                 x= 'month', 
                 y= 'Score',
                 order= ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])
ax.set_ylim((0, 10))
plt.title('Distribuition of Rating per Month', size= 18)
plt.ylabel('Score', size= 14)
plt.xlabel(None)
plt.xticks(rotation= 90, size= 15)


plt.subplot(1, 2, 2)
ax = sns.countplot(df['month'], order= ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'])
plt.title('Released per Month', size= 22)
plt.xlabel('Month', size= 15)
plt.xlabel(None)
plt.xticks(rotation= 90, size= 15)

plt.show()

- **Insights**
    - There is no clear relationship with the release month and the score.
    - The month with the highest number of releases is October and the smallest is July.

In [None]:
plt.figure(figsize= (10, 6))

plt.subplot(1, 2, 1)
ax = sns.boxplot(data= df, 
                 x= 'weekday', 
                 y= 'Score',
                 order= ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
ax.set_ylim((0, 10))
plt.title('Distribuition of Rating per Day of Week', size= 15)
plt.ylabel('Score', size= 15)
plt.xlabel(None)
plt.xticks(rotation= 90, size= 15)
plt.yticks(size= 15)


plt.subplot(1, 2, 2)
ax = sns.countplot(df['weekday'],
                   order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
plt.title('Released per Day of Week', size= 18, )
plt.xlabel(None)
plt.xticks(rotation= 90, size= 15)
plt.yticks(size= 14)

plt.show()

- **Insights**
    - There is no clear relationship with launch day and rating.
    - The vast majority of originals are released on Friday.

# Runtime Analysis

In [None]:
plt.figure(figsize= (10, 6))

plt.subplot(1, 2, 1)
ax = sns.scatterplot(data= df, 
                x= 'Runtime', 
                y= 'Score', 
                hue= 'Score', 
                size= 'Score',
                palette= 'RdYlGn',
                legend= False)

ax.set_ylim((0, 10))
plt.title('Score per Runtime', size= 20)
plt.xlabel('Runtime', size= 17)
plt.ylabel('Score', size= 17)
plt.yticks(size= 15)
plt.xticks(size= 15)

plt.subplot(1, 2, 2)

sns.histplot(df['Runtime'])
plt.title('Runtime Distribuition', size= 20)
plt.xlabel('Runtime', size= 17)
plt.yticks(size= 15)
plt.xticks(size= 15)

plt.show()

- **Insights**
    - Most originals have an average duration of 100 minutes.
    - Short-lived originals have a predominantly high rating.

In [None]:
plt.figure(figsize= (10, 6))
sns.boxplot(data= df_genre.loc[df_genre['Genre'].isin(top_genres)], 
            x= 'Genre',
            y= 'Runtime')

plt.title('Runtime per Genre Distribuition', size= 20)
plt.ylabel('Runtime', size= 15)
plt.xlabel(None)
plt.xticks(rotation= 90,size= 15)
plt.yticks(size= 14)


plt.show()

- **Insights**
    - Animations have a shorter duration.

In [None]:
plt.figure(figsize= (10, 6))
sns.boxplot(data = df, 
            y= 'Runtime',
            x= 'year')

plt.title('Runtime per Year Distribuition', size= 20)
plt.ylabel('Runtime', size= 15)
plt.xlabel(None)
plt.xticks(rotation= 45, size= 15)
plt.yticks(size= 13)

plt.show()

- **Insights**
    - Runtime increased slightly after 2015.