# INTRODUCTON

In this study: 
- we are going to make Exploratory Data Analysis (EDA) with the Netflix original films dataset.
- Study aims to be beginner friendly and give as much as possible explanation for each step on the way for beginners.
- Study's dataset has top 584 Netflix original films on the different genre.
- Each films has language, release time, runtime and IMDB Score.

- First and foremost, let's import the required libraries.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt 

import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

# OVERVIEW

Let's read our csv dataset and have a look at basic information about it.

In [None]:
df = pd.read_csv("../input/netflix-original-films-imdb-scores/NetflixOriginals.csv", encoding='latin-1')
df.head()

In [None]:
df.shape

In total, we have 584 films with 6 attributes.

In [None]:
df.columns.value_counts()

In [None]:
df.isnull().sum()

It seems that we have black swan here (black swan is The black swan theory or theory of black swan events is a metaphor that describes an event that comes as a surprise, has a major effect). We have no missing value at all.

In [None]:
df.info()

Although **Premiere** is tend to show date, it is object type. We need to convert it into datetime. Rest is fine.

In [None]:
df["Date"] = pd.to_datetime(df.Premiere)
df["Date"]

In [None]:
df.head()

It is better now. We can divide date into **year**, **month**, and **date** as well.

In [None]:
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month
df["day_of_week"] = df["Date"].dt.dayofweek
df["Year_Month"] = df["Date"].dt.strftime("%Y-%m")

In [None]:
df.head()

In [None]:
df.info()

# ANALYSIS

 ### 1. Genre

In [None]:
df.Genre.nunique()

In [None]:
df.Genre.unique()

In [None]:
a = df.Genre.unique()
len(a)

In [None]:
df.Genre.value_counts(normalize=True) * 100

In total, we have 115 different genre.

Let's check the top 20 **Genre**.

In [None]:
genre = df.Genre.value_counts().nlargest(20)  # genre = df['Genre'].value_counts()[:20]
genre

With 27.22%. movies in Documentary genre are at the top followed by Drama genre.

In [None]:
genre.index

In [None]:
genre.values

In [None]:
fig = px.bar(data_frame=genre, x=genre.index, y=genre.values, labels={"y":"Number of Movies from the Genre", "index":"Genres"})
fig.update_layout(xaxis={"categoryorder":"total descending"})

fig.show()

 ### 2. Languages

In [None]:
df.Language.unique()

In [None]:
df.Language.value_counts()

In [None]:
df.Language.nunique()

In [None]:
df.Language.value_counts().nlargest(20)

In [None]:
top_20_lang = df.Language.value_counts().nlargest(20)
top_20_lang

In [None]:
top_20_lang.index

In [None]:
top_20_lang.values

In [None]:
fig = px.bar(data_frame=top_20_lang, x=top_20_lang.index, y=top_20_lang.values, labels={"y":"Count", "index":"Language"})
fig.update_layout(xaxis={"categoryorder":"total descending"})

fig.show()

English is the most used language in the programs by landslide. Hindi and Spanish follows it.

### 3. Runtime

In [None]:
df.Runtime

In [None]:
df.Runtime.describe()

**Note:** If the **mean** is greater than the **median**, the distribution is **positively** skewed. If the **mean** is less than the **median**, the distribution is **negatively** skewed.

- We have at around 95 minutes runtime for the programs.
- Based on the given descriptive info, we can expect outliers from both the maximum side and the minimum side.
- Since mean score is lower than median score, we can expect left skewed distribution which means that we will see more runtime values on the minimum side.

In [None]:
fig = px.histogram(data_frame=df, x="Runtime", title="Runtime of Programs")

fig.show()

With the very above histogram, we confirmed that the distribution of **Runtime** is left (neatively) skewed.

In [None]:
fig = px.box(data_frame=df, x="Runtime", hover_data=df[["Title", "Genre"]])
fig.update_traces(quartilemethod="inclusive")

fig.show()

In [None]:
df[df.Runtime == df.Runtime.max()]["Title"]

In [None]:
df[df.Runtime == df.Runtime.min()]["Title"]

- As we expected, we have left skewed distribution with multiple outliers are on the both side, but much more are on the left side-minimum side.
- The longest movie is **Irishman** and the shorest movie is **Sol Levante**.

### 4. IMDB Score

In [None]:
df["IMDB Score"].describe()

- Avergae rating is around 6.3 average. Max rating is 9 and minum one is 2.5.

- Mean and median values are close to each other. Since median is bigger than mean score, we can expect left skewed distribution with several outliers on the sides.

In [None]:
fig = px.histogram(data_frame=df, x=df["IMDB Score"], title="IMDB Scores of the Programs")

fig.show()

As we expected, we have left skewed distribution.

In [None]:
fig = px.box(data_frame=df, x=df["IMDB Score"], hover_data=df[["Title", "Genre"]])
fig.update_traces(quartilemethod="inclusive")

fig.show()

In [None]:
df[df["IMDB Score"] == df["IMDB Score"].max()][["Title", "Genre"]]

In [None]:
df[df["IMDB Score"] == df["IMDB Score"].min()][["Title", "Genre"]]

- **A Life on Our Planet** documentary got 9 maximum point in the list.

- Minimum rating is **Enter the Anime**.

- Interestingly both maximum and minimum rating programs are from documentary genre.

Now, let see the **correlation between Runtime and IMDB Score**

In [None]:
df["Runtime"].corr(df["IMDB Score"])

In [None]:
df[["IMDB Score", "Runtime"]].corr()

In [None]:
df[["Runtime", "IMDB Score"]].corr()

In [None]:
fig = px.scatter(data_frame=df, x="IMDB Score", y="Runtime")
fig.update_layout(autosize=False, width=800, height=600,)

fig.show()

It is obvious from the graph above that there is no significant relationship between runtime and IMDB score.


### 5. Year

In [None]:
year = df.Year.value_counts()
year

In [None]:
year.index

In [None]:
year.values

In [None]:
fig = px.bar(data_frame=df, x=year.index, y=year.values, labels={"y":"Count of Movies per each Year", "x":"Year"})
fig.update_layout(xaxis={'categoryorder':'total descending'})

fig.show()

- Each year number of programs in the Netflix increases.
- Since the full data on the 2021 is not known, difference between 2020 and 2021 is normal.

### 6. Month

In [None]:
month = df.Month.value_counts(sort=False)
month

In [None]:
month.index

In [None]:
month.values

In [None]:
months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

fig = px.bar(data_frame=df, x=months, y=month.values, labels={"y":"Count of Movies per each Month", "x":"Month"})

fig.show()

- October and April are the months which have the highest number of program releases.
-During summer time the least number of movie is released.

### 7. Day

In [None]:
df.day_of_week.value_counts(sort=False)

In [None]:
day = df.day_of_week.value_counts(sort=False)
day.values

In [None]:
days = ["Mon", "Tue", "Wen", "Thu", "Fri", "Sat", "Sun"]

fig = px.bar(data_frame=df, x=days, y=day.values, labels={"y":"Count of Movies per each Day", "x":"Day"})

fig.show()

- Friday has the maximum number of new releases.

- Saturday and Sunday have the lowest number of releases.

### Top 15 Ratings by Genre

In [None]:
df.groupby("Genre")["IMDB Score"].nlargest(10)

In [None]:
top_15_ratings_by_genre = df.groupby("Genre")["IMDB Score"].mean().nlargest(15)
top_15_ratings_by_genre

In [None]:
# top_10_ratings_by_genre = df.groupby('Genre')['IMDB Score'].mean().sort_values(ascending=False)[:10]
# top_10_ratings_by_genre

In [None]:
fig = px.bar(data_frame=top_15_ratings_by_genre, x=top_15_ratings_by_genre.index, y=top_15_ratings_by_genre.values, 
             labels={'y':'Average Rating Score', 'x':'Genre'})

fig.show()

Top rating score is on the Animation-Christmas-Comedy-Adventure Genre followed by Musical/short and Concert Film.

### Lowest 15 Ratings by Genre

In [None]:
lowest_15_ratings_by_genre = df.groupby("Genre")["IMDB Score"].mean().nsmallest(15)
lowest_15_ratings_by_genre

In [None]:
fig = px.bar(data_frame=lowest_15_ratings_by_genre, x=lowest_15_ratings_by_genre.index, y=lowest_15_ratings_by_genre.values, 
             labels={'y':'Average Rating Score', 'x':'Genre'})

fig.show()

Lowest rating movies are from Heist film/Thriller followed by Musical/Wester/Fantasy and Horror Anthology genres.

### Top 10 Rating Movies

In [None]:
top_10_ratings = df[["IMDB Score", "Title", "Genre", "Year", "Language"]].sort_values(["IMDB Score"], ascending=False)[:10]
top_10_ratings

In [None]:
fig = px.scatter(top_10_ratings, y= 'Title', x='IMDB Score', hover_data = top_10_ratings[['Genre','Year','Language']], color='Genre', 
                 title = "Top 10 High Rated Programs")

fig.show()

Mor than half of the top rated movies come from Documentary genre.

### Lowest 10 Rate Movies

In [None]:
lowest_10_ratings = df[["IMDB Score", "Title", "Genre", "Year", "Language"]].sort_values(["IMDB Score"])[:10]
lowest_10_ratings

In [None]:
fig = px.scatter(lowest_10_ratings, y= 'Title', x='IMDB Score', hover_data = lowest_10_ratings[['Genre','Year','Language']], color='Genre', 
                 title = "Lowest 10 High Rated Programs")

fig.show()

Unlike the top 10 rated movies, lowest 10 rated movies come from every different genres.

- Thanks for the dataset contibutor for this data.

- It was a quite pleasure to share with you this detailed, beginner friendly EDA.

- Have fun reading.