- Recently I got a lot of feedback from my dear friends who just change or about the change their career towards to Data Analysis/ Data Science and Machine Learning areas about the lack of material between beginning the analysis journey and the advanced techniques.

- They are looking for detailed but at the same time beginner friendly, not so much complicated (with different regression, normalization techniques, etc.) explained Explanatory Data Analysis examples, which show them how to start and most importantly how to read the descriptive statistics and graphs.

- After getting these feedbacks, I have decided to make some kind of series of EDA’s from different datasets, without making so complicated for the people at their first steps of DS/ML journey.

### This notebook is part of the 9 Beginner Friendly EDAs. If these EDAs would be helpful to anyone, I would be more than happy.



### **INTRO**

- In this study, we are going to make Exploratory Data Analysis (EDA) with the Netflix original films dataset. 
- Study aims to be beginner friendly and give as much as possible explanation for each step on the way.
- Study's dataset has top 584 Netflix original films  on the different genre. 
- Each films has language, release time, runtime and IMDB Score.

- First, let's import the required libraries.
- We will use Plotly's interactive environment for visualization.

In [None]:
import pandas as pd
import numpy as np


import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

### Overview Stage

- Read the csv
- Look for basic information about the dataset

In [None]:
df= pd.read_csv('../input/netflix-original-films-imdb-scores/NetflixOriginals.csv')

In [None]:
df.head()

In [None]:
df.shape

- We have 584 films and 6 attributes

In [None]:
df.isnull().sum()

- Yeah, it is very hard to find that kind of clean data in the real life. 
- No missing values. Hurray !!!

In [None]:
df.info()

- We need to make an adjustment on the Premiere feature,it should be datetime object.
- Other than that, everything Seems OK.

In [None]:
df['date'] = pd.to_datetime(df['Premiere'])
df['date']

- OK, it is much better.
- Let's make use of it and make columns out of it, such as, year, month, day.

In [None]:
df['year_month']= df['date'].dt.strftime('%Y-%m')
df['year'] = df['date'].dt.year
df['month']= df['date'].dt.month
df['day_of_week']=df['date'].dt.dayofweek

- Now we are ready to move on to the analysis part.

### Analysis Part

#### **Genre**

In [None]:
df['Genre'].nunique()

In [None]:
df['Genre'].value_counts(normalize=True)

- We have 115 different genre
- Let's look at the first 20 genre

In [None]:
genre = df['Genre'].value_counts()[:20]
genre

- 27.2% of the movies on the Documentary genre, then 13% of the movies on Drama genre.
- Majority of the movies  come from different genres and each genre shares at around 1% each.

In [None]:
fig = px.bar(genre, x= genre.index, y=genre.values, labels={'y':'Number of Movies from the Genre', 'index':'Genres'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

#### Languages

In [None]:
df['Language'].nunique()

In [None]:
top_10_languages_used= df['Language'].value_counts()[:10]
top_10_languages_used

In [None]:
fig = px.bar(top_10_languages_used, x= top_10_languages_used.index, y=top_10_languages_used.values, labels={'y':'Count', 'index':'Language'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- English is the most used language in the programs, Hindi and Spanish follows it.

#### **Runtime**

In [None]:
df['Runtime'].describe()

- We have at around 93-97 minutes runtime for the programs in Netflix.
- Based on the given descriptive info, we can expect outliers from both the maximum side and the minimum side. 
- Since mean score is lower than median score; we can expect left skewed distribution and  we will see more runtime values on the minimum side.

In [None]:
fig = px.histogram(df, x= 'Runtime', title='Runtime of the Programs in Netflix')

fig.show()

In [None]:
fig = px.box(df, x= 'Runtime', hover_data = df[['Title','Genre']])
fig.update_traces(quartilemethod="inclusive")
fig.show()

- As we expected, we have left skewed distribution with multiple outliers are on the both side, but much more are on the left side-minimum side.

- Movie with the maximum runtime  is 'Irishman', yeah, agreed, it was quite a long movie. But no complaints. I loved to see Al Pacino and Robert De Niro at the same movie.

- Minimum runtime was 4 minute animation 'Sol Levante'

#### IMDB Score

In [None]:
df['IMDB Score'].describe()

- Before going further, I have to admit that, I am regular follower of IMDB website. Most of the time, I agreed with their rating scores.

- Programs in the Netflix, got around 6.3 average rating. Max 9 and minum was 2.5.

- Mean and median values are close to each other. Since median is bigger than mean score, we can expect left skewed distribution with several outliers are on the left side-minimum side.

In [None]:
fig = px.histogram(df, x= 'IMDB Score', title='IMDB Score of the Programs in Netflix')

fig.show()

In [None]:
fig = px.box(df, x= 'IMDB Score', hover_data = df[['Title','Genre']])
fig.update_traces(quartilemethod="inclusive")
fig.show()

- **David Attenborough** and his documentaries, I love him. He is true hero and excellent documentary producer-presenter. it is very normal for me to see, his documentary got 9 maximum point in the list.

- Minimum rating is 'Enter the Anime'.

- Interestingly both maximum and minimum rating programs are from documentary genre.

#### Correlation Between Runtime and IMDB Ratings

In [None]:
df[['IMDB Score','Runtime']].corr()

In [None]:
fig = px.scatter(df, x='IMDB Score', y='Runtime')
fig.show()

- There is no significant relationship between runtime and IMDB score.

#### **Year**

In [None]:
Year = df['year'].value_counts()
Year

In [None]:
fig = px.bar(Year, x= Year.index, y=Year.values, labels={'y':'Count of Movies in Each Year', 'index':'Year'})
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- As one can expect, each year number of programs in the Netflix increase.
- Since we don't have full data on the 2021, difference between 2020 and 2021 is normal.

#### **Month**

In [None]:
Month= df['month'].value_counts(sort=False)
Month

In [None]:
months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

fig = px.bar(Month, x= months, y=Month.values, labels={'y':'Count of Movies in Each Month', 'x':'Month'})
fig.show()

- Number of program releases differs by months. October and April are the months which have the highest number of program releases.

- During the summer time, Jun-Aug, the least number of movie is released.

#### **Day**

In [None]:
days= df['day_of_week'].value_counts(sort=False)
days

In [None]:
day = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

fig = px.bar(days, x= day, y=days.values, labels={'y':'Count of Movies in Each Day', 'x':'Day'})
fig.show()

- Friday has the maximum number of new releases.

- Saturday and Sunday have the lowest number of releases.

#### **Top 10 Ratings by Genre**

In [None]:
top_10_ratings_by_genre = df.groupby('Genre')['IMDB Score'].mean().sort_values(ascending=False)[:10]
top_10_ratings_by_genre

#### Top 10 Rating Genres

In [None]:
fig = px.bar(top_10_ratings_by_genre, x= top_10_ratings_by_genre.index, y=top_10_ratings_by_genre.values, labels={'y':'Average Rating Score', 'x':'Genre'})
fig.show()

- Top rating score is on the Animation-Christmas-Comedy-Adventure Genre then Musical/short and Concert Film.

#### **Lowest 10 Ratings by Genre**

In [None]:
bottom_10_ratings_by_genre = df.groupby('Genre')['IMDB Score'].mean().sort_values()[:10]
bottom_10_ratings_by_genre

#### Bottom 10 Ratings

In [None]:
fig = px.bar(bottom_10_ratings_by_genre, x= bottom_10_ratings_by_genre.index, y=bottom_10_ratings_by_genre.values, labels={'y':'Average Rating Score', 'x':'Genre'})
fig.show()

- Lowest rating movies are from Heist film/Thriller, Musical/Wester/Fantasy and Horror Anthology genres.

#### **Top 20 High Rating Movies** 

In [None]:
top_20 = df[['IMDB Score','Title','Genre','year','Language']].sort_values(['IMDB Score'], ascending=False)[:20]
top_20

In [None]:
fig = px.scatter(top_20, y= 'Title', x='IMDB Score', 
                 hover_data = top_20[['Genre','year','Language']], color='Genre', 
                 title = "Top 20 High Rated Programs")
fig.show()

- 16 out 20 top rated movies come from Documentary genre.

#### **20 Lowest Rated Movies** 

In [None]:
bottom_20 = df[['IMDB Score','Title','Genre','year','Language']].sort_values(['IMDB Score'])[:20]
bottom_20

In [None]:
fig = px.scatter(bottom_20, y= 'Title', x='IMDB Score', 
                 hover_data = bottom_20[['Genre','year','Language']], color='Genre', 
                 title = "20 Lowest Rated Programs")
fig.show()

- We can see lowest rated movies from every different genre.

## This notebook is a part of the 9 Beginner Friendly EDAs
## If you like this one, you can also check out other notebooks in the Beginner Friendly EDAs series!

* [Data Analyst Jobs - EDA](https://www.kaggle.com/kaanboke/plotly-data-analyst-jobs)
* [Top Games on Google Play Store](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-top-games)
* [Hollywood Top Movies- EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-top-movies)
* [UDEMY Courses EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-udemy)
* [World Happiness Report - EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-eda)
* [Countries Life Expectancy](https://www.kaggle.com/kaanboke/plotly-beginner-friendly)
* [Amazon Top 50 Bestselling Books EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-amazon)
* [London Bike Sharing - EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-london-bike)


- Thanks for the dataset contibutor for this data. I really enjoyed working on it.

- It was a quite pleasure to share with you this detailed, beginner friendly EDA. Thanks for your time.

- All the best 