# Introduction

#### I love watching movies so I decided to do some EDA on this too. In this notebook, I did the EDA based on the questions I got while looking at the data. This notebook will help you to make awasome plots using plotly, which will definitely help beginners.

![image](https://sm.pcmag.com/pcmag_au/review/n/netflix/netflix_38rt.jpg)

In [None]:
# For data handling
import numpy as np
import pandas as pd

# For visvalization
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [None]:
df=pd.read_csv('../input/netflix-original-films-imdb-scores/NetflixOriginals.csv')
df.head()

### IMDB Score:
#### IMDb registered users can cast a vote (from 1 to 10) on every released title in the database. Individual votes are then aggregated and summarized as a single IMDb rating

In [None]:
# checking for correct data types
df.info()

#### Premiere is a date, so I will change its format to datetime

In [None]:
# Changing the format of 'Premiere' to datetime
df['Premiere']=pd.to_datetime(df['Premiere'], dayfirst=True)

# Adding day, month and year columns
df['Day']=df['Premiere'].apply(lambda x: x.day)

month_dict = {1:"January", 2:"February", 3:"March", 4:"April", 5:"May", 6:"June",
  7:"July", 8:"August", 9:"September", 10:"October", 11:"November", 12:"December"}
df['Month']=df['Premiere'].apply(lambda x: month_dict[x.month])

df['Year']=df['Premiere'].apply(lambda x: x.year)

In [None]:
# Checking for missing values
df.isnull().sum()

#### No missing values found!

In [None]:
# Checking for dublicate data
df.duplicated().sum()

#### No dublicate entries found.

#### The data is clean and ready for EDA!

## Distribution of the numeric features

In [None]:
## numeric features
num_col=['Runtime', 'IMDB Score', 'Day', 'Year']

# Plotting
fig=make_subplots(cols=4, rows=1)

for i,col in enumerate(num_col):
    fig.add_trace(go.Box(y=df[col], name=col, hovertext=df['Title']),row=1, col=i+1)

# One can use the bellow code also, but the range of all features is different so I plotted them in different plots
#fig=px.box(df, y=num_col)

fig.show()

### Observations:
#### The runtime varies from 4 mins to 209 mins.
#### The least IMDB score is 2.5 for 'Enter the Anime' and maximum is 9 for 'David Attenborough: A Life On Our Planet'.
#### The oldest collection is from 2014, of only one movie 'My Own Man'

## Is there any relation between runtime and the IMDB score?

In [None]:
fig=px.scatter(df, x='Runtime', color='IMDB Score', hover_name='Title', 
               color_continuous_scale=[(0, "skyblue"), (0.5, "blue"), (1, "red")])
fig.update_layout(title=dict(text='IMDB Score & Runtime', xanchor='center', yanchor='top', x=0.5),
                 yaxis=dict(title=''))
fig.show()

### Observations
#### There is no relation between runtime and IMDB rating, as for low and high runtimes the rating was both low and high. To be sure lets, see numerically

In [None]:
df.corr()

#### The correlation score:
#### 1  : Strongly and positively correlated (one increases, other also increases and vice versa)
#### 0  : No correlation
#### -1 : Strongly and negetively correlated (one increases, other also decreases and vice versa)

#### As the correlation between runtime and IMDB score is -0.04089, this confirms there is no correlation between them. A low duration movie can have high IMDB rating and high duration movie can have low rating.

## Which Genre movies got most IMDB rating?

In [None]:
# Let's first check how many Genre are there
len(df['Genre'].unique())

#### As there are too many Genre, and if I plot all of them then the plot won't look good and we won't be able to interprete the it. So, I will plot the top 5 Genre based on the average IMDB score

In [None]:
df_temp=df.groupby(['Genre']).mean(['IMDB rating']).sort_values(by='IMDB Score', ascending=False).reset_index().iloc[:5,:]

fig=px.pie(df_temp, names='Genre', values='IMDB Score',hole=0.5)
fig.update_layout(title=dict(text='Top 5 rated Genre', xanchor='center', yanchor='top', x=0.4))
fig.show()

### Observations:
#### People mostly like Animation, Comedy, Adventure and Musical type movies

## Which are the top 10 highest rated movies?

In [None]:
df_temp=df.sort_values(by='IMDB Score', ascending=False).reset_index().iloc[:10,:]
fig=px.bar(df_temp, x='Title', y='IMDB Score', hover_name='Genre')
fig.update_layout(title=dict(text='Top 10 rated movies', xanchor='center', yanchor='top', x=0.5))
fig.show()

### Observations
#### 6 out 10 are Documentries. That's interesting as ocumentry was not there in top 5 Genre, it means that some of the documetries have got very less IMDB score.

#### So, let's see what is the distribution of IMDB score in Documentry to support the theory.

In [None]:
fig=px.box(df[df['Genre']=='Documentary'], x='IMDB Score', hover_name='Title')
fig.update_layout(title=dict(text='IMDB Score distribution in Documentry', xanchor='center', yanchor='top', x=0.5))
fig.show()

#### As we predicted, some of the documentries like 'Enter the Anime', 'Searching for Sheela', 'After the Raid', have got very less IMDB rating, which are taking out the Documentry genre from top 5.

## Which language movies have more rating?

In [None]:
# Let's first check how many languages are there
len(df['Language'].unique())

#### As there are too many Languages (same problem with Genre), I will plot the top 5 Language movies based on the average IMDB score.

In [None]:
df_temp=df.groupby(['Language']).mean(['IMDB rating']).sort_values(by='IMDB Score', ascending=False).reset_index().iloc[:5,:]

fig=px.pie(df_temp, names='Language', values='IMDB Score')
fig.update_layout(title=dict(text='Top 5 rated Language movies', xanchor='center', yanchor='top', x=0.45))
fig.show()

## Is there any relation between Premiere month and IMDB rating?

In [None]:
monthlist=['January', 'February', 'March', 'April', 'May', 'June', 'July', 
           'August', 'September', 'October', 'November', 'December']

yearlist=list(np.sort(df['Year'].unique()))

# If month list is not given in 'category_orders', then the month names will not be in order
fig=px.box(df, y='Month', x='IMDB Score', category_orders={'Month':monthlist}, hover_name='Title')
fig.update_layout(title=dict(text='Premiere month and IMDB Score', xanchor='center', yanchor='top', x=0.5))
fig.show()

### Observation
#### All months have more or less same median between 6 and 7. So, there is no relation between the Premiere month and IMDB Score. 
#### Let's see in which month most Premiere are there.

In [None]:
df_temp=df.groupby(['Month'])['Title'].count().reset_index()

# If month list is not given in 'category_orders', then the month names will not be in order
fig=px.bar(df_temp, y='Month', x='Title', category_orders={'Month':monthlist})
fig.update_layout(title=dict(text='Month and Number of Premieres', xanchor='center', yanchor='top', x=0.5), 
                 xaxis=dict(title='Number of Premieres'))
fig.show()

### Observation
#### Most of the movies have Premieres in month of October followed by April. I think this is due to hollidays but I am not sure. What do you think, please write in the comment.

## How many good movies are released over the year?

In [None]:
# 'Good Movies means, whose IMDB score is more than 7. I took it based on my search, but you can take the threshold as you like!
threshold=7
df['Best']=df['IMDB Score'].apply(lambda x: 1 if x>threshold else 0)

In [None]:
fig=px.histogram(df, x='Year', color='Best', barmode='group')
fig.update_layout(title=dict(text='Month and Number of Premieres', xanchor='center', yanchor='top', x=0.5), 
                 xaxis=dict(title='Number of Premieres'))
fig.show()

### Observation
#### The of proportion of good movies is decreasing over the years. Hope we get some good directors, writers and actors soon!

## Longest movies are in which language?

In [None]:
df_temp=df.groupby(['Language'])['Runtime'].mean().reset_index()

fig=px.bar(df_temp, x='Language', y='Runtime')
fig.update_layout(title=dict(text='Longest movies are in which language?', xanchor='center', yanchor='top', x=0.5), 
                 yaxis=dict(title='Average Runtime'))
fig.show()

### Observation
#### Khmer/English/French and English/Akan language movies have the longest runtime. While, Georgian and English/Hindi language movies have least runtime.

### I like science fiction movies, so I will explore them. You can replace science fiction with your favorite genre

In [None]:
# Replace 'Science fiction' with your favorite gerne
favorite_genre='Science fiction'

# Filtering the favorite genre
df_genre=df[df['Genre'].str.contains(favorite_genre)].reset_index(drop=True)
df_genre.head()

## Distribution of Science Fiction movies over the years

In [None]:
# For counting the movies over the years
df_genre['Count']=1

fig=px.sunburst(df_genre, path=['Year','Month','Title'], values='Count')
fig.update_layout(title=dict(text=f'Number of {favorite_genre} movies over the years',
                             xanchor='center', yanchor='top', x=0.5), yaxis=dict(title='Movies count'))
fig.show()

## Distribution of Science Fiction movies over the Languages

In [None]:
df_hist=df_genre.groupby(['Year','Language']).mean(['Count']).reset_index()
fig=px.histogram(df_hist, x='Language', y='Count', color='Year')
fig.update_layout(title=dict(text=f'Distribution of {favorite_genre} movies over languages', xanchor='center', yanchor='top', x=0.5),
                 xaxis=dict(title='Language'), yaxis=dict(title='Movies count'))
fig.show()

## How all the categories are related?

In [None]:
# If I consider all the Genre, the plot won't look good so I am considering top 100 movies for the plot
df_temp=df.sort_values(by='IMDB Score', ascending=False).reset_index(drop=True).iloc[:200,:]
fig=px.parallel_categories(df_temp,  dimensions=['Language', 'Genre', 'Best'],
                          color='IMDB Score',color_continuous_scale=[(0,'blue'),(0.5,'yellow'),(1,'red')])
fig.update_layout(title=dict(text='Parallel Categories Plot', xanchor='center', yanchor='top', x=0.5))
fig.show()

## If you like, please **upvote**