# NETFLIX || EXPLORATORY DATA ANALYSIS 

## Data Description:

This Netflix Dataset has information about the TV shows and Movies available on Netflix from the year 2008 to 2021.
Netflix is an application that keeps growing exponentially whole around the world and it is the most famous streaming platform.

![netflix%20img.png](attachment:netflix%20img.png)

## STEP 1: IMPORT LIBRARY

In [2]:
# IMPORT LIBRARY FOR PERFORMING EDA ON THE GIVEN DATASET

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
sns.set()

### Description of the above commands

* pandas: 
  A powerful data manipulation library for working with structured data.
  
* numpy: 
  A library for numerical operations and handling arrays.
  
* matplotlib.pyplot: 
  A widely used plotting library for creating static, animated, and interactive visualizations.
  
* seaborn: 
  A statistical data visualization library based on Matplotlib, providing a high-level interface for drawing   attractive and informative statistical graphics.
  
* warnings: 
  The Python warnings module is used to control warning messages that may be issued by the interpreter or other Python modules.
  
* warnings.filterwarnings('ignore'): 
  Ignores any warning messages that might be generated during the execution of the script.

* sns.set(): 
  Sets the default seaborn aesthetics for plots, making them visually appealing.

## STEP 2: READING DATASET

In [5]:
df = pd.read_csv('/Users/suparthjain/Desktop/EDA-project/Dataset/netflix_titles_2021.csv')

In [6]:
df.head(10)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
5,s6,TV Show,Midnight Mass,Mike Flanagan,"Kate Siegel, Zach Gilford, Hamish Linklater, H...",,"September 24, 2021",2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries",The arrival of a charismatic young priest brin...
6,s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,"September 24, 2021",2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
8,s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...


### Description of Netflix Dataset

The dataset consist of following columns

* show_id: Gives the information about show id.

* type: Gives information about 2 different unique values one is TV Show and another is Movie.

* title: Gives information about the title of Movie or TV Show.

* director: Gives information about the director who directed the Movie or TV Show.

* cast: Gives information about the cast who plays role in Movie or TV Show.

* country: Gives information about the Name of country.

* date_added: Gives information about the tv shows or movie added on netflix.

* release_year: Gives information about the year when Movie or TV Show was released.

* rating: Gives information about the Movie or TV Show are in which category (eg like the movies are only for students, or adults, etc).

* duration: Gives information about the duration of Movie or TV Show.

* listed_in: Gives information about the genre of Movie or TV Show.

* description: Gives information about the description of Movie or TV Show.

## STEP 2: INSPECTING DATASET

In [7]:
#Calculating the number of rows and columns in dataset
df.shape

(8807, 12)

#### The given dataset has 12 columns and 8807 records

In [8]:
#Inspecting the datatype of different columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


#### From the above result, we can see that some columns have non-null count less than 8807, which means that these columns have some missing values.

In [9]:
# Describing the data statistics as per release_year
df.describe()

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


## STEP 3: DATA CLEANING

In [10]:
#CHECKING NULL VALUES COUNT IN EVERY COLUMN
df.isnull().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

#### From the above result, we can conclude that director, cast, country, date_added, rating, duration have null values in some of the records

### Handling missing values of director column

In [13]:
percentage_of_missing_values_in_director_column =  round((df['director'].isnull().sum()/df.shape[0])*100, 2)

print(f"Director column has {percentage_of_missing_values_in_director_column} % null values")

Director column has 29.91 % null values


#### Director column has 29.91% missing values which is a huge number. So to enter the missing values in these records is not what we encourage. So to handle these missing values, either we can remove these records or we can add a keyword 'N/A' for director in these records.

#### If we are removing 2634 records, there will be a huge decline in number of records. To avoid this, lets add a keyword 'N/A' to these records

In [15]:
#Adding N/A in place of missing values in director column
df['director'] = df['director'].fillna('N/A')

In [16]:
df['director'].isnull().sum()

0

### Handling missing values of cast column

In [18]:
percentage_of_missing_values_in_cast_column =  round((df['cast'].isnull().sum()/df.shape[0])*100, 2)

print(f"Cast column has {percentage_of_missing_values_in_cast_column} % null values")

Cast column has 9.37 % null values


#### Cast column has 9.37% missing values which is a significant number. So to enter the missing values in these records is not what we encourage. So to handle these missing values, either we can remove these records or we can add a keyword 'N/A' for cast in these records.

#### If we are removing 825 records, there will be significant decline in number of records. To avoid this, lets add a keyword 'N/A' to these records

In [19]:
#Adding N/A in place of missing values in cast column
df['cast'] = df['cast'].fillna('N/A')

In [20]:
df['cast'].isnull().sum()

0

### Handling missing values of country column

In [21]:
percentage_of_missing_values_in_country_column =  round((df['country'].isnull().sum()/df.shape[0])*100, 2)

print(f"Country column has {percentage_of_missing_values_in_country_column} % null values")

Country column has 9.44 % null values


#### Country column has 9.44% missing values which is a significant number. So to enter the missing values in these records is not what we encourage. So to handle these missing values, either we can remove these records or we can add a keyword 'N/A' for country in these records.

#### If we are removing 831 records, there will be significant decline in number of records. To avoid this, lets add a keyword 'N/A' to these records

In [22]:
#Adding N/A in place of missing values in country column
df['country'] = df['country'].fillna('N/A')

In [23]:
df['country'].isnull().sum()

0

### Handling missing values of date_added column

In [24]:
percentage_of_missing_values_in_date_added_column =  round((df['date_added'].isnull().sum()/df.shape[0])*100, 2)

print(f"Date added column has {percentage_of_missing_values_in_date_added_column} % null values")

Date added column has 0.11 % null values


#### Date_added column has 0.11% missing values which is a very les. So to handle these missing values, either we can remove these records or we can add the data manually if we can find it somewhere for these records.

In [25]:
#Removing the records which have missing values in date_added column
df = df.dropna(subset=['date_added'])

In [26]:
df.shape

(8797, 12)

In [27]:
df['date_added'].isnull().sum()

0

### Handling missing values of rating column

In [28]:
percentage_of_missing_values_in_rating_column =  round((df['rating'].isnull().sum()/df.shape[0])*100, 2)

print(f"Rating column has {percentage_of_missing_values_in_rating_column} % null values")

Rating column has 0.05 % null values


#### Rating column has 0.05% missing values which is a very les. So to handle these missing values, either we can remove these records or we can add the data manually if we can find it somewhere for these records.

In [29]:
#Removing the records which have missing values in rating column
df = df.dropna(subset=['rating'])

In [30]:
df.shape

(8793, 12)

In [31]:
df['rating'].isnull().sum()

0

### Handling missing values of duration column

In [32]:
percentage_of_missing_values_in_duration_column =  round((df['duration'].isnull().sum()/df.shape[0])*100, 2)

print(f"Duration column has {percentage_of_missing_values_in_duration_column} % null values")

Duration column has 0.03 % null values


#### Duration column has 0.03% missing values which is a very les. So to handle these missing values, either we can remove these records or we can add the data manually if we can find it somewhere for these records.

In [33]:
#Removing the records which have missing values in duration column
df = df.dropna(subset=['duration'])

In [34]:
df.shape

(8790, 12)

In [35]:
df['duration'].isnull().sum()

0

In [36]:
#Checking if still any missing values present in the dataset
df.isnull().sum()

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

### Modification in date_added column

In [37]:
#Checking date_added and release_year column
df[['date_added','release_year']]

Unnamed: 0,date_added,release_year
0,"September 25, 2021",2020
1,"September 24, 2021",2021
2,"September 24, 2021",2021
3,"September 24, 2021",2021
4,"September 24, 2021",2021
...,...,...
8802,"November 20, 2019",2007
8803,"July 1, 2019",2018
8804,"November 1, 2019",2009
8805,"January 11, 2020",2006


#### To make our analysis much better and informative, let's split date_added in month_added and year_added

In [38]:
#Converting date_added to month and year

df['month_added'] = df['date_added'].apply(lambda x:x.split(',')[0].split()[0])
df['year_added'] = df['date_added'].apply(lambda x:x.split(',')[1])

df[['month_added','year_added']]

Unnamed: 0,month_added,year_added
0,September,2021
1,September,2021
2,September,2021
3,September,2021
4,September,2021
...,...,...
8802,November,2019
8803,July,2019
8804,November,2019
8805,January,2020


In [39]:
df.head(10)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,month_added,year_added
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",September,2021
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",September,2021
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,September,2021
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",September,2021
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,September,2021
5,s6,TV Show,Midnight Mass,Mike Flanagan,"Kate Siegel, Zach Gilford, Hamish Linklater, H...",,"September 24, 2021",2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries",The arrival of a charismatic young priest brin...,September,2021
6,s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,"September 24, 2021",2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...,September,2021
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s...",September,2021
8,s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...,September,2021
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...,September,2021


In [40]:
# Dropping date_added column as it does not have relevance any more

df.drop('date_added',axis=1,inplace= True)

In [41]:
# Checking the datatype of every column again

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8790 entries, 0 to 8806
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8790 non-null   object
 1   type          8790 non-null   object
 2   title         8790 non-null   object
 3   director      8790 non-null   object
 4   cast          8790 non-null   object
 5   country       8790 non-null   object
 6   release_year  8790 non-null   int64 
 7   rating        8790 non-null   object
 8   duration      8790 non-null   object
 9   listed_in     8790 non-null   object
 10  description   8790 non-null   object
 11  month_added   8790 non-null   object
 12  year_added    8790 non-null   object
dtypes: int64(1), object(12)
memory usage: 961.4+ KB


#### From the above result, we can see that year_added column has datatype as object, which is not correct. Let's convert year_added into integer

In [43]:
#Converting datatype of added_year column from object to integer

df['year_added'] = df['year_added'].astype(int)

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8790 entries, 0 to 8806
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8790 non-null   object
 1   type          8790 non-null   object
 2   title         8790 non-null   object
 3   director      8790 non-null   object
 4   cast          8790 non-null   object
 5   country       8790 non-null   object
 6   release_year  8790 non-null   int64 
 7   rating        8790 non-null   object
 8   duration      8790 non-null   object
 9   listed_in     8790 non-null   object
 10  description   8790 non-null   object
 11  month_added   8790 non-null   object
 12  year_added    8790 non-null   int64 
dtypes: int64(2), object(11)
memory usage: 961.4+ KB


### Comparing year_added and release_year column

In [45]:
df[['year_added','release_year']]

Unnamed: 0,year_added,release_year
0,2021,2020
1,2021,2021
2,2021,2021
3,2021,2021
4,2021,2021
...,...,...
8802,2019,2007
8803,2019,2018
8804,2019,2009
8805,2020,2006


In [48]:
df[df['year_added'] < df['release_year']]

Unnamed: 0,show_id,type,title,director,cast,country,release_year,rating,duration,listed_in,description,month_added,year_added
1551,s1552,TV Show,Hilda,,"Bella Ramsey, Ameerah Falzon-Ojo, Oliver Nelso...","United Kingdom, Canada, United States",2021,TV-Y7,2 Seasons,Kids' TV,"Fearless, free-spirited Hilda finds new friend...",December,2020
1696,s1697,TV Show,Polly Pocket,,"Emily Tennant, Shannon Chan-Kent, Kazumi Evans...","Canada, United States, Ireland",2021,TV-Y,2 Seasons,Kids' TV,After uncovering a magical locket that allows ...,November,2020
2920,s2921,TV Show,Love Is Blind,,"Nick Lachey, Vanessa Lachey",United States,2021,TV-MA,1 Season,"Reality TV, Romantic TV Shows",Nick and Vanessa Lachey host this social exper...,February,2020
3168,s3169,TV Show,Fuller House,,"Candace Cameron Bure, Jodie Sweetin, Andrea Ba...",United States,2020,TV-PG,5 Seasons,TV Comedies,The Tanner family’s adventures continue as DJ ...,December,2019
3287,s3288,TV Show,Maradona in Mexico,,Diego Armando Maradona,"Argentina, United States, Mexico",2020,TV-MA,1 Season,"Docuseries, Spanish-Language TV Shows","In this docuseries, soccer great Diego Maradon...",November,2019
3369,s3370,TV Show,BoJack Horseman,,"Will Arnett, Aaron Paul, Amy Sedaris, Alison B...",United States,2020,TV-MA,6 Seasons,TV Comedies,Meet the most beloved sitcom horse of the '90s...,October,2019
3433,s3434,TV Show,The Hook Up Plan,,"Marc Ruchmann, Zita Hanrot, Sabrina Ouazani, J...",France,2020,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...","When Parisian Elsa gets hung up on her ex, her...",October,2019
4844,s4845,TV Show,Unbreakable Kimmy Schmidt,,"Ellie Kemper, Jane Krakowski, Tituss Burgess, ...",United States,2019,TV-14,4 Seasons,TV Comedies,When a woman is rescued from a doomsday cult a...,May,2018
4845,s4846,TV Show,Arrested Development,,"Jason Bateman, Portia de Rossi, Will Arnett, M...",United States,2019,TV-MA,5 Seasons,TV Comedies,It's the Emmy-winning story of a wealthy famil...,May,2018
5394,s5395,Movie,Hans Teeuwen: Real Rancour,Doesjka van Hoogdalem,Hans Teeuwen,Netherlands,2018,TV-MA,86 min,Stand-Up Comedy,Comedian Hans Teeuwen rebels against political...,July,2017


#### From the above result, we can see that some records have year_added < release_year 
#### The above condition cannot be true as movie/show cannot be added on netflix before it is released.

In [49]:
#Removing the records which shows that year_added < release_year

df.drop(df[df['year_added']<df['release_year']].index,inplace = True)

In [50]:
# Checking the dimension of data available after cleaning the dataset
df.shape

(8776, 13)

### Checking for Duplicate rows in Dataset

In [52]:
#Finding duplicate rows in the dataset

duplicate_rows = df[df.duplicated()]
duplicate_rows

Unnamed: 0,show_id,type,title,director,cast,country,release_year,rating,duration,listed_in,description,month_added,year_added


#### It is great to see that there are no duplicate rows present in the given dataset

### Outliers

#### The given dataset does not have scope of any outliers. So ignoring this step

### Standardization of Dataset

In [53]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,release_year,rating,duration,listed_in,description,month_added,year_added
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",September,2021
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",September,2021
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,September,2021
3,s4,TV Show,Jailbirds New Orleans,,,,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",September,2021
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,September,2021
