<a href="https://colab.research.google.com/github/r9hit10/NetFlix-EDA/blob/main/Netflix_Exploratory_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##<font color="pink" style="sans-serif"> NETFLIX </font>
<font color="cornflowerblue" style="sans-serif">Netflix is one of the most popular media and video streaming platforms.
They have over 10000 movies or tv shows available on their platform, as of mid-2021, they have over 222M Subscribers globally.
This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.</font>

## <font color="pink" style="sans-serif"> Business Problem Statement </font>
<font color="cornflowerblue" style="sans-serif">
Analyze the data for Netflix and generate insights that could help Netflix in deciding which type of shows/movies to produce and how they can grow the business in different countries

## <font color="pink" style="sans-serif"> Importing Python Libraries </font>

In [144]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from matplotlib import rcParams
rcParams['figure.figsize']= 20,10

In [145]:
## Reading the Dataset

data=pd.read_csv("netflix.csv")

In [146]:
## Exploring the Dataset
## shape, number of rows and number of columns
## Default Data Types for the Columns

In [147]:
data.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [148]:
## Number of rows and columns
## data.shape

In [149]:
data.shape

(8807, 12)

In [150]:
## No. of Rows 8807
## No of Columns/Features 12

In [151]:
## Dataset Information
## data.info()

In [152]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


## <font color="pink" style="sans-serif"> Dataset Information </font>

The dataset provided consists of a list of all the TV shows/movies available on Netflix:

Show_id: Unique ID for every Movie / Tv Show

Type: Identifier - A Movie or TV Show

Title: Title of the Movie / Tv Show

Director: Director of the Movie

Cast: Actors involved in the movie/show

Country: Country where the movie/show was produced

Date_added: Date it was added on Netflix

Release_year: Actual Release year of the movie/show

Rating: TV Rating of the movie/show

Duration: Total Duration - in minutes or number of seasons

Listed_in: Genre

Description: The summary description


In [153]:
## Data types for columns
## df.dtypes

In [154]:
data.dtypes

Unnamed: 0,0
show_id,object
type,object
title,object
director,object
cast,object
country,object
date_added,object
release_year,int64
rating,object
duration,object


In [155]:
## Converting "type" column to 'category' data type for faster evaluation and comparisions

In [156]:
data["type"]=data["type"].astype("category")

In [157]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   show_id       8807 non-null   object  
 1   type          8807 non-null   category
 2   title         8807 non-null   object  
 3   director      6173 non-null   object  
 4   cast          7982 non-null   object  
 5   country       7976 non-null   object  
 6   date_added    8797 non-null   object  
 7   release_year  8807 non-null   int64   
 8   rating        8803 non-null   object  
 9   duration      8804 non-null   object  
 10  listed_in     8807 non-null   object  
 11  description   8807 non-null   object  
dtypes: category(1), int64(1), object(10)
memory usage: 765.7+ KB


In [158]:
## Movie/TV_Series Release_Years

In [159]:
data.describe()

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


<font color="cornflowerblue" style="sans-serif">
Using .describe() we can see that,

Release_year
Ranges from 1925(min) to
2021(max)

50% of Releases are in 2017

In [160]:
## checking for the missing values
## using .isnull()

In [161]:
data.isnull().sum()

Unnamed: 0,0
show_id,0
type,0
title,0
director,2634
cast,825
country,831
date_added,10
release_year,0
rating,4
duration,3


In [162]:
## Total Missing values in the Dataset

In [163]:
data.isnull().sum().sum()

4307

<font color="cornflowerblue" style="sans-serif">There are in total 4307 Missing Values in the Dataset

In [164]:
## Percentage of Missing Values
## Total Missing Values/Total Rows

In [165]:
Percent_Missing_Values = round((data.isnull().sum().sum()/data.shape[0])*100,2)
print("Percent of Missing Values",Percent_Missing_Values)

Percent of Missing Values 48.9


In [166]:
## Checking the Percentage of Movies and TV Series in the Dataset
## .value_counts(normalize=True)

In [167]:
data["type"].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
type,Unnamed: 1_level_1
Movie,0.696151
TV Show,0.303849


In [168]:
## Approximately 70% of the data is for Movies
## Aproximately 30% of the data is for TV Series

<font color="cornflowerblue" style="sans-serif">Top 5 Countrywide Releases

In [169]:
data["country"].value_counts(normalize=True).head()

Unnamed: 0_level_0,proportion
country,Unnamed: 1_level_1
United States,0.35331
India,0.121866
United Kingdom,0.052533
Japan,0.030717
South Korea,0.02495


<font color="cornflowerblue" style="sans-serif">United States with 35% Movies and TV Series Release with 65% contribution from the rest of the World.

In [170]:
## Ratings Proportion

In [171]:
data["rating"].value_counts(normalize=True)[:5]

Unnamed: 0_level_0,proportion
rating,Unnamed: 1_level_1
TV-MA,0.364308
TV-14,0.245371
TV-PG,0.098035
R,0.090765
PG-13,0.055663


In [172]:
## Most common ratings "TV-MA"

In [173]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   show_id       8807 non-null   object  
 1   type          8807 non-null   category
 2   title         8807 non-null   object  
 3   director      6173 non-null   object  
 4   cast          7982 non-null   object  
 5   country       7976 non-null   object  
 6   date_added    8797 non-null   object  
 7   release_year  8807 non-null   int64   
 8   rating        8803 non-null   object  
 9   duration      8804 non-null   object  
 10  listed_in     8807 non-null   object  
 11  description   8807 non-null   object  
dtypes: category(1), int64(1), object(10)
memory usage: 765.7+ KB


## <font color="pink" style="sans-serif"> Splitting data into Movies and TV Shows to Analyse Cast

In [174]:
Movies=data.loc[data["type"]=="Movie"]
TV_Shows=data.loc[data["type"]=="TV Show"]

In [175]:
Movies.shape

(6131, 12)

In [176]:
TV_Shows.shape

(2676, 12)

<font color="cornflowerblue" style="sans-serif">Total Movies = 6131
<font color="cornflowerblue" style="sans-serif">Total TV Shows = 2676

In [177]:
Movies.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
6,s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,"September 24, 2021",2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...
12,s13,Movie,Je Suis Karl,Christian Schwochow,"Luna Wedler, Jannis Niewöhner, Milan Peschel, ...","Germany, Czech Republic","September 23, 2021",2021,TV-MA,127 min,"Dramas, International Movies",After most of her family is murdered in a terr...


In [178]:
Movies.drop(columns=["show_id","type","date_added","release_year","duration","description"])

Unnamed: 0,title,director,cast,country,rating,listed_in
0,Dick Johnson Is Dead,Kirsten Johnson,,United States,PG-13,Documentaries
6,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,PG,Children & Family Movies
7,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...",TV-MA,"Dramas, Independent Movies, International Movies"
9,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,PG-13,"Comedies, Dramas"
12,Je Suis Karl,Christian Schwochow,"Luna Wedler, Jannis Niewöhner, Milan Peschel, ...","Germany, Czech Republic",TV-MA,"Dramas, International Movies"
...,...,...,...,...,...,...
8801,Zinzana,Majid Al Ansari,"Ali Suliman, Saleh Bakri, Yasa, Ali Al-Jabri, ...","United Arab Emirates, Jordan",TV-MA,"Dramas, International Movies, Thrillers"
8802,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,R,"Cult Movies, Dramas, Thrillers"
8804,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,R,"Comedies, Horror Movies"
8805,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,PG,"Children & Family Movies, Comedies"


## <font color="cornflowerblue"> Most frequently casted actor(s/ess) in movies</font>

In [179]:
movie_cast=dict()

In [180]:
for i in Movies["cast"].dropna().values:
  for j in i.split(","):
    if j in movie_cast:
      movie_cast[str(j).strip()] +=1
    else:
      movie_cast[str(j).strip()] =1

In [181]:
movie_cast=sorted(movie_cast.items(), key= lambda x : x[1], reverse=True)

In [182]:
movie_cast[:5]

[('Adam Sandler', 20),
 ('Vatsal Dubey', 16),
 ('Ahmed Helmy', 13),
 ('Samuel West', 11),
 ('Eddie Murphy', 10)]

## <font color="pink"> Most frequent directors</font>

In [183]:
movie_directors=dict()
for i in Movies["director"].dropna().values:
  for j in i.split(","):
    if j in movie_directors:
      movie_directors[str(j).strip()] +=1
    else:
      movie_directors[str(j).strip()] =1

In [184]:
movie_directors = sorted(movie_directors.items(), key= lambda x : x[1] , reverse=True)

In [185]:
movie_directors[:5]

[('Rajiv Chilaka', 22),
 ('Suhas Kadav', 16),
 ('Jay Karas', 15),
 ('Marcus Raboy', 15),
 ('Cathy Garcia-Molina', 13)]

## <font color="pink"> Most frequent Movie Genres</font>

In [186]:
Movies["listed_in"]

Unnamed: 0,listed_in
0,Documentaries
6,Children & Family Movies
7,"Dramas, Independent Movies, International Movies"
9,"Comedies, Dramas"
12,"Dramas, International Movies"
...,...
8801,"Dramas, International Movies, Thrillers"
8802,"Cult Movies, Dramas, Thrillers"
8804,"Comedies, Horror Movies"
8805,"Children & Family Movies, Comedies"


In [187]:
movie_genre= dict()
for i in Movies["listed_in"].dropna().values:
  for j in i.split(","):
    if  j in movie_genre:
      movie_genre[str(j).strip()] +=1
    else:
      movie_genre[str(j).strip()] = 1

In [188]:
movie_genre

{'Documentaries': 19,
 'Children & Family Movies': 5,
 'Dramas': 2,
 'Independent Movies': 1,
 'International Movies': 1,
 'Comedies': 1,
 'Thrillers': 1,
 'Romantic Movies': 1,
 'Music & Musicals': 1,
 'Horror Movies': 1,
 'Sci-Fi & Fantasy': 1,
 'Action & Adventure': 859,
 'Classic Movies': 2,
 'Anime Features': 2,
 'Sports Movies': 1,
 'Cult Movies': 2,
 'Faith & Spirituality': 1,
 'LGBTQ Movies': 1,
 'Stand-Up Comedy': 26,
 'Movies': 57}

In [189]:
movie_genre = sorted(movie_genre.items(), key = lambda x:x[1], reverse= True)
movie_genre[:5]

[('Action & Adventure', 859),
 ('Movies', 57),
 ('Stand-Up Comedy', 26),
 ('Documentaries', 19),
 ('Children & Family Movies', 5)]

## <font color="pink"> High Number of Movies producing countries</font>

In [201]:
Movies["country"].value_counts().reset_index().head(10)

Unnamed: 0,country,count
0,United States,2058
1,India,893
2,United Kingdom,206
3,Canada,122
4,Spain,97
5,Egypt,92
6,Nigeria,86
7,Indonesia,77
8,Turkey,76
9,Japan,76
