## Dataset Content

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.</p>

Inspiration
- Understanding what content is available in different countries
- Identifying similar content by matching text-based features
- Network analysis of Actors / Directors and find interesting insights
- Is Netflix has increasingly focusing on TV rather than movies in recent years.


## Import Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

In [2]:
df = pd.read_csv('E:/Data Analyst Portofilio Data/Datasets/Netflex/netflix_titles.csv')
df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


## Data Preprocessing

In [3]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [5]:
# Convert data type for date_added colum to datetime
df["date_added"] = pd.to_datetime(df['date_added'])

# Add new columns
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month
df['day_added'] = df['date_added'].dt.day

In [6]:
# check null values
df.isnull().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
year_added        10
month_added       10
day_added         10
dtype: int64

In [7]:
# drop  unnecessary features/columns

# drop column description
df.drop('show_id',axis=1,inplace=True)

# drop column director
df.drop('director',axis=1,inplace=True)

# drop column cast
df.drop('cast',axis=1,inplace=True)

# drop column description
df.drop('description',axis=1,inplace=True)

In [8]:
# the mode value of country column
df['country'].value_counts().idxmax()

'United States'

In [9]:
# the mode value of rating column
df['rating'].value_counts().idxmax()

'TV-MA'

In [10]:
# Replacing null values by the mode of column

df['country'].replace(np.nan,'United States',inplace=True)

df['rating'].replace(np.nan,'TV-MA',inplace=True)

# drop the null values for rest columns

df.dropna(inplace=True)

In [11]:
# check null values
df.isnull().sum()

type            0
title           0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
year_added      0
month_added     0
day_added       0
dtype: int64

In [12]:
# check duplicated values
df.duplicated().value_counts()

False    8794
dtype: int64

In [13]:
df.head()

Unnamed: 0,type,title,country,date_added,release_year,rating,duration,listed_in,year_added,month_added,day_added
0,Movie,Dick Johnson Is Dead,United States,2021-09-25,2020,PG-13,90 min,Documentaries,2021.0,9.0,25.0
1,TV Show,Blood & Water,South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries",2021.0,9.0,24.0
2,TV Show,Ganglands,United States,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",2021.0,9.0,24.0
3,TV Show,Jailbirds New Orleans,United States,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV",2021.0,9.0,24.0
4,TV Show,Kota Factory,India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",2021.0,9.0,24.0


In [14]:
df[['year_added','month_added','day_added']] = df[['year_added','month_added','day_added']].astype('int')

In [15]:
# Calculate summary statistics for categorical columns
df.describe(include='object')

Unnamed: 0,type,title,country,rating,duration,listed_in
count,8794,8794,8794,8794,8794,8794
unique,2,8794,748,14,220,513
top,Movie,Dick Johnson Is Dead,United States,TV-MA,1 Season,"Dramas, International Movies"
freq,6128,1,3639,3209,1793,362


In [16]:
df['type'].value_counts()

Movie      6128
TV Show    2666
Name: type, dtype: int64

In [17]:
df['country'].value_counts()

United States                             3639
India                                      972
United Kingdom                             418
Japan                                      244
South Korea                                199
                                          ... 
Romania, Bulgaria, Hungary                   1
Uruguay, Guatemala                           1
France, Senegal, Belgium                     1
Mexico, United States, Spain, Colombia       1
United Arab Emirates, Jordan                 1
Name: country, Length: 748, dtype: int64

In [18]:
df['rating'].value_counts()

TV-MA       3209
TV-14       2157
TV-PG        861
R            799
PG-13        490
TV-Y7        333
TV-Y         306
PG           287
TV-G         220
NR            79
G             41
TV-Y7-FV       6
NC-17          3
UR             3
Name: rating, dtype: int64

In [19]:
df['duration'].value_counts()

1 Season     1793
2 Seasons     421
3 Seasons     198
90 min        152
94 min        146
             ... 
16 min          1
186 min         1
193 min         1
189 min         1
191 min         1
Name: duration, Length: 220, dtype: int64

In [20]:
df['listed_in'].value_counts()

Dramas, International Movies                                   362
Documentaries                                                  359
Stand-Up Comedy                                                334
Comedies, Dramas, International Movies                         274
Dramas, Independent Movies, International Movies               252
                                                              ... 
Crime TV Shows, International TV Shows, TV Sci-Fi & Fantasy      1
International TV Shows, TV Horror, TV Sci-Fi & Fantasy           1
Crime TV Shows, Kids' TV                                         1
Horror Movies, International Movies, Sci-Fi & Fantasy            1
Cult Movies, Dramas, Thrillers                                   1
Name: listed_in, Length: 513, dtype: int64

In [21]:
df['release_year'].value_counts()

2018    1146
2017    1031
2019    1030
2020     953
2016     901
        ... 
1959       1
1925       1
1961       1
1947       1
1966       1
Name: release_year, Length: 74, dtype: int64

In [22]:
df.head()

Unnamed: 0,type,title,country,date_added,release_year,rating,duration,listed_in,year_added,month_added,day_added
0,Movie,Dick Johnson Is Dead,United States,2021-09-25,2020,PG-13,90 min,Documentaries,2021,9,25
1,TV Show,Blood & Water,South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries",2021,9,24
2,TV Show,Ganglands,United States,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",2021,9,24
3,TV Show,Jailbirds New Orleans,United States,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV",2021,9,24
4,TV Show,Kota Factory,India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",2021,9,24


## Exploratory Data Analysis

In [23]:
df.sample(10)

Unnamed: 0,type,title,country,date_added,release_year,rating,duration,listed_in,year_added,month_added,day_added
4471,Movie,Been So Long,United States,2018-10-26,2018,TV-MA,100 min,"Dramas, International Movies, Music & Musicals",2018,10,26
4457,Movie,Satyagraha,India,2018-11-01,2013,TV-14,146 min,"Dramas, International Movies, Music & Musicals",2018,11,1
149,Movie,I Got the Hook Up,United States,2021-09-01,1998,R,93 min,"Action & Adventure, Comedies",2021,9,1
5244,TV Show,Star Trek: Enterprise,United States,2017-10-01,2004,TV-14,4 Seasons,"Classic & Cult TV, TV Action & Adventure, TV S...",2017,10,1
4388,Movie,Aalorukkam,India,2018-11-15,2018,TV-PG,122 min,"Dramas, Independent Movies, International Movies",2018,11,15
2908,Movie,The Forest,United States,2020-02-16,2016,PG-13,93 min,"Horror Movies, Independent Movies",2020,2,16
7292,Movie,Leo the Lion,"United States, Italy",2015-12-20,2013,TV-Y7-FV,78 min,"Children & Family Movies, Comedies",2015,12,20
2138,Movie,Christine,"United Kingdom, United States",2020-08-13,2016,R,119 min,"Dramas, Independent Movies",2020,8,13
8752,Movie,Wish Man,United States,2019-12-03,2019,TV-14,108 min,"Children & Family Movies, Dramas",2019,12,3
1883,Movie,StarBeam: Halloween Hero,Canada,2020-10-06,2020,TV-Y,33 min,Children & Family Movies,2020,10,6


In [24]:
df_movie = df[df['type'] == 'Movie']

df_tv_show = df[df['type'] == 'TV Show']

In [25]:
# 10 country that give highest movies

df_movie[['type','country']].groupby(['country']).count()['type'].nlargest(10).to_frame()

Unnamed: 0_level_0,type
country,Unnamed: 1_level_1
United States,2495
India,893
United Kingdom,206
Canada,122
Spain,97
Egypt,92
Nigeria,86
Indonesia,77
Japan,76
Turkey,76


In [26]:
# 10 country that give highest TV Show

df_tv_show[['type','country']].groupby(['country']).count()['type'].nlargest(10).to_frame()

Unnamed: 0_level_0,type
country,Unnamed: 1_level_1
United States,1144
United Kingdom,212
Japan,168
South Korea,158
India,79
Taiwan,68
Canada,59
France,49
Spain,48
Australia,47


In [27]:
# 10 highest duration for movies

df_movie[['type','duration']].groupby(['duration']).count()['type'].nlargest(10).to_frame()

Unnamed: 0_level_0,type
duration,Unnamed: 1_level_1
90 min,152
93 min,146
94 min,146
97 min,146
91 min,144
95 min,137
96 min,130
92 min,129
102 min,122
98 min,120


In [28]:
# 10 highest duration for TV Show

df_tv_show[['type','duration']].groupby(['duration']).count()['type'].nlargest(10).to_frame()

Unnamed: 0_level_0,type
duration,Unnamed: 1_level_1
1 Season,1793
2 Seasons,421
3 Seasons,198
4 Seasons,94
5 Seasons,64
6 Seasons,33
7 Seasons,23
8 Seasons,17
9 Seasons,9
10 Seasons,6


In [29]:
# 10 highest rating for movies

df_movie[['type','rating']].groupby(['rating']).count()['type'].nlargest(10).to_frame()

Unnamed: 0_level_0,type
rating,Unnamed: 1_level_1
TV-MA,2064
TV-14,1427
R,797
TV-PG,540
PG-13,490
PG,287
TV-Y7,139
TV-Y,131
TV-G,126
NR,75


In [30]:
# 10 highest rating for TV Show

df_tv_show[['type','rating']].groupby(['rating']).count()['type'].nlargest(10).to_frame()

Unnamed: 0_level_0,type
rating,Unnamed: 1_level_1
TV-MA,1145
TV-14,730
TV-PG,321
TV-Y7,194
TV-Y,175
TV-G,94
NR,4
R,2
TV-Y7-FV,1


In [31]:
# 10 highest listed-in for movies

df_movie[['type','listed_in']].groupby(['listed_in']).count()['type'].nlargest(10).to_frame()

Unnamed: 0_level_0,type
listed_in,Unnamed: 1_level_1
"Dramas, International Movies",362
Documentaries,359
Stand-Up Comedy,334
"Comedies, Dramas, International Movies",274
"Dramas, Independent Movies, International Movies",252
Children & Family Movies,215
"Children & Family Movies, Comedies",201
"Documentaries, International Movies",186
"Dramas, International Movies, Romantic Movies",180
"Comedies, International Movies",176


In [32]:
# 10 highest listed-in for TV Show

df_tv_show[['type','listed_in']].groupby(['listed_in']).count()['type'].nlargest(10).to_frame()

Unnamed: 0_level_0,type
listed_in,Unnamed: 1_level_1
Kids' TV,219
"International TV Shows, TV Dramas",121
"Crime TV Shows, International TV Shows, TV Dramas",110
"Kids' TV, TV Comedies",98
Reality TV,95
"International TV Shows, Romantic TV Shows, TV Comedies",94
"International TV Shows, Romantic TV Shows, TV Dramas",90
"Anime Series, International TV Shows",88
Docuseries,84
TV Comedies,68


In [33]:
# 10 highest Year for Added movies on Netflix

df_movie[['type','year_added']].groupby(['year_added']).count()['type'].nlargest(10).to_frame()

Unnamed: 0_level_0,type
year_added,Unnamed: 1_level_1
2019,1424
2020,1284
2018,1237
2021,993
2017,838
2016,251
2015,56
2014,19
2011,13
2013,6


In [34]:
# 10 highest Year for Added TV Show on Netflix

df_tv_show[['type','year_added']].groupby(['year_added']).count()['type'].nlargest(10).to_frame()

Unnamed: 0_level_0,type
year_added,Unnamed: 1_level_1
2020,595
2019,592
2021,505
2018,412
2017,349
2016,176
2015,26
2013,5
2014,5
2008,1


In [35]:
# 10 highest Year release for movies on Netflix

df_movie[['type','release_year']].groupby(['release_year']).count()['type'].nlargest(10).to_frame()

Unnamed: 0_level_0,type
release_year,Unnamed: 1_level_1
2018,767
2017,766
2016,658
2019,633
2020,517
2015,397
2021,277
2014,264
2013,225
2012,173


In [36]:
# 10 highest Year release for TV Show on Netflix

df_tv_show[['type','release_year']].groupby(['release_year']).count()['type'].nlargest(10).to_frame()

Unnamed: 0_level_0,type
release_year,Unnamed: 1_level_1
2020,436
2019,397
2018,379
2021,315
2017,265
2016,243
2015,160
2014,88
2012,63
2013,62
