## Netflix Show Data Exploatory Analysis

In [1]:
import pandas as pd 
import numpy as np
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
import plotly.express as px

In [2]:
netflix = pd.read_csv(r"C:\Users\Administrator\Desktop\project\netflix\netflix_titles.csv")

### Data Exploration  

In [3]:
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       7787 non-null   object
 1   type          7787 non-null   object
 2   title         7787 non-null   object
 3   director      5398 non-null   object
 4   cast          7069 non-null   object
 5   country       7280 non-null   object
 6   date_added    7777 non-null   object
 7   release_year  7787 non-null   int64 
 8   rating        7780 non-null   object
 9   duration      7787 non-null   object
 10  listed_in     7787 non-null   object
 11  description   7787 non-null   object
dtypes: int64(1), object(11)
memory usage: 730.2+ KB


In [4]:
netflix['date_added'] = pd.to_datetime(netflix['date_added'])
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   show_id       7787 non-null   object        
 1   type          7787 non-null   object        
 2   title         7787 non-null   object        
 3   director      5398 non-null   object        
 4   cast          7069 non-null   object        
 5   country       7280 non-null   object        
 6   date_added    7777 non-null   datetime64[ns]
 7   release_year  7787 non-null   int64         
 8   rating        7780 non-null   object        
 9   duration      7787 non-null   object        
 10  listed_in     7787 non-null   object        
 11  description   7787 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(10)
memory usage: 730.2+ KB


In [5]:
netflix.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,2020-08-14,2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19:00 AM,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016-12-23,2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,11:59:00 PM,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2018-12-20,2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,2017-11-16,2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,2020-01-01,2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [6]:
missing_value_count = netflix.isnull().sum()

missing_value_count

show_id            0
type               0
title              0
director        2389
cast             718
country          507
date_added        10
release_year       0
rating             7
duration           0
listed_in          0
description        0
dtype: int64

## 1. Number of Shows

In [7]:
show_number = netflix.type.value_counts()
show_number = pd.DataFrame(show_number).reset_index()
show_number.columns = ['type','number']
show_number

Unnamed: 0,type,number
0,Movie,5377
1,TV Show,2410


## 2.Rating


Data in the 'rating' column is quite confusing.So i make 'age' column with value ('Kids' or 'Teens' or 'Adult').

In [8]:
kids = [ 'TV-Y', 'G','TV-Y7','TV-Y7-FV', 'TV-G', 'PG', 'TV-PG']
teens = [ 'PG-13', 'TV-14']
adults = [ 'R', 'TV-MA', 'NC-17']

netflix_rating = pd.DataFrame(netflix ['rating'],columns = ['rating'])

def age(x):
    if x in kids:
        return 'Kids'
    elif x in teens:
        return 'Teens'
    elif x in adults:
        return 'Adults'
    else:
        return 'Not Rated'

netflix_rating['ages'] = netflix['rating'].apply(age)

netflix_rating

Unnamed: 0,rating,ages
0,TV-MA,Adults
1,TV-MA,Adults
2,R,Adults
3,PG-13,Teens
4,PG-13,Teens
...,...,...
7782,TV-MA,Adults
7783,TV-14,Teens
7784,TV-MA,Adults
7785,TV-PG,Kids


count shows number for each rating

In [9]:
nr = netflix_rating.groupby(['rating','ages']).agg(number=('rating','count')).sort_values(by= 'ages').reset_index()
nr

Unnamed: 0,rating,ages,number
0,NC-17,Adults,3
1,R,Adults,665
2,TV-MA,Adults,2863
3,G,Kids,39
4,PG,Kids,247
5,TV-G,Kids,194
6,TV-PG,Kids,806
7,TV-Y,Kids,280
8,TV-Y7,Kids,271
9,TV-Y7-FV,Kids,6


In [10]:
fig = px.sunburst(nr,path = ['ages','rating'],values = 'number')
fig.show()

## 3.Genre

have a look at data

In [11]:
netflix['listed_in'].head(3)

0    International TV Shows, TV Dramas, TV Sci-Fi &...
1                         Dramas, International Movies
2                  Horror Movies, International Movies
Name: listed_in, dtype: object

Shows have multiple genre so i will split them into multiple rows.

In [12]:
df = netflix
#split rows and put to new dataframe
new_df = pd.DataFrame(df.listed_in.str.split(',').tolist()).stack()
new_df = pd.DataFrame(new_df).reset_index()
new_df.columns = ['a','b','genre']
new_df.drop(columns=['a','b'],inplace= True)
new_df['genre'] = new_df['genre'].str.strip()
new_df[:5]

Unnamed: 0,genre
0,International TV Shows
1,TV Dramas
2,TV Sci-Fi & Fantasy
3,Dramas
4,International Movies


In [13]:
genre = pd.DataFrame(new_df[['genre']].value_counts()).reset_index()
genre.columns = ['genre','number']
genre[:10]

Unnamed: 0,genre,number
0,International Movies,2437
1,Dramas,2106
2,Comedies,1471
3,International TV Shows,1199
4,Documentaries,786
5,Action & Adventure,721
6,TV Dramas,704
7,Independent Movies,673
8,Children & Family Movies,532
9,Romantic Movies,531


In [14]:
fig = px.bar(genre[:10],x='genre',y='number')
fig.show()

## 4.Country (making shows) 

Country column have the same issue as genre one so i do the same thing with a bit difference method.

In [15]:
netflix['country'] = netflix['country'].fillna("null")
netflix_country=pd.concat([Series(row['date_added'], row['country'].split(','))              
                    for _, row in netflix.iterrows()]).reset_index()
netflix_country.columns = ['country','date_added']
netflix_country

Unnamed: 0,country,date_added
0,Brazil,2020-08-14
1,Mexico,2016-12-23
2,Singapore,2018-12-20
3,United States,2017-11-16
4,United States,2020-01-01
...,...,...
9569,,2020-09-25
9570,Australia,2020-10-31
9571,United Kingdom,2020-03-01
9572,Canada,2020-03-01


In [16]:
netflix_country['country'] = netflix_country['country'].str.strip()
shows_by_country = netflix_country['country'].value_counts().to_frame().reset_index()
shows_by_country.columns = ['country','shows']
shows_by_country

Unnamed: 0,country,shows
0,United States,3297
1,India,990
2,United Kingdom,723
3,,507
4,Canada,412
...,...,...
114,Belarus,1
115,Kazakhstan,1
116,Angola,1
117,Samoa,1


delete null row. (index =3)

In [17]:
shows_by_country.drop([3])

Unnamed: 0,country,shows
0,United States,3297
1,India,990
2,United Kingdom,723
4,Canada,412
5,France,349
...,...,...
114,Belarus,1
115,Kazakhstan,1
116,Angola,1
117,Samoa,1


In [18]:
netflix_country = netflix_country.sort_values(by='date_added',ascending=False)
netflix_country['n'] = 1
date_country = netflix_country.groupby(['country','date_added']).sum().groupby(level=0).cumsum().reset_index()
date_country.sort_values(by = 'n')


Unnamed: 0,country,date_added,n
0,,2014-11-07,1
2754,Singapore,2017-05-01,1
2748,Serbia,2016-07-01,1
2745,Senegal,2017-09-01,1
2735,Saudi Arabia,2017-10-12,1
...,...,...,...
5001,United States,2021-01-10,3279
5002,United States,2021-01-11,3280
5003,United States,2021-01-13,3281
5004,United States,2021-01-15,3288


Shows number on the last rows(index=5005) for 'United States' is supposed to be 3297. It might be some problem on 'date_added' column.

In [19]:
netflix_country[netflix_country['country']=='United States']

Unnamed: 0,country,date_added,n
238,United States,2021-01-16,1
2032,United States,2021-01-16,1
6126,United States,2021-01-16,1
8567,United States,2021-01-15,1
3296,United States,2021-01-15,1
...,...,...,...
2749,United States,NaT,1
2779,United States,NaT,1
4241,United States,NaT,1
4767,United States,NaT,1


date_added with NAT value is very suspicious.

In [20]:
date_country[date_country['date_added'].isnull()]

Unnamed: 0,country,date_added,n


In [21]:
netflix_country[netflix_country['date_added'].isnull()]

Unnamed: 0,country,date_added,n
311,United Kingdom,NaT,1
660,United States,NaT,1
2749,United States,NaT,1
2779,United States,NaT,1
3120,Japan,NaT,1
4095,,NaT,1
4241,United States,NaT,1
4767,United States,NaT,1
6213,United States,NaT,1
7330,Australia,NaT,1


Replace NaT with 22/10/2009 (When Netflix Originals was launched).

In [22]:
netflix_country['date_added'].fillna(pd.Timestamp('20140101'),inplace=True)
date_country = netflix_country.groupby(['country','date_added']).sum().groupby(level=0).cumsum().reset_index()
pd.DataFrame(date_country.groupby(['country'])['n'].max()).sort_values('n',ascending = False)

Unnamed: 0_level_0,n
country,Unnamed: 1_level_1
United States,3297
India,990
United Kingdom,723
,507
Canada,412
...,...
East Germany,1
Dominican Republic,1
Montenegro,1
Somalia,1


It worked.

Make visualization.

In [23]:
top10_country = shows_by_country[:10].set_index('country')
dc = date_country[date_country['country'].isin(top10_country.index.tolist())]
fig = px.line(dc,x='date_added',y='n',color = 'country')
fig.show()
