<a href="https://colab.research.google.com/github/satijagunika/Netflix-analysis-using-python/blob/main/Netflix_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing required library

In [55]:
import numpy as np
import pandas as pd
import plotly.express as px
from textblob import TextBlob
import matplotlib.pyplot as plt
import seaborn as sns

## **Fetching raw data**


In [56]:
# Using pandas to extract the csv file for analyiss
df = pd.read_csv('netflix_titles.csv')

 Total number of elements (cells) in a DataFrame

In [76]:
df.size  #useful for quickly understanding the overall capacity or size of your DataFrame, regardless of its contents.

105684

 For the number of rows and columns in the DataFrame.
 Essential for understanding the structure and size of your data

In [58]:
df.shape

(8807, 12)

 For quick view of the first few rows of the DataFrame

In [59]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


# Visualizing the Distribution of Content Rating with a Pie Chart



In [60]:
x = df.groupby('rating').size().reset_index(name='count')
cont_rating = px.pie(x,
                     values='count',
                     names='rating',
                     title = 'Distribution of Content Rating',
                     width = 700,
                     height=500)
title_font = dict(size = 20,family = 'ariel',color = 'black')
axis_font = dict(size = 10,family = 'ariel',color = 'black')
cont_rating.show()


Filling Missing Values in multiple Column

In [61]:
df['director'] = df['director'].fillna('No Director Specified')
df['cast'] = df['cast'].fillna('No Cast Specified')
df['country'] = df['country'].fillna('No Country Specified')
df['date_added'] = df['date_added'].fillna('No Date Specified')
df['rating'] = df['rating'].fillna('No Rating Specified')
df['duration'] = df['duration'].fillna('No Duration Specified')
df['listed_in'] = df['listed_in'].fillna('No Listed In Specified')
df['description'] = df['description'].fillna('No Description Specified')


In [62]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,No Cast Specified,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,No Director Specified,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",No Country Specified,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,No Director Specified,No Cast Specified,No Country Specified,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,No Director Specified,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


# Splitting and Stacking Director's Names from 'director' Column

In [63]:
# df_exploded = df.explode('director')
# df_exploded['director list'] = df_exploded['director'].str.split(', ')
# # director_list.column = ['Directors']
# # print(director_list)


In [64]:
director_list = pd.DataFrame()
director_list = df['director'].str.split(', ', expand=True).stack()
director_list = director_list.to_frame()
director_list.columns = ['Directors']
print(director_list)


                    Directors
0    0        Kirsten Johnson
1    0  No Director Specified
2    0        Julien Leclercq
3    0  No Director Specified
4    0  No Director Specified
...                       ...
8802 0          David Fincher
8803 0  No Director Specified
8804 0        Ruben Fleischer
8805 0           Peter Hewitt
8806 0            Mozez Singh

[9612 rows x 1 columns]


In [65]:
director_list = director_list[director_list['Directors'] != 'No Director Specified']
print(director_list)

              Directors
0    0  Kirsten Johnson
2    0  Julien Leclercq
5    0    Mike Flanagan
6    0    Robert Cullen
     1   José Luis Ucha
...                 ...
8801 0  Majid Al Ansari
8802 0    David Fincher
8804 0  Ruben Fleischer
8805 0     Peter Hewitt
8806 0      Mozez Singh

[6978 rows x 1 columns]


In [66]:
directors = director_list.groupby(['Directors']).size().reset_index(name='Total count')
print(directors)

                Directors  Total count
0             A. L. Vijay            2
1            A. Raajdheep            1
2               A. Salaam            1
3         A.R. Murugadoss            2
4         Aadish Keluskar            1
...                   ...          ...
4988           Éric Warin            1
4989     Ísold Uggadóttir            1
4990  Óskar Thór Axelsson            1
4991     Ömer Faruk Sorak            3
4992         Şenol Sönmez            2

[4993 rows x 2 columns]


In [67]:
directors = directors.sort_values(by=['Total count'])
print(directors)

             Directors  Total count
4554   Taylor Hackford            1
2692  Lionel C. Martin            1
2693       Lisa Arnold            1
2694       Lisa Cortés            1
2695      Liu Bang-yao            1
...                ...          ...
2866      Marcus Raboy           16
4457       Suhas Kadav           16
3800       Raúl Campos           19
1906         Jan Suter           21
3749     Rajiv Chilaka           22

[4993 rows x 2 columns]


In [68]:
directors = directors.sort_values('Total count', ascending = False)
print(directors)

                Directors  Total count
3749        Rajiv Chilaka           22
1906            Jan Suter           21
3800          Raúl Campos           19
4457          Suhas Kadav           16
2866         Marcus Raboy           16
...                   ...          ...
3086  Michael James Regan            1
3087  Michael John Warren            1
3114       Michael Seater            1
3113      Michael Schmitt            1
4809        Vincent Perez            1

[4993 rows x 2 columns]


In [69]:
top5directors = directors.head(5)
print(top5directors)

          Directors  Total count
3749  Rajiv Chilaka           22
1906      Jan Suter           21
3800    Raúl Campos           19
4457    Suhas Kadav           16
2866   Marcus Raboy           16


Creating a Bar Chart for Top 5 Directors by Total Count

In [87]:
directors_5 = px.bar(top5directors,
                     x='Directors',
                     y='Total count',
                     title = 'Top 5 Directors',
                     width = 700,
                     height=500,
                     color_discrete_sequence =['violet'])
title_font = dict(size = 20,family = 'Garamond',color = 'black')# Customizing Title Font
directors_5.show()

# Extracting and Stacking Actor's Names from 'cast' Column


In [45]:
cast_df = pd.DataFrame()
cast_df = df['cast'].str.split(',', expand = True).stack()
cast_df = cast_df.to_frame()
cast_df.columns = ['Actors']
print(cast_df)

                        Actors
0    0       No Cast Specified
1    0              Ama Qamata
     1             Khosi Ngema
     2           Gail Mabalane
     3          Thabang Molaba
...                        ...
8806 3        Manish Chaudhary
     4            Meghna Malik
     5           Malkeet Rauni
     6          Anita Shabdish
     7   Chittaranjan Tripathy

[64951 rows x 1 columns]


In [46]:
Actors = cast_df.groupby(['Actors']).size().reset_index(name = 'Total count')
Actors = Actors.sort_values(by = ['Total count'])
Actors = Actors[Actors.Actors != 'No Cast Specified']
Actors = Actors.sort_values('Total count', ascending = False)
Actors = Actors.head(5)

print(Actors)

                  Actors  Total count
2612         Anupam Kher           39
26941       Rupa Bhimani           31
30303   Takahiro Sakurai           30
15541      Julie Tejwani           28
23624            Om Puri           27


# Creating a Bar Chart for Top 5 Actors by Total Count


In [95]:
actors_5 = px.bar(Actors,
                 x = 'Actors',
                 y = 'Total count',
                 title = 'Top 5 Actors',
                 width = 700,
                 height=500,
                 color_discrete_sequence=['crimson'])
actors_5.show()

# Grouping and Counting Content Types by Release Year


In [50]:
df1 = df[['type','release_year']]
df1 = df1.rename(columns = {'release_year':'Release Year','type':'Type'})
# print(df1)
df2 = df1.groupby(['Type', 'Release Year']).size().reset_index(name = 'Total count')
print(df2)

        Type  Release Year  Total count
0      Movie          1942            2
1      Movie          1943            3
2      Movie          1944            3
3      Movie          1945            3
4      Movie          1946            1
..       ...           ...          ...
114  TV Show          2017          265
115  TV Show          2018          380
116  TV Show          2019          397
117  TV Show          2020          436
118  TV Show          2021          315

[119 rows x 3 columns]


In [96]:
# Filtering and Pivoting Data for Content Types Released Since 2000
df2 = df2[df2['Release Year']>=2000]
df2_pivot = df2.pivot(index = 'Release Year', columns = 'Type', values = 'Total count')
print(df2_pivot)

Type          Movie  TV Show
Release Year                
2000             33        4
2001             40        5
2002             44        7
2003             51       10
2004             55        9
2005             67       13
2006             82       14
2007             74       14
2008            113       23
2009            118       34
2010            154       40
2011            145       40
2012            173       64
2013            225       63
2014            264       88
2015            398      162
2016            658      244
2017            767      265
2018            767      380
2019            633      397
2020            517      436
2021            277      315


Visualizing the Trend of Content Presented on Netflix Over Release Years


In [52]:
df2_graph = px.line(df2,
                    x = 'Release Year',
                    y = 'Total count',
                    color = 'Type',
                    height = 500,
                    width = 700,
                    title = 'Trend of content preseneted on netflix')
df2_graph.show()

## Sentiment Analysis using description provided


In [72]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,No Cast Specified,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,No Director Specified,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",No Country Specified,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,No Director Specified,No Cast Specified,No Country Specified,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,No Director Specified,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


# Performing Sentiment Analysis on Netflix Descriptions and Visualizing Results


In [97]:
# Creating new variable with required columns in the dataset
df3 = df[['release_year','description']]
df3 = df3.rename(columns = {'release_year':'Release Year',
                            'description':'Description'})
# Useful for deriving insights into the overall sentiment trends of descriptions over time
for index, row in df3.iterrows():
    analysis = TextBlob(row['Description'])
    polarity = analysis.sentiment.polarity
    if polarity > 0:
        df3.loc[index, 'Sentiment'] = 'Positive'
    elif polarity == 0:
        df3.loc[index, 'Sentiment'] = 'Neutral'
    else:
        df3.loc[index, 'Sentiment'] = 'Negative'

df3 = df3.groupby(['Release Year','Sentiment']).size().reset_index(name = 'Total count')
df3_pivot = df3.pivot(index = 'Release Year', columns = 'Sentiment', values = 'Total count')
# print(df3_pivot)
df3 = df3[df3['Release Year']>2005]
# print(df3)

# Visualizing Sentiment Analysis Results of Descriptions
bargraph = px.bar(df3,
                  x = 'Release Year',
y = 'Total count',
color = 'Sentiment',
color_discrete_sequence=['rebeccapurple','burlywood','springgreen'],
height = 500,
width = 700,
                  title = 'Sentiment Analysis of Description')
bargraph.show()

##Insights found from the analysis of Netflix data
(1). Netflix released more number of TV shows then movies.

(2). Positive content type is available more on netflix.

(3). More content is for Mature Audience Only. Intended for adults and may be unsuitable for children under 17.