![](https://torranceca.files.wordpress.com/2019/10/netflix.jpg)


# Netflix Movies & TV shows Analysis

As we all know NETFLIX is a subscription-based streaming service offering online streaming from a library of films and television series, including those produced in-house is a subscription-based streaming service offering online streaming from a library of films and television series, including those produced in-house

## Let's do Exploratory Data Analysis on NETFLIX trends.

## Importing Useful Libraries

To begin, first we will import some of the basic but strong python libraries.

In [None]:
import pandas as pd
import os
import matplotlib.pyplot as plt

## Loading Datasets
DataSource : [Kaggle](http://www.kaggle.com/shivamb/netflix-shows)

Now we are ready to load dataset from DataSource. We will check for the datafile in the directory and use read_csv() to have glimpse of data.

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
filename = os.path.join(dirname, filename)
df = pd.read_csv(filename)
df

After having glimpse of the dataset, noticeable points:
1. Columns having NaN/Null values.
2. Columns having comma separated values which may create difficulty on finding insights.
3. Data in columns not seems useful.

Let's dig down more

### Data Preparation and Cleaning



In [None]:
df.info()

There are total 12 columns and 7786 records in our dataframe.

We can also get columns and rows matrix of dataset by using shape

In [None]:
df.shape

### Handling NaN/Null values

It seems we have lot of unique values in our dataframe. Show_id having all unique values as it counts as unique key for the dataset.
type column having 2 count may be there are 2 tyoes of shows only in our dataset. lets check

In [None]:
df.nunique()

We were right! There are only 2 types of shows in our dataframe.

In [None]:
df.type.unique()

Yes we were right. We have 2 types of shows in our dataframe i.e Movies and TV shows

### Checking number of NaN/Null values in series

In [None]:
missing_values = df.isna().sum()
missing_values

### Finding percentage of NaN/Null values in series

In [None]:
missing_percent = (missing_values.sort_values(ascending= False)/len(df))*100
missing_percent

As we have 30% of data in director column and 9% in cast columns as Null. So let's not consider these columns in our analysis.

### Filling Nan/Null values with default values

To count null values as well. Let's assign some default value.

In [None]:
df.fillna('Unknown')

NaN/Null values are now filled with 'Unknown' value.

In [None]:
df['title'][:20]

It seems that title column doesn't contain correct/valuable data. let's drop this column too from analysis.

As we have noticed there are comma separated values in few columns.

lets take an example of country. How could we get correct count of released shows per country if having comma separated values?

In [None]:
df['country'] = df['country'].str.split(',')
df_country = df.explode('country')
df_country

Now we have shows released per country. 

Suppose we have a show released/telecast in 3 countries then we should have 3 separate enteries to count releases per country.

New dataset having 9574 rows and 12 columns

As we did for country column, Same analysis we like to do on listed_in column.

In [None]:
df['listed_in'] = df['listed_in'].str.split(',')
df_listed_in = df.explode('listed_in')
df_listed_in

## EXPLORATORY DATA ANALYSIS AND VISUALIZATION

Let's do Data visulaization using some basic graphs.

In [None]:
df_country['country'] = df_country['country'].str.strip()
df1 = df_country.groupby(['country'])['show_id'].count().sort_values(ascending = False)/len(df_country)*100
df1[:20]

This gives us top 20 countries in which max number of shows releases.

- United States is on top and India is on second with release of 3297 and 990 shows respectively.

In [None]:
plt.figure(figsize=(14, 6))
df1[:20].plot(kind = 'bar', color='r')
plt.xlabel('Country')
plt.ylabel('Nb of Releases')
plt.title("Nb of releases per country")
plt.show()


Let's check which generes of shows releases the most.

In [None]:
df_listed_in[:50]

In [None]:
df_listed_in['listed_in'] = df_listed_in['listed_in'].str.strip()
df2 = df_listed_in['listed_in'].value_counts().sort_values(ascending = False).head(20)/len(df_listed_in)*100
df_listed_sort = df2.sort_values()
df_listed_sort

This gives us in which genre maximum shows releases.

- It seems maximum shows releases in 'Internationa Movies' then comes 'Dramas'.

In [None]:
plt.figure(figsize=(14, 6))
df_listed_sort.plot(kind = 'barh', color='r')
plt.xlabel('Releases')
plt.ylabel('Categories')
plt.title("Nb of releases per Categories")
plt.show()


### Ratings

In [None]:
df.rating.value_counts().sort_values(ascending=False)

Now let's dive into rating and find out maximum content on Netflix are for which audience ;)

In [None]:
def name_(row):
    if row == 'TV-MA':
        row = 'TV-MA - Mature Audience Only'
    elif row == 'G':
        row = 'G – General Audiences'
    elif row == 'TV-PG':
        row = 'TV-PG – Parental Guidance Suggested'
    elif row == 'PG':
        row = 'PG – Parental Guidance Suggested'
    elif row == 'PG-13':
        row = 'PG-13 – Parents Strongly Cautioned'
    elif row == 'R':
        row = 'R – Restricted'
    elif row == 'NC-17':
        row = 'NC-17 – No children under 17'
    elif row == 'TV-Y':
        row = 'TV-Y - All Children'
    elif row == 'TV-Y7':
        row = 'TV-Y7-FV - Directed to Older Children above 7'
    elif row == 'TV-Y7-FV':
        row = 'TV-Y7 FV: Directed to Older Children - Fantasy Violence'
    elif row == 'TV-14':
        row = 'TV-14 - Parents Strongly Cautioned'
    elif row == 'TV-G':
        row = 'TV-G: General Audience'
    elif row == 'UR' or 'NR':
        row = 'UR - Unrated'
    else:
        row = row
    return row
df['rating']=df['rating'].apply(name_)

df_rating = df['rating'].value_counts().sort_values(ascending=True)
df_rating

So interesting thing we have got

- Maximum content on netflix are for only 'Mature Audience'.

In [None]:
plt.figure(figsize=(14, 6))
df_rating.plot(kind = 'barh', color='r')
plt.xlabel('Releases')
plt.ylabel('Rating')
plt.legend(title = "Release",loc="lower right", fontsize = 10)
plt.title("Nb of releases as per ratings")
plt.show()


I'm also thinking about the impact of pandemic on Netflix.
Let's check number of shows releases on Netflix in last few years specially 2020 and 2021.

In [None]:
df_release_year = df['release_year'].value_counts().sort_index(ascending=False).head(20)
df_release_year.head(20)


Here we get...from last 20 years content creation was increasing year by year on Netflix till 2018. Then it goes down.
so yes may be impact of covid on Netflix as well

- Maximum released content was in 2018 with 1121 shows.
- Its 996 in 2019.
- Its 868 in 2020.


In [None]:
plt.figure(figsize=(14, 6))
#df_release_year.plot(kind = 'bar', color='r')
df_release_year.plot(kind = 'bar', color = 'r')
plt.xlabel('Releases')
plt.ylabel('Rating')
plt.legend(title = "Release",loc="lower right", fontsize = 10)
plt.title("Nb of releases as per ratings")
plt.show()


Here I have used a different way to get shows release per year using df_country dataframe.

In [None]:
df_country.drop_duplicates(subset=['show_id'])['release_year'].value_counts().sort_index(ascending = False)

Initially we have checked type of shows on Netflix. So before ending it lets check percent of movies and TV shows.

In [None]:
plt.figure(figsize=(14, 6))
plt.title("% of Netflix Titles that are either Movies or TV Shows",color = 'w')
g = plt.pie(df.type.value_counts(), explode=(0.025,0.025), labels=df.type.value_counts().index, colors=['Crimson','IndianRed'],autopct='%1.1f%%', startangle=180,textprops={'color':"w"});
plt.legend()
plt.show()

69% shows on Netlix are 'Movies'.

31% shows are 'TV shows'

## Ask and Answer questions

1. Top 5 countries with most releases?
2. Top 5 genres with most relased shows?
3. Rating under which most shows releases?
4. Which Year having most shows released since last 20 years?
5. Types of shows with number of releases?


    

## Summary & Conclusions

It's clear that Netflix has grown over the years. We can see it from the data that the company took certain approaches in their marketing strategy to break into new markets around the world. Based on an article from Business Insider, Netflix had about 158 million subscribers worldwide with 60 million from the US and almost 98 million internationally. Netflix's original subscriber base was based solely in the United States following its IPO. A large part of its success was due to the decision to expand to international markets. The popular markets prioritizes what content the company will release. In this case, we can see that a good amount of international movies and TV shows were added over the years as part of Netflix's global expansion. Even most of the viewers of Netflix are Mature Audience which results in maximun release of shows under rating MA. It also impacted by Covid as per the trend of last 20 year, content was increasing year by year till 2018.

In [None]:
!pip install jovian --upgrade -q