### Introduction

Netflix is a streaming service that hosts tens of thousands of TV Shows, movies, documentaries and more on internet-connected devices. 

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine. 
**The objective of this notebook is to explore the dataset and extract interesting and meaningful insights from the same.**


### Contents

### Importing the dataset

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Loading and  the dataset

In [None]:
df = pd.read_csv('../input/netflix-shows/netflix_titles.csv')
df.head()

In [None]:
df.shape

### Data Cleaning

In [None]:
# Counting the number of missing values in each column 

df.isnull().sum()

Clearly, there are missing values in certain columns. We can use a seaborn.heatmap to visualize the occurrences of these missing values

In [None]:
sns.heatmap(df.isnull(), cmap = 'Reds')
plt.title('Heatmap of null values')
plt.show()

### Dealing With Missing Values

From the above count of missing values and its visualization, it is noticed that
* 'director', 'cast' and 'country' columns have numerous missing values with variations in occurence
* 'date_added' and 'rating' columns have a few missing values

Now let's deal with the missing values

* 'Director' and 'cast': columns are completely dropped as my analysis doesn't require these columns. 
* 'rating': Since number of missing values are very few, I will be filling those by searching for their appropriate values from the internet
* 'date_added': Number of missing values are few. However, it is difficult to find those online and hence they are scrapped off
* 'country': Will be filled with the modal value

In [None]:
# Dropping 'director' and 'cast' columns

df = df.drop(['director', 'cast'], axis = 1)
df.columns

In [None]:
# Finding the rows where the 'rating' value is missing

df[df['rating'].isna()]

In [None]:
rating_replacements = {
    67: 'TV-PG',
    2359: 'TV-14',
    3660: 'TV-MA',
    3736: 'TV-MA',
    3737: 'NR',
    3738: 'TV-MA',
    4323: 'TV-MA '
}

for id, rate in rating_replacements.items():
    df.iloc[id, 6] = rate
    
df['rating'].isna().sum()


In [None]:
# Scraping off the null values in the 'date_added' column

df = df[df['date_added'].notnull()]

In [None]:
# Filling country with modal value

df['country'] = df['country'].fillna(df['country'].mode()[0])

In [None]:
# Checking the cleaned data

df.isna().sum()

### Data Transformations

In [None]:
df.dtypes

In [None]:
# 'date_added' column should be in datetime format

df['date_added'] = pd.to_datetime(df['date_added'])


# 'release_year' column should be in datetime format

df['release_year'] = pd.to_datetime(df['release_year'], format = '%Y')

In [None]:
df.dtypes

In [None]:
df1 = df.head(3)

### Visualizations

**Content Types Comparison**

In [None]:
# Representing the types of content in the form of a pie chart

plt.figure(figsize = (5,5))
plt.pie(df['type'].value_counts().values.tolist(), labels = df['type'].value_counts().keys().tolist(), colors = ['lightblue', 'royalblue'],explode=[0.02,0.02], autopct = '%0.2f%%', startangle = 90)
plt.title('Composition of content on Netflix')
plt.show()

As it is evident from the chart, more than two-thirds of the content on Netflix are movies while the rest comprise of TV Shows

**Showing the distribution of content across countries**

In [None]:
# Top 10 countries which produced the highest amount of content

content_by_country = df['country'].value_counts()[0:10]

plt.figure(figsize = (7,7))
plt.bar(content_by_country.keys().tolist(), content_by_country.values.tolist(), color = '#6593F5')
plt.xlabel('Countries', fontsize = 15)
plt.ylabel('Total content produced', fontsize = 15)
plt.xticks(rotation = 90)
plt.show()

As you can guess, United States leads the charts with the most content produced.

The margin of the difference between the first and second placed countries is astonishing. US has managed to produce more than thrice the amount of content then any other nation.

**Content Produced Across The Years**

In [None]:
# Extracting the year added from the 'date_added' column

df['year_added'] = df['date_added'].dt.year

In [None]:
year_wise_content = df.year_added.value_counts()[:20]


It is worth noticing the huge rise in the content released on Netflix post-2015.  

**Range Of Seasons In TV Shows Produced**

In [None]:
tvshow_df = df[df['type'] == 'TV Show']
tvshow_df['seasons'] = tvshow_df['duration'].apply(lambda x: x.split()[0])
seasons_count = tvshow_df['seasons'].value_counts()

plt.figure(figsize = (7,7))
plt.bar(seasons_count.keys().tolist(), seasons_count.values, color = '#6593F5')
plt.xlabel('No of seasons')
plt.ylabel('No of shows')
plt.show()

**Top Categories**

In [None]:
category_list = []

for genres in df['listed_in'].iloc[::]:
    genre = genres.split(', ')
    
    for category in genre:
        category_list.append(category)
            
            
category_dict = {}

for item in category_list:
    if item in category_dict:
        continue
    category_dict[item] = category_list.count(item)
    

allcategories = sorted(category_dict.items(), key = lambda x: x[1], reverse = True)[0:10]

category_name = [key for key,value in allcategories]
category_count = [value for key,value in allcategories]

plt.figure(figsize=(7,7))
plt.bar(category_name, category_count, color = '#6593F5')
plt.xlabel('Top Categories')
plt.ylabel('Occurences')
plt.xticks(rotation = 90)
plt.show()

