# EDA: Netflix Shows
Netflix, Inc. is an American over-the-top content platform and production company headquartered in Los Gatos, California. Netflix was founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California. The company's primary business is a subscription-based streaming service offering online streaming from a library of films and television series, including those produced in-house. As of October 2020, Netflix had over 195 million paid subscriptions worldwide, including 73 million in the United States. It is available worldwide except in the following: mainland China (due to local restrictions), Syria, North Korea, and Crimea (due to US sanctions). It was reported in 2020 that Netflix's operating income is $1.2 billion. The company has offices in France, the United Kingdom, Brazil, the Netherlands, India, Japan, and South Korea. Netflix is a member of the Motion Picture Association (MPA), producing and distributing content from countries all over the globe.
<img class="c-lazy-image__img lrv-u-background-color-grey-lightest lrv-u-width-100p lrv-u-display-block lrv-u-height-auto" src="https://variety.com/wp-content/themes/pmc-variety-2020/assets/public/lazyload-fallback.gif" data-lazy-src="https://variety.com/wp-content/uploads/2020/05/netflix-logo.png?w=681&amp;h=383&amp;crop=1" alt="Netflix Logo" data-lazy-srcset="https://variety.com/wp-content/uploads/2020/05/netflix-logo.png 2560w, https://variety.com/wp-content/uploads/2020/05/netflix-logo.png?resize=150,84 150w, https://variety.com/wp-content/uploads/2020/05/netflix-logo.png?resize=300,169 300w, https://variety.com/wp-content/uploads/2020/05/netflix-logo.png?resize=125,70 125w, https://variety.com/wp-content/uploads/2020/05/netflix-logo.png?resize=681,383 681w, https://variety.com/wp-content/uploads/2020/05/netflix-logo.png?resize=450,253 450w, https://variety.com/wp-content/uploads/2020/05/netflix-logo.png?resize=250,140 250w" data-lazy-sizes="(min-width: 87.5rem) 1000px, (min-width: 78.75rem) 681px, (min-width: 48rem) 450px, (max-width: 48rem) 250px" height="" width="">

###  Importing Python Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('darkgrid')
from sklearn.linear_model import LinearRegression
%matplotlib inline

### Reading the CSV file

In [None]:
df = pd.read_csv('../input/netflix-shows/Netflix Shows.csv', encoding='latin-1')

### Glance of the DataFrame

In [None]:
df

### Deleting duplicate values

In [None]:
df = df.drop_duplicates() 
df.duplicated().values.any()

### Deleteing Outliers

In [None]:
df.drop(df.index[df['release year'] == 1940], inplace = True)

### Filling the gaps

In [None]:
df = df.fillna(method="ffill")
df

### Checking for Null Values

In [None]:
pd.isnull(df).any  # Checking for Null Values

# Exploratory Data Analysis

DataFrame Head

In [None]:
df.head()

DataFrame Tail

In [None]:
df.tail()

## Overview of the DataSet

In [None]:
df.describe()

From the above cell, we can interpret basic stats of the data
<li>The average description rating is 68.821643 </li>
<li>The average user rating score  is 78.388778	 </li>
<li>The average user rating size is 80.973948</li>

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.columns

## Visualizing Data

### How  many shows are there in Netflix Shows Dataset? 

In [None]:
df['title'].count()

We are analyzing 500 netflix shows in this dataset across 7 features

### How many Netflix shows are produced each year?

In [None]:
plt.figure(figsize=(18,10))
plt.hist(df['release year'], ec='white', bins = 40, color='#f05454')
plt.xlabel('Year of Release', fontsize=16)

plt.ylabel('No. of shows released', fontsize=16)
plt.show()

In [None]:
plt.figure(figsize=(18,10))
sns.distplot(df['release year'], bins=40, color='#f05454')
plt.title('Average Number of shows released', fontsize=16)
plt.show()

### How many Netflix shows are produced per rating?

In [None]:
freq = df['rating'].value_counts()
print(freq)

In [None]:
plt.figure(figsize=(18, 10))
plt.bar(freq.index, height = freq,ec='white',color='#f05454')
plt.xlabel('Rating', fontsize=16)
plt.ylabel('No. of shows released', fontsize=16)
plt.show()

In [None]:
plt.figure(figsize=(18, 10))
freq.plot.pie(autopct="%.1f%%")
plt.xlabel('Rating', fontsize=16)
plt.ylabel('No. of shows released', fontsize=16)


## Shows based on Rating Description

### Number of shows released in the year VS Rating Description
 

In [None]:
plt.figure(figsize=(18, 10))
X = pd.DataFrame(df, columns=['release year'])
y = pd.DataFrame(df, columns=['ratingDescription'])
plt.xlabel('Number of shows released in the year',fontsize=16)
plt.ylabel('Rating Description',fontsize=16)
plt.scatter(X, y, s=900, alpha = 0.8,ec='white',color='#f05454')
plt.show()


In [None]:

plt.figure(figsize=(18,10))
plt.hist(df['ratingDescription'], ec='white', bins = 15, color='#f05454')
plt.xlabel('Rating Description', fontsize=16)
plt.title('Distribution of shows based on rating description')
plt.axvline(df['ratingDescription'].mean(), color='#21209c', linestyle='dashed', linewidth=5, label='Average Rating')
plt.legend(loc='upper right', bbox_to_anchor=(0.98, 1.11), frameon=False, fontsize=14)

plt.ylabel('No. of shows', fontsize=16)
plt.show()

In [None]:

plt.figure(figsize=(18,10))
sns.distplot(df['ratingDescription'], bins=15, color='#f05454')
plt.xlabel('Rating Description', fontsize=16)
plt.title('Distribution of shows based on rating description')
plt.axvline(df['ratingDescription'].mean(), color='#21209c', linestyle='dashed', linewidth=5, label='Average Rating')
plt.legend(loc='upper right', bbox_to_anchor=(0.98, 1.11), frameon=False, fontsize=14)

plt.ylabel('No. of shows', fontsize=16)
plt.show()

In [None]:
plt.figure(figsize = (20,8))
sns.swarmplot(x = 'rating',
              y = 'ratingDescription', 
              data = df,
              alpha = 0.8,
              s = 25
)
plt.show()

## Shows based on User Rating Score


### Number of shows released in the year VS User Rating Score
 

In [None]:
plt.figure(figsize=(18, 10))
X = pd.DataFrame(df, columns=['release year'])
y = pd.DataFrame(df, columns=['user rating score'])
plt.xlabel('Number of shows released in the year',fontsize=16)
plt.ylabel('Rating Description',fontsize=16)
plt.scatter(X, y, s=900, alpha = 0.8,ec='white',color='#f05454')
plt.show()

In [None]:
plt.figure(figsize=(18,10))
plt.hist(df['user rating score'], ec='white', bins = 15, color='#f05454')
plt.xlabel('User Rating Score', fontsize=16)
plt.title('Distribution of shows based on user rating score')
plt.axvline(df['user rating score'].mean(), color='#21209c', linestyle='dashed', linewidth=5, label='Average Rating')
plt.legend(loc='upper right', bbox_to_anchor=(0.98, 1.11), frameon=False, fontsize=14)

plt.ylabel('No. of shows', fontsize=16)
plt.show()

In [None]:
plt.figure(figsize=(18,10))
sns.distplot(df['user rating score'], bins=15, color='#f05454')
plt.xlabel('user rating score', fontsize=16)
plt.title('Distribution of shows based on user rating score')
plt.axvline(df['user rating score'].mean(), color='#21209c', linestyle='dashed', linewidth=5, label='Average Rating')
plt.legend(loc='upper right', bbox_to_anchor=(0.98, 1.11), frameon=False, fontsize=14)

plt.ylabel('No. of shows', fontsize=16)
plt.show()


In [None]:
plt.figure(figsize = (20,8))
sns.swarmplot(x = 'rating',
              y = 'user rating score', 
              data = df,
              alpha = 0.8,
              s = 20
)
plt.show()

## Shows based on User Rating Size


### Number of shows released in the year VS User Rating Size

In [None]:
plt.figure(figsize=(18, 10))
X = pd.DataFrame(df, columns=['release year'])
y = pd.DataFrame(df, columns=['user rating size'])
plt.xlabel('Number of shows released in the year',fontsize=16)
plt.ylabel('User Rating Size',fontsize=16)
plt.scatter(X, y, s=900, alpha = 0.8,ec='white',color='#f05454')
plt.show()

In [None]:
plt.figure(figsize=(18,10))
plt.hist(df['user rating size'], ec='white', bins = 15, color='#f05454')
plt.xlabel('User Rating Size', fontsize=16)
plt.title('Distribution of shows based on user rating size')
plt.axvline(df['user rating score'].mean(), color='#21209c', linestyle='dashed', linewidth=5, label='Average Rating')
plt.legend(loc='upper right', bbox_to_anchor=(0.98, 1.11), frameon=False, fontsize=14)

plt.ylabel('No. of shows', fontsize=16)
plt.show()

## Comapring the User Rating score and Descrition Rating:

In [None]:
plt.figure(figsize=(18,10))
sns.lmplot(x = 'user rating score', 
           y = 'ratingDescription',
           hue = 'rating',
           data = df
)
plt.show()

In [None]:
plt.figure(figsize=(18,10))

sns.regplot(x='user rating score', y='ratingDescription', data=df, color='#ff4646')

plt.title('Relationship Between User Rating score and Description Rating', fontsize=22)
plt.ylabel('Description Rating', fontsize=16)
plt.xlabel('User Rating score',fontsize=16)
sns.despine()

plt.show()

In [None]:
plt.figure(figsize=(18,10))
sns.pairplot(df[['user rating score',
                 'ratingDescription',
                 'user rating size']]
)

plt.show()