# TV Shows and Movies listed on Netflix
This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

Inspiration
Some of the interesting questions (tasks) which can be performed on this dataset -

* Understanding what content is available in different countries
* Identifying similar content by matching text-based features
* Network analysis of Actors / Directors and find interesting insights
* Is Netflix has increasingly focusing on TV rather than movies in recent years.

![](https://i.gadgets360cdn.com/large/netflix_best_tv_series_1600167552333.jpg)

In [None]:
!pip install dataprep

# import library


In [None]:
# manipulation data
import pandas as pd
import numpy as np

#visualiation data
import matplotlib.pyplot as plt
import seaborn as sns 
import matplotlib
import plotly.graph_objects as go
import plotly.express as px

#default theme
sns.set(context='notebook', style='darkgrid', palette='colorblind', font='sans-serif', font_scale=1, rc=None)
matplotlib.rcParams['figure.figsize'] =[8,8]
matplotlib.rcParams.update({'font.size': 15})
matplotlib.rcParams['font.family'] = 'sans-serif'

In [None]:
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot

In [None]:
from dataprep.eda import *
from dataprep.datasets import load_dataset
from dataprep.eda import create_report

# load & analysis data

In [None]:
df = pd.read_csv('../input/netflix-shows/netflix_titles.csv')
df

In [None]:
df.info()

In [None]:
df.dtypes.value_counts().plot.pie(explode=[0.1,0.1],autopct='%1.1f%%',shadow=True)
plt.title('the type of our data');

In [None]:
df.columns

In [None]:
df.describe(include='all')

#### some note about data describtion
like we c in our data :
the most frequent value in each columns are :
* type : Movie with 5377/7787
* director : Raúl Campos, Jan Suter 18/5398	
* cast:      David Attenborough 18/7069
* country:   United States 2555/7280
* date_added: January 1, 2020/7777
* release_year:
* duration:
* listed_in:
* description:








# finding missing values

In [None]:
missing_values=df.isnull().sum()
percent_missing = df.isnull().sum()/df.shape[0]*100

value = {
    'missing_values ':missing_values,
    'percent_missing %':percent_missing
}
frame=pd.DataFrame(value)
frame

so our mising data are : 
* rating : 7 -- 0.08%
* date_added: 10 -- 0.12
* country:    507 -- 6.51
* cast :      718-- 9.22
* director:   2389 --30.67%

In [None]:
df.shape

### a) rating 

In [None]:
freq_value=df.rating.value_counts()
print(freq_value)
freq_value.plot.bar()

1. like we c our **rating** columns had the **TV-MA** is the most frequ value with 2863
2. and our misiing value is 0.08% from the data 

==> so we gonna remplace it with the frequent value

In [None]:
freq_rating=df.rating.mode()

In [None]:
df['rating'].fillna(df['rating'].mode,inplace=True)

In [None]:
df.rating.isnull().sum()

### b) date_added

In [None]:
freq_date=df.date_added.value_counts()
freq_date

we had just 10 missing value in **date_added** so it's batter to drop the missing value

In [None]:
df=df.dropna(axis=0, subset=['date_added'])

In [None]:
df.date_added.isnull().sum()

### c) country

In [None]:
df.country.value_counts()

In [None]:
plt.figure(figsize=(15,8))
country_val=df.country.value_counts().head(15)
sns.barplot(country_val.index,country_val)
plt.xticks(rotation=45)
plt.title('content available in different countries ')

* the most freqent country is **united states** 
* so we gonna ramplace all the mising values 507 -- 6.51%  with the **united states**

In [None]:
df.country.mode()

In [None]:
df['country'].fillna(df['country'].mode,inplace=True)

In [None]:
df.country.isnull().sum()

### d) cast

In [None]:
df.cast.value_counts().count()

like we see in this case :
* the missing value are 718 -- 9.8% of our data 
* the most freq value is **David Attenborough** with 18 count
* we had 6821 values in this columns **cast**

#### CONCLUSION :
it's too hard to find the right methode to change the missing value so we gonna drop the missing values 

In [None]:
df=df.dropna(axis=0, subset=['cast'])

In [None]:
df.isnull().sum()

In [None]:
df.director.value_counts()

## 1. Content Type on Netflix

In [None]:
col = "type"
grouped = df[col].value_counts().reset_index()
grouped = grouped.rename(columns = {col : "count", "index" : col})

## plot
trace = go.Pie(labels=grouped[col], values=grouped['count'], pull=[0.05, 0], marker=dict(colors=["#6ad49b", "#a678de"]))
layout = go.Layout(title="", height=400, legend=dict(x=0.1, y=1.1))
fig = go.Figure(data = [trace], layout = layout)
iplot(fig)

In [None]:
plot(df.type)

* 66% of the content on netflix is movies
* 33% of them are TV Shows.

## 2. Growth in content over the years

In [None]:
d1 = df[df["type"] == "TV Show"]
d2 = df[df["type"] == "Movie"]

col = "release_year"

vc1 = d1[col].value_counts().reset_index()
vc1 = vc1.rename(columns = {col : "count", "index" : col})
vc1['percent'] = vc1['count'].apply(lambda x : 100*x/sum(vc1['count']))
vc1 = vc1.sort_values(col)

vc2 = d2[col].value_counts().reset_index()
vc2 = vc2.rename(columns = {col : "count", "index" : col})
vc2['percent'] = vc2['count'].apply(lambda x : 100*x/sum(vc2['count']))
vc2 = vc2.sort_values(col)

trace1 = go.Scatter(x=vc1[col], y=vc1["count"], name="TV Shows", marker=dict(color="#a678de"))
trace2 = go.Scatter(x=vc2[col], y=vc2["count"], name="Movies", marker=dict(color="#6ad49b"))
data = [trace1, trace2]
layout = go.Layout(title="Content added over the years", legend=dict(x=0.1, y=1.1, orientation="h"))
fig = go.Figure(data, layout=layout)
fig.show()

1. The growth in number of movies on netflix is much higher than that od TV shows
2.  About 1300 new movies were added in both 2018 and 2019
3. The growth in content started from 2013
4. Netflix kept on adding different movies and tv shows on its platform over the years.
5. This content was of different variety - content from different countries, content which was released over the years.

## 3. Original Release Year of the movies

In [None]:
plot(df.release_year)

## 4. Top actor on netflix

In [None]:
small = df[df["type"] == "Movie"]
small = small[small["country"] == "India"]

col = "director"
categories = ", ".join(small[col].fillna("")).split(", ")
counter_list = Counter(categories).most_common(12)
counter_list = [_ for _ in counter_list if _[0] != ""]
labels = [_[0] for _ in counter_list][::-1]
values = [_[1] for _ in counter_list][::-1]
trace1 = go.Bar(y=labels, x=values, orientation="h", name="TV Shows", marker=dict(color="yellow"))

data = [trace1]
layout = go.Layout(title="Movie Directors from India with most content", legend=dict(x=0.1, y=1.1, orientation="h"))
fig = go.Figure(data, layout=layout)
fig.show()

In [None]:
small = df[df["type"] == "Movie"]
small = small[small["country"] == "United States"]

col = "director"
categories = ", ".join(small[col].fillna("")).split(", ")
counter_list = Counter(categories).most_common(12)
counter_list = [_ for _ in counter_list if _[0] != ""]
labels = [_[0] for _ in counter_list][::-1]
values = [_[1] for _ in counter_list][::-1]
trace1 = go.Bar(y=labels, x=values, orientation="h", name="TV Shows", marker=dict(color="yellow"))

data = [trace1]
layout = go.Layout(title="Movie Directors from US with most content", legend=dict(x=0.1, y=1.1, orientation="h"))
fig = go.Figure(data, layout=layout)
fig.show()