**NETFLIX - Movies and TV Shows Prediction**

![picture](https://www.it-tech.co.za/wp-content/uploads/2020/09/netflix.jpg)









About the Dataset - 

*   This dataset consists of tv shows and movies available on Netflix.
*    The dataset is collected from Flixable which is a third-party Netflix search engine.





Aim :  To analyse the data and various factors affecting the trend of Movies and Shows available on Netflix. Data Visualization was a primary aim and was implemented using Plotly. Building a NETFLIX Movies and TV Shows Prediction Model based on the dataset.

The Dataset Contains : 7787 rows and 12 columns.

Content :

      Import required libraries
      Import the dataset
      Data Exploration
      Data Preprocessing
      Data Analysis and Visualization
      Feature Engineering
      Feature Selection
      Spliting : training and testing dataset
      Modeling
      

##Import the required libraries 

In [None]:
import pandas as pd
import numpy as np   
import matplotlib.pyplot as plt  
import seaborn as sns 
import warnings
warnings.filterwarnings("ignore")

## Import the dataset

In [None]:
#upload dataset from system
'''from google.colab import files
uploaded = files.upload()'''

In [None]:
#we can upload the dataset from github :
df = pd.read_csv('https://raw.githubusercontent.com/Somali19/dataset/main/netflix_titles.csv')


Display the Dataset

In [None]:
'''import io
df = pd.read_csv(io.BytesIO(uploaded['netflix_titles.csv']))  '''
df

## Data Exploration

In [None]:
df.shape

    So, the dataset have 7787 rows and 12 columns

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
df.director.unique()

In [None]:
df.cast.unique()

##Data Preprossing

In [None]:
df.duplicated().sum() #check for duplicate value

In [None]:
df.isnull().sum() #null value

In [None]:
print("Percentage Of Missing Values")
Perc_Of_Missing_Values=df.isna().sum()/len(df)*100
Perc_Of_Missing_Values[Perc_Of_Missing_Values!=0]

In [None]:
only_missing=Perc_Of_Missing_Values[Perc_Of_Missing_Values!=0]
only_missing.plot(kind="bar")
plt.title("% Age of Missing Values")
plt.show()

The missing values and not required column :

    "show_id": Not that important. so i will be dropping this.
    "director": Very less information-not needed for the analysis so I will be dropping this.
    "cast": there are too many diferent values so I will be dropping this.
    "country":Important variable hence we need to fix this.
    "date_added": there are just a few cases, so lets scrap them.
    "rating": there are just a few cases, so lets fix them.
    "Description": Not that Important.So i will be dropping this.

 Drop (cast,director,show id,description) columns:

In [None]:
df.drop("cast",axis=1,inplace=True)
df.drop("director",axis=1,inplace=True)
df.drop("show_id",axis=1,inplace=True)
df.drop("description",axis=1,inplace=True)

Fill the missing values:

In [None]:
df["country"]=df["country"].fillna(df["country"].mode()[0])
df["rating"]=df["rating"].fillna(df["rating"].mode()[0])

Fix the Date_added Column:

In [None]:
df[df.date_added.isna()]

We will drop those Rows.They are only 10 Rows .Becoz it is difficult to add date of those Rows

In [None]:
df=df[df["date_added"].notna()]

Check the Cleaned dataset

In [None]:
df.isna().sum()

Let's make New Columns:

In [None]:
df['year_added'] = df['date_added'].apply(lambda x: x.split(" ")[-1])
df['year_added'].head()

In [None]:
df['month_added'] = df['date_added'].apply(lambda x: x.split(" ")[0])
df['month_added'].head()

In [None]:
df.replace({'TV Show': 0, 'Movie': 1}, df["type"]) #replace tv show with 0 and movie with 1

In [None]:
df["rating"].unique() #list of unique rating

In [None]:
ratings_ages = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}  #replacing ratings ages with some category (older kids , adults , teens , kids)

In [None]:
df['target_ages'] = df['rating'].replace(ratings_ages)  #replace the ratings with above category
df['target_ages'].unique() 

In [None]:
df

##Exploratory Data Analysis and Visualization

In [None]:
import plotly.graph_objects as go
val = df['type'].value_counts().index
cnt = df['type'].value_counts().values

fig = go.Figure([go.Bar(x=val, y=cnt, marker_color='darkturquoise')])
fig.update_layout(title_text='Netflix Sources Distribution', title_x=0.5)
fig.show()
#movie and tv show distribution

In [None]:
countries=pd.crosstab(df["country"],["type"]).sort_values(by="type",ascending=False)
countries.head(10).plot(kind="bar")
plt.legend()
plt.title("COUNTRY WITH HIGHEST NUMBER OF SHOWS")
plt.show()
#country based on show 

In [None]:
df["type"].value_counts().plot(kind="pie",autopct="%1.1f%%")
plt.title("%AGE OF MOVIES AND TV SHOWS")
plt.legend()
plt.show()
#percentage of movie and tv show

In [None]:
df_movie = df[df['type']=='Movie'].groupby('release_year').count()
df_tv = df[df['type']=='TV Show'].groupby('release_year').count()


df_movie.reset_index(level=0, inplace=True)
df_tv.reset_index(level=0, inplace=True)

# fig = px.line(data_movie, x="release_year", y="type")
# fig.show()

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_movie['release_year'], y=df_movie['type'],
                    mode='lines',
                    name='Movies', marker_color='mediumpurple'))
fig.add_trace(go.Scatter(x=df_tv['release_year'], y=df_tv['type'],
                    mode='lines',
                    name='TV Shows', marker_color='lightcoral'))
fig.update_layout(title_text='Trend Movies vs TV Shows in recent years', title_x=0.5)
fig.show()
#trend of movies and tv shows in recent year (from 1930 to 2020)

In [None]:
df_movie = df[df['type']=='Movie'].groupby('year_added').count()
df_tv = df[df['type']=='TV Show'].groupby('year_added').count()


df_movie.reset_index(level=0, inplace=True)
df_tv.reset_index(level=0, inplace=True)

# fig = px.line(data_movie, x="year_added", y="type")
# fig.show()

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_movie['year_added'], y=df_movie['type'],
                    mode='lines',
                    name='Movies', marker_color='mediumpurple'))
fig.add_trace(go.Scatter(x=df_tv['year_added'], y=df_tv['type'],
                    mode='lines',
                    name='TV Shows', marker_color='lightcoral'))
fig.update_layout(title_text='Trend Movies vs TV Shows in year added', title_x=0.5)
fig.show()
#trend of movies and tv shows in year added (from 2008 to 2020)

In [None]:
df_movie = df[df['type']=='Movie'].groupby('month_added').count()
df_tv = df[df['type']=='TV Show'].groupby('month_added').count()


df_movie.reset_index(level=0, inplace=True)
df_tv.reset_index(level=0, inplace=True)

# fig = px.line(data_movie, x="year_added", y="type")
# fig.show()

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_movie['month_added'], y=df_movie['type'],
                    mode='lines',
                    name='Movies', marker_color='mediumpurple'))
fig.add_trace(go.Scatter(x=df_tv['month_added'], y=df_tv['type'],
                    mode='lines',
                    name='TV Shows', marker_color='lightcoral'))
fig.update_layout(title_text='Trend Movies vs TV Shows in month added', title_x=0.5)
fig.show()
#Trend Movies vs TV Shows in month added (from august to september)

In [None]:
df_tv = df[df["type"] == "TV Show"]
df_movie = df[df["type"] == "Movie"]

movie_ratings = df_movie.groupby(['target_ages'])['type'].count().reset_index(name='count').sort_values(by='count',ascending=False)
fig_dims = (18,8)
fig, ax = plt.subplots(figsize=fig_dims)  
sns.pointplot(x='target_ages',y='count',data=movie_ratings)
plt.title('Top Movie Ratings Based On Rating System',size='20')
plt.show()
#movie rating 

In [None]:
tv_ratings = df_tv.groupby(['target_ages'])['type'].count().reset_index(name='count').sort_values(by='count',ascending=False)
fig_dims = (18,8)
fig, ax = plt.subplots(figsize=fig_dims)  
sns.pointplot(x='target_ages',y='count',data=tv_ratings)
plt.title('Top TV Show Ratings Based On Rating System',size='20')
plt.show()
#tv show rating based on rating system

In [None]:
import plotly.express as px
def generate_rating_df(df):
    rating_df = df.groupby(['rating', 'target_ages']).agg({'type': 'count'}).reset_index()
    rating_df = rating_df[rating_df['type'] != 0]
    rating_df.columns = ['rating', 'target_ages', 'counts']
    rating_df = rating_df.sort_values('target_ages')
    return rating_df


rating_df = generate_rating_df(df)
fig = px.bar(rating_df, x='rating', y='counts', color='target_ages', title='Ratings of Movies And TV Shows Based On Target Age Groups',  labels={'counts':'COUNT', 'rating':'RATINGS', 'target_ages':'TARGET AGE GROUPS' })
fig.show()
#Ratings of Movies And TV Shows Based On Target Age Groups

In [None]:
plt.figure(figsize=(12,10))
sns.set(style="whitegrid")
ax = sns.countplot(y="release_year", data=df_movie, palette="coolwarm", order=df_movie['release_year'].value_counts().index[0:15])

plt.title('ANALYSIS ON RELEASE YEAR OF MOVIES', fontsize=15, fontweight='bold')
plt.show()
#release year of movies

In [None]:
plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(y="release_year", data=df_tv, palette="coolwarm", order=df_tv['release_year'].value_counts().index[0:15])

plt.title('ANALYSIS ON RELEASE YEAR OF TV Show', fontsize=15, fontweight='bold')
plt.show()
#release year of tv show

In [None]:
from collections import Counter
country_data = df['country']
country_count = pd.Series(dict(Counter(','.join(country_data).replace(' ,',',').replace(
    ', ',',').split(',')))).sort_values(ascending=False)
top20country = country_count.head(10)
plt.figure(figsize=(15,5))
sns.barplot(x= top20country.index, y=top20country, palette="pastel")
plt.xticks(rotation=50)
plt.title('Top 10 countries with most contents', fontsize=15, fontweight='bold')
plt.show()

In [None]:
rating_order_movie =  ['G', 'TV-Y', 'TV-G', 'PG', 'TV-Y7', 'TV-Y7-FV', 'TV-PG', 'PG-13', 'TV-14', 'R', 'NC-17', 'TV-MA']
rating_order_tv =  [ 'TV-Y', 'TV-G', 'TV-Y7', 'TV-Y7-FV', 'TV-PG', 'TV-14', 'R', 'TV-MA']
movie_rating = df_movie['rating'].value_counts()[rating_order_movie]
tv_rating = df_tv['rating'].value_counts()[rating_order_tv].fillna(0)
def rating_barplot(data, title, height, h_lim=None):
    fig, ax = plt.subplots(1,1, figsize=(15, 7))
    if h_lim :
        ax.set_ylim(0, h_lim)
    ax.bar(data.index, data,  color="#d0d0d0", width=0.6, edgecolor='black')

    color =  ['green',  'blue',  'orange',  'red']
    span_range = [[0, 2], [3,  6], [7, 8], [9, 11]]

    for idx, sub_title in enumerate(['Little Kids', 'Older Kids', 'Teens', 'Mature']):
        ax.annotate(sub_title,
                    xy=(sum(span_range[idx])/2 ,height),
                    xytext=(0,0), textcoords='offset points',
                    va="center", ha="center",
                    color="w", fontsize=16, fontweight='bold',
                    bbox=dict(boxstyle='round4', pad=0.4, color=color[idx], alpha=0.6))
        ax.axvspan(span_range[idx][0]-0.4,span_range[idx][1]+0.4,  color=color[idx], alpha=0.1)

    ax.set_title(f'Distribution of {title} Rating', fontsize=20, fontweight='bold', position=(0.5, 1.0+0.03))
    plt.show()

In [None]:
rating_barplot(movie_rating,'Movie', 1500)

In [None]:
rating_barplot(tv_rating,'TV Show' , 600, 1500)

In [None]:
from wordcloud import WordCloud, STOPWORDS

text = ' '.join(df_movie['listed_in'])

plt.rcParams['figure.figsize'] = (12,12)
wordcloud = WordCloud(background_color = 'black',colormap='vlag', width = 1200,  height = 1200, max_words = 121).generate(text)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

In [None]:
from wordcloud import WordCloud, STOPWORDS

text = ' '.join(df_tv['listed_in'])

plt.rcParams['figure.figsize'] = (12,12)
wordcloud = WordCloud(background_color = 'lightblue', width = 1200,  height = 1200, max_words = 121).generate(text)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

##Feature Engineering

In [None]:
#label encoding
from sklearn.preprocessing import LabelEncoder
'''target_ages_enc = LabelEncoder()
target_ages_enc.fit(df['target_ages'])
df['target_ages_enc'] = target_ages_enc.transform(df['target_ages'])'''
#adults is replaced by 0 , teens is replaced by 1 , older kids is replaced by 2 and kids is replaced by 3

In [None]:
#df.drop('target_ages', axis='columns', inplace=True)

In [None]:
type_enc = LabelEncoder()
type_enc.fit(df['type'])
df['type_enc'] = type_enc.transform(df['type'])
# movie is replaced by 1 and tv show is replaced by 0

In [None]:
df.drop('type', axis='columns', inplace=True)

In [None]:
title_enc = LabelEncoder()
title_enc.fit(df['title'])
df['title_enc'] = title_enc.transform(df['title'])

In [None]:
df.drop('title', axis='columns', inplace=True)

In [None]:
country_enc = LabelEncoder()
country_enc.fit(df['country'])
df['country_enc'] = country_enc.transform(df['country'])

In [None]:
df.drop('country', axis='columns', inplace=True)

In [None]:
df['date_added_date'] = df['date_added'].apply(lambda x: x.split(",")[-2])
df['date_added_date'].head()

In [None]:
df

In [None]:
df['date_added_dates'] = df['date_added_date'].apply(lambda x: x.split(" ")[-1])
df['date_added_dates'].head()

In [None]:
df

In [None]:
df.drop('date_added', axis='columns', inplace=True)

In [None]:
df.drop('date_added_date', axis='columns', inplace=True)

In [None]:
df

In [None]:
month_added_enc = LabelEncoder()
month_added_enc.fit(df['month_added'])
df['month_added_enc'] = month_added_enc.transform(df['month_added'])

In [None]:
df.drop('month_added', axis='columns', inplace=True)

In [None]:
year_added_enc = LabelEncoder()
year_added_enc.fit(df['year_added'])
df['year_added_enc'] = year_added_enc.transform(df['year_added'])

In [None]:
df.drop('year_added', axis='columns', inplace=True)

In [None]:
release_year_enc = LabelEncoder()
release_year_enc.fit(df['release_year'])
df['release_year_enc'] = release_year_enc.transform(df['release_year'])

In [None]:
df

In [None]:
df.drop('release_year', axis='columns', inplace=True)

In [None]:
df

In [None]:
rating_enc = LabelEncoder()
rating_enc.fit(df['rating'])
df['rating_enc'] = rating_enc.transform(df['rating'])

In [None]:
df.drop('rating', axis='columns', inplace=True)

In [None]:
duration_enc = LabelEncoder()
duration_enc.fit(df['duration'])
df['duration_enc'] = duration_enc.transform(df['duration'])

In [None]:
df.drop('duration', axis='columns', inplace=True)

In [None]:
listed_in_enc = LabelEncoder()
listed_in_enc.fit(df['listed_in'])
df['listed_in_enc'] = listed_in_enc.transform(df['listed_in'])

In [None]:
df.drop('listed_in', axis='columns', inplace=True)

In [None]:
target_ages_enc = LabelEncoder()
target_ages_enc.fit(df['target_ages'])
df['target_ages_enc'] = target_ages_enc.transform(df['target_ages'])

In [None]:
df.drop('target_ages', axis='columns', inplace=True)

In [None]:
df

Quasi Constant Removal

In [None]:
from sklearn.feature_selection import VarianceThreshold

In [None]:
constant_filter = VarianceThreshold(threshold=0)
constant_filter.fit(df)

len(df.columns[constant_filter.get_support()])

constant_columns = [column for column in df.columns
                    if column not in df.columns[constant_filter.get_support()]]

df.drop(labels=constant_columns, axis=1, inplace=True)

In [None]:
df

In [None]:
qconstant_filter = VarianceThreshold(threshold=0.16)
qconstant_filter.fit(df)

In [None]:
len(df.columns[qconstant_filter.get_support()])

In [None]:
qconstant_columns = [column for column in df.columns
                    if column not in df.columns[qconstant_filter.get_support()]]

print(len(qconstant_columns))

##Feature Selection

In [None]:
corrmat = df.corr()
plt.subplots(figsize=(12,10))
sns.heatmap(corrmat)

In [None]:
colormap = plt.cm.RdBu
plt.subplots(figsize=(15,14))
sns.heatmap(df.corr(), linewidths=0.1, vmax=1.0, square=True, cmap=colormap, linecolor='black' , annot=True)

In [None]:
def correlation(dataset, threshold):
    col_corr = set() # Set of all the names of deleted columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if  (corr_matrix.iloc[i, j] >= threshold) :
                colname = corr_matrix.columns[i] # getting the name of column
                col_corr.add(colname)
    return col_corr


In [None]:
corr_features = correlation(df , 0.8)
len(set(corr_features))

In [None]:
df.cov()

In [None]:
y = df.iloc[:,[10]] #target column

In [None]:
df.drop(df.columns[10], axis=1, inplace=True)

In [None]:
x=df

##Modeling

In [None]:
from sklearn.model_selection import train_test_split
xtr,xts,ytr,yts = train_test_split(x,y,test_size=0.2)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
list=[]
for i in range(1,21):
    knn=KNeighborsClassifier(n_neighbors=i)
    knn.fit(xtr,ytr)
    pred=knn.predict(xts)
    res1=accuracy_score(yts,pred)
print("K Nearest Neighbors Top 5 Success Rates is:" , "{:.2f}%".format(100*res1))

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()

rf.fit(xtr,ytr)
pred1=rf.predict(xts)
res2=accuracy_score(yts,pred1)
print("Random Forest Classifier Success Rate is :", "{:.2f}%".format(100*res2))


In [None]:
lst = [res1 , res2]

In [None]:
lst

In [None]:
lst2 = ["KNearestNeighbours" , "RandomForest"]

In [None]:
plt.rcParams['figure.figsize']=20,8
sns.set_style('darkgrid')
ax = sns.barplot(x=lst2, y=lst, palette = "husl", saturation =2.0)
plt.xlabel('Classifier Models', fontsize = 20 )
plt.ylabel('% of Accuracy', fontsize = 20)
plt.title('Accuracy of different Classifier Models', fontsize = 20)
plt.xticks(fontsize = 12, horizontalalignment = 'center', rotation = 8)
plt.yticks(fontsize = 12)
for i in ax.patches:
    width, height = i.get_width(), i.get_height()
    x, y = i.get_xy() 
    ax.annotate(f'{round(height,2)}%', (x + width/2, y + height*1.02), ha='center', fontsize = 'x-large')
plt.show()