![](https://logos-download.com/wp-content/uploads/2016/03/Netflix_logo_red.png)

# Introduction

Netflix was conceived in 1997 by Reed Hastings (the current CEO) and Marc Randolph. Both had previous in the West Coast tech scene â€“ Hastings was the owner of debugging software firm Pure Atria, while Randolph had cofounded, and then sold computer mail order company MicroWarehouse for $700 million.
Netflix.com started life as a DVD rental service in 1998; an online rival to the then dominant Blockbuster Video. 

At the end of 2019, Netflix subscribers numbered 167.1 million. Of these, 61 million accounts were registered in the US, with the remaining 106.1 million (63%) spread over the rest of the globe.
International growth in Netflix subscriptions has far outpaced domestic growth in recent years, since international users first came to account for the greatest proportion of international users as recently as 2017. Since 2015 the number of international Netflix users has increased nearly fourfold, while domestic users have increased by less than 50%.

One of the technologies that made netflix the technological giant, that it is today, is recommendations engine.
A recommendations engine, in simple words, is a piece of code which can recommend users the most related item based on their current item choice or their previous history of choices. In this notebook, I have tried to create a simple recommendations engine based on weighted averages technique and Content based filtering.  

# NOTE

I have used some sections of code from Krish Naik's notebook and would like to give credits to him. This project is made for study and learning purposes. I have added my own changes and work as well to make the recommendations system more efficient and useful. The data that I have used is available on Kaggle and I have engineered features according to my needs. 

In [None]:
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import pickle

In [None]:
imdb_df = pd.read_csv('/kaggle/input/netflix-data/IMDb movies.csv')
netflix_df = pd.read_csv('/kaggle/input/netflix-data/netflix_titles.csv')
netflix_df2 = pd.read_csv('/kaggle/input/netflix-data/NetflixViewingHistory.csv')
streaming_platforms_df = pd.read_csv('/kaggle/input/movies-on-netflix-prime-video-hulu-and-disney/MoviesOnStreamingPlatforms_updated.csv')

# Data-Engineering

In [None]:
streaming_platforms_df['title']=streaming_platforms_df['Title']
drop=['Unnamed: 0', 'ID','Year', 'Age','Type','Directors','Genres', 'Country', 'Language', 'Runtime','Title','Rotten Tomatoes','IMDb']
streaming_platforms_df.drop(drop, axis=1, inplace=True)

In [None]:
netflix_df2['title']=netflix_df2.Title

In [None]:
drop=['Title','Date']
netflix_df2.drop(drop, axis=1,inplace=True)
netflix_df2 = netflix_df2.drop_duplicates()

In [None]:
imdb_df.columns

In [None]:
drop = ['imdb_title_id','original_title','worlwide_gross_income','metascore','usa_gross_income','budget',
       'writer', 'duration', 'country', 'language', 'director','year', 'date_published']
imdb_df.drop(drop, axis=1, inplace=True)

In [None]:
imdb_df.head()

In [None]:
netflix_df = netflix_df[netflix_df['type']=='Movie']

In [None]:
drop = ['show_id', 'cast', 'country','listed_in','rating','release_year','type','date_added','duration','description']
netflix_df.drop(drop, axis=1, inplace=True)

In [None]:
netflix_df = pd.merge(netflix_df, netflix_df2, how='outer', on='title')
netflix_df = netflix_df.drop_duplicates()
dataset = pd.merge(imdb_df,netflix_df, how='inner',on='title')

In [None]:
dataset.head()

# Weighted Averages Method

In weighted averages method, I will be recommending movies based on votes polled by users and average votes(IMDb Score). I could have just recommended movies based on highest IMDb scores but some movies are just not famous or maybe they are newly released and thus it would be more suitable to take user votes into consideration as well.

In [None]:
# Calculate all the components based on the weighted averages formula
v=dataset['votes']
R=dataset['avg_vote']
C=dataset['avg_vote'].mean()
m=dataset['votes'].quantile(0.70)

In [None]:
dataset['weighted_average']=((R*v)+ (C*m))/(v+m)

In [None]:
dataset.head()

In [None]:
df_sorted=dataset.sort_values('weighted_average',ascending=False)
df_sorted[['title', 'votes', 'avg_vote', 'weighted_average']].head(20)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
weight_average=df_sorted.sort_values('weighted_average',ascending=False)
plt.figure(figsize=(12,6))
axis1=sns.barplot(x=weight_average['weighted_average'].head(20), y=weight_average['title'].head(20), data=weight_average)
plt.xlim(4, 10)
plt.title('Best Movies on Netflix by average votes(on IMDb)', weight='bold')
plt.xlabel('Weighted Average Score', weight='bold')
plt.ylabel('Movie Title', weight='bold')

# Content Filtering Method

In content based filtering, I will be using certain features related to the content of movie like genre, actors, description etc to find out the similarity of any given movie with respect to all the other movies. After that, I will be selecting top 10 movies after based on the similarity values. There are certain advantages and disadvantages related to content based filtering method. They are:

### Advantages
1. Content based filtering does not require user history for making a recommendation. It can just examine the content of the movie to make recommendations. In other words, even if a user if first time using the recommendation system, the recommendation system will work just fine.

### Disadvantages
1. Content based filtering requires a lot of time to examine all the content of the movies. Since, it is based on content filtering, it needs to process all the movie and their contents in order to make a recommendation. 
2. To examine huge amount of data, it requires a lot of memory which again is a drawback.

In [None]:
dataset['IMDb Score']=dataset['avg_vote']
dataset.drop('avg_vote',axis=1, inplace=True)
dataset.head(1)['description']

In [None]:
def augmentation(df, col1, col2, col3, col4, col5):
    index_col1 = df.columns.get_loc(col1)
    index_col2 = df.columns.get_loc(col2)
    index_col3 = df.columns.get_loc(col3)
    index_col4 = df.columns.get_loc(col4)
    index_col5 = df.columns.get_loc(col5)
    
    for row in range(len(df)):
        count=0
        cast = str(df.iat[row, index_col2])
        main_cast = ""
        for i in range(len(cast)):
            if cast[i]!=',':
                if count!=3:
                    main_cast = main_cast+cast[i]
                else:
                    break
            else:
                count=count+1
        df.iat[row,index_col3] = str(str(df.iat[row,index_col1])+str(main_cast)+str(df.iat[row,index_col4])+str(df.iat[row, index_col5]))
        
dataset["Information"]=""

augmentation(dataset,'description','actors','Information','genre','director')

In [None]:
def case_conversion(df, col1, col2):
    index_col1 = df.columns.get_loc(col1)
    index_col2 = df.columns.get_loc(col2)
    
    for rows in range(len(df)):
        df.iat[rows, index_col2] = df.iat[rows, index_col1].lower()
        
dataset['title_lower'] = ""
case_conversion(dataset, "title", "title_lower")

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfv = TfidfVectorizer(min_df=3,  max_features=None, 
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 6),
            stop_words = 'english')

# Filling NaNs with empty string
dataset['Information'] = dataset['Information'].fillna('')
dataset['description'] = dataset['description'].fillna('None')

In [None]:
dataset.to_csv('movie_dataset.csv', header=True, index=False)

In [None]:
# Fitting the TF-IDF on the 'Information' text
tfv_matrix = tfv.fit_transform(dataset['Information'])

In [None]:
from sklearn.metrics.pairwise import sigmoid_kernel

# Compute the sigmoid kernel
sig = sigmoid_kernel(tfv_matrix, tfv_matrix)

In [None]:
# Reverse mapping of indices and movie titles
indices = pd.Series(dataset.index, index=dataset['title_lower']).drop_duplicates()

In [None]:

def recommendations(title, sig=sig):
    # Get the index corresponding to original_title
    title = title.lower()
    idx = indices[title]

    # Get the pairwsie similarity scores 
    sig_scores = list(enumerate(sig[idx]))

    # Sort the movies 
    sig_scores = sorted(sig_scores, key=lambda x: x[1], reverse=True)

    # Scores of the 10 most similar movies
    sig_scores = sig_scores[1:11]

    # Movie indices
    movie_indices = [i[0] for i in sig_scores]

    # Top 10 most similar movies
    return dataset.iloc[movie_indices]
    

In [None]:
df = recommendations("the green mile")
data = df[['title','genre','description','IMDb Score','actors']].head(10)
data

# Conclusion

In this project, I tried to study, understand and implement some algorithms which are used in modern day recommendations engine. In future, I will be trying to use other techniques out there like Collaborative based RecSys and Hybrid RecSys. Though this notebook, I tried to explain the theoritical aspects along with the practical implementations of what I learned while working on this project. I hope this notebook helps you in some way. Thanks for your time.