# **NETFLIX MOVIE RECOMMENDATION SYSTEM**
In this notebook, I have created classes and functions to automate the code.\
You can call individual functions in the class or call the automate() function to run them all in sequence.\
This approach makes the code cleaner but reduces flexibility.

In [1]:
#installing required library
%pip install neattext

Collecting neattext
  Downloading neattext-0.1.3-py3-none-any.whl.metadata (12 kB)
Downloading neattext-0.1.3-py3-none-any.whl (114 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/114.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m112.6/114.7 kB[0m [31m4.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.7/114.7 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: neattext
Successfully installed neattext-0.1.3


In [2]:
#importing required libraries for the recommendation system
import pandas as pd
import numpy as np
import neattext.functions as nfx
from sklearn.feature_extraction.text import CountVectorizer #You can also use TfidfVectorizer instead of CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#importing libraries for better visualization of the table
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = 'colab'

#importing time for adding delays at runtime to make the process appear sequential and automated
import time

import warnings
warnings.filterwarnings('ignore')

# **CREATING A CLASS AND FUNCTIONS TO AUTOMATE THE PROCESS**

Class **recommendationEngine** reads the file when it is initialised/called

## **Functions**

### **1. clean_data(type):**
Takes type=***Movie*** or ***TV Show*** as input, creates separate data frames for both, and cleans them accordingly. Hence, we have a movie dataset and a TV show dataset depedning on the value of ***type***. Retuens 2 datasets.

### **2. remove_stop_char(data):**
Takes the Movie or TV Show dataset as input and removes stopwords(the,a,and, etc.) and special characters(!,&,%,etc.) from the texts as they might skew the analysis and hinder the recommendation system's output. Returns the cleaned datset.

### **3. vectorization(data,column,token=True):**
Takes dataset, column name, and token = True(Default) or False as input. Depending on the the input dataset and column it vectorizes that column, aka, takes the text of that column, separates the words or group of words into separate columns, and gives 1 and 0 as values depending on whether that word is present in that particular cell. Returns an array of the vectorized column.

### **4. binary(self,type,cast,country,genre,director=1):**
Takes as input:
1. type=***Movie*** or ***TV Show***
2. cast column's vectorized array
3. country column's vectorized array
4. genre column's vectorized array
5. director= 1(default, placeholder) or director column's vectorized array

Converts all arrays to data frame where movies are rows and tokens/words/bag-of-words are columns and concatenates them horizontally(axis=1). Returns this concatenated dataframe.

### **5. cosine(df_binary):**
Takes the concatenated data frame returned from binary() function as input and calculates cosine_similarity. Cosine_similarity calculates angle between 2 vectors, here each vector is the vectorized data of a movie.

$$
\text{Cosine Similarity (A, B)} = \frac{A \cdot B}{\|A\| \|B\|}
$$

Where A and B are of the form: [0,1,1,0,0,........] and A.B is the dot product of the vectors, and \|A\| \|B\| is the product of the magnitude of these vectors.

The resulting output is square matrix of the size = number of movies/tv show. Each cell will have a value between 0 and 1, where 0 means no co-relation/similarity between the 2 movies and 1 means both movies are the same.

The higher the value, the more similar/co-related the movies. Returns this correaltion matrix.

### **6. recommedation(title,df_movies,df_tv,movie_sim,tv_sim):**
Takes as input:
1. title = Whatever movie you want recommendation based on
2. The movie dataset
3. The TV Show dataset
4. The cosine similarity matrix of movies
5. The cosine similarity matrix of TV Shows

It takes the index of the title(movie/tv show) from the movie/tv_show dataset, finds it in the cosine similarity matrix, sorts the row in descending order, so the highest correlation comes first and the lowest comes last, selects the first 5 values and their respective index, finds those movies/tv shows in the dataset using the index and returns the dataframe of those 5 movies/tv shows.

### **7. table(df):**
Takes the recommended dataset as input and uses Plotly to display an interactive Dataset Table.

### **8. automate():**
Automates the entire process of crating a recommendation engine.

Calls the functions/class in following order:
1. engine=recommendationEngine() - Class
2. engine.clean_data() - Twice, once for Movie and once for TV Shows
3. engine.remove_stop_char() - Twice, once for Movie and once for TV Shows
4. engine.vectorization() - Multiple times, for all required columns of both datasets
5. engine.binary() - Twice, once for Movie and once for TV Shows
6. engine.cosine() - Twice, once for Movie and once for TV Shows

Returns:
1. engine = class it initialized
2. Movies Data Frame
3. TV Shows Data Frame
4. Cosine Similarity of Movies
5. Cosine Similarity of TV Shows
6. The original Dataset

**NOTE: After running the engine.automate() function, you can call the recommendation() function to get the recommended movies/tv shows or table(recommendation()) to get the recommended movies/tv shows and display them using plotly.**

**Also, if you want you can call each funtion individually.**

In [6]:
class recommendationEngine:

  def __init__(self):
    self.df=pd.read_csv('netflix_titles.csv')

  def clean_data(self,type):
    self.df.rename(columns={'listed_in':'genre'},inplace=True)
    df_temp=self.df[self.df['type']==type].copy()
    df_temp.reset_index(drop=True,inplace=True)

    # filling NaN manually at the director and ratings column so pandas can treat it as a Non null-value
    if type == 'TV Show':
      df_temp['director'].fillna('NaN', inplace = True)
    df_temp['rating'].fillna('NaN', inplace = True)

    # Dropping null values
    df_temp.dropna(inplace= True)
    df_temp.reset_index(drop=True, inplace=True)

    time.sleep(1)
    print(f'Shape of {type} dataframe is {df_temp.shape[0]} rows and {df_temp.shape[1]} columns\n')

    temp=df_temp[['title','director','cast','country','genre','rating','type']].copy()
    time.sleep(1)
    print(f'Few statistics about the columns of the {type} Dataset are\n{temp.describe().T}\n')

    return df_temp, temp

  #Removing stopwords and special characters since they have negligible influence on text analysis
  #sw=stopward
  def remove_stop_char(self,data):
    data['cast']=data['cast'].apply(nfx.remove_stopwords)
    data['country']=data['country'].apply(nfx.remove_stopwords)
    data['genre']=data['genre'].apply(nfx.remove_stopwords)
    data['country']=data['country'].apply(nfx.remove_special_characters)

    if data['type'].unique()[0]=='Movie':
      data['director']=data['director'].apply(nfx.remove_stopwords)
      time.sleep(1)
      print('Removed Stopwords and Special Characters from Movies Dataset')
    else:
      time.sleep(1)
      print('Removed Stopwords and Special Characters from TV Show Dataset')
    return data

  #Vectorizing Data
  def vectorization(self,data,column,token=True):
    if token:
      countVector = CountVectorizer(binary=True, tokenizer=lambda x:x.split(','))
      time.sleep(1)
      print(f'Vectorized {column} column of {data.type.unique()[0]} dataset')
      return countVector.fit_transform(data[column]).toarray()
    else:
      countVector = CountVectorizer(binary=True)
      time.sleep(1)
      print(f'Vectorized {column} column of {data.type.unique()[0]} dataset')
      return countVector.fit_transform(data[column]).toarray()

  def binary(self,type,cast,country,genre,director=1):
    if type=='Movie':
      time.sleep(1)
      print('Converting Movies data to Binary')
      binary_director=pd.DataFrame(director)
    else:
      time.sleep(1)
      print('Converting TV Show data to Binary')
    binary_cast=pd.DataFrame(cast)
    binary_country=pd.DataFrame(country)
    binary_genre=pd.DataFrame(genre)

    # Concating Dataframe
    if type=='Movie':
      df_binary = pd.concat([binary_director,binary_cast,binary_country,binary_genre],axis=1, ignore_index=True)
    else:
      df_binary = pd.concat([binary_cast,  binary_country, binary_genre], axis=1,ignore_index=True)
    time.sleep(1)
    print('Converted\n')
    return df_binary

  def cosine(self,df_binary,data):
    time.sleep(1)
    print(f'Calculating Cosine Similarity of {data.type.unique()[0]} data')
    cos_sim = cosine_similarity(df_binary)
    time.sleep(1)
    print('Cosine Similarity Calculated\n')
    return cos_sim

  def recommedation(self,title,df_movies,df_tv,movie_sim,tv_sim):
    if title in df_movies.title.values:
      index=df_movies[df_movies.title == title].index.item()
      scores=dict(enumerate(movie_sim[index]))
      sorted_scores=dict(sorted(scores.items(),key=lambda x:x[1],reverse=True))

      selected_movies_index=[id for id, scores in sorted_scores.items()]
      selected_movies_score=[scores for id, scores in sorted_scores.items()]

      recommend_movies=df_movies.iloc[selected_movies_index]
      recommend_movies['similarity'] = selected_movies_score

      movie_recommend = recommend_movies.reset_index(drop=True)
      return movie_recommend[1:6]

    elif title in df_tv['title'].values:
      index=df_tv[df_tv.title == title].index.item()
      scores=dict(enumerate(tv_sim[index]))
      sorted_scores=dict(sorted(scores.items(),key=lambda x:x[1],reverse=True))

      selected_tv_index=[id for id, scores in sorted_scores.items()]
      selected_tv_score=[scores for id, scores in sorted_scores.items()]

      recommend_tv=df_tv.iloc[selected_tv_index]
      recommend_tv['similarity'] = selected_tv_score

      tv_recommend = recommend_tv.reset_index(drop=True)
      return tv_recommend[1:6]

    else:
      print('Title not found')

  def table(self,df):
    fig = go.Figure(data=[go.Table(
        columnorder=[1, 2, 3, 4, 5],
        columnwidth=[20, 20, 20, 30, 50],
        header=dict(values=list(['Type', 'Title', 'Country', 'Genre(s)', 'Description']),
                    line_color='black', font=dict(color='black', family="Gravitas One", size=20), height=40,
                    fill_color='#FF6865',
                    align='center'),
        cells=dict(values=[df.type, df.title, df.country, df.genre, df.description],
                   font=dict(color='black', family="Lato", size=16),
                   fill_color='#FFB3B2',
                   align='left'))
    ])

    fig.update_layout(height=700,
                      title={'text': "Top Movie Recommendations", 'font': {'size': 22, 'family': 'Gravitas One'}},
                      title_x=0.5
                      )
    fig.show()

  def automate():
    engine=recommendationEngine()
    time.sleep(1)
    print('RECOMMENDATION ENGINE CALLED\n')

    df_movies,movies=engine.clean_data('Movie')
    df_tv,tv=engine.clean_data('TV Show')

    movies=engine.remove_stop_char(movies)
    tv=engine.remove_stop_char(tv)
    print('\n')

    country = engine.vectorization(movies,'country',False)
    director = engine.vectorization(movies,'director')
    cast = engine.vectorization(movies,'cast')
    genre = engine.vectorization(movies,'genre')

    tv_country = engine.vectorization(tv,'country',False)
    tv_cast = engine.vectorization(tv,'cast')
    tv_genre = engine.vectorization(tv,'genre')
    print('\n')

    movie_binary = engine.binary('Movie',cast,country,genre,director)
    movie_sim = engine.cosine(movie_binary,df_movies)

    tv_binary = engine.binary('TV Show',tv_cast,tv_country,tv_genre)
    tv_sim = engine.cosine(tv_binary,df_tv)

    time.sleep(1)
    print('RECOMENDATION ENGINE CREATED\n')
    time.sleep(1)
    print('USE engine.recommedation("title",df_movies,df_tv,movie_sim,tv_sim) TO GET RECOMMENDATIONS IN TABULAR FORM USING PANDAS\n')
    time.sleep(1)
    print('USE engine.table(engine.recommedation("title",df_movies,df_tv,movie_sim,tv_sim)) TO GET RECOMMENDATIONS IN TABULAR FORM USING PLOTLY\n')

    return engine,df_movies,df_tv,movie_sim,tv_sim,engine.df

In [7]:
engine,df_movies,df_tv,movie_sim,tv_sim,df = recommendationEngine.automate()

RECOMMENDATION ENGINE CALLED

Shape of Movie dataframe is 5186 rows and 12 columns

Few statistics about the columns of the Movie Dataset are
         count unique                           top  freq
title     5186   5186                       Sankofa     1
director  5186   3829        Raúl Campos, Jan Suter    18
cast      5186   5062                   Samuel West    10
country   5186    594                 United States  1819
genre     5186    268  Dramas, International Movies   336
rating    5186     15                         TV-MA  1741
type      5186      1                         Movie  5186

Shape of TV Show dataframe is 2015 rows and 12 columns

Few statistics about the columns of the TV Show Dataset are
         count unique                 top  freq
title     2015   2015       Blood & Water     1
director  2015    142                 NaN  1868
cast      2015   1982  David Attenborough    14
country   2015    184       United States   618
genre     2015    219            Kids

In [8]:
engine.table(engine.recommedation('Coffee & Kareem',df_movies,df_tv,movie_sim,tv_sim))

In [9]:
engine.recommedation('Kota Factory',df_movies,df_tv,movie_sim,tv_sim)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,genre,description,similarity
1,s3294,TV Show,Little Things,,"Dhruv Sehgal, Mithila Palkar",India,"November 9, 2019",2019,TV-MA,3 Seasons,"International TV Shows, Romantic TV Shows, TV ...",A cohabiting couple in their 20s navigate the ...,0.471405
2,s7873,TV Show,Rishta.com,,"Shruti Seth, Kavi Shastri, Siddhant Karnick, K...",India,"March 15, 2018",2010,TV-14,1 Season,"International TV Shows, Romantic TV Shows, TV ...",Partners at an Indian matrimonial agency face ...,0.408248
3,s7454,TV Show,Midnight Misadventures With Mallika Dua,,Mallika Dua,India,"April 1, 2019",2018,TV-14,1 Season,"International TV Shows, Stand-Up Comedy & Talk...","In this talk show, comedian Mallika Dua serves...",0.387298
4,s8776,TV Show,Yeh Meri Family,,"Vishesh Bansal, Mona Singh, Akarsh Khurana, Ah...",India,"August 31, 2018",2018,TV-PG,1 Season,"International TV Shows, TV Comedies","In the summer of 1998, middle child Harshu bal...",0.365148
5,s1590,TV Show,Bhaag Beanie Bhaag,,"Swara Bhasker, Dolly Singh, Ravi Patel, Varun ...",India,"December 4, 2020",2020,TV-MA,1 Season,"International TV Shows, Romantic TV Shows, TV ...","Facing disapproving parents, a knotty love lif...",0.365148
