# What is Recommendation System ?
Recommender/recommendation system is a subclass of information filtering system that seeks to predict the rating/ preference a user would give to an item.

They are primarily used in applications where a person/ entity is involved with a product/ service. To further improve their experience with this product, we try to personalize it to their needs. For this we have to look up at their past interactions with this product.

*In one line* -> **Specialized content for everyone.**

*For further info, [Wiki](https://en.wikipedia.org/wiki/Recommender_system#:~:text=A%20recommender%20system%2C%20or%20a,would%20give%20to%20an%20item.)*

## Types of Recommender System
* 1). Popularity Based
* 2). Classification Based
* 3). Content Based
* 4). Collaborative Based
* 5). Hybrid Based (Content + Collaborative)
* 6). Association Based Rule Mining

# Content based recommender system
Recommends content based on product description. Here we would convert movie titles into a vector to find its cosine similarity. Similar movie would have a high cosine similarity and thus would be recommended to the user.

# Import packages and dataset

We would use Rake package, Rake stands for Rapid Automatic Keyword Extraction algorithm which is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and it's co-occurrance with other words in the text.

*Credits to -> [csurfer](https://github.com/csurfer/rake-nltk)*

In [None]:
!pip install rake_nltk

In [None]:
import pandas as pd
import numpy as np

from rake_nltk import Rake
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer #tokenizes a collection of words extracted from a text doc
from ast import literal_eval #This evaluates whether an expresion is a Python datatype or not

In [None]:
data = pd.read_csv('../input/imdb-extensive-dataset/IMDb movies.csv')
print(data.shape)
data.head()

In [None]:
#There are many null values
data.isnull().sum()

In [None]:
#Lets convert all Null values into 'missing value'
data = data.fillna('missing value')

### Recommend movies based on a director/ writer

In [None]:
#Recommend movies based on a director (Pls give full names)
#rec_director = input('Enter director you want to be recommended movies of: ')
rec_director = 'Christopher Nolan'
data[data['director'] == rec_director]

#Recommend movies based on a writer (Pls give full names)
#rec_writer = input('Enter writer you want to be recommended movies of: ')
#data[data['writer'] == rec_writer]

### Recommend movies based on actor

In [None]:
#rec_actor = input('Enter actor you want to be recommended movies of: ')
rec_actor = 'Ryan Gosling'
rec_actor = data[data['actors'].str.contains(rec_actor)] 

In [None]:
rec_actor

# Data Preprocessing

**Things to do:**
* Impute all missing values
* Extract only relevant columns
* Convert all columns into lower case
* Split all names into comma separated
* Combine director, writer, actor names, production company into 1 word respectively this will be used for text extraction

In [None]:
data.columns

In [None]:
#Extract relevant columns that would influence a movie's rating based on the content.

#Due to memory issue using just 3k data. You can try this code on Google Colabs for better performance
data1 = data[['title','genre','director','actors','description']].head(3000)
data1.head()

Remember the more columns you extract here more are the chances of overfitting as movies recommended will also take into account director, writer, production_company and et all. These features may be irrelevant to a user who wants to be recommended a movie based on his preferences.

In [None]:
data1.isnull().sum()

In [None]:
#Impute all missing values
data1 = data1.fillna('missing value')

In [None]:
#Convert all columns into lower case
data1 = data1.applymap(lambda x: x.lower() if type(x) == str else x)
data1.head()

In [None]:
#Use genre as a list of words
data1['genre'] = data1['genre'].map(lambda x: x.split(','))
data1['genre']

In [None]:
#Similarily lets separate names into first and last name with commas
data1[['director','actors']] = data1[['director','actors']].applymap(lambda x: x.split(',')) #apply map used for more than 1 column, map for 1 column
data1[['director','actors']].head()

In [None]:
#Combine director, actor names into 1 word respectively this will be used for text extraction

for index,row in data1.iterrows():
    row['actors'] = [x.replace(' ','') for x in row['actors']]
    row['director'] = [x.replace(' ','') for x in row['director']]

In [None]:
data1.head()

## For content based movie recommendation we have to use NLP techniques like 
* Keyword extraction -> Extract keywords from description
* Bag of Words Creation -> Extracting all words from a row into a Bag
* Count Vectorizer -> Count frequency of words from this BOW
* Cosine Similarity -> Find cosine similarity between all movie titles




# Keyword Extraction
Keyword extraction is automatic detection of terms that best describe the subject of a document. We will use Rake to extract keywords from description.

*For more info -> [Wiki](https://en.wikipedia.org/wiki/Keyword_extraction)*

**Things to do:**
* Create a empty list Keywords
* Loop across all rows to extract all keywords from description
* Create a dictionary with keywords and all their scores
* Append 'keywords' column into dataframe

In [None]:
#Create a empty list Keywords
data1['keywords'] = ''

In [None]:
#Loop across all rows to extract all keywords from description
for index, row in data1.iterrows():
    description = row['description']
    
    #instantiating Rake by default it uses English stopwords from NLTK and discards all punctuation chars
    r = Rake()
    
    #extract words by passing the text
    r.extract_keywords_from_text(description)
    
    #get the dictionary with key words and their scores
    keyword_dict_scores = r.get_word_degrees()
    
    #assign keywords to new columns
    row['keywords'] = list(keyword_dict_scores.keys())
    
#drop description

In [None]:
data1.set_index('title', inplace = True)
data1.head()

# Bag of Words Creation

This is an important technique used in NLP and other such information retrieval programs to create a bag of words concerning a text *(in our case its 'title')* Here the occurence of every word is used as a feature for training a classifier.

*For more info, -> [Wiki](https://en.wikipedia.org/wiki/Bag-of-words_model)*

**Things to do:**
* Create empty list of bow
* Iterate over all rows combining genre with director & actor names

In [None]:
data1['bow'] = ''
columns = data1.columns
for index, row in data1.iterrows():
    words = ''
    for col in columns:
        words = words + ' '.join(row[col])+ ' '
        row['bow'] = words
        

#Use below code if you do not want to include director name into bow
    #for col in columns:
        #if col != 'director':
            #words = words + ' '.join(row[col])+ ' '
        #else:
            #words = words + row[col]+ ' '
        #row['bow'] = words

    
#df1.drop(columns = [col for col in df1.columns if col!= 'bag_of_words'], inplace = True)

In [None]:
data1.head()

# Count Vectorizer

Convert a collection of text documents to a matrix of token counts. It's a data table that is obtained after normalization of next-generation sequencing data.

*For more info -> [Count Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)*

**Things to do:**
* Instantiate & Fit CountVectorizer into 'bow' -> to create count_matrix this is useful for cosine similarity
* 'title' column is an Index as we saw above, hence we convert 'title' column as Series -> to use 'title' as an ordered numerical list
* Understand the count_matrix -> Check its shape and type
* Convert sparse count_matrix to dense vector -> To reduce complexity, *For more info -> [Sparse Matrices](https://machinelearningmastery.com/sparse-matrices-for-machine-learning/)*
* Dense matrix for a sample row
* Check all words in the vocabulary
* Generate cosine similarity for count_matrix

In [None]:
#instantiating and generating the count matrix
count = CountVectorizer()
count_matrix = count.fit_transform(data1['bow'])

#create a Series for movie titles so they are associated to an ordered numerical list, we will use this later to match index
indices = pd.Series(data1.index)
indices[:5]

In [None]:
#Shape count_matrix
count_matrix

In [None]:
type(count_matrix)

In [None]:
#Convert sparse count_matrix to dense vector
c = count_matrix.todense()
c

In [None]:
#Print count_matrix for 0th row
print(count_matrix[0,:]) #This shows all words and their frequency in bow of 0th row

In [None]:
#Gives vocabulary of all words in 'bow' and their counts
count.vocabulary_

# Calculate Cosine similarity

In [None]:
#generating the cosine similarity matrix

cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim

# Recommend top n movies given a movie name

**Things to do:**
* Create empty list
* Get index of the movie that matches this title
* Find highest cosine_sim this title shares with other titles extracted earlier and save it in a Series
* Get indexes of the 'n' most similar movies
* Populate list with titles of n matching movies

In [None]:
#Lets build a function that takes in movie and recommends top n movies

def recommendations(title,n,cosine_sim = cosine_sim):
    recommended_movies = []
    
    #get index of the movie that matches the title
    idx = indices[indices == title].index[0]
    
    #find highest cosine_sim this title shares with other titles extracted earlier and save it in a Series
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    
    #get indexes of the 'n' most similar movies
    top_n_indexes = list(score_series.iloc[1:n+1].index)
    print(top_n_indexes)
    
    #populating the list with titles of n matching movie
    for i in top_n_indexes:
        recommended_movies.append(list(data1.index)[i])
    return recommended_movies

In [None]:
#movie = input("Enter the movie name you wished to be recommended similar movies: ").lower()
movie = 'cleopatra'
#n = int(input("How many movies do you want to be recommended: "))
n = 10

In [None]:
movie

In [None]:
recommendations(movie, n)

**What is the index of the movie you requested ?**

In [None]:
indices[indices == movie].index[0]

**What is the cosine similarity this movie shares with all other movies ?**

In [None]:
pd.Series(cosine_sim[indices[indices == movie].index[0]])

***Thus we can recommend movies based on their content.***