# Movie Recommendation System Using Scikit-Learn

There are many types of recommender systems:
+ **Collaborative filtering** : The main idea behind these methods is to use other users’ preferences and taste to recommend new items to a user. The usual procedure is to find similar users (or items) to recommend new items which were liked by those users, and which presumably will also be liked by the user being recommended.
+ **Content-Based filtering** : Content-based filtering methods are based on a description of the item and a profile of the user's preferences.
+ **Hybrid** : Includes techniques combining collaborative filtering, content based and other possible approaches. Nowadays most recommender systems are hybrid.

### Our Movie Recommender System

We are using a hybrid approach in our recommender system - content-based filtering using a TF-IDF vectorizer and then Collaborative filtering using Nearest Neighbor Algorithm to make movie recommendations.

#### The Movielens Dataset

The dataset can be found here - https://grouplens.org/datasets/movielens/

---

### Import packages and read the data

In [None]:
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', None) # don't truncate the columns of the data frame while printing

In [None]:
ratings = pd.read_csv('rating.csv')

In [None]:
ratings.head()

In [None]:
movies = pd.read_csv('movie.csv', index_col='movieId')
movies.head()

---

### Manipulating the Genres column

**Genres are present with a '|' delimiter which will cause problems as we start applying ML Models. Thus, we change it to string type and comma separated entity**

In [None]:
# Break up the big genre string into a string array
movies['genres'] = movies['genres'].str.split('|')

# Convert genres to string value
movies['genres'] = movies['genres'].fillna("").astype('str')

movies.head()

In [None]:
movies.shape

---

##### Create a dataset like 'movies' but with natural indices starting from 0

In [None]:
movies_genre = movies.reset_index()
movies_genre.head()

---

## Content Based Recommendations

**Analyzing the contents of the movie such as genre and getting movies with similar content by ranking the similarity scores calculated using the Linear_kernel function in Scikit-Learn.**

TF-IDF refers to Term Frequency-Inverse Document Frequency. TF is simply the frequency a word appears in a document. IDF is the inverse of the document frequency in the whole corpus of documents. The idea behind the TF-IDF is to dampen the effect of high-frequency words in determining the importance of an item.

The fewer movies that contain a given genre the higher the resulting weight. 

---

#### Use the TF-IDF Vectorizer from scikit-learn to vectorize genres.

Vectorization helps to implement ML models on text based features. In simple terms, TF-IDF vectorizer converts strings into vectors based on their frequency. Words with higher frequency are penalized, this helps in segregating important words from connecting words like 'the', 'is', 'in' etc.

##### *TF-IDF Vectorizer parameter*

+ **analyzer** = 'word';Feature is made of word ngram
+ **ngram_range** = (1,2);(1, 2) means unigrams and bigrams
+ **min_df** = 0;Ignores terms that have a document frequency strictly lower than 0
+ **stop_words** = 'english';‘english’ is currently the only supported string value

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
genre_matrix = tf.fit_transform(movies_genre['genres'])

---

##### **Using scikit-learn's linear_kernel function to generate cosine distances of the genre-matrix**

Though scikit-learn comes with cosine-similarity module, in this case we can simply calculate the cosine distances using dot product of the matrix with itself. linear_kernel function is faster than cosine-similarity function for performing dot product operations.


In [None]:
from sklearn.metrics.pairwise import linear_kernel
dis_cosine = linear_kernel(genre_matrix, genre_matrix)
titles = movies_genre['title']
idx = pd.Series(movies_genre.index, index=movies_genre['title'])

---

### Importing and initializing scikit-learn's Nearest Neighbors to find closest matches using movie-ratings

In the earlier steps, we created a means to create a cosine-similarity matrix based on genres. Now, we will use Nearest Neighbors with cosine-similarity metric to find movies similar to each other based on user ratings.

We will combine results from both approaches to build the recommendation system that recommends similar movies based on genre as well as user ratings. In simple terms, **The Best of both Worlds**


##### Model parameters

+ **n_neighbors** = 10; Model will suggest 10 most similar movies
+ **algorithm** = auto; Algorithm used to search for similar movies
+ **metric** = cosine; Model will use cosine distance between movies to judge similarity
+ **n_jobs** = -1; Number of parallel jobs used for search(-1 will use all processors)


In [None]:
from sklearn.neighbors import NearestNeighbors

In [None]:
model = NearestNeighbors(n_neighbors=11, algorithm='auto', metric='cosine', n_jobs = -1)

---

### Function to combine both Models and recommend movies

##### Working of the Function:

+ Function takes in title of the movie as input.

+ It maps index of the movie from idx dataset created during the vectorizer process

+ It unstacks the sparse matrix of cosine distances, sorts them in Descending order and takes first 100 movies

+ It finds those indices in movie_genre dataset and the resulting dataset is merged(inner join on 'movieId') with ratings dataset. We make sure the input movie is always contained in the resulting dataset.

+ It creates a pivot table with index as movie titles, columns as reviewer's User IDs and values as ratings. We impute nulls as 0 for efficient processing.

+ Nearest Neighbors model is fitted on the pivot table of movies-ratings.

+ Then we use KNeighbors function of the model to find 10 best matches. Suggestions Data Frame is returned that contains Suggested Movie Titles and their Genres


In [None]:
def recommend(title):
    index = idx[title]
    scores = list(enumerate(dis_cosine[index]))
    scores = sorted(scores, key=lambda x: x[1], reverse=True)
    scores = scores[1:101]
    movie_indices = [i[0] for i in scores]
    movie_indices.append(index)
    genre_recommend = movies_genre.iloc[movie_indices]
    
    movies_suggested = genre_recommend.merge(ratings, on='movieId', how='inner')
    pivot_movies = movies_suggested.pivot_table(index = 'title', columns='userId', values = 'rating').fillna(0)
    
    model.fit(pivot_movies)
    
    suggest = model.kneighbors(pivot_movies.loc[title, :].values.reshape(1,-1))[1]
    
    suggested = pd.DataFrame(pivot_movies.index[suggest].reshape(-1,1), columns=['title'])
    
    suggestions = suggested.merge(movies, on='title', how='inner')
    
    return suggestions
    
    

---

In [None]:
recommend('Five Children and It (2004)')

---

In [None]:
recommend('Scarface (1983)')

---

In [None]:
recommend("Captain Phillips (2013)")

---

In [None]:
recommend("Pulp Fiction (1994)")

---

In [None]:
recommend("Frozen (2013)")

---

In [None]:
recommend("E.T. the Extra-Terrestrial (1982)")

---

In [None]:
recommend("Psycho (1960)")

---

In [None]:
recommend("Saving Private Ryan (1998)")

---

In [None]:
recommend("Casablanca (1942)")

---

In [None]:
recommend("Chicago (2002)")

---

In [None]:
recommend("Jumanji (1995)")

---