# Movie Recommendation System Using Scikit-Learn

There are many types of recommender systems:
+ **Collaborative filtering** : The main idea behind these methods is to use other users’ preferences and taste to recommend new items to a user. The usual procedure is to find similar users (or items) to recommend new items which were liked by those users, and which presumably will also be liked by the user being recommended.
+ **Content-Based filtering** : Content-based filtering methods are based on a description of the item and a profile of the user's preferences.
+ **Hybrid** : Includes techniques combining collaborative filtering, content based and other possible approaches. Nowadays most recommender systems are hybrid.

### Our Movie Recommender System

We are using a hybrid approach in our recommender system - content-based filtering using a TF-IDF vectorizer and then Collaborative filtering using Nearest Neighbor Algorithm to make movie recommendations.

#### The Movielens Dataset

The dataset can be found here - https://grouplens.org/datasets/movielens/

---

### Import packages and read the data

In [1]:
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', None) # don't truncate the columns of the data frame while printing

In [2]:
ratings = pd.read_csv('rating.csv')

In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


In [4]:
movies = pd.read_csv('movie.csv', index_col='movieId')
movies.head()

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


---

### Manipulating the Genres column

**Genres are present with a '|' delimiter which will cause problems as we start applying ML Models. Thus, we change it to string type and comma separated entity**

In [5]:
# Break up the big genre string into a string array
movies['genres'] = movies['genres'].str.split('|')

# Convert genres to string value
movies['genres'] = movies['genres'].fillna("").astype('str')

movies.head()

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),"['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']"
2,Jumanji (1995),"['Adventure', 'Children', 'Fantasy']"
3,Grumpier Old Men (1995),"['Comedy', 'Romance']"
4,Waiting to Exhale (1995),"['Comedy', 'Drama', 'Romance']"
5,Father of the Bride Part II (1995),['Comedy']


In [6]:
movies.shape

(27278, 2)

---

##### Create a dataset like 'movies' but with natural indices starting from 0

In [7]:
movies_genre = movies.reset_index()
movies_genre.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),"['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']"
1,2,Jumanji (1995),"['Adventure', 'Children', 'Fantasy']"
2,3,Grumpier Old Men (1995),"['Comedy', 'Romance']"
3,4,Waiting to Exhale (1995),"['Comedy', 'Drama', 'Romance']"
4,5,Father of the Bride Part II (1995),['Comedy']


---

## Content Based Recommendations

**Analyzing the contents of the movie such as genre and getting movies with similar content by ranking the similarity scores calculated using the Linear_kernel function in Scikit-Learn.**

TF-IDF refers to Term Frequency-Inverse Document Frequency. TF is simply the frequency a word appears in a document. IDF is the inverse of the document frequency in the whole corpus of documents. The idea behind the TF-IDF is to dampen the effect of high-frequency words in determining the importance of an item.

The fewer movies that contain a given genre the higher the resulting weight. 

---

#### Use the TF-IDF Vectorizer from scikit-learn to vectorize genres.

Vectorization helps to implement ML models on text based features. In simple terms, TF-IDF vectorizer converts strings into vectors based on their frequency. Words with higher frequency are penalized, this helps in segregating important words from connecting words like 'the', 'is', 'in' etc.

##### *TF-IDF Vectorizer parameter*

+ **analyzer** = 'word';Feature is made of word ngram
+ **ngram_range** = (1,2);(1, 2) means unigrams and bigrams
+ **min_df** = 0;Ignores terms that have a document frequency strictly lower than 0
+ **stop_words** = 'english';‘english’ is currently the only supported string value

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
genre_matrix = tf.fit_transform(movies_genre['genres'])

---

##### **Using scikit-learn's linear_kernel function to generate cosine distances of the genre-matrix**

Though scikit-learn comes with cosine-similarity module, in this case we can simply calculate the cosine distances using dot product of the matrix with itself. linear_kernel function is faster than cosine-similarity function for performing dot product operations.


In [9]:
from sklearn.metrics.pairwise import linear_kernel
dis_cosine = linear_kernel(genre_matrix, genre_matrix)
titles = movies_genre['title']
idx = pd.Series(movies_genre.index, index=movies_genre['title'])

---

### Importing and initializing scikit-learn's Nearest Neighbors to find closest matches using movie-ratings

In the earlier steps, we created a means to create a cosine-similarity matrix based on genres. Now, we will use Nearest Neighbors with cosine-similarity metric to find movies similar to each other based on user ratings.

We will combine results from both approaches to build the recommendation system that recommends similar movies based on genre as well as user ratings. In simple terms, **The Best of both Worlds**


##### Model parameters

+ **n_neighbors** = 10; Model will suggest 10 most similar movies
+ **algorithm** = auto; Algorithm used to search for similar movies
+ **metric** = cosine; Model will use cosine distance between movies to judge similarity
+ **n_jobs** = -1; Number of parallel jobs used for search(-1 will use all processors)


In [10]:
from sklearn.neighbors import NearestNeighbors

In [11]:
model = NearestNeighbors(n_neighbors=11, algorithm='auto', metric='cosine', n_jobs = -1)

---

### Function to combine both Models and recommend movies

##### Working of the Function:

+ Function takes in title of the movie as input.

+ It maps index of the movie from idx dataset created during the vectorizer process

+ It unstacks the sparse matrix of cosine distances, sorts them in Descending order and takes first 100 movies

+ It finds those indices in movie_genre dataset and the resulting dataset is merged(inner join on 'movieId') with ratings dataset. We make sure the input movie is always contained in the resulting dataset.

+ It creates a pivot table with index as movie titles, columns as reviewer's User IDs and values as ratings. We impute nulls as 0 for efficient processing.

+ Nearest Neighbors model is fitted on the pivot table of movies-ratings.

+ Then we use KNeighbors function of the model to find 10 best matches. Suggestions Data Frame is returned that contains Suggested Movie Titles and their Genres


In [12]:
def recommend(title):
    index = idx[title]
    scores = list(enumerate(dis_cosine[index]))
    scores = sorted(scores, key=lambda x: x[1], reverse=True)
    scores = scores[1:101]
    movie_indices = [i[0] for i in scores]
    movie_indices.append(index)
    genre_recommend = movies_genre.iloc[movie_indices]
    
    movies_suggested = genre_recommend.merge(ratings, on='movieId', how='inner')
    pivot_movies = movies_suggested.pivot_table(index = 'title', columns='userId', values = 'rating').fillna(0)
    
    model.fit(pivot_movies)
    
    suggest = model.kneighbors(pivot_movies.loc[title, :].values.reshape(1,-1))[1]
    
    suggested = pd.DataFrame(pivot_movies.index[suggest].reshape(-1,1), columns=['title'])
    
    suggestions = suggested.merge(movies, on='title', how='inner')
    
    return suggestions
    
    

---

In [13]:
recommend('Five Children and It (2004)')

Unnamed: 0,title,genres
0,Five Children and It (2004),"['Adventure', 'Children', 'Fantasy']"
1,The Search for Santa Paws (2010),"['Adventure', 'Children']"
2,Napoleon (1995),"['Adventure', 'Children']"
3,Fishtales (2008),"['Children', 'Fantasy']"
4,Back to the Secret Garden (2001),"['Adventure', 'Children', 'Fantasy']"
5,Snow Queen (2002),"['Adventure', 'Children', 'Fantasy']"
6,"Thief Lord, The (2006)","['Adventure', 'Children', 'Fantasy']"
7,Standing Up (2013),"['Adventure', 'Children']"
8,"Chronicles of Narnia: The Voyage of the Dawn Treader, The (2010)","['Adventure', 'Children', 'Fantasy']"
9,"Borrowers, The (2011)","['Adventure', 'Children', 'Fantasy']"


---

In [14]:
recommend('Scarface (1983)')

Unnamed: 0,title,genres
0,Scarface (1983),"['Action', 'Crime', 'Drama']"
1,"Untouchables, The (1987)","['Action', 'Crime', 'Drama']"
2,Swordfish (2001),"['Action', 'Crime', 'Drama']"
3,Colors (1988),"['Action', 'Crime', 'Drama']"
4,Black Rain (1989),"['Action', 'Crime', 'Drama']"
5,Menace II Society (1993),"['Action', 'Crime', 'Drama']"
6,Four Brothers (2005),"['Action', 'Crime', 'Drama']"
7,New Jack City (1991),"['Action', 'Crime', 'Drama']"
8,Death Wish (1974),"['Action', 'Crime', 'Drama']"
9,Harley Davidson and the Marlboro Man (1991),"['Action', 'Crime', 'Drama']"


---

In [15]:
recommend("Captain Phillips (2013)")

Unnamed: 0,title,genres
0,Captain Phillips (2013),"['Adventure', 'Drama', 'Thriller', 'IMAX']"
1,Skyfall (2012),"['Action', 'Adventure', 'Thriller', 'IMAX']"
2,Contagion (2011),"['Sci-Fi', 'Thriller', 'IMAX']"
3,Mission: Impossible - Ghost Protocol (2011),"['Action', 'Adventure', 'Thriller', 'IMAX']"
4,Iron Man 3 (2013),"['Action', 'Sci-Fi', 'Thriller', 'IMAX']"
5,"Bourne Legacy, The (2012)","['Action', 'Adventure', 'Drama', 'Thriller', 'IMAX']"
6,127 Hours (2010),"['Adventure', 'Drama', 'Thriller']"
7,Super 8 (2011),"['Mystery', 'Sci-Fi', 'Thriller', 'IMAX']"
8,White House Down (2013),"['Action', 'Drama', 'Thriller', 'IMAX']"
9,Jack Ryan: Shadow Recruit (2014),"['Action', 'Drama', 'Thriller', 'IMAX']"


---

In [16]:
recommend("Pulp Fiction (1994)")

Unnamed: 0,title,genres
0,Pulp Fiction (1994),"['Comedy', 'Crime', 'Drama', 'Thriller']"
1,Fargo (1996),"['Comedy', 'Crime', 'Drama', 'Thriller']"
2,Trainspotting (1996),"['Comedy', 'Crime', 'Drama']"
3,Bad Boys (1995),"['Action', 'Comedy', 'Crime', 'Drama', 'Thriller']"
4,"Player, The (1992)","['Comedy', 'Crime', 'Drama']"
5,Fried Green Tomatoes (1991),"['Comedy', 'Crime', 'Drama']"
6,In Bruges (2008),"['Comedy', 'Crime', 'Drama', 'Thriller']"
7,Matchstick Men (2003),"['Comedy', 'Crime', 'Drama']"
8,Burn After Reading (2008),"['Comedy', 'Crime', 'Drama']"
9,Confessions of a Dangerous Mind (2002),"['Comedy', 'Crime', 'Drama', 'Thriller']"


---

In [17]:
recommend("Frozen (2013)")

Unnamed: 0,title,genres
0,Frozen (2013),"['Adventure', 'Animation', 'Comedy', 'Fantasy', 'Musical', 'Romance']"
1,Monsters University (2013),"['Adventure', 'Animation', 'Comedy']"
2,Tangled (2010),"['Animation', 'Children', 'Comedy', 'Fantasy', 'Musical', 'Romance', 'IMAX']"
3,"Croods, The (2013)","['Adventure', 'Animation', 'Comedy']"
4,ParaNorman (2012),"['Adventure', 'Animation', 'Comedy']"
5,Puss in Boots (2011),"['Adventure', 'Animation', 'Comedy', 'Fantasy', 'IMAX']"
6,Scott Pilgrim vs. the World (2010),"['Action', 'Comedy', 'Fantasy', 'Musical', 'Romance']"
7,"Princess and the Frog, The (2009)","['Animation', 'Children', 'Fantasy', 'Musical', 'Romance']"
8,Ice Age 4: Continental Drift (2012),"['Adventure', 'Animation', 'Comedy']"
9,Enchanted (2007),"['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy', 'Musical', 'Romance']"


---

In [18]:
recommend("E.T. the Extra-Terrestrial (1982)")

Unnamed: 0,title,genres
0,E.T. the Extra-Terrestrial (1982),"['Children', 'Drama', 'Sci-Fi']"
1,Contact (1997),"['Drama', 'Sci-Fi']"
2,"Truman Show, The (1998)","['Comedy', 'Drama', 'Sci-Fi']"
3,Babe (1995),"['Children', 'Drama']"
4,Unbreakable (2000),"['Drama', 'Sci-Fi']"
5,"Iron Giant, The (1999)","['Adventure', 'Animation', 'Children', 'Drama', 'Sci-Fi']"
6,Old Yeller (1957),"['Children', 'Drama']"
7,Forbidden Planet (1956),"['Drama', 'Sci-Fi']"
8,Metropolis (1927),"['Drama', 'Sci-Fi']"
9,Powder (1995),"['Drama', 'Sci-Fi']"


---

In [19]:
recommend("Psycho (1960)")

Unnamed: 0,title,genres
0,Psycho (1960),"['Crime', 'Horror']"
1,"Silence of the Lambs, The (1991)","['Crime', 'Horror', 'Thriller']"
2,American Psycho (2000),"['Crime', 'Horror', 'Mystery', 'Thriller']"
3,From Hell (2001),"['Crime', 'Horror', 'Mystery', 'Thriller']"
4,Psycho (1998),"['Crime', 'Horror', 'Thriller']"
5,Henry: Portrait of a Serial Killer (1986),"['Crime', 'Horror', 'Thriller']"
6,Identity (2003),"['Crime', 'Horror', 'Mystery', 'Thriller']"
7,Serial Mom (1994),"['Comedy', 'Crime', 'Horror']"
8,Book of Shadows: Blair Witch 2 (2000),"['Crime', 'Horror', 'Mystery', 'Thriller']"
9,"Last House on the Left, The (1972)","['Crime', 'Horror', 'Thriller']"


---

In [20]:
recommend("Saving Private Ryan (1998)")

Unnamed: 0,title,genres
0,Saving Private Ryan (1998),"['Action', 'Drama', 'War']"
1,Apocalypse Now (1979),"['Action', 'Drama', 'War']"
2,Black Hawk Down (2001),"['Action', 'Drama', 'War']"
3,"Patriot, The (2000)","['Action', 'Drama', 'War']"
4,"Boot, Das (Boat, The) (1981)","['Action', 'Drama', 'War']"
5,"Thin Red Line, The (1998)","['Action', 'Drama', 'War']"
6,"Dirty Dozen, The (1967)","['Action', 'Drama', 'War']"
7,Inglourious Basterds (2009),"['Action', 'Drama', 'War']"
8,We Were Soldiers (2002),"['Action', 'Drama', 'War']"
9,"Longest Day, The (1962)","['Action', 'Drama', 'War']"


---

In [21]:
recommend("Casablanca (1942)")

Unnamed: 0,title,genres
0,Casablanca (1942),"['Drama', 'Romance']"
1,Breakfast at Tiffany's (1961),"['Drama', 'Romance']"
2,Good Will Hunting (1997),"['Drama', 'Romance']"
3,Titanic (1997),"['Drama', 'Romance']"
4,"Room with a View, A (1986)","['Drama', 'Romance']"
5,Jerry Maguire (1996),"['Drama', 'Romance']"
6,Leaving Las Vegas (1995),"['Drama', 'Romance']"
7,"Piano, The (1993)","['Drama', 'Romance']"
8,Shine (1996),"['Drama', 'Romance']"
9,"Remains of the Day, The (1993)","['Drama', 'Romance']"


---

In [22]:
recommend("Chicago (2002)")

Unnamed: 0,title,genres
0,Chicago (2002),"['Comedy', 'Crime', 'Drama', 'Musical']"
1,Matchstick Men (2003),"['Comedy', 'Crime', 'Drama']"
2,Trainspotting (1996),"['Comedy', 'Crime', 'Drama']"
3,Fried Green Tomatoes (1991),"['Comedy', 'Crime', 'Drama']"
4,Sexy Beast (2000),"['Comedy', 'Crime', 'Drama']"
5,Death to Smoochy (2002),"['Comedy', 'Crime', 'Drama']"
6,Paper Moon (1973),"['Comedy', 'Crime', 'Drama']"
7,Burn After Reading (2008),"['Comedy', 'Crime', 'Drama']"
8,"Player, The (1992)","['Comedy', 'Crime', 'Drama']"
9,Crimes and Misdemeanors (1989),"['Comedy', 'Crime', 'Drama']"


---

In [23]:
recommend("Jumanji (1995)")

Unnamed: 0,title,genres
0,Jumanji (1995),"['Adventure', 'Children', 'Fantasy']"
1,Casper (1995),"['Adventure', 'Children']"
2,"Wizard of Oz, The (1939)","['Adventure', 'Children', 'Fantasy', 'Musical']"
3,Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001),"['Adventure', 'Children', 'Fantasy']"
4,"Indian in the Cupboard, The (1995)","['Adventure', 'Children', 'Fantasy']"
5,"NeverEnding Story, The (1984)","['Adventure', 'Children', 'Fantasy']"
6,"Chronicles of Narnia: The Lion, the Witch and the Wardrobe, The (2005)","['Adventure', 'Children', 'Fantasy']"
7,Escape to Witch Mountain (1975),"['Adventure', 'Children', 'Fantasy']"
8,Swiss Family Robinson (1960),"['Adventure', 'Children']"
9,Tom and Huck (1995),"['Adventure', 'Children']"


---