# Content based recommendation engine

Wondered how Google comes up with movies that are similar to the ones you like? In this section we will build one such recommendation system ourselves.

## 1. Finding the similarity

We know that our recommendation engine will be content based. So, we need to find similar movies to a given movie and then recommend those similar movies to the user. The logic is pretty straightforward. Right?

But, wait…. How can we find out which movies are similar to the given movie in the first place? How can we find out how much similar (or dissimilar) two movies are?

Let us start with something simple and easy to understand. Suppose, you are given the following two texts:

- Text A: London Paris London
- Text B: Paris Paris London

How would you find the similarity between Text A and Text B? Let’s analyze these texts...

- Text A contains the word "London" twice and the word "Paris" once.
- Text B contains the word "London" once and the word "Paris" twice.

What will happen if we try to represent these two texts in a 2D plane (with "London" in X axis and "Paris" in Y axis)? Let’s try to do this. It will look like this:

<img src="./resources/cos1.png"  style="height: 250px"/>

Here, the red vector represents "Text A" and the blue vector represents "Text B". Now we have graphically represented these two texts. So can we find out the similarity between these two texts?

The answer is "Yes, we can". But, exactly how? These two texts are represented as vectors. Right? So, we can say that two vectors are similar if the distance between them is small. By distance, we mean the angular distance between two vectors, which is represented by θ (theta).

<img src="./resources/cos2.png"  style="height: 250px"/>

By thinking further from the machine learning perspective, we can understand that the value of cos θ makes more sense to us rather than the value of θ (theta) because, the cosine function will map the value of θ in the first quadrant between 0 to 1 (Remember? cos 90° = 0 and cos 0° = 1 ). Can you see that the smaller θ is (more similarity between the two texts), the larger the value of cos θ will be?
 
<img src="./resources/cos3.png"  style="height: 250px"/>

## 2. Cosine_similarity()

Don’t get scared, we don’t need to implement the formula from scratch to calculate cos θ. We have our friend Scikit-Learn to calculate that for us :) Just remember: cosine similarity is a simple, but very effective vector similarity metric, that's heavily used, especially in the field of NLP (Natural Language Processing).

Let’s see how we can do that.

At first, we need to have text A and B in our program (we will take four cities):

In [54]:
text = ["London Paris London Amsterdam", "Paris Paris London Brussels"]

Now, we need to find a way to represent these texts as vectors, because we need to calculate with them. So, we have to kind of 'encode' our text, to get our numerical representation, in the form of a vector (not just a single scalar, so we have more expressive power).  

The CountVectorizer() class can do this for us.

In [55]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
cv = CountVectorizer()
count_matrix = cv.fit_transform(text)

count_matrix gives us a sparse matrix. 

In [56]:
count_matrix

<2x4 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

To make it in human readable form, we need to apply the `toarrray()` method over it. And before printing out this count_matrix, let us first print out the feature list (or, word list), which has been fed to our `CountVectorizer()` object.

In [57]:
print(cv.get_feature_names_out())
print(count_matrix.toarray())

['amsterdam' 'brussels' 'london' 'paris']
[[1 0 2 1]
 [0 1 1 2]]


This indicates that the word amsterdam occurs 1 time in A and 0 times in B, brussels occurs 0 times in A and 1 time in B, london occurs 2 times in A and 1 time in B and paris occurs 1 time in A and 2 times in B. Makes sense. Right?

Now, we need to find cosine  similarity between these vectors to find out how similar they are to each other. We can calculate this using `cosine_similarity()` function:

In [58]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_scores = cosine_similarity(count_matrix)
print(similarity_scores)

[[1.         0.66666667]
 [0.66666667 1.        ]]


What does this output indicate? We can interpret this output like this: 

1. Each row of the similarity matrix indicates each sentence of our input. So, row 0 = Text A and row 1 = Text B.
2. The same thing applies for columns.

To get a better understanding of this, we can say that the output given above is the same as the following:

```
        Text A      Text B
Text A    1          0.66
Text B   0.66         1
```

By interpreting this, the output says that Text A is similar to Text A (itself) by 100% (position [0,0]) and Text A is similar to Text B by 66% (position [0,1]). We can easily see that the output is always going to be a symmetric matrix. Because if Text A is similar to Text B by 66% then, Text B is also going to be similar to Text A by 66%.

## 3. Building the recommendation engine - Exercise

Now we know how to find similarity between contents. So, let’s try to apply this knowledge to build a content based movie recommendation engine. 

### 3.1 Read the data

Import all the required libraries and then read the `movie_dataset.csv` file from the `resources` directory using the `read_csv()` method.

In [59]:

import pandas as pd

# Read the movie dataset
movies_df = pd.read_csv('resources/movie_dataset.csv')

# Display the first few rows of the dataset
print(movies_df.head())

   index     budget                                    genres  \
0      0  237000000  Action Adventure Fantasy Science Fiction   
1      1  300000000                  Adventure Fantasy Action   
2      2  245000000                    Action Adventure Crime   
3      3  250000000               Action Crime Drama Thriller   
4      4  260000000          Action Adventure Science Fiction   

                                       homepage      id  \
0                   http://www.avatarmovie.com/   19995   
1  http://disney.go.com/disneypictures/pirates/     285   
2   http://www.sonypictures.com/movies/spectre/  206647   
3            http://www.thedarkknightrises.com/   49026   
4          http://movies.disney.com/john-carter   49529   

                                            keywords original_language  \
0  culture clash future space war space colony so...                en   
1  ocean drug abuse exotic island east india trad...                en   
2         spy based on novel sec

Explore the data by showing the first 5 lines.

In [60]:
movies_df.head(5)


Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton


You will see that it has many extra info about a movie. We don’t need all of them. So, we choose the keywords, cast, genres and director column to use as our feature set. We also need to clean and preprocess the data for our use. We will fill all the NaN values with blank strings in the dataframe (for the columns keywords, cast, genres and director).

In [61]:
features = ['keywords', 'cast', 'genres', 'director']

# filling all NaNs with blank strings
for feature in features:
    movies_df[feature] = movies_df[feature].fillna('') 

Your next task is to create a function for combining the values of these columns into a single string and call this function over each row of our dataframe. The name of the new column is `combined_features`.

In [62]:
def combine_features(row):
    return row['keywords'] + ' ' + row['cast'] + ' ' + row['genres'] + ' ' + row['director']
movies_df['combined_features'] = movies_df.apply(combine_features, axis=1)

Check the content of the new `combined_features` column by showing the first 5 rows.

In [63]:
movies_df['combined_features'].head(5)


0    culture clash future space war space colony so...
1    ocean drug abuse exotic island east india trad...
2    spy based on novel secret agent sequel mi6 Dan...
3    dc comics crime fighter terrorist secret ident...
4    based on novel mars medallion space travel pri...
Name: combined_features, dtype: object

### 3.2 Create the matrix

Now that you have obtained the combined_features column, you can now feed these strings to a CountVectorizer() object for getting the count matrix.

In [64]:
vectorizer = CountVectorizer()

# Fit and transform the 'combined_features' column to get the count matrix
count_matrix = vectorizer.fit_transform(movies_df['combined_features'])


Print the total number of different words found in the combined_features column.

In [65]:
vocabulary = vectorizer.vocabulary_
total_words = len(vocabulary)
total_words


14845

Print all the words found in the combined_features column.

In [66]:
print(vectorizer.get_feature_names_out())


['11' '15th' '17th' ... 'zwick' 'zwigoff' 'zylka']


Print the number of films and the number of features.

Print the number of films and the number of words.

Print the content of the count_matrix.

In [67]:
print(count_matrix)


  (0, 3115)	1
  (0, 2616)	1
  (0, 4886)	1
  (0, 12386)	2
  (0, 14235)	1
  (0, 2755)	1
  (0, 12299)	1
  (0, 11517)	1
  (0, 14561)	1
  (0, 14820)	1
  (0, 11490)	1
  (0, 12134)	1
  (0, 14291)	1
  (0, 12567)	1
  (0, 7496)	1
  (0, 8831)	1
  (0, 11217)	1
  (0, 86)	1
  (0, 144)	1
  (0, 4435)	1
  (0, 11745)	1
  (0, 4566)	1
  (0, 6542)	1
  (0, 2061)	1
  (1, 86)	1
  :	:
  (4801, 10069)	1
  (4801, 5844)	1
  (4801, 252)	1
  (4801, 4098)	1
  (4801, 14796)	1
  (4801, 11361)	1
  (4801, 2978)	1
  (4801, 12036)	1
  (4801, 6138)	1
  (4802, 9659)	1
  (4802, 3812)	1
  (4802, 1788)	2
  (4802, 4210)	1
  (4802, 5181)	1
  (4802, 2912)	1
  (4802, 3821)	1
  (4802, 1069)	1
  (4802, 11185)	1
  (4802, 3681)	1
  (4802, 5399)	1
  (4802, 3894)	1
  (4802, 2056)	1
  (4802, 3093)	1
  (4802, 4502)	1
  (4802, 5900)	2


Now, obtain the cosine similarity matrix from the count matrix.

In [68]:

cosine_sim = cosine_similarity(count_matrix, count_matrix)

We will define two helper functions to get movie title from movie index and vice-versa.

In [69]:
def get_title_from_index(index):
    return movies_df[movies_df.index == index]["title"].values[0]
def get_index_from_title(title):
    return movies_df[movies_df.title == title]["index"].values[0]

Print the index of Interstellar.

Print the title of the film with index 95.

### 3.3 Find similar movies

We will need the functions `enumerate()` and `list()`:

In [70]:
x = ['apple', 'banana', 'cherry']
# enumerate over x and make a tuple (index, fruit)
print(list(enumerate(x)))


[(0, 'apple'), (1, 'banana'), (2, 'cherry')]


- We will find the index of the movie the user likes. 
- After that, we will access the row corresponding to this movie in the similarity matrix. Thus, we will get the similarity scores of all other movies from the current movie.
- Then we will enumerate through all the similarity scores of that movie to make a tuple of movie index and similarity score. 

In [71]:
movie_user_likes = "Pulp Fiction"
movie_index = get_index_from_title(movie_user_likes)
similar_movies = list(enumerate(cosine_sim[movie_index]))

This will convert a row of similarity scores like this [1 0.5 0.2 0.9] to this [(0, 1) (1, 0.5) (2, 0.2) (3, 0.9)]. Here, each item is in this form (movie index, similarity score).

In [72]:
print(similar_movies)

[(0, 0.0), (1, 0.0408248290463863), (2, 0.04662524041201569), (3, 0.1315587028960544), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0), (9, 0.0), (10, 0.0), (11, 0.08944271909999159), (12, 0.0), (13, 0.0), (14, 0.0), (15, 0.0), (16, 0.0), (17, 0.0), (18, 0.0), (19, 0.04767312946227961), (20, 0.0), (21, 0.0), (22, 0.0512989176042577), (23, 0.0), (24, 0.039528470752104736), (25, 0.049999999999999996), (26, 0.0), (27, 0.045643546458763846), (28, 0.045643546458763846), (29, 0.04767312946227961), (30, 0.0), (31, 0.0), (32, 0.0), (33, 0.042257712736425826), (34, 0.04767312946227961), (35, 0.0), (36, 0.0), (37, 0.0), (38, 0.0), (39, 0.043852900965351466), (40, 0.08304547985373996), (41, 0.043852900965351466), (42, 0.0), (43, 0.044721359549995794), (44, 0.0), (45, 0.043852900965351466), (46, 0.0), (47, 0.0), (48, 0.0), (49, 0.0), (50, 0.0), (51, 0.0), (52, 0.04767312946227961), (53, 0.0), (54, 0.0), (55, 0.0), (56, 0.0), (57, 0.05590169943749474), (58, 0.10540925533894599), (59, 0.04303314829

### Question

If you print the index of Pulp Fiction, what will be the value of y in (index, y) in similar_movies? Makes sense? Right?

In [73]:
print(movie_index)

3232


### 3.4 Sort and print Top 9 similar movies

We will sort the list similar_movies according to similarity scores in descending order. Since the most similar movie to a given movie will be itself, we will discard the first element after sorting the movies.

In [74]:
sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse=True)[1:]

In [75]:
i = 0
print("Top 9 similar movies to " + movie_user_likes + " are:\n")
for element in sorted_similar_movies:
    print(get_title_from_index(element[0]))
    i = i + 1
    if i > 8:
        break

Top 9 similar movies to Pulp Fiction are:

Basic
Die Hard: With a Vengeance
Shaft
Con Air
Surrogates
Unbreakable
Kill Bill: Vol. 2
Kill Bill: Vol. 1
Jackie Brown


You can compare your results with the collaborative filtering based recommendation engine from Google. 4/9. Mmm, not bad.

<img src="./resources/pulp.png"/>

Maybe you can try to find similar movies for some other movie you like.