## Content Based Recommendation

In this section, we will implement **Content Based Recommendation** using the MovieLens dataset.
* We will use the tf-idf vectorizer to extract features from the reviews (**NLP techniques**)
* We will use the cosine similarity to find similar movies.

### Import MovieLens dataset

The movie lens dataset contains 100,000 ratings from 1000 users on 1700 movies. The dataset is available at [GroupLens](https://grouplens.org/datasets/movielens/). We will use the 100k dataset. The dataset is available in the `data` folder.

Let's inspect the dataset by looking at the columns and the first few rows.

In [2]:
import pandas as pd

metadata = pd.read_csv('datasets/movies_metadata.csv', low_memory=False)

# only keep movie ratings having more votes than 90% of the movies
# in order to avoid movies with very few ratings
quantile = 0.9
metadata = metadata[metadata.vote_count > metadata.vote_count.quantile(quantile)].reset_index()
pd.DataFrame(metadata.columns, columns=['columns']).T

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/movies_metadata.csv'

In [None]:
metadata.head()

### Preprocessing

The `overview` column contains the **plot summary** of a movie, and we will use it to build our content-based recommendation system.

 In order to perform machine learning on the plot summaries, we need to **transform** them into vector representations such that we can apply numeric machine learning to them. This process is called **feature extraction** or in this case, simply, vectorization, and is an essential first step toward language-aware analysis. Every plot summary will be transformed from a sequence of words to a point in a **high-dimensional semantic space**. The simplest encoding of semantic space is the **bag-of-words (BOW)** model, another is the tf-idf model.

Both models are in fact tables where each row represents a plot summary and each column represents a word and are **count-based**. They count the number of times a word appears in a document and use that as a proxy for the importance of the word in that document. The difference between the two is that the BOW model simply counts the number of times a word appears in a document, while the tf-idf model also takes into account how often the word appears in all documents. The BOW model is a **sparse** representation, meaning that most of the entries in the vector are zero. The tf-idf model is more useful than the BOW model because it **downweights** words that appear frequently in a corpus and are therefore less informative than those that appear rarely.

 So, to sum up, every plot summary (or document) will be encoded as a single vector whose length is equal to the size of the vocabulary of all the plot summaries (the so-called corpus) and whose entries are some sort of counts of the words in that summary. This is because most words in the vocabulary do not appear in a given plot summary.

We'll use the tf-idf vectorizer to extract features from the plot summaries. The tf-idf vectorizer will transform the plot summaries into a matrix of tf-idf features. The tf-idf vectorizer will **ignore** words that occur in **more than 80%** of the movies and **ignore** words that occur in **less than 2 movies**. This will help us **reduce the noise** in the dataset.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_df=0.8, min_df=2)

metadata.overview = metadata.overview.fillna('')
tfidf_model = vectorizer.fit_transform(metadata.overview)
print(f'Matrix contains {tfidf_model.shape[0]} movies and {tfidf_model.shape[1]} words')

### Inspect the tf-idf model

What does the tf-idf model look like?
Let us inspect the columns with popular movie terms like 'love', 'young', 'story', etc.

In [None]:
popular_terms = ['life', 'young', 'man', 'film', 'new', 'love', 'story', 'world']
columns = vectorizer.get_feature_names_out()
tfidf_model_df = pd.DataFrame.sparse.from_spmatrix(tfidf_model, columns=columns)
tfidf_model_df[popular_terms].head()

## Find similar movies

We'll use a (K-Nearest Neighbors) KNN model to find similar movies. As you might remember from earlier lessons, every KNN uses a **distance metric** to find the nearest neighbors. In this case we're going to use the **cosine similarity** as the distance metric. The cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.

### Cosine Similarity

The cosine similarity is a measure of similarity between two vectors $\bf{x}$ and $\bf{y}$.

$cos(\bf{x},\bf{y}) = \frac{\bf{x} \cdot \bf{y}}{||\bf{x}|| \cdot ||\bf{y}||}$
<br/>
$\phantom{cos(\bf{x},\bf{y})} = \frac{\sum_{i=1}^{n} x_i y_i}{\sqrt{\sum_{i=1}^{n}(x_i)^2} \sqrt{\sum_{i=1}^{n}(y_i)^2}}$

where $\bf{x}$ and $\bf{y}$ are vectors and $||\bf{x}||$ and $||\bf{y}||$ are the norms of $\bf{x}$ and $\bf{y}$ and where $x_i$ and $y_i$ are the term frequency of the $i$th word in the two documents.

The cosine similarity has the following properties, it is:

* a **normalized dot product**.
* **independent of the magnitude** of the vectors.
* is **zero** if the two vectors are **orthogonal** and **one** if the two vectors are **equal**.
* **symmetric**, this means that the similarity between A and B is the same as the similarity between B and A.
* **non-negative**
* **bounded** between 0 and 1, this means that the similarity between two vectors is always between 0 and 1.

We're going to compute the **cosine similarity** between different movies based on their plot summary _term frequency occurence - signature_. That is the vector representation of the plot summary. The higher the cosine similarity, the more similar the movies are.

SciKit-Learn provides a function to compute the cosine similarity between two vectors. We could use that function to compute the cosine similarity between the plot summaries of the movies, but in this case we're going to use a kNN model to find the nearest neighbors of a movie. The kNN model has a parameter called `metric` that we can set to `cosine` to use the cosine similarity as the distance metric.


In [None]:
from sklearn.neighbors import NearestNeighbors


def get_content_based_recommendation(title, top_n=10, metric='cosine'):
    # Get the index of the movie that matches the title
    # we'll use that index to locate the row in the tf-idf matrix that corresponds to that movie
    idx = metadata[metadata.title.str.lower() == title.lower()].index[0]

    model = NearestNeighbors(n_neighbors=top_n, metric=metric)
    model.fit(tfidf_model)
    similar_movies = model.kneighbors(tfidf_model[idx], return_distance=False)[0]

    # Return the top 10 most similar movies
    return metadata.iloc[similar_movies]

In [None]:
get_content_based_recommendation('I am legend')[['title', 'genres', 'vote_average', 'vote_count', 'overview']]

In [None]:
get_content_based_recommendation('The Matrix')[['title', 'genres', 'vote_average', 'vote_count', 'overview']]
