# Recommendation system by description with natural language processing

In the notebook a recommendation system if presented which proposes similar movies by their description. The descriptions of the movies are converted to a vector, and the cosine similarity between the description vectors give back the most similar movies (clearly, based only on the description) - for more about cosine similary see the notebook which uses cosine similarity for prediction.  

For converting the descriptions into vector the Doc2Vec algorithm is used, which is a deep learning algorithm, designed for tasks like these: forms vectors from documents and if the cosine similarities of the vectors are higher the documents are more similar. It builds strongly on the Word2Vec algorithm, which does the same, but only for words.

![](https://miro.medium.com/max/1400/1*9tVCGDm-ytPydhtJWVx3Zw.png)

## Preprocessing

Used libraries:
* numpy : for linear algebra calculation
* pandas : for csv/dataframe manipulation
* nltk : for natural language preprocessing functions like stopword removal, stemming, tokenization
* gensim : for Doc2Vec algorithm
* cosine_similarity : for cos sim calculation
* plt : for plotting diagrams

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

from sklearn.metrics.pairwise import cosine_similarity

import matplotlib.pyplot as plt

Used datas:
* df_movies : dataframe of movies
* df_ratings : dataframe of ratings
* useful_links_df : contains only the movies, for which the descriptions were scraped
* df : dataframe of descriptions

We need the ratings, because as in the case of matrix factorization and in the case of the poster similarity calulcation we use only the most popular datas, as the scraping would have taken too much time if we wanted to scrape the descriptions for all of the movies. 

In [None]:
df_links = pd.read_csv("/kaggle/input/movielens-25m-dataset/ml-25m/links.csv")
df_movies = pd.read_csv("/kaggle/input/movielens-25m-dataset/ml-25m/movies.csv")
ratings_df = pd.read_csv("/kaggle/input/movielens-25m-dataset/ml-25m/ratings.csv")
ratings_df.drop(columns = ["timestamp"], inplace=True)

In [None]:
df_links = df_links.merge(df_movies, on="movieId")
df_links.fillna(0, inplace=True)
df_links["tmdbId"] = df_links["tmdbId"].astype(int)
df_links

ratings_df["movie_freq"] = ratings_df.groupby("movieId")["movieId"].transform('count')
MOVIE_FREQ_LIMIT = 500
ratings_df = ratings_df.loc[(ratings_df["movie_freq"] > MOVIE_FREQ_LIMIT)]
most_popular_film_ids = ratings_df["movieId"].unique()
most_popular_film_ids.sort()
useful_links_df = df_links.loc[df_links["movieId"].isin(most_popular_film_ids)]
useful_links_df = useful_links_df.reset_index(drop=True)

In [None]:
useful_links_df = useful_links_df.drop([1816, 2511, 3459, 3643, 3707, 4050, 4327, 4698, 4947, 5086, 5088, 5109, 5167])

In [None]:
useful_links_df = useful_links_df.reset_index(drop=True)
useful_links_df = useful_links_df.head(5155)

In [None]:
useful_links_df.loc[useful_links_df["title"].str.contains("Star")]

In [None]:
df = pd.read_csv("/kaggle/input/movie-reviews/descriptions-2.csv")

In [None]:
df["vector"] = ""
df

In [None]:
proba_description = df.iloc[0,1]
proba_description

Function for cleaning the sentence. The following preprocessings are done:
* remove non-alphanumeric character
* tokenization of sentences
* stop word removal
* lemmatization of words

In [None]:
def clean_sentence(description_text):
    #remove non-alphabetic characters
    description_text = re.sub("[^a-zA-Z]"," ", description_text)

    #tokenize the sentences
    description_tokens = word_tokenize(description_text.lower())

    #stop words removal
    omit_words = set(stopwords.words('english'))
    words = [x for x in description_tokens if x not in omit_words]

    #lemmatize each word to its lemma
    lemma_words = [WordNetLemmatizer().lemmatize(i) for i in words]

    return lemma_words

In [None]:
tmdb_ids = df["tmdb_id"].tolist()

In [None]:
cleaned_reviews = []

In [None]:
for i in range(len(df)):
    cleaned_reviews.append(clean_sentence(df.iloc[i, 1]))

After the basic NLP preprocessings, in the documents variable the descriptions are brought to the format, which are accepted by the Doc2Vec algorithm. 

In [None]:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(cleaned_reviews)]

## Training

Run the Doc2Vec model

In [None]:
model = Doc2Vec(documents, vector_size=300, min_count=2, epochs=40, window=2)

In [None]:
vectors = []

Create list from predictions

In [None]:
for i in range(len(df)):
    vector = model.infer_vector(documents[i][0])
    vectors.append(vector.tolist())

Calculate cosine similarity

In [None]:
cosine_sim = cosine_similarity(vectors)

# Evaluation

In [None]:
cosine_sim

In [None]:
actual_genres = []
predicted_genres = {}

In [None]:
def predict_by_idx(idx):
    actual_genres = []
    predicted_genres_good = {}
    predicted_genres_bad = {}
    print("Prediction for movie:",useful_links_df["title"].iloc[idx], "\n \n")
    actual_genres = useful_links_df["genres"].iloc[idx].split("|")
    similar_movies = list(enumerate(cosine_sim[idx]))
    sorted_similar_movies = sorted(similar_movies, key=lambda x:x[1], reverse=True)
    
    print("Predictions \n \n")
    
    i = 0
    for movie_idx in sorted_similar_movies:
        if i>0:
            print(useful_links_df["title"].iloc[movie_idx[0]], "   sim:", movie_idx[1])
            predicted_genres_for_movie = useful_links_df["genres"].iloc[movie_idx[0]].split("|")
            for genre in predicted_genres_for_movie:
                if genre in actual_genres:
                    if genre in predicted_genres_good:
                        predicted_genres_good[genre] =  predicted_genres_good[genre] + 1
                    else:
                        predicted_genres_good[genre] = 1
                else:
                    if genre in predicted_genres_bad:
                        predicted_genres_bad[genre] =  predicted_genres_bad[genre] + 1
                    else:
                        predicted_genres_bad[genre] = 1
        i = i + 1
        if i>20:
            break
            
    print(actual_genres)
    print(predicted_genres)
    plt.subplots(figsize=(18,5))
    plt.bar(predicted_genres_good.keys(), predicted_genres_good.values(), width=0.3, color='g')
    plt.bar(predicted_genres_bad.keys(), predicted_genres_bad.values(), width=0.3, color='r')
    plt.show()

In [None]:
predict_by_idx(204)

In the below diagram an example TOP 15 recommendaions can be seen for Star Wars 4 grouped by genre. The green bars represent the genres which is the Star Wars 4.

![](https://i.ibb.co/WsMgzCk/Screenshot-2021-12-12-at-17-41-06.png)


## Conclusion

It can be seen that for this particular movie, the recommendations are satisfactory. However, like in the CNN case, a clear metric should be defined, and hyperparameter optimalization based on this.