# Recommendation system with cosine similarity

In this notebook a movie recommendation system can be seen, which uses the cosine similarity for calulcating the similarities between each of the movies. In this case each movie can be interpreted as a vector, and the dimensions of the vectors are the features present for the movies. The formula for calculating the cosine similarity can be seen here:

![](https://i.ibb.co/HDkh3Gg/cos.png)

This formula can be interpreted in our case the following way: each of the movies have a feature vector like (f1, f2, ..., fn) with n features (the features are genres and the calculated relevant tags for ex. oscar-winner), and if the fm feature is true for the movie, in m-th place 1 will be in the vector, otherwise 0. To simplify this: if 2 movies have more common features, then their cosine similarity will be greater, thus they are similar.

## Preparations

We are using the movielens-25m-dataset which contains data about more than 62k movies. In the 'movies.csv' the movie names and it's genres are present, and we will use the 'genome-scores.csv' also, which file contains features, and their relevance to each of the movies.

### Imports
*  Pandas - for reading in the csv-s and manipulating it's data 
*  Numpy - linear algebra calculations
*  CountVectorizer - creates vector from feature words
*  Cosine_similairty - for calculating the actual cosine similairty

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
base_path = "/kaggle/input/movielens-25m-dataset/ml-25m/"

### Reading in the necessary csv files

In the movied_df dataframe we read all the movies, in the genome_scores_df the movie-tag relevance scores.

In [None]:
movies_df = pd.read_csv(base_path + "movies.csv")
movies_df.head()

In [None]:
genome_scores_df = pd.read_csv(base_path + "genome-scores.csv")
genome_scores_df.head()

### Data clean and transformation

We say that a movie has a feature if it's relevance is above the RELEVANT_SCORE threshold (which is 0.6 in our case). Therefore we keep only those tags which are above this threshold.

In [None]:
RELEVANT_SCORE = 0.6
genome_score_relevant_df = genome_scores_df.loc[genome_scores_df["relevance"] > RELEVANT_SCORE]
genome_score_relevant_df.head()

In [None]:
genome_tags_df = pd.read_csv(base_path + "genome-tags.csv")
genome_tags_df.head()

In [None]:
genome_score_relevant_df = genome_score_relevant_df.merge(genome_tags_df, on="tagId")

After joining the genome_score_relevant_df with the genome_tags_df it can, in the new dataframe every row presents a movieId and it's relevant tag name.

In [None]:
genome_score_relevant_df.sort_values(by = "movieId", inplace=True)
genome_score_relevant_df

In [None]:
pd.set_option("display.max_colwidth", None)
genome_score_relevant_df["tag"] = genome_score_relevant_df["tag"].astype(str)
genome_score_relevant_df["tag"] = '"' + genome_score_relevant_df["tag"] + '"'
genome_score_relevant_df = genome_score_relevant_df.groupby(['movieId'])['tag'].apply(lambda x: ','.join(x)).reset_index()

We create a new column in the movies_df with the name "tag", where all the relevant tags (from the genome_score_relevnat_df) to the movie are present divided by "," character.

In [None]:
movies_df = movies_df.merge(genome_score_relevant_df, on="movieId", how="left")
movies_df["tag"].fillna("", inplace=True)
movies_df.head()

The genres list are changed to the same format as the tags, then each of the genres are added to the tags list.

In [None]:
movies_df["genres"] = '"' + movies_df["genres"].str.replace('|', '","') + '"'
movies_df.head()

In [None]:
movies_df["combined_features"] = movies_df["genres"] + "," + movies_df["tag"]
movies_df

In [None]:
movies_df["combined_features"].fillna('', inplace=True)

## Cosine similarity

The CountVectorizer creates from the list of features (movies_df.tags), vectors, for which the cosine similarity is calculated.

In [None]:
cv = CountVectorizer()
count_matrix = cv.fit_transform(movies_df["combined_features"])

In [None]:
from sys import getsizeof

chunk_size = 500 
matrix_len = count_matrix.shape[0] # Not sparse numpy.ndarray

cosine_sim = np.empty((0,matrix_len), int)

def similarity_cosine_by_chunk(start, end):
    if end > matrix_len:
        end = matrix_len
    return cosine_similarity(X=count_matrix[start:end], Y=count_matrix) # scikit-learn function

for chunk_start in range(0, matrix_len, chunk_size):
    cosine_similarity_chunk = similarity_cosine_by_chunk(chunk_start, chunk_start+chunk_size)
    cosine_sim = np.append(cosine_sim, cosine_similarity_chunk, axis=0)
    print(cosine_sim.shape)
    print(getsizeof(cosine_sim))

In [None]:
cosine_sim.shape

## Prediction

The prediction is done by the predict_by_title function, which writes the 20 most similar movie to the one present in the "title" parameter.

In [None]:
def predict_by_title(title):
    idx = movies_df.index[movies_df["title"] == title]
    idx = idx[0]
    similar_movies = list(enumerate(cosine_sim[idx]))
    sorted_similar_movies = sorted(similar_movies, key=lambda x:x[1], reverse=True)
    sorted_similar_movies
    
    i = 0
    for movie in sorted_similar_movies:
      print(movies_df.iloc[movie[0]]["title"])
      i = i + 1
      if i>20:
        break

In [None]:
predict_by_title("Fight Club (1999)")