## Collaborative filtering recommendation system

The primary assumption underlying collaborative filtering recommendation is personal tastes are correlated. For example, if both person A and person B likes items X and Y (which means they probably have similar tastes), then if person B also likes item Z, it is likely for person A to like item Z as well.

In this task, we will use the matrix factorization based collaborative filtering approach. The user-item matrix will be decomposed into the product of two lower dimensionality rectangular metrices. By doing so, we represent users and items in a lower dimensional latent space. 

The essential task is to predict ratings of unknown entries in the user-item matrix. Recommendations are then made based on those high ratings from our predictions.

In [1]:
# imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

import tensorflow as tf
import keras

from keras.layers import Input, Embedding, Reshape, Dot, Concatenate, Dense, Dropout
from keras.models import Model

from scipy.sparse import vstack

Using TensorFlow backend.


In [2]:
# remove unnecessary TF logs
import logging
tf.get_logger().setLevel(logging.ERROR)

In [3]:
# check keras and TF version used
print('TF Version:', tf.__version__)
print('Keras Version:', keras.__version__)

TF Version: 1.15.0
Keras Version: 2.2.5


## Load datasets

In [4]:
movies_df = pd.read_csv('./datasets/ml-latest-small/movies.csv')
movies_df.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
ratings_df = pd.read_csv('./datasets/ml-latest-small/ratings.csv')
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
tags_df = pd.read_csv('./datasets/ml-latest-small/tags.csv')
tags_df.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


## Process data

In [7]:
# filter out rarely rated movies and rarely rating users
min_movie_ratings = 10
min_user_ratings = 10

filter_movies = (ratings_df['movieId'].value_counts() > min_movie_ratings)
filter_movies = filter_movies[filter_movies].index.tolist()

filter_users = (ratings_df['userId'].value_counts() > min_user_ratings)
filter_users = filter_users[filter_users].index.tolist()

# the get filtered data
mask = (ratings_df['movieId'].isin(filter_movies)) & (ratings_df['userId'].isin(filter_users))
ratings_df_filtered = ratings_df[mask]
del filter_movies, filter_users
ratings_df_filtered.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [9]:
# set aside a small port of ratings_df for testing purpose
n = 10000

rng = np.random.default_rng(42)
permuted_indices = rng.permutation(ratings_df_filtered.shape[0])


df_train = ratings_df_filtered.iloc[permuted_indices[:-n],:]
df_test = ratings_df_filtered.iloc[permuted_indices[-n:],:]
print(df_train.shape)
print(df_test.shape)

(69636, 4)
(10000, 4)


## Deep learning matrix factorization based collaborative filtering recommendation

In [8]:
# user id and movie id mappings
user_id_mapping = {id: i for i, id in enumerate(ratings_df_filtered['userId'].unique())}
movie_id_mapping = {id: i for i, id in enumerate(ratings_df_filtered['movieId'].unique())}

In [40]:
# inverse mapping for movies
movie_id_mapping_inv = {i: id for i, id in enumerate(ratings_df_filtered['movieId'].unique())}

In [10]:
# apply mapping on training data
train_user_data = df_train['userId'].map(user_id_mapping)
train_movie_data = df_train['movieId'].map(movie_id_mapping)

In [11]:
# apply mapping on testing data
test_user_data = df_test['userId'].map(user_id_mapping)
test_movie_data = df_test['movieId'].map(movie_id_mapping)

In [12]:
# input variable sizes
users = len(user_id_mapping)
movies = len(movie_id_mapping)
embedding_size = 100

In [13]:
# create tensors for user and movie
user_id_input = Input(shape=(1,), name='user')
movie_id_input = Input(shape=(1,), name='movie')

In [14]:
# embedding layer for users
user_embedding = Embedding(output_dim=embedding_size,
                          input_dim=users,
                          input_length=1,
                          name='user_embedding')(user_id_input)

# embedding layer for movie
movie_embedding = Embedding(output_dim=embedding_size,
                           input_dim=movies,
                           input_length=1,
                           name='movie_embedding')(movie_id_input)

In [15]:
# reshape embedding layers
user_vector = Reshape([embedding_size])(user_embedding)
movie_vector = Reshape([embedding_size])(movie_embedding)

In [16]:
# dot product of user_vector and movie_vector
y = Dot(1, normalize=False)([user_vector, movie_vector])

In [17]:
# model
model = Model(inputs=[user_id_input, movie_id_input], outputs=y)
model.compile(loss='mse', optimizer='adam')
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
user (InputLayer)               (None, 1)            0                                            
__________________________________________________________________________________________________
movie (InputLayer)              (None, 1)            0                                            
__________________________________________________________________________________________________
user_embedding (Embedding)      (None, 1, 100)       61000       user[0][0]                       
__________________________________________________________________________________________________
movie_embedding (Embedding)     (None, 1, 100)       212100      movie[0][0]                      
____________________________________________________________________________________________

In [18]:
# fit model
X = [train_user_data, train_movie_data]
y = df_train['rating']

batch_size = 100
epochs = 10
validation_split = 0.1

model.fit(X, y,
         batch_size=batch_size,
         epochs=epochs,
         validation_split=validation_split,
         shuffle=True,
         verbose=1)

Train on 62672 samples, validate on 6964 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fed4da93fd0>

In [19]:
# test model
y_pred = model.predict([test_user_data, test_movie_data]).ravel()
y_pred = list(map(lambda x: 1.0 if x<1 else 5.0 if x>5.0 else x, y_pred))
y_true = df_test['rating'].values

# rmse
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f'RMSE on test data is: {rmse}')

RMSE on test data is: 0.8353738832188007


In [58]:
# compare predictions and actual ratings
test_movie_ids = test_movie_data.map(movie_id_mapping_inv)

test_movie_titles = []

for i in test_movie_ids:
    title = movies_df[movies_df.movieId.values==i].title.values[0]
    test_movie_titles.append(title)

results = pd.DataFrame({'new_userId': test_user_data.values,
                      'new_movieId': test_movie_data.values,
                        'old_movideId': test_movie_ids.values,
                      'title': test_movie_titles,
                      'predicted_rating': np.round(y_pred,1),
                      'actual_rating': y_true
                       })

results.head(10)

Unnamed: 0,new_userId,new_movieId,old_movideId,title,predicted_rating,actual_rating
0,248,1081,788,"Nutty Professor, The (1996)",3.1,3.5
1,0,165,2872,Excalibur (1981),4.4,5.0
2,483,628,33794,Batman Begins (2005),4.0,4.0
3,287,16,296,Pulp Fiction (1994),4.4,5.0
4,15,842,913,"Maltese Falcon, The (1941)",3.6,4.0
5,327,288,912,Casablanca (1942),4.6,1.0
6,533,1119,1485,Liar Liar (1997),4.3,3.5
7,517,233,31,Dangerous Minds (1995),3.7,1.0
8,447,324,1923,There's Something About Mary (1998),3.9,4.0
9,181,656,5902,Adaptation (2002),3.9,4.5
