Model-based collaborative filtering (CF) based on:
- SVD
- Deep learning models 

Let's go through Singular Vector Decomposition (SVD) first, a dimensionality reduction technique that is used in modern model-based CF recommender system. The other type of CF recommender system is the memory-based CF. We will also do another model-based RS based on deep learning networks
in the latter part of this notebook. 

Somehow, since I have not updated Mac to latest version, Keras installation fails. Return when I update. 
See https://github.com/khanhnamle1994/movielens/blob/master/Deep_Learning_Model.ipynb

In [50]:
# Import libraries
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn import model_selection, metrics, preprocessing

# Reading ratings file
ratings = pd.read_csv('../data/ratings.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'movie_id', 'rating', 'timestamp'])

# Reading users file
users = pd.read_csv('../data/users.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'gender', 'zipcode', 'occ_desc'])

# Reading movies file
movies = pd.read_csv('../data/movies.csv', sep='\t', encoding='latin-1', usecols=['movie_id', 'title', 'genres'])

In [51]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [52]:
n_users = ratings.user_id.unique().shape[0]
n_movies = ratings.movie_id.unique().shape[0]
print ('Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_movies)) 

Number of users = 6040 | Number of movies = 3706


In [53]:
ratings_1 = ratings.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0) 

In [54]:
R = ratings_1.to_numpy()
user_ratings_mean = np.mean(R, axis = 1)
user_ratings_mean

array([0.05990286, 0.12924987, 0.05369671, ..., 0.02050729, 0.1287102 ,
       0.3291959 ])

In [55]:
ratings_1_sub_mean = R - user_ratings_mean.reshape(-1, 1)

In [56]:
# Check how sparse

sparsity = round(1.0 - len(ratings) / float(n_users * n_movies), 3)
print ('The sparsity level of MovieLens1M dataset is ' +  str(sparsity * 100) + '%') 

# Number of ratings: 0.045*6040*3706 \sim 1007291

The sparsity level of MovieLens1M dataset is 95.5%


In [57]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(ratings_1_sub_mean, k = 50)

In [58]:
U.shape, Vt.shape

((6040, 50), (50, 3706))

In [59]:
sigma = np.diag(sigma)
all_user_predicted_ratings = (U@sigma)@Vt + user_ratings_mean.reshape(-1, 1)


In [60]:
all_user_predicted_ratings.shape

(6040, 3706)

In [61]:
preds = pd.DataFrame(all_user_predicted_ratings, columns = ratings_1.columns)
preds.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
0,4.288861,0.143055,-0.19508,-0.018843,0.012232,-0.176604,-0.07412,0.141358,-0.059553,-0.19595,...,0.027807,0.00164,0.026395,-0.022024,-0.085415,0.403529,0.105579,0.031912,0.05045,0.08891
1,0.744716,0.169659,0.335418,0.000758,0.022475,1.35305,0.051426,0.071258,0.161601,1.567246,...,-0.056502,-0.013733,-0.01058,0.062576,-0.016248,0.15579,-0.418737,-0.101102,-0.054098,-0.140188
2,1.818824,0.456136,0.090978,-0.043037,-0.025694,-0.158617,-0.131778,0.098977,0.030551,0.73547,...,0.040481,-0.005301,0.012832,0.029349,0.020866,0.121532,0.076205,0.012345,0.015148,-0.109956
3,0.408057,-0.07296,0.039642,0.089363,0.04195,0.237753,-0.049426,0.009467,0.045469,-0.11137,...,0.008571,-0.005425,-0.0085,-0.003417,-0.083982,0.094512,0.057557,-0.02605,0.014841,-0.034224
4,1.574272,0.021239,-0.0513,0.246884,-0.032406,1.552281,-0.19963,-0.01492,-0.060498,0.450512,...,0.110151,0.04601,0.006934,-0.01594,-0.05008,-0.052539,0.507189,0.03383,0.125706,0.199244


In [62]:
user_row_number = 1310 - 1 # User ID starts at 1, not 0
sorted_user_predictions = preds.iloc[user_row_number].sort_values(ascending=False) # User ID starts at 1
sorted_user_predictions

movie_id
1097    1.571580
1090    1.373596
1674    1.292395
1196    1.263977
1961    1.153507
          ...   
2712   -0.334191
3578   -0.362995
910    -0.368467
1304   -0.494537
1136   -0.512347
Name: 1309, Length: 3706, dtype: float64

In [63]:
user_data = ratings[ratings.user_id == (1310)]
user_data

Unnamed: 0,user_id,movie_id,rating,timestamp
215928,1310,2988,3,974781935
215929,1310,1293,5,974781839
215930,1310,1295,2,974782001
215931,1310,1299,4,974781701
215932,1310,2243,4,974782001
215933,1310,2248,5,974781573
215934,1310,2620,5,974781573
215935,1310,3683,5,974781935
215936,1310,3685,4,974781935
215937,1310,1185,4,974781839


In [64]:
user_full = user_data.merge(movies).sort_values(['rating'], ascending=False)
user_full

Unnamed: 0,user_id,movie_id,rating,timestamp,title,genres
5,1310,2248,5,974781573,Say Anything... (1989),Comedy|Drama|Romance
6,1310,2620,5,974781573,This Is My Father (1998),Drama|Romance
7,1310,3683,5,974781935,Blood Simple (1984),Drama|Film-Noir
15,1310,1704,5,974781573,Good Will Hunting (1997),Drama
1,1310,1293,5,974781839,Gandhi (1982),Drama
12,1310,3101,4,974781573,Fatal Attraction (1987),Thriller
11,1310,1343,4,974781534,Cape Fear (1991),Thriller
20,1310,2000,4,974781892,Lethal Weapon (1987),Action|Comedy|Crime|Drama
18,1310,3526,4,974781892,Parenthood (1989),Comedy|Drama
17,1310,3360,4,974781935,Hoosiers (1986),Drama


In [65]:
print ('User {0} has already rated {1} movies.'.format(1310, user_full.shape[0]))

User 1310 has already rated 24 movies.


In [66]:
sorted_user_predictions = preds.iloc[user_row_number].sort_values(ascending=False)
sorted_user_predictions

movie_id
1097    1.571580
1090    1.373596
1674    1.292395
1196    1.263977
1961    1.153507
          ...   
2712   -0.334191
3578   -0.362995
910    -0.368467
1304   -0.494537
1136   -0.512347
Name: 1309, Length: 3706, dtype: float64

In [67]:
num_recommendations = 10
recommendations = (movies[~movies['movie_id'].isin(user_full['movie_id'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index()).
         rename(columns = {user_row_number: 'Pred'}).sort_values('Pred', ascending = False).iloc[:num_recommendations, :-1]
                      )

recommendations

Unnamed: 0,movie_id,title,genres
1527,1674,Witness (1985),Drama|Romance|Thriller
1769,1961,Rain Man (1988),Drama
1115,1210,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Romance|Sci-Fi|War
1144,1242,Glory (1989),Action|Drama|War
1130,1225,Amadeus (1984),Drama
1201,1302,Field of Dreams (1989),Drama
1148,1246,Dead Poets Society (1989),Drama
1770,1962,Driving Miss Daisy (1989),Drama
1766,1957,Chariots of Fire (1981),Drama
1827,2020,Dangerous Liaisons (1988),Drama|Romance


## Let's try some model evaluation now using Surprise

In [68]:
# Import libraries from Surprise package
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

# Load Reader library
reader = Reader()

# Load ratings dataset with Dataset library
data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], reader)

# Split the dataset for 5-fold evaluation
# Use the SVD algorithm.
svd = SVD()
cross_validate(svd, data, measures=['RMSE'], cv=5)
# measures=['RMSE', 'MAE']

{'test_rmse': array([0.87374066, 0.87515976, 0.87290008, 0.87342131, 0.87423553]),
 'fit_time': (5.140377998352051,
  5.085600852966309,
  4.666769981384277,
  4.462267160415649,
  4.609418153762817),
 'test_time': (0.8201639652252197,
  0.6316211223602295,
  0.8383281230926514,
  0.641211986541748,
  0.6348040103912354)}

In [69]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x15acff9d0>

In [70]:
ratings[ratings['user_id'] == 1310] # Ratings given by User 1310

Unnamed: 0,user_id,movie_id,rating,timestamp
215928,1310,2988,3,974781935
215929,1310,1293,5,974781839
215930,1310,1295,2,974782001
215931,1310,1299,4,974781701
215932,1310,2243,4,974782001
215933,1310,2248,5,974781573
215934,1310,2620,5,974781573
215935,1310,3683,5,974781935
215936,1310,3685,4,974781935
215937,1310,1185,4,974781839


In [71]:
svd.predict(1310, 1674)

Prediction(uid=1310, iid=1674, r_ui=None, est=3.76537631492737, details={'was_impossible': False})

In [72]:
svd.predict(1310, 1961)

Prediction(uid=1310, iid=1961, r_ui=None, est=3.8592427426163822, details={'was_impossible': False})

Note that MovieID 1674, 1961 were both suggested by our earlier computation. Instead of SVD we could have also
used NMF (Non negative Matrix Factorization). We will do it and see how the RMSE and compare to 0.873 from SVD. 

In [73]:
#from surprise import NMF
#algo = NMF()
#cross_validate(algo, data, measures=['RMSE'], cv=5)

RMSE of 0.915. Therefore, SVD does better! 

## Now deep learning stuff! 

In [74]:
class MovieDataset:
    def __init__(self, users, movies, ratings):
        self.users = users
        self.movies = movies
        self.ratings = ratings
    # len(movie_dataset)
    def __len__(self):
        return len(self.users)
    # movie_dataset[1] 
    def __getitem__(self, item):

        users = self.users[item] 
        movies = self.movies[item]
        ratings = self.ratings[item]
        
        return {
            "users": torch.tensor(users, dtype=torch.long),
            "movies": torch.tensor(movies, dtype=torch.long),
            "ratings": torch.tensor(ratings, dtype=torch.long),
        }

In [75]:
class RecSysModel(nn.Module):
    def __init__(self, n_users, n_movies):
        super().__init__()
        # trainable lookup matrix for shallow embedding vectors
        
        self.user_embed = nn.Embedding(n_users, 32)
        self.movie_embed = nn.Embedding(n_movies, 32)
        # user, movie embedding concat
        self.out = nn.Linear(64, 1)

    
    def forward(self, users, movies, ratings=None):
        user_embeds = self.user_embed(users)
        movie_embeds = self.movie_embed(movies)
        output = torch.cat([user_embeds, movie_embeds], dim=1)
        
        output = self.out(output)
        
        return output

In [76]:
ratings.user_id.nunique()
ratings = ratings.iloc[:5000]
ratings.describe()


Unnamed: 0,user_id,movie_id,rating,timestamp
count,5000.0,5000.0,5000.0,5000.0
mean,19.786,1835.7824,3.5744,978509700.0
std,9.505045,1079.542711,1.081064,2020025.0
min,1.0,1.0,1.0,978100400.0
25%,11.0,1013.0,3.0,978136000.0
50%,19.5,1777.0,4.0,978197500.0
75%,26.0,2722.0,4.0,978273000.0
max,35.0,3952.0,5.0,1009669000.0


In [77]:
# Encode the user and movie id to start from 0 so we don't run into index out of bound with Embedding

df = ratings.copy()

lbl_user = preprocessing.LabelEncoder()
lbl_movie = preprocessing.LabelEncoder()
df.user_id = lbl_user.fit_transform(df.user_id.values)
df.movie_id = lbl_movie.fit_transform(df.movie_id.values)

df_train, df_valid = model_selection.train_test_split(df, test_size=0.1, random_state=42, stratify=df.rating.values)
train_dataset = MovieDataset(users=df_train.user_id.values,movies=df_train.movie_id.values,ratings=df_train.rating.values)
valid_dataset = MovieDataset(users=df_valid.user_id.values,movies=df_valid.movie_id.values,ratings=df_valid.rating.values)

In [78]:
print(len(lbl_user.classes_))
print(len(lbl_movie.classes_))
print(ratings.movie_id.max())
print(len(train_dataset))

35
1647
3952
4500


In [79]:
import numpy as np
from keras.layers import Embedding, Reshape, Concatenate
from keras.models import Sequential

In [150]:
class CFModel(Sequential):

    # The constructor for the class
    def __init__(self, n_users, m_items, k_factors, **kwargs):
        # P is the embedding layer that creates an User by latent factors matrix.
        # If the intput is a user_id, P returns the latent factor vector for that user.
        P = Sequential()
        P.add(Embedding(n_users, k_factors, input_length=1))
        P.add(Reshape((k_factors,)))

        # Q is the embedding layer that creates a Movie by latent factors matrix.
        # If the input is a movie_id, Q returns the latent factor vector for that movie.
        Q = Sequential()
        Q.add(Embedding(m_items, k_factors, input_length=1))
        Q.add(Reshape((k_factors,)))

        super(CFModel, self).__init__(**kwargs)
        
        # The Merge layer takes the dot product of user and movie latent factor vectors to return the corresponding rating.
        self.add(Concatenate([P, Q]))

    # The rate function to predict user's rating of unrated items
    def rate(self, user_id, item_id):
        return self.predict([np.array([user_id]), np.array([item_id])])[0][0]

In [151]:
# Import Keras libraries
from keras.callbacks import Callback, EarlyStopping, ModelCheckpoint

# Define constants
K_FACTORS = 100 # The number of dimensional embeddings for movies and users
TEST_USER = 2000 # A random test user (user_id = 2000)

max_userid = ratings['user_id'].drop_duplicates().max()
max_movieid = ratings['movie_id'].drop_duplicates().max()

# Define model
model = CFModel(max_userid, max_movieid, K_FACTORS)
# Compile the model using MSE as the loss function and the AdaMax learning algorithm
model.compile(loss='mse', optimizer='adamax')



In [152]:
# Create training set
shuffled_ratings = ratings.sample(frac=1., random_state=11)

# Shuffling users
Users = shuffled_ratings['user_id'].values
print ('Users:', Users, ', shape =', Users.shape)

# Shuffling movies
Movies = shuffled_ratings['movie_id'].values
print ('Movies:', Movies, ', shape =', Movies.shape)

# Shuffling ratings
Ratings = shuffled_ratings['rating'].values
print ('Ratings:', Ratings, ', shape =', Ratings.shape)

Users: [22  8 19 ... 29 27 18] , shape = (5000,)
Movies: [1921 1265 1610 ... 3753 1188 1682] , shape = (5000,)
Ratings: [4 5 2 ... 4 3 4] , shape = (5000,)


In [153]:
shuffled_ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
2608,22,1921,4,978136684
587,8,1265,5,978229524
2407,19,1610,2,978147043
1687,15,1092,4,978198581
2915,23,316,3,978464728


In [154]:
x = shuffled_ratings[["user_id", "movie_id"]].values
# Normalize the targets between 0 and 1. Makes it easy to train.
y = shuffled_ratings["rating"].values
# Assuming training on 90% of the data and validating on 10%.
train_indices = int(0.9 * shuffled_ratings.shape[0])
x_train, x_val, y_train, y_val = (
    x[:train_indices],
    x[train_indices:],
    y[:train_indices],
    y[train_indices:],
)

In [155]:
# Train model 

#import tensorflow as tf

# Callbacks monitor the validation loss
# Save the model weights each time the validation loss has improved
#callbacks = [EarlyStopping('val_loss', patience=2), ModelCheckpoint('weights.h5', save_best_only=True)]

# Use 1 epoch for now, 90% training data, 10% validation data 
#history = model.fit([Users, Movies], Ratings, epochs=1, validation_split=.1, verbose=2, callbacks=callbacks)

#Ratings=tf.convert_to_tensor(Ratings) 
#X=tf.convert_to_tensor([Users, Movies]) 

#history = model.fit(X, Ratings, epochs=1, validation_split = 0.1, batch_size = 600)


history = model.fit(x=x_train,y=y_train,batch_size=16,epochs=1,verbose=1,validation_data=(x_val, y_val))


ValueError: in user code:

    File "/Users/apple/miniconda3/lib/python3.10/site-packages/keras/engine/training.py", line 1284, in train_function  *
        return step_function(self, iterator)
    File "/Users/apple/miniconda3/lib/python3.10/site-packages/keras/engine/training.py", line 1268, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/apple/miniconda3/lib/python3.10/site-packages/keras/engine/training.py", line 1249, in run_step  **
        outputs = model.train_step(data)
    File "/Users/apple/miniconda3/lib/python3.10/site-packages/keras/engine/training.py", line 1050, in train_step
        y_pred = self(x, training=True)
    File "/Users/apple/miniconda3/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/Users/apple/miniconda3/lib/python3.10/site-packages/keras/layers/merging/concatenate.py", line 98, in build
        raise ValueError(

    ValueError: Exception encountered when calling layer 'cf_model_15' (type CFModel).
    
    A `Concatenate` layer should be called on a list of at least 1 input. Received: input_shape=(None, 2)
    
    Call arguments received by layer 'cf_model_15' (type CFModel):
      • inputs=tf.Tensor(shape=(None, 2), dtype=int64)
      • training=True
      • mask=None
