*Copyright (c) Microsoft Corporation. All rights reserved.*

*Licensed under the MIT License.*

## Riemannian Low-rank Matrix Completion algorithm on Movielens dataset

Riemannian Low-rank Matrix Completion (RLRMC) is a matrix factorization based (vanilla) matrix completion algorithm that solves the optimization problem using Riemannian conjugate gradients algorithm (Absil et al., 2008). RLRMC is based on the works by Jawanpuria and Mishra (2018) and Mishra et al. (2013). 

This notebook provides an example of how to utilize and evaluate RLRMC implementation in **reco_utils**.

In [1]:
import numpy as np
import sys
import time
import pandas as pd
sys.path.append("../../")
sys.path.append("../../reco_utils/recommender/rlrmc/")

from reco_utils.dataset.python_splitters import python_random_split
from reco_utils.dataset import movielens
from reco_utils.recommender.rlrmc.RLRMCdataset import RLRMCdataset 
from reco_utils.recommender.rlrmc.RLRMCalgorithm import RLRMCalgorithm
# %load_ext autoreload
# %autoreload 2

In [2]:
print("Pandas version: {}".format(pd.__version__))
print("System version: {}".format(sys.version))


Pandas version: 0.23.4
System version: 3.7.1 (default, Dec 14 2018, 13:28:58) 
[Clang 4.0.1 (tags/RELEASE_401/final)]


In [3]:
# Select Movielens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '10m'

### 1. Download the MovieLens dataset


In [4]:

df = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=["userID", "itemID", "rating", "timestamp"]
)

65.6MB [00:40, 1.63MB/s]                            


### 2. Split the data using the Spark chronological splitter provided in utilities

In [5]:
train, test = python_random_split(df) # we need to ensure that corresponding to every (user,item) in test split, the user and the item has at least one rating in training split

Generate an RLRMCdataset object from the data subsets.

In [7]:
data = RLRMCdataset(train=train, validation=test)

### 3. Train the RLRMC model on the training data

Set the default parameters.



In [8]:
# Model parameters

# rank of the model, a positive integer (usually small), required parameter
rank_parameter = 10
# regularization parameter multiplied to loss function, a positive number (usually small), required parameter
regularization_parameter = 0.001
# initialization option for the model, 'svd' employs singular value decomposition, optional parameter
initialization_flag = 'svd' #default is 'random'
# maximum number of iterations for the solver, a positive integer, optional parameter
maximum_iteration = 10 #optional, default is 100
# maximum time in seconds for the solver, a positive integer, optional parameter
maximum_time = 300#optional, default is 1000

In [9]:
model = RLRMCalgorithm(rank = rank_parameter,
                       C = regularization_parameter,
                       model_param = data.model_param,
                       initialize_flag = initialization_flag,
                       maxiter=maximum_iteration,
                       max_time=maximum_time)

In [10]:
# Verbosity of the intermediate results
verbosity=0 #optional parameter, valid values are 0,1,2, default is 0
# Whether to compute per iteration train RMSE (and test RMSE, if test data is given)
compute_iter_rmse=True #optional parameter, boolean value, default is False

In [12]:
start_time = time.time()

model.fit(data,verbosity=verbosity)

# fit_and_evaluate will compute RMSE on the validation set (if given) at every iteration
model.fit_and_evaluate(data,verbosity=verbosity)

train_time = time.time() - start_time

print("Took {} seconds for training.".format(train_time))

Took 5.135854721069336 seconds for training.


### 4. Obtain predictions from the RLRMC model on the test data

In [13]:
# Obtain predictions on (userID,itemID) pairs (60586,54775) and (52681,36519)
output = model.predict([60586,52681],[54775,36519])
# Obtain prediction on the full test set
output_full = model.predict(test['userID'].values,test['itemID'].values)


### Reference
[1] Pratik Jawanpuria and Bamdev Mishra. *A unified framework for structured low-rank matrix learning*. In International Conference on Machine Learning, 2018.

[2] Bamdev Mishra, Gilles Meyer, Francis Bach, and Rodolphe Sepulchre. *Low-rank optimization with trace norm penalty*. In SIAM Journal on Optimization 23(4):2124-2149, 2013.

[3] James Townsend, Niklas Koep, and Sebastian Weichwald. *Pymanopt: A Python Toolbox for Optimization on Manifolds using Automatic Differentiation*. In Journal of Machine Learning Research 17(137):1-5, 2016.

[4] P.-A. Absil, R. Mahony, and R. Sepulchre. *Optimization Algorithms on Matrix Manifolds*. Princeton University Press, Princeton, NJ, 2008.

[5] A. Edelman, T. Arias, and S. Smith. *The geometry of algo- rithms with orthogonality constraints*. SIAM Journal on Matrix Analysis and Applications, 20(2):303–353, 1998.