## Alternating Least Squares

In this notebook, i will try to implement and tuning model ALS (implemented from implicit) with Movielens 1M dataset. And, I use Mean Average precision at k and precision at k to evaluate this model.

**Alternating Least Squares**: a matrix factorization approach in recommender system. ALS try to decompose the rating matrix **R** into two factors **U** and **V** such that $R \approx U^T.V$.


**Precision at k**: How many relevant items are present in the top-k recommendations of your system. The higher the better.


**Mean average precision at k**: The mean of P@k for all user. The higher the better.

**Root Mean Squares Error**: Compute different between prediction rating and true rating. The lower the better.

In [1]:
import os
import sys
import pathlib

In [2]:
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"

In [3]:
parent = pathlib.Path(os.getcwd()).parent
sys.path.append(str(parent))

In [4]:
import pandas as pd
import numpy as np
from implicit import evaluation
from sklearn.model_selection import train_test_split

In [5]:
from implicit.als import AlternatingLeastSquares
from recommend.dataset import Dataset
from utils.util import save_model, load_model

In [6]:
rating = pd.read_csv("../data/rating.csv", index_col=0)

In [7]:
rating.head()

Unnamed: 0,UserID,MovieID,Rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5


In [8]:
rating.nunique()

UserID     6040
MovieID    3706
Rating        5
dtype: int64

In [9]:
ds_train = Dataset(rating, user="UserID", item="MovieID", rating="Rating")
# ds_test = Dataset(train_df, user="UserID", item="MovieID", rating="Rating")

### Train test split
In Recommender system, when split data for train test, be carefully with some item/user only have once rated. make sure that number of item/user in train dataset is equal to origin data

In [10]:
# Train test split, let split by 0.95 and 0.05
train_df, test_df = evaluation.train_test_split(ds_train.get_csr(), train_percentage=.95)

In [11]:
train_df

<6040x3706 sparse matrix of type '<class 'numpy.float32'>'
	with 949990 stored elements in Compressed Sparse Row format>

In [12]:
test_df.data

array([5., 5., 4., ..., 1., 4., 2.], dtype=float32)

In [13]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

### Experiment model with number of latent factor is 100

In [12]:
# Initialize model, let numbers of latent factors is 100 (default)
als = AlternatingLeastSquares(num_threads=1, random_state=1)

In [13]:
als.fit(train_df)

  0%|          | 0/15 [00:00<?, ?it/s]

#### Performance RMSE of model
To compute RMSE, we need:

- Compute prediction rating matrix from user factor and item factor
- Get user index and item index contains in test data
- From index, we get true rating and store it in an array
- From index, we get predict rating and store it in another array
- Use sklearn.metrics.mean_squared_error to compute MSE of prediction
- Call sqrt to get RMSE

In [27]:
# Compute prediction rating matrix
rev_matrix = als.user_factors.dot(als.item_factors.T)

In [37]:
# True rating of test data store in test_df.data (CSR sparse matrix) or in rating matrix if you stored it in dataframe
uidx, iidx = test_df.nonzero() # User idex and item index is nonzero value of matrix
# Get prediction of rating from prediction matrix and index
predict = rev_matrix[uidx, iidx]

In [38]:
# Call sqrt and mse to get RMSE of model
als_rmse = np.sqrt(mean_squared_error(predict, test_df.data))
print(f"RMSE of ALS is {als_rmse}")

RMSE of ALS is 3.214193820953369


In [43]:
# Calculate MAP@k of ALS model
print(f"MAP@k of ALS is {evaluation.mean_average_precision_at_k(als, train_df, test_df)}")

  0%|          | 0/5627 [00:00<?, ?it/s]

MAP@k of ALS is 0.12102710341157799


In [44]:
# Calculate P@k of ALS model
print(f"P@k of ALS is {evaluation.precision_at_k(als, train_df, test_df)}")

  0%|          | 0/5627 [00:00<?, ?it/s]

P@k of ALS is 0.2342468307233408


In [14]:
def train_and_evaluation(factor=100, data=ds_train.get_csr()):
    """Function for evaluate model with difference factor.
    
    Parameters
    ----------
    factor: int
        Dimension of factor.
        
    data: CSR matrix
        Rating matrix for train/test model.
        
    """
    train, test = evaluation.train_test_split(data, train_percentage=.95)
    model = AlternatingLeastSquares(factors=factor, num_threads=1, random_state=1)
    model.fit(train)
    inv_mt = model.user_factors.dot(model.item_factors.T)
    rmse_res = np.sqrt(mean_squared_error(inv_mt[test.nonzero()], test.data))
    print(f"RMSE of model is: {rmse_res}")
    mapk = evaluation.mean_average_precision_at_k(model, train, test)
    preck = evaluation.precision_at_k(model, train, test)
    print(f"MAP@k of model is {mapk}")
    print(f"P@k of model is {preck}")
    return

In [15]:
# Calculate MAP@k and P@k with dimension of latent factor is 1000
train_and_evaluation(1000)

  0%|          | 0/15 [00:00<?, ?it/s]

RMSE of model is: 3.6769254207611084


  0%|          | 0/5591 [00:00<?, ?it/s]

  0%|          | 0/5591 [00:00<?, ?it/s]

MAP@k of model is 0.03617523741612405
P@k of model is 0.07756670488704329


In [16]:
# Calculate MAP@k and P@k with dimension of latent factor is 300
train_and_evaluation(300)

  0%|          | 0/15 [00:00<?, ?it/s]

RMSE of model is: 3.4202327728271484


  0%|          | 0/5642 [00:00<?, ?it/s]

  0%|          | 0/5642 [00:00<?, ?it/s]

MAP@k of model is 0.08315190597455513
P@k of model is 0.16853209039894515


In [17]:
# Calculate MAP@k and P@k with dimension of latent factor is 50
train_and_evaluation(50)

  0%|          | 0/15 [00:00<?, ?it/s]

RMSE of model is: 3.181378126144409


  0%|          | 0/5636 [00:00<?, ?it/s]

  0%|          | 0/5636 [00:00<?, ?it/s]

MAP@k of model is 0.1265496960866067
P@k of model is 0.24678591391839017


In [18]:
# Calculate MAP@k and P@k with dimension of latent factor is 30
train_and_evaluation(30)

  0%|          | 0/15 [00:00<?, ?it/s]

RMSE of model is: 3.1846776008605957


  0%|          | 0/5642 [00:00<?, ?it/s]

  0%|          | 0/5642 [00:00<?, ?it/s]

MAP@k of model is 0.12268558662026065
P@k of model is 0.24459694447894673


In [19]:
# Calculate MAP@k and P@k with dimension of latent factor is 10
train_and_evaluation(10)

  0%|          | 0/15 [00:00<?, ?it/s]

RMSE of model is: 3.216597080230713


  0%|          | 0/5610 [00:00<?, ?it/s]

  0%|          | 0/5610 [00:00<?, ?it/s]

MAP@k of model is 0.1023264824167602
P@k of model is 0.20694595857343093
