# Latent Factors Model

In this notebook, a latent factor model will be implemented to create a recommendation system for movies. 

This notebook contains part of the documentation. The full details of the algorithm can be found in the accompanying report found in the repository.

First, we will install the dependencies. We need to install numpy and optuna. Optuna will be used for hyper-parameter optimization.

In [1]:
%pip install numpy sklearn optuna

Note: you may need to restart the kernel to use updated packages.


Now we import the required dependencies.

In [2]:
from random import randint
import os
import operator
import math
import time
from typing import Any, Dict, List
import json
import optuna
import numpy as np
from sklearn.metrics import mean_squared_error

In order to retrieve the train, test and validation datasets as well as the constructed utility matrix, we will the MatrixMaker class.
The class is responsible for constructing the matrices and storing them to save computation time. Make sure to delete this files when changing the ratings file.
The full functionality of the class can be found in the documentation.

In [3]:
from matrix_maker import MatrixMaker

data_retriever = MatrixMaker()
(train_set, test_set, validation_set, utility_matrix) = data_retriever.make_matrices(remake=False)

Files found, returning the matrices.


We will use the Logger class to log meta data about the execution of the program so it can be used later to analyse the algorithm.

In [4]:
from logger import Logger

OPERATION = "latent_factors"
logger = Logger(OPERATION)

We will get the program configuration from a configuration file. This is done to separate the parameters of the program from the program itself.

In [5]:
config = None
try:
    config = json.load(open("config.json", "r"))
except Exception as e:
    print(e.__doc__)
    print("Check if config file exists and is in good order")

hyper_optimization = bool(config["hyper_optimization"])
hyper_epoch = config["hyper_epoch"]

The **get_biases** function will be used to calculate the biases for movies and users. This is needed to add the local effects for the final predicted ratings.

In [6]:
def get_biases(utility_matrix: np.ndarray, global_average: float) -> (Dict, Dict):
    """Calculate biases for movies and users.

    Args:
        utility_matrix (ee.ndarray): The utility matrix
        global_average (float): The average rating for the train set

    Returns:
        (Dict, Dict): Dictionaries for user and movie biases
    """    
    # Calculate the user biases
    users_bias = dict()
    for i in range(utility_matrix.shape[1]):
        m = np.nanmean(utility_matrix[:, i])
        if(np.isnan(m)):
            users_bias[i] = 0.0
        else: 
            users_bias[i] = m - global_average
    
    # Calculate the movies biases
    movies_bias = dict()
    for i in range(utility_matrix.shape[0]):
        m = np.nanmean(utility_matrix[i, :])
        if(np.isnan(m)):
            movies_bias[i] = 0.0
        else:
            movies_bias[i] = m - global_average

    return (users_bias, movies_bias)

We will now use the **get_biases** function to get the biases.

In [7]:
(number_users, number_movies, max_ratings, max_timestamp) = np.max(train_set, axis=0)
number_users = int(number_users)
number_movies = int(number_movies)
number_predictions = len(test_set)
number_ratings = len(train_set)
global_average = train_set.mean(axis=0)[2]

(init_users_bias, init_movies_bias) = get_biases(utility_matrix, global_average)

  m = np.nanmean(utility_matrix[i, :])


The **build_latent_factors** will create the matrices q and p through stochastic gradient descent.

In [8]:
def build_latent_factors(latent_factors: int, train_epoch: int, alpha: float, regularization: float, movies_bias: Dict, users_bias: Dict) -> (np.ndarray, np.ndarray):
    """Calculate biases for movies and users.

    Args:
        latent_factors (int): The number of latent factors to build the Q and P matrices
        train_epoch (int): The number of training cycles
        alpha (float): The learning rate
        regularization (float): The regularization factor
        movies_bias (Dict): A dictionary containing the biases of all movies.
        users_bias (Dict): A dictionary containing the biases of all users.

    Returns:
        (np.ndarray, np.ndarray): The Q and P matrices respectively
    """    
    #Intiallize random matrices q and p
    q = np.random.rand(number_movies, latent_factors)
    p = np.random.rand(latent_factors, number_users)
    #Perform stochastic gradient descent to get matrices q and p
    for e in range(train_epoch):
        print("Iteration "+str(e+1)+ " out of "+str(train_epoch))
        for i in range(number_movies):
            for j in range(number_users):
                if(np.isnan(utility_matrix[i, j])): continue
                current_rating = predict(p, q, i, j, movies_bias, users_bias)
                difference = utility_matrix[i, j] - current_rating
                movies_bias[i] = movies_bias[i] + (alpha * (difference-(regularization*movies_bias[i])))
                users_bias[j] = users_bias[j] + (alpha * (difference-(regularization*users_bias[j])))
                q[i, :] = q[i, :] + (alpha*((difference*p[:, j])-(regularization*q[i, :])))
                p[:, j] = p[:, j] + (alpha*((difference*q[i, :])-(regularization*p[:, j])))
    return (q, p)

The **predict_results** function will loop through the entries and produces a list of the predicted ratings. It will either loop through the test set or validation set.

In [9]:
def predict_results(q: np.ndarray, p: np.ndarray, movies_bias: Dict, users_bias: Dict, prediction_set: np.ndarray) -> List[float]:
    """Calculate the predicted ratings for the provided data set.

    Args:
        q (np.ndarray): The Q matrix (movies x latent factors)
        p (np.ndarray): The P matrix (latent factors x users)
        movies_bias (Dict): A dictionary containing the biases of all movies.
        users_bias (Dict): A dictionary containing the biases of all users.
        prediction_set (np.ndarray): The data set that the predictions need to be made from

    Returns:
        List[float]: The list of the predicted ratings.
    """
    result = []
    for index in range(len(prediction_set)):
        userp = int(prediction_set[index, 0])-1
        moviep = int(prediction_set[index, 1])-1
        rating = predict(p, q, moviep, userp, movies_bias, users_bias)
        result.append(rating)
    return result

The **predict** function calculated the rating for a single entry.

In [10]:
def predict(p: np.ndarray, q: np.ndarray, i: int, j: int, movies_bias: Dict, users_bias: Dict) -> float:
    """Calculate the predicted entry for a single entry.

    Args:
        q (np.ndarray): The Q matrix (movies x latent factors)
        p (np.ndarray): The P matrix (latent factors x users)
        i (int): The movie index
        j (int): The user index
        movies_bias (Dict): A dictionary containing the biases of all movies.
        users_bias (Dict): A dictionary containing the biases of all users.

    Returns:
        float: The predicted rating.
    """
    return global_average + movies_bias[i] + users_bias[j] + np.dot(q[i, :], p[:, j])

The **calculate_RMSE** function is used to calculate the Root Mean Squared Error which is used to find the accuracy of the algorithm on a data set (test/validation).

In [11]:
def calculate_RMSE(results: List[float], prediction_set: np.ndarray) -> float:
    """Calculate the RMSE between the results and prediction set.

    Args:
        results (List[float]): The list of predicted results
        test_set (ee.ndarray): The test set 

    Returns:
        float: The RMSE between the results and test set
    """
    expected = prediction_set[:, 2].flatten()
    assert len(expected) == len(results)
    return math.sqrt(mean_squared_error(expected, results))

The **objective** function is used by Optuna for hyper-parameter optimization.

In [12]:
def objective(trial: optuna.trial.Trial) -> float:
    """Used by Optuna for hyper parameter optimization.
    Calculates the RMSE for a particular set of hyper parameters.

    Args:
        trial (optuna.trial.Trial): The Trial object that Optuna uses.

    Returns:
        float: The RMSE of the model built using the hyperparameters on the validation set.
    """    
    latent_factors = trial.suggest_int("latent_factors", 7, 18)
    train_epoch = trial.suggest_int("train_epoch", 200, 600, 50)
    alpha = trial.suggest_float("alpha", 0.01, 0.02)
    regularization = trial.suggest_float("regularization", 0.045, 0.85)
    movies_bias = init_movies_bias.copy()
    users_bias = init_users_bias.copy() 
    (q, p) = build_latent_factors(latent_factors, train_epoch, alpha, regularization, movies_bias, users_bias)
    results = predict_results(q, p, movies_bias, users_bias, validation_set)
    return calculateRMSE(results, validation_set)

This is the main function. If hyperoptimization is on, then the program will use Optuna to optimize the hyper-parameters, otherwise the Q and P matrices will be contructed and the predictions will be made using the paramters from the config file.

In [None]:
if __name__ == '__main__':
    movies_bias = init_movies_bias.copy()
    users_bias = init_users_bias.copy() 
    if not hyper_optimization:
        latent_factors = config["latent_factors"]
        train_epoch = config["train_epoch"]
        alpha = config["alpha"]
        regularization = config["regularization_factor"]
        start_time = time.time()
        (q, p) = build_latent_factors(latent_factors, train_epoch, alpha, regularization, movies_bias, users_bias)
        results = predict_results(q, p, movies_bias, users_bias, test_set)

        # Calculate the time taken and RMSE and save to the log file
        time = time.time() - start_time
        rmse = calculate_RMSE(results, test_set)
        print("--- %s seconds ---" % (time))
        print(rmse)
        logger.save(time, rmse)

    else:
        # Start an Optuna study for hyper parameter optimization
        study = optuna.create_study()
        print("start of hyperoptimization")
        study.optimize(objective, n_trials=hyper_epoch, n_jobs=-1)
        print("end of hyperoptimization")

        # Retrieve the best parameters found
        latent_factors = study.best_params["latent_factors"]
        train_epoch = study.best_params["train_epoch"]
        alpha = study.best_params["alpha"]
        regularization = study.best_params["regularization_factor"]

        # Modify the logger object to the new parameters
        logger.latent_factors = latent_factors
        logger.train_epoch = train_epoch
        logger.alpha = alpha
        logger.regularization = regularization

        # Calculate the time taken and RMSE and save to the log file
        start_time = time.time()
        (q, p) = build_latent_factors(latent_factors, train_epoch, alpha, regularization, movies_bias, users_bias)
        results = predict_results(q, p, movies_bias, users_bias, test_set)
        total_time = time.time() - start_time
        rmse = calculate_RMSE(test_set, results)
        print("--- " + str(total_time) + " seconds ---")
        print("--- rmse: " + str(rmse) + " ---")
        logger.save(total_time, rmse)

[32m[I 2021-02-12 23:52:04,051][0m A new study created in memory with name: no-name-9e867a1b-4546-499d-938b-53edc9d5361a[0m


start of hyperoptimization
Iteration 1 out of 300
Iteration 1 out of 450
Iteration 1 out of 500
Iteration 1 out of 400
Iteration 1 out of 450
Iteration 2 out of 450
Iteration 2 out of 450
Iteration 2 out of 400
Iteration 2 out of 300
Iteration 2 out of 500
Iteration 3 out of 450
Iteration 3 out of 400
Iteration 3 out of 300
Iteration 3 out of 450
Iteration 3 out of 500
Iteration 4 out of 450
Iteration 4 out of 450
Iteration 4 out of 400
Iteration 4 out of 300
Iteration 4 out of 500
Iteration 5 out of 450
Iteration 5 out of 300
Iteration 5 out of 400
Iteration 5 out of 450
Iteration 5 out of 500
Iteration 6 out of 450
Iteration 6 out of 300
Iteration 6 out of 450
Iteration 6 out of 400
Iteration 6 out of 500
Iteration 7 out of 450
Iteration 7 out of 300
Iteration 7 out of 450
Iteration 7 out of 400
Iteration 7 out of 500
Iteration 8 out of 450
Iteration 8 out of 500
Iteration 8 out of 450
Iteration 8 out of 400
Iteration 8 out of 300
Iteration 9 out of 450
Iteration 9 out of 500
Iterati