# Recommender Systems
This project focuses on building a recommender system that predicts user ratings for books based on their past interactions. By understanding user preferences, the system aims to suggest books that align with individual tastes, enhancing the reading experience through personalized recommendations.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Let's import the necessary libraries.

In [None]:
# Import necessary libraries

from tqdm import tqdm  # For displaying progress bars during iterations

import pandas as pd # For data manipulation and CSV file handling
import numpy as np # For numerical operations

Before we start, we need to load the two files needed for the project:
- `train.csv`: contains the user ratings for the books.
- `test.csv`: contains the user-book pairs for which we need to predict a rating.

In [None]:
# Load the training, test, and books data
train_df = pd.read_csv("Data/train.csv")
test_df = pd.read_csv("Data/test.csv")

## Matrix Factorization Method
This section is devoted to the implementation of matrix factorization with bias (SVD-like approach) method to predict the rating a user might give to a book.

In the first step, we define a function to compute the Root Mean Squared Error (RMSE), a key metric for evaluating the accuracy of our model's predictions. The RMSE calculates the average squared difference between the predicted and actual values, followed by taking the square root. We will use this function to evaluate the model's performance during hyperparameter tuning and on the final predictions.

In [None]:
# Function for the comutation of RMSE 
def compute_rmse(actual, predicted):
    return np.sqrt(np.mean((actual - predicted) ** 2))

The next step is to divide the training dataset into two subsets: train_set and val_set. The train_set is used for training the model, while the val_set is reserved for validation during hyperparameter tuning. This split ensures that the evaluation of model performance is done on unseen data, preventing overfitting and enabling us to select the best hyperparameters.

In [None]:
# Split training data into training and validation sets
train_set, val_set = train_test_split(train_df, test_size=0.2, random_state=42)

We then define the grid search function, which systematically tests various combinations of hyperparameters (latent dimensions, learning rates, and regularization terms). For each combination, the model is trained on the train_set and evaluated on the val_set using RMSE. The best hyperparameters, which minimize RMSE on the validation set, are saved for training the final model. This process ensures that our model is both accurate and generalizes well to new data.

In [None]:
# Function for performing grid search for hyperparameter tuning.
def grid_search(train_set, val_set, latent_dims, learning_rates, reg_params, n_epochs=20):
    print("Starting grid search...")
    best_params = None  # To store the best combination of hyperparameters
    best_rmse = float('inf')  # Initialize the best RMSE to infinity
    results = []  # List to store the results for each combination

    # Iterate through all combinations of latent dimensions, learning rates, and regularization parameters
    for latent_dim in latent_dims:
        for learning_rate in learning_rates:
            for reg_param in reg_params:
                print(f"Testing params: Latent Dim: {latent_dim}, Learning Rate: {learning_rate}, Regularization: {reg_param}")

                # Map user and book IDs to numerical indices for matrix operations
                user_to_index = {user_id: idx for idx, user_id in enumerate(train_set['user_id'].unique())}
                book_to_index = {book_id: idx for idx, book_id in enumerate(train_set['book_id'].unique())}

                n_users = len(user_to_index)  # Total number of unique users
                n_items = len(book_to_index)  # Total number of unique books

                # Initialize latent factor matrices (P, Q) and biases (b_u, b_i) with small random values
                P = np.random.normal(scale=0.01, size=(n_users, latent_dim))  # User latent factors
                Q = np.random.normal(scale=0.01, size=(n_items, latent_dim))  # Book latent factors
                mu = train_set['rating'].mean()  # Global mean of ratings
                b_u = np.zeros(n_users)  # User biases
                b_i = np.zeros(n_items)  # Book biases

                # Training loop for the current combination of hyperparameters
                for epoch in range(n_epochs):
                    for _, row in train_set.iterrows():
                        user_idx = user_to_index.get(row['user_id'])  # User index
                        book_idx = book_to_index.get(row['book_id'])  # Book index
                        rating = row['rating']  # Actual rating

                        # Skip if user or book is not in the mapping
                        if user_idx is None or book_idx is None:
                            continue

                        # Predicted rating using the model
                        pred_rating = mu + b_u[user_idx] + b_i[book_idx] + np.dot(P[user_idx], Q[book_idx])

                        # Compute the error between the actual and predicted ratings
                        error = rating - pred_rating

                        # Update biases and latent factors using gradient descent
                        b_u[user_idx] += learning_rate * (error - reg_param * b_u[user_idx])
                        b_i[book_idx] += learning_rate * (error - reg_param * b_i[book_idx])
                        P[user_idx] += learning_rate * (error * Q[book_idx] - reg_param * P[user_idx])
                        Q[book_idx] += learning_rate * (error * P[user_idx] - reg_param * Q[book_idx])

                # Evaluate the model on the validation set
                val_actual = []  # List of actual ratings in the validation set
                val_predicted = []  # List of predicted ratings in the validation set

                for _, row in val_set.iterrows():
                    user_idx = user_to_index.get(row['user_id'])
                    book_idx = book_to_index.get(row['book_id'])
                    if user_idx is not None and book_idx is not None:
                        # Predict rating using the model
                        pred_rating = mu + b_u[user_idx] + b_i[book_idx] + np.dot(P[user_idx], Q[book_idx])
                    else:
                        # Fallback to global mean if user or book is not in the mapping
                        pred_rating = mu  

                    val_actual.append(row['rating'])
                    val_predicted.append(pred_rating)

                # Compute RMSE for the validation set
                rmse = compute_rmse(np.array(val_actual), np.array(val_predicted))
                print(f"  Params RMSE: {rmse:.4f}")
                results.append((latent_dim, learning_rate, reg_param, rmse))  # Store results

                # Update best parameters if current RMSE is the lowest
                if rmse < best_rmse:
                    best_rmse = rmse
                    best_params = (latent_dim, learning_rate, reg_param)

    # Print the best RMSE and corresponding hyperparameters
    print(f"Best RMSE: {best_rmse}")
    print(f"Best Parameters: Latent Dim: {best_params[0]}, Learning Rate: {best_params[1]}, Regularization: {best_params[2]}")
    return best_params, results  # Return the best parameters and all results

Using the best hyperparameters obtained from the grid search, we train the final model on the entire training dataset. The following function initializes the latent factors (user and item embeddings) and biases, and iteratively updates them using the chosen hyperparameters over multiple epochs. The objective is to minimize the prediction error by fine-tuning the model parameters. Once the training is complete, the function generates predictions for the test set. For each user-item pair in the test set, it calculates a predicted rating based on the trained model. If a user or item is not found in the training data, the function defaults to the global average rating. Finally, the predictions are saved as a CSV file.

In [None]:
# Function for trainning the model using the best parameters and predict ratings on the test set
def train_and_predict(train_df, test_df, best_params, n_epochs=20):
    # Unpack the best parameters obtained from grid search
    latent_dim, learning_rate, reg_param = best_params
    print(f"Training final model with best params: Latent Dim: {latent_dim}, Learning Rate: {learning_rate}, Regularization: {reg_param}")

    # Map user and book IDs to numerical indices
    user_to_index = {user_id: idx for idx, user_id in enumerate(train_df['user_id'].unique())}
    book_to_index = {book_id: idx for idx, book_id in enumerate(train_df['book_id'].unique())}

    # Number of unique users and books
    n_users = len(user_to_index)
    n_items = len(book_to_index)

    # Initialize latent factor matrices for users (P) and books (Q) and biases
    P = np.random.normal(scale=0.01, size=(n_users, latent_dim))  # User latent factors
    Q = np.random.normal(scale=0.01, size=(n_items, latent_dim))  # Book latent factors
    mu = train_df['rating'].mean()  # Global mean rating
    b_u = np.zeros(n_users)  # User biases
    b_i = np.zeros(n_items)  # Book biases

    # Training loop: iterate over epochs to optimize the model
    for epoch in range(n_epochs):
        total_loss = 0  # Track the cumulative loss for the epoch
        for _, row in tqdm(train_df.iterrows(), total=len(train_df), desc=f"Epoch {epoch+1}/{n_epochs}"):
            user_idx = user_to_index.get(row['user_id'])  # Map user ID to index
            book_idx = book_to_index.get(row['book_id'])  # Map book ID to index
            rating = row['rating']  # Actual rating

            # Skip if user or book is not in the training data
            if user_idx is None or book_idx is None:
                continue

            # Predict the rating using the model
            pred_rating = mu + b_u[user_idx] + b_i[book_idx] + np.dot(P[user_idx], Q[book_idx])

            # Compute the error between actual and predicted ratings
            error = rating - pred_rating

            # Accumulate the squared error for loss tracking
            total_loss += error**2

            # Update biases and latent factors using gradient descent
            b_u[user_idx] += learning_rate * (error - reg_param * b_u[user_idx])  # Update user bias
            b_i[book_idx] += learning_rate * (error - reg_param * b_i[book_idx])  # Update book bias
            P[user_idx] += learning_rate * (error * Q[book_idx] - reg_param * P[user_idx])  # Update user latent factors
            Q[book_idx] += learning_rate * (error * P[user_idx] - reg_param * Q[book_idx])  # Update book latent factors

        # Print the cumulative loss after each epoch
        print(f"Epoch {epoch+1}/{n_epochs}, Loss: {total_loss:.4f}")

    # Prediction: Generate predictions for the test set
    predictions = []
    for _, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Predicting"):
        user_idx = user_to_index.get(row['user_id'])  # Map user ID to index
        book_idx = book_to_index.get(row['book_id'])  # Map book ID to index

        # Predict the rating for user-book pair or fallback to the global mean
        if user_idx is not None and book_idx is not None:
            pred_rating = mu + b_u[user_idx] + b_i[book_idx] + np.dot(P[user_idx], Q[book_idx])
        else:
            pred_rating = mu  # Fallback to global mean rating if user or book is unknown

        # Append the prediction to the results
        predictions.append({'id': row['id'], 'rating': pred_rating})

    # Save predictions to a CSV file
    predictions_df = pd.DataFrame(predictions)
    predictions_df.to_csv("svd_predictions.csv", index=False)  # Save predictions for submission or evaluation

Here is the search space for the hyperparameters that will be tested during the grid search. The key hyperparameters include:
- latent_dims: The number of latent dimensions for user and item embeddings, which control the complexity of the model.
- learning_rates: The step sizes for updating the model parameters during training.
- reg_params: Regularization parameters to prevent overfitting by penalizing large values in the model's parameters.

Using the grid search function, we evaluate every combination of these hyperparameters on the training and validation sets. The goal is to identify the set of parameters that minimizes the validation error (measured by RMSE).

The best parameters are then stored for training the final model, ensuring it is both accurate and generalizes well to unseen data.

In [None]:
# Define hyperparameters to test
latent_dims = [25, 50, 75]
learning_rates = [0.001, 0.005, 0.01]
reg_params = [0.01, 0.05, 0.1, 0.02]

# Perform grid search
best_params, grid_results = grid_search(train_set, val_set, latent_dims, learning_rates, reg_params, n_epochs=20)

Finally, we train the model using the best hyperparameters identified during the grid search, in order to predict ratings for the test set.

In [None]:
# Train final model and predict
train_and_predict(train_df, test_df, best_params, n_epochs=20)