# Using Singular Value Decomposition (SVD) to create a recommendation system

This is brief introduction to recommender systems based on **collaborative filtering using SVD** and its implementation in the specific case of goal based portfolio recommendation.

We want to **recommend objective portfolios, associated with goals (or needs), to each client**. In the standard framework of recommendation systems (which can be misleading...), the goals are the **"products" (or "items")**, while the customers are the **"users"**. And, in the standard framework of recommendation systems, there is also the concept of **"rating"**: for us, the rating is a measure of satisfaction with the objective or need. For example, it is a metric associated with the purchase of a goal portfolio (e.g., % weight), or with the purchase of funds that characterize a goal portfolio - this is the concept of **implicit rating, or implied rating**.

The matrix of dimensions users x products is called **Utility Matrix**.

A few words on SVD.
Just like a number, say 30, can be decomposed as factors 30 = 2x5x3, a matrix can also be expressed as multiplication of some other matrices. But because matrices are arrays of numbers, they have their own rules of multiplication: SVD is a linear algebra technique aimed to break down a matrix into the product of a few smaller matrices - see https://en.wikipedia.org/wiki/Singular_value_decomposition, and https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html.

SVD assumes a matrix of dimension users x products, the Utility Matrix, is decomposed as follows: UtilityMatrix = U$\cdot$S$\cdot$V'

Basically, we are looking for the **latent variables, or latent factors**, that hide under the surface of the phenomenon we are analyzing (in this case, the relationship between users and products=investment goals):
- U represents the relationship between users and latent factors
- S  describes the strength of each latent factor
- V describes the similarity between products and latent factors.

Let's put it in practice with a toy example.

## Generate generate some synthetic data.

We create a dataset of N=1000 customers and Q=10 products (our goals or needs). For each customer-product pair, we generate an (implicit) rating using a normal distribution with, say, a mean of 5 and a standard deviation of 1. We clip the ratings between 1 and 10 (the range used by Azimut) to ensure they remain within a given range.

Then we create a DataFrame, converting the generated data into a Pandas DataFrame with columns "user_id", "item_id", and "rating".

The next step is to create a user-item matrix: we create a matrix where rows represent users, columns represent products, and the values in the matrix represent the ratings. We fill the missing values with 0, as we assume no rating is available for those user-products pairs.

In [1]:
import numpy as np
import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Generate synthetic data
N = 1000  # Number of customers
Q = 10  # Number of products (e.g., goals, or needs)
np.random.seed(42)

ratings_data = [] # implied rating (e.g. a purchase of a goal based portfolio)

for i in range(N):
    for j in range(Q):
        user_id = i
        product_id = j
        rating = np.random.normal(loc=5, scale=1)  # Normal distribution with mean 5 and std 1
        rating = max(1, min(rating, 10))  # Clip rating between 1 and 10
        ratings_data.append([user_id, product_id, rating])

# Convert the data to a DataFrame
ratings_df = pd.DataFrame(ratings_data, columns=["user_id", "product", "rating"])

# Create user-item matrix
user_product_matrix = ratings_df.pivot_table(index="user_id", columns="product", values="rating", fill_value=0)

In [2]:
user_product_matrix

product,0,1,2,3,4,5,6,7,8,9
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,5.496714,4.861736,5.647689,6.523030,4.765847,4.765863,6.579213,5.767435,4.530526,5.542560
1,4.536582,4.534270,5.241962,3.086720,3.275082,4.437712,3.987169,5.314247,4.091976,3.587696
2,6.465649,4.774224,5.067528,3.575252,4.455617,5.110923,3.849006,5.375698,4.399361,4.708306
3,4.398293,6.852278,4.986503,3.942289,5.822545,3.779156,5.208864,3.040330,3.671814,5.196861
4,5.738467,5.171368,4.884352,4.698896,3.521478,4.280156,4.539361,6.057122,5.343618,3.236960
...,...,...,...,...,...,...,...,...,...,...
995,5.867805,5.227405,4.110155,4.039220,5.254128,5.697051,5.391881,3.965402,5.650668,5.425911
996,3.929334,4.215679,5.688496,4.765492,6.589147,5.501129,4.513369,4.989794,5.063383,4.271610
997,4.087412,5.701390,5.845273,5.603781,6.515318,4.458227,6.674271,4.099079,3.987314,3.240041
998,4.554205,4.496278,5.525937,5.243891,3.807027,4.607274,4.628538,3.224018,4.019053,4.229186


## Train and validate
First, we split the user-item matrix for the sake of simplicity just into a training set (say 80%) and a validation set (20%).

Then we train the SVD model, using the TruncatedSVD class from Scikit-Learn to perform matrix factorization on the training set. TruncatedSVD is method for matrix factorization that works well on sparse data (note: in the context of a recommender system, the user-product matrix is often sparse, as users rate/buy/click/etc only a small subset of the available items). TruncatedSVD helps in finding **latent factors** that can explain the observed ratings and make predictions for the unrated items. It is called "truncated" because it keeps only the top k singular values/vectors, where k is the number of components you set.
In this example, we used TruncatedSVD to **approximate the user-product matrix using 5 most significant singular vectors (n_components=5)**. This helped us **create a lower-dimensional representation of the user-product matrix**, which we then used for making predictions and recommendations.

Thus, once the TruncatedSVD is fitted on the training data, we reconstruct the user-product matrix, i.e., we transform the train and validation data using the fitted SVD model and then we reconstruct the user-product matrix by applying the inverse_transform method. This step gives us the predicted ratings for the user-product pairs.

Now we can calculate the Root Mean Squared Error (RMSE) between the actual ratings in the validation set and the reconstructed (predicted) ratings. This is a measure of the accuracy of the model on unseen data.

In [3]:
# Split the data into train and validation sets
train_data, val_data = train_test_split(user_product_matrix, test_size=0.2, random_state=42)

# Train the SVD model
n_components = 5
svd = TruncatedSVD(n_components=n_components, random_state=42)
svd.fit(train_data)

# Transform and reconstruct the user-item matrix
train_data_reconstructed = svd.inverse_transform(svd.transform(train_data))

# Calculate RMSE on the validation set
val_data_reconstructed = svd.inverse_transform(svd.transform(val_data))
rmse = np.sqrt(mean_squared_error(val_data, val_data_reconstructed))
print("Validation RMSE:", rmse)


Validation RMSE: 0.7595377830921124


##Prediction and recommendation
Let's create a new synthetic customer: we generate a new synthetic customer's ratings for the 10 products using the same normal distribution and clipping process as before.

Then we recommend the best products for this new customer.

In order to do that, firstly, we convert the new customer's ratings into a 1x10 numpy array and apply the same SVD transformation and reconstruction process we used before, to get the **predicted ratings for the new customer**.

We then **sort the products by their predicted ratings and select the top K = 3 products with the highest ratings**: this is our recommendations.

In [4]:
# New synthetic customer
new_customer_ratings = []

for i in range(Q):
    product_id = i
    rating = np.random.normal(loc=5, scale=1)  # Normal distribution with mean 5 and std 1
    rating = max(1, min(rating, 10))  # Clip rating between 1 and 10
    new_customer_ratings.append(rating)

# Recommend products for the new customer
new_customer_vector = np.array(new_customer_ratings).reshape(1, -1)
new_customer_reconstructed = svd.inverse_transform(svd.transform(new_customer_vector))

# Recommend the top K = 3 products
top_3_products = np.argsort(new_customer_reconstructed[0])[::-1][:3]
print("Top 3 recommended products for the new customer:", top_3_products.tolist())

Top 3 recommended products for the new customer: [6, 4, 8]
