# Standard Variational Autoencoder (SVAE)

A Standard Variational Autoencoder (SVAE) is a sophisticated machine learning model designed for compressing and reconstructing data. It operates in two main phases:

1. Encoding: The SVAE takes complex input data, such as Goodreads book reviews, and compresses it into a simplified, lower-dimensional representation. This process involves distilling the essential features and patterns from the data.
2. Decoding: It then reconstructs the input data from this compact form. The reconstruction aims to be as close as possible to the original, retaining the critical elements of the input.


The unique aspect of SVAEs is their 'variational' approach. Instead of representing an input as a single fixed point, they encode it as a range of possibilities, a distribution. This probabilistic method captures the inherent uncertainties and variations in the data, making the model more flexible and powerful.

For a recommendation system using Goodreads book reviews, an SVAE can effectively learn the complex preferences and nuances in the reviews, facilitating accurate and personalized book suggestions.

In [1]:
import sys
import os
# Extending the system path to include the parent directory for module imports
sys.path.append(os.path.dirname(os.getcwd()))

import numpy as np
import pandas as pd

# Local utility functions and model imports
from src.utils import preprocess, metrics, build_features
from src.models import SVAE

# TensorFlow specific setup for disabling eager execution
from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()

# Suppressing warnings for a cleaner notebook presentation
import warnings
warnings.simplefilter("ignore")

## Prepare data

In [2]:
# Loading ratings data
rating_file = os.path.join('..', 'src', 'data', 'goodreads_2m', 'ratings.csv')
ratings = pd.read_csv(rating_file)

# Displaying the shape of the dataset and a random sample of 5 entries
print(f'Shape: {ratings.shape}')
ratings.sample(5, random_state=123)

Shape: (91226, 3)


Unnamed: 0,user_id,book_name,rating
74505,2540,"A Game of Thrones (A Song of Ice and Fire, #1)",4
60643,5886,The Amazing Adventures of Kavalier & Clay,4
87603,4411,The World to Come,5
81524,4934,Harry Potter and the Philosopher's Stone (Harr...,5
60556,5791,"Bloodsucking Fiends (A Love Story, #1)",3


In [3]:
# Binarizing the dataset - retaining only ratings of 4 or higher
df_preferred = ratings[ratings['rating'] >= 4]
# Optionally handling lower ratings, if needed in future analysis
df_low_rating = ratings[ratings['rating'] < 4]

print(df_preferred.shape)
df_preferred.head(5)

(61252, 3)


Unnamed: 0,user_id,book_name,rating
0,1,The Restaurant at the End of the Universe (Hit...,5
1,1,"Ready Player One (Ready Player One, #1)",4
3,1,Bad Blood: Secrets and Lies in a Silicon Valle...,4
4,1,A Short History of Nearly Everything,5
5,1,"The Collapsing Empire (The Interdependency, #1)",4


In [4]:
# Filtering users with at least 5 reviews (unnecessary because of previous filtering)
df_filtered_users = df_preferred.groupby('user_id').filter(lambda x: len(x) >= 5)

# Filtering books reviewed by at least one user
df_final = df_filtered_users.groupby('book_name').filter(lambda x: len(x) >= 1)

# Displaying the shape of the final filtered dataset
print(df_final.shape)

(61245, 3)


In [5]:
# Calculate the count of unique users and books
usercount = df_final.groupby('user_id', as_index=False).size().rename(columns={'size': 'user_count'})
itemcount = df_final.groupby('book_name', as_index=False).size().rename(columns={'size': 'book_count'})

# Compute sparsity of the dataset after filtering
total_ratings = ratings.shape[0]
num_users = usercount.shape[0]
num_books = itemcount.shape[0]
sparsity = 1. * total_ratings / (num_users * num_books)

print(f"After filtering, there are {total_ratings} book reviews from {num_users} users and {num_books} books (sparsity: {sparsity * 100:.3f}%)")

After filtering, there are 91226 book reviews from 1368 users and 2718 books (sparsity: 2.453%)


## Splitting

In [6]:
# Shuffle and split the unique users into train, validation, and test sets
unique_users = sorted(df_final.user_id.unique())
np.random.seed(123)
unique_users = np.random.permutation(unique_users)

In [7]:
HELDOUT_USERS = 200  # Number of users to hold out for validation and test sets

# Splitting the unique users
n_users = len(unique_users)
train_users = unique_users[:(n_users - HELDOUT_USERS * 2)]
val_users = unique_users[(n_users - HELDOUT_USERS * 2):(n_users - HELDOUT_USERS)]
test_users = unique_users[(n_users - HELDOUT_USERS):]

print("Number of unique users:", n_users)
print("\nNumber of training users:", len(train_users))
print("\nNumber of validation users:", len(val_users))
print("\nNumber of test users:", len(test_users))

Number of unique users: 1368

Number of training users: 968

Number of validation users: 200

Number of test users: 200


In [8]:
# Filtering the dataset for training, validation, and test sets based on the user splits
train_set = df_final[df_final['user_id'].isin(train_users)]
val_set = df_final[df_final['user_id'].isin(val_users)]
test_set = df_final[df_final['user_id'].isin(test_users)]

print("Number of training observations:", train_set.shape[0])
print("\nNumber of validation observations:", val_set.shape[0])
print("\nNumber of test observations:", test_set.shape[0])

Number of training observations: 44171

Number of validation observations: 8374

Number of test observations: 8700


In [9]:
# Identifying unique books in the training set
unique_train_items = pd.unique(train_set['book_name'])
print(f"Number of unique books rated in training set: {unique_train_items.size}")

Number of unique books rated in training set: 2712


In [10]:
# Filtering validation and test sets to include only books from the training set
val_set = val_set[val_set['book_name'].isin(unique_train_items)]
print(f"Number of validation observations after filtering: {val_set.shape[0]}")

test_set = test_set[test_set['book_name'].isin(unique_train_items)]
print(f"\nNumber of test observations after filtering: {test_set.shape[0]}")

Number of validation observations after filtering: 8364

Number of test observations after filtering: 8694


In [11]:
# Generating affinity matrices for train, validation, and test sets
am_train = build_features.AffinityMatrix(df=train_set, items_list=unique_train_items)
am_val = build_features.AffinityMatrix(df=val_set, items_list=unique_train_items)
am_test = build_features.AffinityMatrix(df=test_set, items_list=unique_train_items)

In [12]:
# Obtaining the sparse matrices
train_data, _, _ = am_train.gen_affinity_matrix()
print(train_data.shape)

val_data, val_map_users, val_map_items = am_val.gen_affinity_matrix()
print(val_data.shape)

test_data, test_map_users, test_map_items = am_test.gen_affinity_matrix()
print(test_data.shape)

(968, 2712)


(200, 2712)
(200, 2712)


In [13]:
# Stratified splitting for validation and test sets
val_data_tr, val_data_te = preprocess.numpy_stratified_split(val_data, ratio=0.75, seed=123)
test_data_tr, test_data_te = preprocess.numpy_stratified_split(test_data, ratio=0.75, seed=123)

In [14]:
# Binarizing data based on a threshold
threshold = 3.5
train_data = np.where(train_data > threshold, 1.0, 0.0)
val_data_tr = np.where(val_data_tr > threshold, 1.0, 0.0)
val_data_te_ratings = val_data_te.copy()
val_data_te = np.where(val_data_te > threshold, 1.0, 0.0)
test_data_tr = np.where(test_data_tr > threshold, 1.0, 0.0)
test_data_te_ratings = test_data_te.copy()
test_data_te = np.where(test_data_te > threshold, 1.0, 0.0)

In [15]:
# Retrieve real ratings from initial dataset 
test_data_te_ratings=pd.DataFrame(test_data_te_ratings)
val_data_te_ratings=pd.DataFrame(val_data_te_ratings)

for index,i in df_low_rating.iterrows():
    user_old= i['user_id'] # old value 
    item_old=i['book_name'] # old value 

    if (test_map_users.get(user_old) is not None)  and (test_map_items.get(item_old) is not None) :
        user_new=test_map_users.get(user_old) # new value 
        item_new=test_map_items.get(item_old) # new value 
        rating=i['rating'] 
        test_data_te_ratings.at[user_new,item_new]= rating   

    if (val_map_users.get(user_old) is not None)  and (val_map_items.get(item_old) is not None) :
        user_new=val_map_users.get(user_old) # new value 
        item_new=val_map_items.get(item_old) # new value 
        rating=i['rating'] 
        val_data_te_ratings.at[user_new,item_new]= rating   


val_data_te_ratings=val_data_te_ratings.to_numpy()    
test_data_te_ratings=test_data_te_ratings.to_numpy() 

## SVAE

In [16]:
# Model configuration parameters
INTERMEDIATE_DIM = 200
LATENT_DIM = 64
EPOCHS = 400
BATCH_SIZE = 100

# Initialize the SVAE model with specified parameters
model = SVAE.StandardVAE(n_users=train_data.shape[0],  # Number of unique users
                         original_dim=train_data.shape[1],  # Number of unique items
                         intermediate_dim=INTERMEDIATE_DIM,
                         latent_dim=LATENT_DIM,
                         n_epochs=EPOCHS,
                         batch_size=BATCH_SIZE,
                         k=10,  # Number of items to recommend
                         verbose=0,
                         seed=123,  # Seed for reproducibility
                         drop_encoder=0.5,  # Dropout rate for encoder
                         drop_decoder=0.5,  # Dropout rate for decoder
                         annealing=False,  # Whether to use annealing
                         beta=1.0)  # Beta parameter for VAE

In [17]:
# Fitting the model
!%%time
model.fit(x_train=train_data,
          x_valid=val_data,
          x_val_tr=val_data_tr,
          x_val_te=val_data_te_ratings,  # Validation data with original ratings
          mapper=am_val)  # AffinityMatrix instance for validation set

zsh:fg:1: no job control in this shell.


2024-01-31 14:16:53.844928: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:357] MLIR V1 optimization pass is not enabled
2024-01-31 14:16:53.857529: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2024-01-31 14:16:53.920355: W tensorflow/c/c_api.cc:291] Operation '{name:'training/Adam/dense_3/bias/v/Assign' id:563 op device:{requested: '', assigned: ''} def:{{{node training/Adam/dense_3/bias/v/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](training/Adam/dense_3/bias/v, training/Adam/dense_3/bias/v/Initializer/zeros)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
2024-01-31 14:16:54.146599: W tensorflow/c/c_api.cc:291] Operation '{name:'loss/mul' id:202 op device:{requested: '', assigned: ''} def:{{{node loss/mu

## Recommend

In [18]:
# Generating recommendations for the test set
top_k = model.recommend_k_items(x=test_data_tr, k=10, remove_seen=True)

# Mapping the sparse matrix back to a DataFrame for further analysis
recommendations = am_test.map_back_sparse(top_k, kind='prediction')
test_df = am_test.map_back_sparse(test_data_te_ratings, kind='ratings')

## Evaluation Metrics

The performance of our model is evaluated using the test set, which consists of the exact same users in the training set but with books the users have reviewed that the model has not seen before. A good model will recommend books that the user has also reviewed in the test set.

---


### Precision@k

Out of the books that are recommended, what proportion is relevant. Relevant in this case is if the user has reviewed the book.

A precision@10 of about 0.1 means that about 10% of the recommendations are relevant to the user. In other words, out of the 10 recommendations made, on average a user will have 1 book that is actually relevant.

### Recall@k

Out of all the relevant books (in the test set), how many are recommended.

A recall@10 of 0.1 means that about 10% of the relevant books were recommended. By definition you can see how even if all the recommendations made were relevant, recall@k is capped by k. A higher k means that more relevant books can be recommended.

### Mean Average Precision (MAP)

Calculate the average precision for each user and average all the average precisions over all users. Penalizes incorrect rankings of books.

### Normalized Discounted Cumulative Gain (NDGC)

Looks at both relevant books and the ranking order of the relevant books. Normalized by the total number of users.

---


SVAE performance below shows much better results in comparison to SVD with Precision@k = 20%.

In [19]:
# Evaluating model performance
top_k = recommendations.copy()
top_k['rank'] = top_k.groupby('user_id', sort=False).cumcount() + 1
precision_at_k = metrics.precision_at_k(top_k, test_df, 'user_id', 'book_name', 'rank')
recall_at_k = metrics.recall_at_k(top_k, test_df, 'user_id', 'book_name', 'rank')
mean_average_precision = metrics.mean_average_precision(top_k, test_df, 'user_id', 'book_name', 'rank')
ndcg = metrics.ndcg(top_k, test_df, 'user_id', 'book_name', 'rank')

# Printing performance metrics
print(f'Precision: {precision_at_k:.6f}',
      f'Recall: {recall_at_k:.6f}',
      f'MAP: {mean_average_precision:.6f}',
      f'NDCG: {ndcg:.6f}', sep='\n')

Precision: 0.207000
Recall: 0.071519
MAP: 0.041507
NDCG: 0.216293
