# Explicit Feedback Neural Recommender Systems

This notebook is based on the Deep Learning course from the Master Datascience Paris Saclay. Materials of the course can be found [here](https://github.com/m2dsupsdlclass/lectures-labs). 

**Goals**

* Understand recommender data
* Build different models architectures using Keras
* Retrieve Embeddings and visualize them
* Add some metadata information as input to the models

**Dataset used**

* Anime Recommendations Database from Kaggle [link](https://www.kaggle.com/CooperUnion/anime-recommendations-database).

In [None]:
%%bash
pip install -U keras-tuner

In [None]:
# Load libraries
import umap

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import tensorflow as tf

from collections import deque

from kerastuner.tuners import RandomSearch

from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import QuantileTransformer

from tensorflow.keras.layers import (Concatenate, Dense, Dot, Dropout, Embedding, Flatten, Softmax)
from tensorflow.keras.models import Model, load_model

## Load and preprocess the data

### Ratings file

After loading the data, each line of the dataframe contains:
 * user_id - non identifiable randomly generated user id.
 * anime_id - the anime that this user has rated.
 * rating - rating out of $10$ this user has assigned ($-1$ if the user watched it but did not assign a rating).

In [None]:
# Load and preprocess rating files
df_raw = pd.read_csv('../input/anime-recommendations-database/rating.csv')

In [None]:
print(f"Shape of the ratings data: {df_raw.shape}.") 

In [None]:
df_raw.head(5)

### Anime metadata file

The anime metadata file contains the following metadata: 
 * anime_id - myanimelist.net's unique id identifying an anime.
 * name - full name of the anime.
 * genre - comma separated list of genres for this anime.
 * type - movie, TV, OVA, etc.
 * episodes - how many episodes in this show ($1$ if it's a movie).
 * rating - average rating out of $10$ for this anime.
 * members - number of community members that are in this anime's group.

In [None]:
# Load metadata file
metadata = pd.read_csv('../input/anime-recommendations-database/anime.csv')

In [None]:
print(f"Shape of the metadata: {metadata.shape}.")

In [None]:
metadata.head(5)

## Merge ratings and metadata

Let's enrich the raw ratings with the collected items metadata by merging the two dataframes on `anime_id`.

In [None]:
ratings = df_raw.merge(metadata.loc[:, ['name', 'anime_id', 'type', 'episodes']], left_on='anime_id', right_on='anime_id')

In [None]:
print(f"Shape of the complete data: {ratings.shape}.")

In [None]:
ratings.head(5)

### Data preprocessing

To understand well the distribution of the data, the following statistics are computed:
* the number of users
* the number of items
* the rating distribution
* the popularity of each anime

In [None]:
print(f"Number of unique users: {ratings['user_id'].unique().size}.")

In [None]:
print(f"Number of unique animes: {ratings['anime_id'].unique().size}.")

In [None]:
# Histogram of the ratings
x, height = np.unique(ratings['rating'], return_counts=True)

fig, ax = plt.subplots()
ax.bar(x, height, align='center')
ax.set(xticks=np.arange(-1, 11), xlim=[-1.5, 10.5])
plt.show()

Now, let's compute the popularity of each anime, defined as the number of ratings.

In [None]:
# Count the number of ratings for each movie
popularity = ratings.groupby('anime_id').size().reset_index(name='popularity')
metadata = metadata.merge(popularity, left_on='anime_id', right_on='anime_id')

## Speed-up the computation

In order to speed up the computation, we will subset the dataset using three criteria:
* Remove the $-1$ ratings (people who watch the anime but without giving a rate).
* Get only TV shows (because I like TV show).
* Get the most popular ones (more than $5000$ ratings).

In [None]:
# Get most popular anime id and TV shows
metadata_5000 = metadata.loc[(metadata['popularity'] > 5000) & (metadata['type'] == 'TV')]
# Remove -1 ratings and user id less than 10000
ratings = ratings[(ratings['rating'] > -1) & (ratings['user_id'] < 10000)]

## Clean id

Add a new column to metadata_5000 in order to clean up id of the anime.

In [None]:
# Create a dataframe for anime_id
metadata_5000 = metadata_5000.assign(new_anime_id=pd.Series(np.arange(metadata_5000.shape[0])).values)
metadata_5000_indexed = metadata_5000.set_index('new_anime_id')

In [None]:
# Merge the dataframe
ratings = ratings.merge(metadata_5000.loc[:, ['anime_id', 'new_anime_id', 'popularity']], left_on='anime_id', right_on='anime_id')

In [None]:
# Create a dataframe for user_if
user = pd.DataFrame({'user_id': np.unique(ratings['user_id'])})
user = user.assign(new_user_id=pd.Series(np.arange(user.shape[0])).values)

In [None]:
# Merge the dataframe
ratings = ratings.merge(user, left_on='user_id', right_on='user_id')

In [None]:
ratings.head(5)

In [None]:
print(f'Shape of the rating dataset: {ratings.shape}.')

Later in the analysis, we will assume that this popularity does not come from the ratings themselves but from an external metadata, *e.g.* box office numbers in the month after the release in movie theaters.

### Split the dataset into train/test sets

Let's split the enriched data in a train/test split to make it possible to do predictive modeling.

In [None]:
train, test = train_test_split(ratings, test_size=0.2, random_state=42)

user_id_train = np.array(train['new_user_id'])
anime_id_train = np.array(train['new_anime_id'])
ratings_train = np.array(train['rating'])

user_id_test = np.array(test['new_user_id'])
anime_id_test = np.array(test['new_anime_id'])
ratings_test = np.array(test['rating'])

## Explicit feedback: supervised ratings prediction

For each pair of (user, movie), we would like to predict the rating the user would give to the item.

This is the classical setup for building recommender systems from offline data with explicit supervision signal.

### Predictive ratings as a regression problem

The following code implements the following architecture:

![](https://raw.githubusercontent.com/m2dsupsdlclass/lectures-labs/3cb7df3a75b144b4cb812bc6eacec8e27daa5214/labs/03_neural_recsys/images/rec_archi_1.svg)

In [None]:
# For each sample, we input the integer identifiers
# of a a single user and a single items.
class RegressionModel(Model):
    """Define a regression model for items recommendation.
    
    Parameters
    ----------
    embedding_size: integer
        Size the embedding vector
    max_user_id: integer
        Number of user in the dataset
    max_item_id: integer
        Number of item in the dataset
    
    Arguments
    ---------
    user_embedding: Embedding
        Embedding layer of user 
    item_embedding: Embedding
        Embedding layer of item
    flatten: Flatten
        Flatten layer
    dot: Dot
        Dot layer
    """
    def __init__(self, embedding_size, max_user_id, max_item_id, **kwargs):
        super().__init__(**kwargs)
        
        self.user_embedding = Embedding(output_dim=embedding_size,
                                       input_dim=max_user_id + 1,
                                       input_length=1,
                                       name='user_embedding')
        self.item_embedding = Embedding(output_dim=embedding_size,
                                       input_dim=max_item_id + 1,
                                       input_length=1,
                                       name='item_embedding')
        self.flatten = Flatten()
        self.dot = Dot(axes=1)
    
    def call(self, inputs, **kwargs):
        """
        Parameters
        ----------
        inputs: list with two elements
            First element corresponds to the users
            Second element corresponds to the items
        """
        user_inputs = inputs[0]
        item_inputs = inputs[1]
        
        # Definition of the user vectors
        user_vecs = self.flatten(self.user_embedding(user_inputs))
        # Definition of the item vectors
        item_vecs = self.flatten(self.item_embedding(item_inputs))
        
        # Compute the dot product of the previous vectors
        y = self.dot([user_vecs, item_vecs])
        return y

In [None]:
# Define parameters
EMBEDDING_SIZE = 64
MAX_USER_ID = np.max(user_id_train)
MAX_ITEM_ID = np.max(anime_id_train)

# Define and run the model
model = RegressionModel(EMBEDDING_SIZE, MAX_USER_ID, MAX_ITEM_ID)
model.compile(optimizer='adam', loss='mae')

In [None]:
# Initial prediction
initial_train_preds = model.predict([user_id_train, anime_id_train])

### Model error

Using `initial_train_preds`, compute the model errors:
* mean absolute error
* mean squared error

Converting a pandas Series to numpy array is usually implicit, but you may use `ratings_train.values` to do so explicitely. Be sure to monitor the shapes of each object you deal with by using `object.shape`.

In [None]:
def mae(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

def mse(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

In [None]:
print(f'Mean Absolute Error: {mae(ratings_train, initial_train_preds.squeeze())}.')

In [None]:
print(f'Mean Squared Error: {mse(ratings_train, initial_train_preds.squeeze())}.')

### Monitoring runs

Keras enables to monitor various variables during training.

`history.history` returned by the `model.fit` function is a dictionary containing the `'loss'` and validation loss `'val_loss'` after each epoch.

In [None]:
%%time

BATCH_SIZE = 64
EPOCHS = 10
VALIDATION_SPLIT = 0.1

# Train the model
history = model.fit(x=[user_id_train, anime_id_train], y=ratings_train,
                    batch_size=BATCH_SIZE, epochs=EPOCHS,
                    validation_split=VALIDATION_SPLIT, shuffle=True)

In [None]:
# Plot training and test losses
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Validation')
plt.ylim(0, 2)
plt.legend(loc='best')
plt.show()

In [None]:
# Save the model
#model.save('model', save_format='tf')
#model = load_model('../input/embeddings-model/model')

The train loss is higher then the validation loss in the first few epochs because the training loss is not computed on the complete training set. Keras does not compute the train loss on the full training set at the end of each epoch to prevent overfitting.

Now that the model is trained, let's look back at the MSE and MAE.

In [None]:
def plot_prediction(y_true, y_pred):
    """Plot of the prediction.
    :param y_true: Vector of true label
    :param y_pred: Vector of predicted label
    """
    plt.scatter(y_true, y_pred, s=60, alpha=0.01)
    plt.xlim(0.5, 10.5)
    plt.xlabel('True rating')
    plt.ylim(-1, 14)
    plt.ylabel('Predicted rating')

* On the test set

In [None]:
# Perform predition on the test set
test_preds = model.predict([user_id_test, anime_id_test])

In [None]:
print(f'Mean Absolute Error: {mae(ratings_test, test_preds.squeeze())}.')

In [None]:
print(f'Mean Squared Error: {mse(ratings_test, test_preds.squeeze())}.')

In [None]:
# Plot the prediction
plot_prediction(ratings_test, test_preds.squeeze())

* On the train set

In [None]:
# Perform predition on the train set
train_preds = model.predict([user_id_train, anime_id_train])

In [None]:
print(f'Mean Absolute Error: {mae(ratings_train, train_preds.squeeze())}.')

In [None]:
print(f'Mean Squared Error: {mse(ratings_train, train_preds.squeeze())}.')

In [None]:
# Plot the prediction
plot_prediction(ratings_train, train_preds.squeeze())

### Model Embeddings

* It is possible to retrieve the embeddings by simply using the Keras function `model.get_weights` which returns all the model learnable parameters.
* The weights are returned in the same order as they were build in the model.

In [None]:
# Get the weights
weights = model.get_weights()

In [None]:
print(f'The shape of the different weights matrices are: {[w.shape for w in weights]}.')

In [None]:
model.summary()

In [None]:
print(f'There are {np.sum([w.shape[0] * w.shape[1] for w in weights])} trainable parameters in the model.')

In [None]:
# Retrieve the different embeddings
user_embeddings = weights[0]
item_embeddings = weights[1]

In [None]:
# Get embedding vector for a particular anime_id
ANIME_ID = 8
print(f'Title for ANIME_ID={ANIME_ID}: {metadata_5000["name"].iloc[ANIME_ID]}.')

In [None]:
print(f'Embedding vector for ANIME_ID={ANIME_ID}.')
print(item_embeddings[ANIME_ID])
print(f'Shape of the embedding vector: {item_embeddings[ANIME_ID].shape}.')

### Finding the most similar items

Finding the $k$ most similar items to a point in embedding space:
* Write a function to compute cosine similarity between two animes in embedding space.
* Test it on the following cells to check the similarities between popular animes.
* Try to generalize the function to compute similarities between one anime and all the others and return the most related animes.

Notes:
* We may use `np.linalg.norm` to compute norm of vectors, and we may specidy the `axis`.
* The numpy function `np.argsort(...)` enables to compute the sorted indices of a vector.
* `items["name"][idxs]` returns the `name` of the items indexed by array `idxs`.

In [None]:
EPSILON = 1e-07

def cosine(x, y):
    """Compute cosine similarities.
    :param x: Vector
    :param y: Vector
    :return: Cosine similarity
    """
    dot_prod = np.dot(x, y.T)
    norm = np.linalg.norm(x) * np.linalg.norm(y)
    return dot_prod / (norm + EPSILON)

def cosine_similarities(item_id, item_embeddings):
    """Compute cosine similarities between item_id and all items embeddings.
    :param item_id: Item id (integer)
    :param item_embeddings: Matrix of weights of embeddings
    :return: Vector of cosine similarities
    """
    query_vector = item_embeddings[item_id]
    dot_products = item_embeddings @ query_vector
    
    query_vector_norm = np.linalg.norm(query_vector)
    all_item_norms = np.linalg.norm(item_embeddings, axis=1)
    norm_products = query_vector_norm * all_item_norms
    return dot_products / (norm_products + EPSILON)
    
def most_similar(item_id, item_embeddings, titles, top_n=10):
    """Find the `top_n` most similar items to `item_id`.
    :param item_id: Item id (integer)
    :param item_embeddings: Matrix of weights of embeddings
    :param titles: Vector of titles
    :param top_n: Number of anime to return (default=10)
    :return: A list with the most similar items by increasing order
    """
    similarities = cosine_similarities(item_id, item_embeddings)
    sorted_indexes = np.argsort(similarities)[::-1]
    idxs = sorted_indexes[0:top_n]
    return list(zip(idxs, titles.iloc[idxs], similarities[idxs]))
    
def print_similarity(item_a, item_b, item_embeddings, titles):
    """Print a summary of similarity between 2 items
    :param item_a: First item (integer)
    :param item_b: Second item (integer)
    :param item_embeddings: Matrix of weights of embeddings
    :param titles: Vector of titles
    """
    similarity = cosine(item_embeddings[item_a], item_embeddings[item_b])
    print(f'Cosine similarity between {titles.iloc[item_a]} and {titles.iloc[item_b]}: {similarity:.3}.')

In [None]:
print_similarity(8, 102, item_embeddings, metadata_5000['name'])

In [None]:
print_similarity(8, 14, item_embeddings, metadata_5000['name'])

In [None]:
print_similarity(8, 8, item_embeddings, metadata_5000['name'])

In [None]:
# Histogram of cosine similarities
plt.hist(cosine_similarities(8, item_embeddings), bins=30)
plt.show()

In [None]:
# Find the most similar items to One Punch Man
most_similar(8, item_embeddings, metadata_5000['name'], top_n=10)

The similarities do not always make sense: the number of ratings is low and the embedding does not automatically capture semantic relationships in that context. Better representations arise with higher number of ratings, and less overfitting in models or maybe better loss function, such as those based on implicit feedback.

### Visualizing embeddings using t-SNE

We can use scikit-learn to visualize item embeddings via [t-SNE](https://lvdmaaten.github.io/tsne/).

In [None]:
def plot_tsne(item_embeddings, perplexities):
    """Plot tSNE representations of embeddings
    :param item_embeddings: Matrix of weights of embeddings
    :param perplexities: Vector of perplexity
    """
    (fig, subplots) = plt.subplots(1, 4, figsize=(16, 4))
    for i, perplexity in enumerate(perplexities):
        ax = subplots[i]
        item_tsne = TSNE(perplexity=perplexity).fit_transform(item_embeddings)
        ax.set_title(f"Perplexity = {perplexity}")
        ax.scatter(item_tsne[:, 0], item_tsne[:, 1])
        ax.axis('tight')

In [None]:
plot_tsne(item_embeddings, [5, 30, 50, 100])
plt.show()

In [None]:
# t-SNE visualisation using plotly
item_tsne = TSNE(perplexity=5).fit_transform(item_embeddings)
tsne_df = pd.DataFrame(item_tsne, columns=['tsne_1', 'tsne_2'])
tsne_df = tsne_df.assign(item_id=pd.Series(np.arange(item_tsne.shape[0])).values)
tsne_df = tsne_df.merge(metadata_5000, left_on='item_id', right_on='new_anime_id')

px.scatter(tsne_df, x='tsne_1', y='tsne_2', 
           color='rating', 
           hover_data=['name', 'type', 'rating', 'episodes'])

We can do similar things with [Uniform Manifold Approximation and Projection (UMAP)](https://github.com/lmcinnes/umap).

In [None]:
# Plot UMAP
item_umap = umap.UMAP().fit_transform(item_embeddings)

plt.scatter(item_umap[:, 0], item_umap[:, 1])
plt.show()

## A Deep Recommender Model

Using a similar framework as previously, the following deep model described in the course was built (with only two fully connected layers).

![](https://raw.githubusercontent.com/m2dsupsdlclass/lectures-labs/3cb7df3a75b144b4cb812bc6eacec8e27daa5214/labs/03_neural_recsys/images/rec_archi_2.svg)

To build this model, we will need a new king of layer, namely `Concatenate`.

In [None]:
# Define a class for the deep recommender model
class DeepRegressionModel(Model):
    """Define a deep regression model for items recommendation.
    
    Parameters
    ----------
    embedding_size: integer
        Size the embedding vector
    max_user_id: integer
        Number of user in the dataset
    max_item_id: integer
        Number of item in the dataset
    dropout_size: float
        Probablity to dropout the neuron (between 0 and 1)
    layer_size: integer
        Size of the first hidden dense layer
        
    Arguments
    ---------
    user_embedding: Embedding
        Embedding layer of user 
    item_embedding: Embedding
        Embedding layer of item
    flatten: Flatten
        Flatten layer
    concat: Concatenate
        Concatenate layer
    dropout: Dropout
        Dropout layer
    dense1: Dense
        First dense layer
    dense2: Dense
        Second dense layer
    """
    def __init__(self, embedding_size, max_user_id, max_item_id, dropout_size, layer_size, **kwargs):
        super().__init__(**kwargs)
        
        self.user_embedding = Embedding(output_dim=embedding_size,
                                       input_dim=max_user_id + 1,
                                       input_length=1,
                                       name='user_embedding')
        self.item_embedding = Embedding(output_dim=embedding_size,
                                       input_dim=max_item_id + 1,
                                       input_length=1,
                                       name='item_embedding')
        self.flatten = Flatten()
        self.concat = Concatenate()
        
        # Too much dropout lead to underfitting
        self.dropout = Dropout(dropout_size)
        
        self.dense1 = Dense(layer_size, activation="relu")
        # We predict one dimensional rating.
        # No activation needed as we want to predict between 0 and 10.
        self.dense2 = Dense(1)
    
    def call(self, inputs, training=False, **kwargs):
        """
        Parameters
        ----------
        inputs: list with two elements
            First element corresponds to the users
            Second element corresponds to the items
        """
        user_inputs = inputs[0]
        item_inputs = inputs[1]
        
        # Definition of the user vectors
        user_vecs = self.flatten(self.user_embedding(user_inputs))
        # Definition of the item vectors
        item_vecs = self.flatten(self.item_embedding(item_inputs))
        
        # Contenate user and item vectors (fc1)
        input_vecs = self.concat([user_vecs, item_vecs])
        
        # Build the network
        y = self.dropout(input_vecs, training=training)
        y = self.dense1(y) # fc2
        y = self.dense2(y) # fc3
        return y

In [None]:
# Define parameters
EMBEDDING_SIZE = 64
MAX_USER_ID = np.max(user_id_train)
MAX_ITEM_ID = np.max(anime_id_train)
DROPOUT_SIZE = 0.2
LAYER_SIZE = 64

# Define and run the model
model = DeepRegressionModel(EMBEDDING_SIZE, MAX_USER_ID, MAX_ITEM_ID, DROPOUT_SIZE, LAYER_SIZE)
model.compile(optimizer='adam', loss='mae')

In [None]:
# Initial prediction
initial_train_preds = model.predict([user_id_train, anime_id_train])

In [None]:
%%time

BATCH_SIZE = 64
EPOCHS = 10
VALIDATION_SPLIT = 0.1

# Train the model
history = model.fit(x=[user_id_train, anime_id_train], y=ratings_train,
                    batch_size=BATCH_SIZE, epochs=EPOCHS,
                    validation_split=VALIDATION_SPLIT, shuffle=True)

In [None]:
# Plot training and test losses
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Validation')
plt.ylim(0, 1.25)
plt.legend(loc='best')
plt.show()

In [None]:
# Perform predition on the train set
train_preds = model.predict([user_id_train, anime_id_train])

In [None]:
print(f'Mean Absolute Error: {mae(ratings_train, train_preds.squeeze())}.')

In [None]:
print(f'Mean Squared Error: {mse(ratings_train, train_preds.squeeze())}.')

In [None]:
# Plot the prediction
plot_prediction(ratings_train, train_preds.squeeze())

In [None]:
# Perform predition on the test set
test_preds = model.predict([user_id_test, anime_id_test])

In [None]:
print(f'Mean Absolute Error: {mae(ratings_test, test_preds.squeeze())}.')

In [None]:
print(f'Mean Squared Error: {mse(ratings_test, test_preds.squeeze())}.')

In [None]:
# Plot the prediction
plot_prediction(ratings_test, test_preds.squeeze())

The performance of the model not necessarily significantly better than the previous model but you can notice that the gap between train and test is lower, probably thanks to the use of dropout.
Furthermore, this model is more flexible in the sense that we can extend it to include metadata for hybrid recommendation system as we will see in the following.

But before that, let's do some hyperparameters tuning. Manual tuning of so many hyperparameters is tedious. In practice, it is better to automate the design of the model using an hyperparameter search tool such as:
* https://keras-team.github.io/keras-tuner/ (Keras specific)
* https://optuna.org/ (any machine learning framework, Keras included)

In [None]:
# Build a model to do hyperparameters tuning
EMBEDDING_SIZE = 64
MAX_USER_ID = np.max(user_id_train)
MAX_ITEM_ID = np.max(anime_id_train)

def build_model(hp):
    model = DeepRegressionModel(EMBEDDING_SIZE, MAX_USER_ID, MAX_ITEM_ID, 
                                dropout_size=hp.Float('dropout_size', min_value=0.1, max_value=0.5, step=0.1), 
                                layer_size=hp.Int('layer_size', min_value=32, max_value=64, step=32))

    model.compile(optimizer='adam', loss='mae')
    return model

# Instantiate a tuner
tuner = RandomSearch(build_model, 
                     objective='val_loss', 
                     max_trials=5, executions_per_trial=1, 
                     directory='.', project_name='Embedding')

In [None]:
# Start the search for the best hyperparameter configuration
EPOCHS = 5

tuner.search(x=[user_id_train, anime_id_train], y=ratings_train,
             batch_size=BATCH_SIZE, epochs=EPOCHS,
             validation_split=VALIDATION_SPLIT, shuffle=True)

In [None]:
model_best = tuner.get_best_models(1)[0]

In [None]:
# Define parameters
BATCH_SIZE = 64
EPOCHS = 10
VALIDATION_SPLIT = 0.1

# Train the model
history = model_best.fit(x=[user_id_train, anime_id_train], y=ratings_train,
                         batch_size=BATCH_SIZE, epochs=EPOCHS,
                         validation_split=VALIDATION_SPLIT, shuffle=True)

In [None]:
# Plot training and test losses
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Validation')
plt.ylim(0, 1.25)
plt.legend(loc='best')
plt.show()

In [None]:
# Perform predition on the test set
test_preds = model_best.predict([user_id_test, anime_id_test])

In [None]:
print(f'Mean Absolute Error: {mae(ratings_test, test_preds.squeeze())}.')

In [None]:
print(f'Mean Squared Error: {mse(ratings_test, test_preds.squeeze())}.')

In [None]:
# Plot the prediction
plot_prediction(ratings_test, test_preds.squeeze())

## Using Item Metadata in the Model

Using a similar framework as previously, we will build another depp model that can also leverage additional metadata. The resulting system is therefor an **Hybrid Recommender System** that does both **Collaborative Filtering** and **Content-based recommendations**.

![](https://raw.githubusercontent.com/m2dsupsdlclass/lectures-labs/3cb7df3a75b144b4cb812bc6eacec8e27daa5214/labs/03_neural_recsys/images/rec_archi_3.svg)

In [None]:
# Define some metadata
meta_columns = ['episodes', 'popularity']

scaler = QuantileTransformer()
item_meta_train = scaler.fit_transform(train[meta_columns])
item_meta_test = scaler.transform(test[meta_columns])

In [None]:
# Define a class for the Hybrid model
class HybridModel(Model):
    """Define a deep regression model for items recommendation.
    
    Parameters
    ----------
    embedding_size: integer
        Size the embedding vector
    max_user_id: integer
        Number of user in the dataset
    max_item_id: integer
        Number of item in the dataset
        
    Arguments
    ---------
    user_embedding: Embedding
        Embedding layer of user 
    item_embedding: Embedding
        Embedding layer of item
    flatten: Flatten
        Flatten layer
    concat: Concatenate
        Concatenate layer
    dropout: Dropout
        Dropout layer
    dense1: Dense
        First dense layer
    dense2: Dense
        Second dense layer
    """
    def __init__(self, embedding_size, max_user_id, max_item_id, **kwargs):
        super().__init__(**kwargs)
        
        self.user_embedding = Embedding(output_dim=embedding_size,
                                       input_dim=max_user_id + 1,
                                       input_length=1,
                                       name='user_embedding')
        self.item_embedding = Embedding(output_dim=embedding_size,
                                       input_dim=max_item_id + 1,
                                       input_length=1,
                                       name='item_embedding')
        self.flatten = Flatten()
        self.concat = Concatenate()
        
        self.dropout = Dropout(0.3)
        self.dense1 = Dense(64, activation="relu")
        self.dense2 = Dense(64, activation="relu")
        self.dense3 = Dense(1)
    
    def call(self, inputs, training=False, **kwargs):
        """
        Parameters
        ----------
        inputs: list with two elements
            First element corresponds to the users
            Second element corresponds to the items
            Third element corresponds to the metadata
        """
        user_inputs = inputs[0]
        item_inputs = inputs[1]
        meta_inputs = inputs[2]
        
        # Definition of the user vectors
        user_vecs = self.flatten(self.user_embedding(user_inputs))
        user_vecs = self.dropout(user_vecs, training=training)
        
        # Definition of the item vectors
        item_vecs = self.flatten(self.item_embedding(item_inputs))
        item_vecs = self.dropout(item_vecs, training=training)
        
        # Contenate user, item and meta vectors (fc1)
        input_vecs = self.concat([user_vecs, item_vecs, meta_inputs])
        
        # Build the network
        y = self.dense1(input_vecs) # fc2
        y = self.dropout(y, training=training)
        y = self.dense2(y) # fc3
        y = self.dropout(y, training=training)
        y = self.dense3(y)
        return y

In [None]:
# Define parameters
EMBEDDING_SIZE = 64
MAX_USER_ID = np.max(user_id_train)
MAX_ITEM_ID = np.max(anime_id_train)

# Define and run the model
model = HybridModel(EMBEDDING_SIZE, MAX_USER_ID, MAX_ITEM_ID)
model.compile(optimizer='adam', loss='mae')

In [None]:
# Initial prediction
initial_train_preds = model.predict([user_id_train, anime_id_train, item_meta_train])

In [None]:
# Define parameters
BATCH_SIZE = 64
EPOCHS = 10
VALIDATION_SPLIT = 0.1

# Train the model
history = model.fit(x=[user_id_train, anime_id_train, item_meta_train], y=ratings_train,
                         batch_size=BATCH_SIZE, epochs=EPOCHS,
                         validation_split=VALIDATION_SPLIT, shuffle=True)

In [None]:
# Plot training and test losses
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Validation')
plt.ylim(0, 1.5)
plt.legend(loc='best')
plt.show()

In [None]:
# Perform predition on the test set
test_preds = model.predict([user_id_test, anime_id_test, item_meta_test])

In [None]:
print(f'Mean Absolute Error: {mae(ratings_test, test_preds.squeeze())}.')

In [None]:
print(f'Mean Squared Error: {mse(ratings_test, test_preds.squeeze())}.')

In [None]:
# Plot the prediction
plot_prediction(ratings_test, test_preds.squeeze())

The additional metadata seems to improve the predictive power of the model a bit but this should be re-run several times to see the impact of the random initialization of the model. However, the variance of the prediction seems to be improved.

## A recommendation function for a given user

Once the model is trained, the system can be used to recommend a few items for a user, that he/she has not already seen:
* We use the `model.predict` to compute the ratings a user would have given to all items.
* We build a recommendation function that sorts these items and exclude those the user has already seen.

In [None]:
def recommend(user_id, top_n=10):
    """Recommend anime for a given user id.
    :param user_id: Id of a user
    :param top_n: Number of anime to recommend
    """
    item_ids = range(1, MAX_ITEM_ID)
    seen_mask = ratings['new_user_id'] == user_id
    seen_animes = set(ratings[seen_mask]['new_anime_id'])
    item_ids = list(filter(lambda x: x not in seen_animes, item_ids))
    
    print(f'User {user_id} has seen {len(item_ids)} animes, including:')
    for title in ratings[seen_mask].nlargest(20, 'popularity')['name']:
        print(f'\t{title}')
    print(f'Computing ratings for {len(item_ids)} other animes:')
    
    item_ids = np.array(item_ids)
    user_ids = np.zeros_like(item_ids)
    user_ids[:] = user_id
    items_meta = scaler.transform(metadata_5000_indexed[meta_columns].loc[item_ids])
    
    ratings_preds = model.predict([user_ids, item_ids, items_meta])
    
    item_ids = np.argsort(ratings_preds[:, 0])[::-1].tolist()
    rec_items = item_ids[:top_n]
    return [(metadata_5000_indexed['name'][anime], ratings_preds[anime][0]) for anime in rec_items]

In [None]:
for title, pred_rating in recommend(5):
    print(f'\t{pred_rating}: {title}')

## Predicting ratings a classification problem

In this dataset, the ratings all belong  to a finite set of possible values: $1$ to $10$.

Maybe, we can help the model by forcing it to predict those values by treating the problem as a multiclassification problem. The only required changes are:
* setting the final layer to output class membership probabilities using a softmax activation with $10$ outputs;
* optimize the categorical cross-entropy classification loss instead of a regression loss suwh as MSE or MAE.

In [None]:
# Define a class for the Hybrid model
class HybridClassificationModel(Model):
    """Define a deep regression model for items recommendation.
    
    Parameters
    ----------
    embedding_size: integer
        Size the embedding vector
    max_user_id: integer
        Number of user in the dataset
    max_item_id: integer
        Number of item in the dataset
        
    Arguments
    ---------
    user_embedding: Embedding
        Embedding layer of user 
    item_embedding: Embedding
        Embedding layer of item
    flatten: Flatten
        Flatten layer
    concat: Concatenate
        Concatenate layer
    dropout: Dropout
        Dropout layer
    dense1: Dense
        First dense layer
    dense2: Dense
        Second dense layer
    """
    def __init__(self, embedding_size, max_user_id, max_item_id, **kwargs):
        super().__init__(**kwargs)
        
        self.user_embedding = Embedding(output_dim=embedding_size,
                                       input_dim=max_user_id + 1,
                                       input_length=1,
                                       name='user_embedding')
        self.item_embedding = Embedding(output_dim=embedding_size,
                                       input_dim=max_item_id + 1,
                                       input_length=1,
                                       name='item_embedding')
        self.flatten = Flatten()
        self.concat = Concatenate()
        
        self.dropout = Dropout(0.3)
        self.dense1 = Dense(64, activation="relu")
        self.dense2 = Dense(64, activation="relu")
        self.dense3 = Dense(10, activation='softmax')
    
    def call(self, inputs, training=False, **kwargs):
        """
        Parameters
        ----------
        inputs: list with two elements
            First element corresponds to the users
            Second element corresponds to the items
            Third element corresponds to the metadata
        """
        user_inputs = inputs[0]
        item_inputs = inputs[1]
        meta_inputs = inputs[2]
        
        # Definition of the user vectors
        user_vecs = self.flatten(self.user_embedding(user_inputs))
        user_vecs = self.dropout(user_vecs, training=training)
        
        # Definition of the item vectors
        item_vecs = self.flatten(self.item_embedding(item_inputs))
        item_vecs = self.dropout(item_vecs, training=training)
        
        # Contenate user, item and meta vectors (fc1)
        input_vecs = self.concat([user_vecs, item_vecs, meta_inputs])
        
        # Build the network
        y = self.dense1(input_vecs) # fc2
        y = self.dropout(y, training=training)
        y = self.dense2(y) # fc3
        y = self.dropout(y, training=training)
        y = self.dense3(y)
        return y

In [None]:
# Define parameters
EMBEDDING_SIZE = 64
MAX_USER_ID = np.max(user_id_train)
MAX_ITEM_ID = np.max(anime_id_train)

# Define and run the model
model = HybridClassificationModel(EMBEDDING_SIZE, MAX_USER_ID, MAX_ITEM_ID)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

In [None]:
# Initial prediction
initial_train_preds = model.predict([user_id_train, anime_id_train, item_meta_train]).argmax(axis=1) + 1

In [None]:
print(f'Mean Absolute Error: {mae(ratings_train, initial_train_preds.squeeze())}.')

In [None]:
print(f'Mean Squared Error: {mse(ratings_train, initial_train_preds.squeeze())}.')

In [None]:
# Define parameters
BATCH_SIZE = 64
EPOCHS = 10
VALIDATION_SPLIT = 0.1

# Train the model
history = model.fit(x=[user_id_train, anime_id_train, item_meta_train], y=ratings_train - 1,
                         batch_size=BATCH_SIZE, epochs=EPOCHS,
                         validation_split=VALIDATION_SPLIT, shuffle=True)

In [None]:
# Plot training and test losses
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Validation')
plt.ylim(0, 1.5)
plt.legend(loc='best')
plt.show()

In [None]:
# Perform predition on the test set
test_preds = model.predict([user_id_test, anime_id_test, item_meta_test]).argmax(axis=1)

In [None]:
print(f'Mean Absolute Error: {mae(ratings_test, test_preds.squeeze())}.')

In [None]:
print(f'Mean Squared Error: {mse(ratings_test, test_preds.squeeze())}.')

In [None]:
# Plot the prediction
plot_prediction(ratings_test, test_preds.squeeze())