# **CUHK-STAT3009**: Notebook - Neural Collaborative Filtering

##  Latent factor model (matrix factorization) by `tf.keras`

- Before introduce NCF model for recommender systems, we first develop `LFM` by `tf.keras`
- `LFM` is **NOT** a sequential model, it is difficult to construct `LFM` by `keras.Sequential`
- First define `layers` -> `Keras.Model.call` to connect `input` to `output`
- Illustrate based on [MovieLens-latest-small](https://grouplens.org/datasets/movielens/) dataset

In [1]:
!wget https://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip ml-100k.zip -d ./

--2022-11-09 08:31:01--  https://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip’


2022-11-09 08:31:02 (5.77 MB/s) - ‘ml-100k.zip’ saved [4924029/4924029]

Archive:  ml-100k.zip
   creating: ./ml-100k/
  inflating: ./ml-100k/allbut.pl     
  inflating: ./ml-100k/mku.sh        
  inflating: ./ml-100k/README        
  inflating: ./ml-100k/u.data        
  inflating: ./ml-100k/u.genre       
  inflating: ./ml-100k/u.info        
  inflating: ./ml-100k/u.item        
  inflating: ./ml-100k/u.occupation  
  inflating: ./ml-100k/u.user        
  inflating: ./ml-100k/u1.base       
  inflating: ./ml-100k/u1.test       
  inflating: ./ml-100k/u2.base       
  inflating: ./ml-100k/u2.test       
  inflating: ./ml-100k/u3.ba

In [2]:
import numpy as np
import pandas as pd

## train read_csv
train = pd.read_csv('./ml-100k/u1.base', delimiter='\t',
                    names = ['user_id', 'item_id', 'rating', 'timestamp'],
                    header=None)
## test - read_csv
test = pd.read_csv('./ml-100k/u1.test', delimiter='\t',
                    names = ['user_id', 'item_id', 'rating', 'timestamp'],
                    header=None)

## LFM (MF) of MovieLens dataset based on `tf.keras`

- The code is adapted from [Keras Code Example](https://keras.io/examples/structured_data/collaborative_filtering_movielens/)


### Pre-process the ML-100K raw data

- check the `user_id` and `item_id`: mapping `item_id` to a continuous sequence based on `sklean.preprocessing`
- use `sklearn.model_selection.train_test_split` to generate train and test dataset

In [7]:
## mapping 
from sklearn import preprocessing
le_item = preprocessing.LabelEncoder()
le_item.fit(train['item_id'].append(test['item_id']))

train['item_id'] = le_item.transform(train['item_id'])
test['item_id'] = le_item.transform(test['item_id'])

LabelEncoder()

In [8]:
le_user = preprocessing.LabelEncoder()
le_user.fit(train['user_id'].append(test['user_id']))

train['user_id'] = le_user.transform(train['user_id'])
test['user_id'] = le_user.transform(test['user_id'])

In [9]:
## save real ratings for test set for evaluation.
test_rating = np.array(test['rating'])
## remove the ratings in the test set to simulate prediction
test = test.drop(columns='rating')

In [10]:
train.sample(5).T

Unnamed: 0,33206,67939,79463,32650,43366
user_id,470,829,936,462,561
item_id,432,625,864,1033,118
rating,1,3,3,2,3
timestamp,889827822,891561541,876769530,890530703,879196483


In [11]:
# tran_pair, train_rating
train_pair = train[['user_id', 'item_id']].values
train_rating = train['rating'].values

# test_pair
test_pair = test[['user_id', 'item_id']].values
# get descriptive parameters for the dataset
n_user, n_item = max(train_pair[:,0].max(), test_pair[:,0].max())+1, max(train_pair[:,1].max(), test_pair[:,1].max())+1
print('total number of users: %d; total number of items: %d' %(n_user, n_item))

total number of users: 943; total number of items: 1683


### Define LFM by `tf.keras`
- Define the layers: embedding layers: embed both users and movies in to 50-dimensional vectors.
- Connect from `input` to `output`: LFM computes a match score between user and movie embeddings via a dot product, and adds a per-movie and per-user bias.

Take a close look to [tf.keras.layers.Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding)

    tf.keras.layers.Embedding(
        input_dim,
        output_dim,
        embeddings_initializer='uniform',
        embeddings_regularizer=None,
        activity_regularizer=None,
        embeddings_constraint=None,
        mask_zero=False,
        input_length=None,
        **kwargs
    )

- `input_dim`: Integer. Size of the vocabulary, i.e. maximum integer index + 1.

- `output_dim`: Integer. Dimension of the dense embedding.

- `embeddings_initializer`: Initializer for the embeddings matrix (see keras.initializers).

- `embeddings_regularizer`: Regularizer function applied to the embeddings matrix (see keras.regularizers).

- `embeddings_constraint`: Constraint function applied to the embeddings matrix (see keras.constraints).

- `mask_zero`: Boolean, whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).

- `input_length`: Length of input sequences, when it is constant. This argument is required if you are going to connect Flatten then Dense layers upstream (without it, the shape of the dense outputs cannot be computed). 

In [12]:
import pandas as pd
import numpy as np
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from pathlib import Path
import matplotlib.pyplot as plt

In [None]:
# class LFactorNet(keras.Model):
#     ## r_{u,i} = p_u @ q_i
#     def __init__(self, num_users, num_movies, embedding_size, **kwargs):
#         super(LFactorNet, self).__init__(**kwargs)
#         self.num_users = num_users
#         self.num_movies = num_movies
#         self.embedding_size = embedding_size
#         self.user_embedding = layers.Embedding(
#             num_users,
#             embedding_size,
#             embeddings_initializer="he_normal",
#             embeddings_regularizer=keras.regularizers.l2(1e-2),
#         )
#         self.movie_embedding = layers.Embedding(
#             num_movies,
#             embedding_size,
#             embeddings_initializer="he_normal",
#             embeddings_regularizer=keras.regularizers.l2(1e-2),
#         )

#     def call(self, inputs):
#         user_vector = self.user_embedding(inputs[:, 0])
#         movie_vector = self.movie_embedding(inputs[:, 1])
#         dot_user_movie = tf.tensordot(user_vector, movie_vector, 2)
#         x = dot_user_movie
#         return x

In [13]:
class LFactorNet(keras.Model):
    ## r_{u,i} = p_u @ q_i + a_u + b_i + mu
    def __init__(self, num_users, num_movies, embedding_size, **kwargs):
        super(LFactorNet, self).__init__(**kwargs)
        self.num_users = num_users
        self.num_movies = num_movies
        self.embedding_size = embedding_size
        self.user_embedding = layers.Embedding(
            num_users,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.user_bias = layers.Embedding(num_users, 1)
        self.glb_bias = tf.Variable(0., trainable=True)
        self.movie_embedding = layers.Embedding(
            num_movies,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.movie_bias = layers.Embedding(num_movies, 1)

    def call(self, inputs):
        user_vector = self.user_embedding(inputs[:, 0])
        user_bias = self.user_bias(inputs[:, 0])
        movie_vector = self.movie_embedding(inputs[:, 1])
        movie_bias = self.movie_bias(inputs[:, 1])
        dot_user_movie = tf.tensordot(user_vector, movie_vector, 2)
        # Add all the components (including bias)
        x = dot_user_movie + user_bias + movie_bias + self.glb_bias
        return x

### Quick **memo**

- `Model`: `LFactorNet`
- `Loss`: MSE 
- `Algo`: SGD, Adam, ... + `callback`
- `Data`: [u,i] -> rating
- `metric`: RMSE, MAE

In [14]:
model = LFactorNet(num_users=n_user, num_movies=n_item, embedding_size=50)

metrics = [
    keras.metrics.MeanAbsoluteError(name='mae'),
    keras.metrics.RootMeanSquaredError(name='rmse')
]

model.compile(
    optimizer=keras.optimizers.SGD(1e-3), 
    loss=tf.keras.losses.MeanSquaredError(), 
    metrics=metrics
)

In [None]:
callbacks = [keras.callbacks.EarlyStopping( 
    monitor='val_rmse', min_delta=0, patience=5, verbose=1, 
    mode='min', baseline=None, restore_best_weights=True)]

history = model.fit(
    x=train_pair,
    y=train_rating,
    batch_size=64,
    epochs=50,
    verbose=1,
    validation_split=.2,
)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50

In [None]:
## make prediction
pred_rating = model.predict(test_pair).flatten()
print(pred_rating)
print('rmse: LFactorNet: %.3f' %np.sqrt(np.mean((pred_rating - test_rating)**2)))

[3.638312  3.3905618 2.786836  ... 3.619333  3.4383523 3.6299665]
rmse: LFactorNet: 0.989


### Define NCF by `tf.keras`
- Recall the figure
- Define the layers: `layers.Embedding` + `layers.concatenate` + `layers.Dense`
- Connect from `input` to `output`...

- `layers.concatnate`

```python
>>> x1 = tf.keras.layers.Dense(8)(np.arange(10).reshape(5, 2))
>>> x2 = tf.keras.layers.Dense(8)(np.arange(10, 20).reshape(5, 2))
>>> concatted = tf.keras.layers.Concatenate()([x1, x2])
>>> concatted.shape
TensorShape([5, 16])
```

In [None]:
from tensorflow.keras.layers import Embedding, Flatten, Input, Dropout, Dense, Concatenate
from IPython.display import SVG

class NCF(keras.Model):
    ## r_{u,i} = net([p_u, q_i])
    def __init__(self, num_users, num_movies, embedding_size, **kwargs):
        super(NCF, self).__init__(**kwargs)
        self.num_users = num_users
        self.num_movies = num_movies
        self.embedding_size = embedding_size
        self.user_embedding = layers.Embedding(
            num_users,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.movie_embedding = layers.Embedding(
            num_movies,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.concatenate = layers.Concatenate()
        self.dense1 = layers.Dense(100, name='fc-1', activation='relu')
        self.dense2 = layers.Dense(50, name='fc-2', activation='relu')
        self.dense3 = layers.Dense(1, name='fc-3', activation='relu')

    def call(self, inputs):
        user_vector = self.user_embedding(inputs[:, 0])
        movie_vector = self.movie_embedding(inputs[:, 1])
        concatted_vec = self.concatenate([user_vector, movie_vector])
        fc_1 = self.dense1(concatted_vec)
        fc_2 = self.dense2(fc_1)
        fc_3 = self.dense3(fc_2)
        return fc_3

In [None]:
model = NCF(num_users=n_user, num_movies=n_item, embedding_size=50)

metrics = [
    keras.metrics.MeanAbsoluteError(name='mae'),
    keras.metrics.RootMeanSquaredError(name='rmse')
]

model.compile(
    optimizer=keras.optimizers.Adam(1e-3), 
    loss=tf.keras.losses.MeanSquaredError(), 
    metrics=metrics
)

callbacks = [keras.callbacks.EarlyStopping( 
    monitor='val_rmse', min_delta=0, patience=5, verbose=1, 
    mode='min', baseline=None, restore_best_weights=True)]

history = model.fit(
    x=train_pair,
    y=train_rating,
    batch_size=64,
    epochs=50,
    verbose=1,
    validation_split=.2,
)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [None]:
## make prediction
pred_rating = model.predict(test_pair).flatten()
print(pred_rating)
print('rmse: NCF: %.3f' %np.sqrt(np.mean((pred_rating - test_rating)**2)))

[3.3204513 3.652381  1.8998168 ... 3.9807856 3.5982878 4.2914233]
rmse: NCF: 0.888


## Additive NCF (A-NCF):

- Recall the figure
- Define the layers: layers.Embedding + layers.concatenate + layers.Dense
- Connect from input to output...

In [None]:
class ANCF(keras.Model):
    ## r_{u,i} = net([a_u, b_i]) + p_u @ q_i
    def __init__(self, num_users, num_movies, embedding_size, **kwargs):
        super(ANCF, self).__init__(**kwargs)
        self.num_users = num_users
        self.num_movies = num_movies
        self.embedding_size = embedding_size
        self.user_embedding = layers.Embedding(
            num_users,
            embedding_size,
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.fc_user_embedding = layers.Embedding(
            num_users,
            embedding_size,
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.movie_embedding = layers.Embedding(
            num_movies,
            embedding_size,
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.fc_movie_embedding = layers.Embedding(
            num_movies,
            embedding_size,
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.concatenate = layers.Concatenate()
        self.last_concatenate = layers.Concatenate()
        self.dense1 = layers.Dense(100, name='fc-1', activation='relu')
        self.dense2 = layers.Dense(50, name='fc-2', activation='relu')
        self.dense3 = layers.Dense(1, name='fc-3', activation='relu')

    def call(self, inputs):
        user_vector = self.user_embedding(inputs[:, 0])
        movie_vector = self.movie_embedding(inputs[:, 1])
        fc_user_vector = self.fc_user_embedding(inputs[:, 0])
        fc_movie_vector = self.fc_movie_embedding(inputs[:, 1])
        
        ## MF
        dot_user_movie = tf.tensordot(user_vector, movie_vector, 2)

        ## fc
        fc_concatted_vec = self.concatenate([fc_user_vector, fc_movie_vector])
        fc_1 = self.dense1(fc_concatted_vec)
        fc_2 = self.dense2(fc_1)
        fc_3 = self.dense3(fc_2)

        ## outcome
        out = fc_3 + dot_user_movie
        return out

In [None]:
model = ANCF(num_users=n_user, num_movies=n_item, embedding_size=50)

metrics = [
    keras.metrics.MeanAbsoluteError(name='mae'),
    keras.metrics.RootMeanSquaredError(name='rmse')
]

model.compile(
    optimizer=keras.optimizers.Adam(1e-3), 
    loss=tf.keras.losses.MeanSquaredError(), 
    metrics=metrics
)

callbacks = [keras.callbacks.EarlyStopping( 
    monitor='val_rmse', min_delta=0, patience=5, verbose=1, 
    mode='min', baseline=None, restore_best_weights=True)]

history = model.fit(
    x=train_pair,
    y=train_rating,
    batch_size=64,
    epochs=50,
    verbose=1,
    validation_split=.2,
)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [None]:
## make prediction
pred_rating = model.predict(test_pair).flatten()
print(pred_rating)
print('rmse: ANCF: %.3f' %np.sqrt(np.mean((pred_rating - test_rating)**2)))

[3.0393822 3.5689142 1.9841704 ... 4.1937647 3.6621404 4.036018 ]
rmse: ANCF: 0.885


## Neural NCF (NeuMF):
- Recall the figure
- Define the layers: `layers.Embedding` + `layers.concatenate` + `layers.Dense`
- Connect from `input` to `output`...

In [None]:
class NeuMF(keras.Model):
    ## r_{u,i} = net([a_u, b_i, p_u * q_i])
    def __init__(self, num_users, num_movies, embedding_size, **kwargs):
        super(NeuMF, self).__init__(**kwargs)
        self.num_users = num_users
        self.num_movies = num_movies
        self.embedding_size = embedding_size
        self.user_embedding = layers.Embedding(
            num_users,
            embedding_size,
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.fc_user_embedding = layers.Embedding(
            num_users,
            embedding_size,
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.movie_embedding = layers.Embedding(
            num_movies,
            embedding_size,
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.fc_movie_embedding = layers.Embedding(
            num_movies,
            embedding_size,
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.concatenate = layers.Concatenate()
        self.last_concatenate = layers.Concatenate()
        self.dense1 = layers.Dense(100, name='fc-1', activation='relu')
        self.dense2 = layers.Dense(50, name='fc-2', activation='relu')
        self.dense3 = layers.Dense(1, name='fc-3', activation='relu')

    def call(self, inputs):
        user_vector = self.user_embedding(inputs[:, 0])
        movie_vector = self.movie_embedding(inputs[:, 1])
        fc_user_vector = self.fc_user_embedding(inputs[:, 0])
        fc_movie_vector = self.fc_movie_embedding(inputs[:, 1])
        
        ## MF
        dot_user_movie = user_vector * movie_vector

        ## fc
        fc_concatted_vec = self.concatenate([fc_user_vector, fc_movie_vector])
        fc_1 = self.dense1(fc_concatted_vec)
        fc_2 = self.dense2(fc_1)

        ## concat
        neu_vec = self.concatenate([dot_user_movie, fc_concatted_vec])

        ## outcome
        fc_3 = self.dense3(neu_vec)
        return fc_3

In [None]:
model = NeuMF(num_users=n_user, num_movies=n_item, embedding_size=50)

metrics = [
    keras.metrics.MeanAbsoluteError(name='mae'),
    keras.metrics.RootMeanSquaredError(name='rmse')
]

model.compile(
    optimizer=keras.optimizers.Adam(1e-3), 
    loss=tf.keras.losses.MeanSquaredError(), 
    metrics=metrics
)

callbacks = [keras.callbacks.EarlyStopping( 
    monitor='val_rmse', min_delta=0, patience=5, verbose=1, 
    mode='min', baseline=None, restore_best_weights=True)]

history = model.fit(
    x=train_pair,
    y=train_rating,
    batch_size=64,
    epochs=50,
    verbose=1,
    validation_split=.2,
)

Epoch 1/50




Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [None]:
## make prediction
pred_rating = model.predict(test_pair).flatten()
print(pred_rating)
print('rmse: NeuMF: %.3f' %np.sqrt(np.mean((pred_rating - test_rating)**2)))

[2.9664671 3.333763  2.204947  ... 3.8826873 3.4663348 3.7612767]
rmse: NeuMF: 0.887
