# CUHK [STAT3009](https://www.bendai.org/STAT3009/) Notebook7: Neural Collaborative Filtering

### Install ``TensorFlow`` and ``Keras``
- TensorFlow: https://www.tensorflow.org/install
- Keras: Keras comes packaged with TensorFlow 2 as tensorflow.keras (https://keras.io/about/)
- If you use Apple M1: https://naturale0.github.io/2021/01/29/setting-up-m1-mac-for-both-tensorflow-and-pytorch

Credit: The notebook is adapted from https://calvinfeng.gitbook.io/machine-learning-notebook/supervised-learning/recommender/neural_collaborative_filtering

## Introduction to Deep learning with Keras
	- Model: input -> layers -> output
	- Loss: find an appropriate loss function for your problem
	- Algo: SGD, Adam, ...
	- Data: Define the model, then feed the data
	- metric: final evaluation or something you care

## Example 1: [Imbalanced classification: credit card fraud detection](https://keras.io/examples/structured_data/imbalanced_classification/)

- Author: fchollet
- Date created: 2019/05/28
- Last modified: 2020/04/17
- Description: Demonstration of how to handle highly imbalanced classification problems.

In [12]:
# https://keras.io/examples/structured_data/imbalanced_classification/

import csv
import numpy as np

# Get the real data from https://www.kaggle.com/mlg-ulb/creditcardfraud/
fname = "/home/ben/dataset/creditcard.csv"

all_features = []
all_targets = []
with open(fname) as f:
    for i, line in enumerate(f):
        if i == 0:
            print("HEADER:", line.strip())
            continue  # Skip header
        fields = line.strip().split(",")
        all_features.append([float(v.replace('"', "")) for v in fields[:-1]])
        all_targets.append([int(fields[-1].replace('"', ""))])
        if i == 1:
            print("EXAMPLE FEATURES:", all_features[-1])

features = np.array(all_features, dtype="float32")
targets = np.array(all_targets, dtype="uint8")
print("features.shape:", features.shape)
print("targets.shape:", targets.shape)

HEADER: "Time","V1","V2","V3","V4","V5","V6","V7","V8","V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","Amount","Class"
EXAMPLE FEATURES: [0.0, -1.3598071336738, -0.0727811733098497, 2.53634673796914, 1.37815522427443, -0.338320769942518, 0.462387777762292, 0.239598554061257, 0.0986979012610507, 0.363786969611213, 0.0907941719789316, -0.551599533260813, -0.617800855762348, -0.991389847235408, -0.311169353699879, 1.46817697209427, -0.470400525259478, 0.207971241929242, 0.0257905801985591, 0.403992960255733, 0.251412098239705, -0.018306777944153, 0.277837575558899, -0.110473910188767, 0.0669280749146731, 0.128539358273528, -0.189114843888824, 0.133558376740387, -0.0210530534538215, 149.62]
features.shape: (284807, 30)
targets.shape: (284807, 1)


In [13]:
num_val_samples = int(len(features) * 0.2)
train_features = features[:-num_val_samples]
train_targets = targets[:-num_val_samples]
val_features = features[-num_val_samples:]
val_targets = targets[-num_val_samples:]

print("Number of training samples:", len(train_features))
print("Number of validation samples:", len(val_features))

Number of training samples: 227846
Number of validation samples: 56961


In [14]:
mean = np.mean(train_features, axis=0)
train_features -= mean
val_features -= mean
std = np.std(train_features, axis=0)
train_features /= std
val_features /= std

In [15]:
## Build binary classifcation model
from tensorflow import keras

model = keras.Sequential(
    [
        keras.layers.Dense(
            256, activation="relu", input_shape=(train_features.shape[-1],)
        ),
        keras.layers.Dense(256, activation="relu"),
        keras.layers.Dropout(0.3),
        keras.layers.Dense(256, activation="relu"),
        keras.layers.Dropout(0.3),
        keras.layers.Dense(1, activation="sigmoid"),
    ]
)
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 256)               7936      
_________________________________________________________________
dense_5 (Dense)              (None, 256)               65792     
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 256)               65792     
_________________________________________________________________
dropout_3 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 257       
Total params: 139,777
Trainable params: 139,777
Non-trainable params: 0
________________________________________________

In [16]:
metrics = [
    keras.metrics.BinaryAccuracy(name='acc'),
    keras.metrics.AUC(name='auc')
]

model.compile(
    optimizer=keras.optimizers.Adam(1e-2), loss="binary_crossentropy", metrics=metrics
)

callbacks = [keras.callbacks.EarlyStopping( 
    monitor='val_auc', min_delta=0, patience=5, verbose=1, 
    mode='auto', baseline=None, restore_best_weights=True)]

model.fit(
    train_features,
    train_targets,
    batch_size=2048,
    epochs=30,
    verbose=2,
    callbacks=callbacks,
    validation_data=(val_features, val_targets),
)

Epoch 1/30
112/112 - 2s - loss: 2.2655e-06 - acc: 6.5834e-04 - auc: 0.9531 - val_loss: 0.0754 - val_acc: 0.0000e+00 - val_auc: 0.9815
Epoch 2/30
112/112 - 2s - loss: 1.3744e-06 - acc: 6.9345e-04 - auc: 0.9813 - val_loss: 0.0928 - val_acc: 2.4578e-04 - val_auc: 0.9878
Epoch 3/30
112/112 - 2s - loss: 1.1540e-06 - acc: 7.0662e-04 - auc: 0.9859 - val_loss: 0.3780 - val_acc: 6.6712e-04 - val_auc: 0.9864
Epoch 4/30
112/112 - 2s - loss: 1.1425e-06 - acc: 8.1634e-04 - auc: 0.9882 - val_loss: 0.1672 - val_acc: 4.9156e-04 - val_auc: 0.9861
Epoch 5/30
112/112 - 2s - loss: 9.6454e-07 - acc: 9.1729e-04 - auc: 0.9919 - val_loss: 0.0197 - val_acc: 6.6712e-04 - val_auc: 0.9834
Epoch 6/30
112/112 - 2s - loss: 1.1569e-06 - acc: 0.0012 - auc: 0.9911 - val_loss: 0.0520 - val_acc: 1.4045e-04 - val_auc: 0.9860
Epoch 7/30
112/112 - 2s - loss: 8.4055e-07 - acc: 8.3390e-04 - auc: 0.9945 - val_loss: 0.0449 - val_acc: 5.0912e-04 - val_auc: 0.9838
Epoch 8/30
112/112 - 2s - loss: 9.3953e-07 - acc: 0.0012 - auc: 0.

<tensorflow.python.keras.callbacks.History at 0x7fe4ec439f40>

In [None]:
## Back to array
pred_prob = model.predict(val_features)
pred_label = 1*(pred_prob > .5)

## Example 2: [Collaborative Filtering for Movie Recommendations](https://keras.io/examples/structured_data/collaborative_filtering_movielens/)

- Author: Siddhartha Banerjee
- Date created: 2020/05/24
- Last modified: 2020/05/24
- Description: Recommending movies using a model trained on Movielens dataset.

## Pre-process the ML-100K raw data
- check the `user_id` and `item_id`: mapping `item_id` to a continuous sequence based on `sklean.preprocessing`
- use `sklearn.model_selection.train_test_split` to generate `train` and `test` dataset

In [17]:
import numpy as np
import pandas as pd

df = pd.read_csv('./dataset/ml-latest-small/ratings.csv')
del df['timestamp']
## mapping 
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df['movieId'] = le.fit_transform(df['movieId'])
df['userId'] = le.fit_transform(df['userId'])
## generate train / test dataset
from sklearn.model_selection import train_test_split
dtrain, dtest = train_test_split(df, test_size=0.33, random_state=42)
## save real ratings for test set for evaluation.
test_rating = np.array(dtest['rating'])
## remove the ratings in the test set to simulate prediction
dtest = dtest.drop(columns='rating')


In [18]:
# tran_pair, train_rating
train_pair = dtrain[['userId', 'movieId']].values
train_rating = dtrain['rating'].values
# test_pair
test_pair = dtest[['userId', 'movieId']].values
n_user, n_item = max(train_pair[:,0].max(), test_pair[:,0].max())+1, max(train_pair[:,1].max(), test_pair[:,1].max())+1

### Create the model
We embed both users and movies in to 50-dimensional vectors.

The model computes a match score between user and movie embeddings via a dot product, and adds a per-movie and per-user bias.

In [19]:
import pandas as pd
import numpy as np
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from pathlib import Path
import matplotlib.pyplot as plt

In [26]:
class LFactorNet(keras.Model):
    def __init__(self, num_users, num_movies, embedding_size, **kwargs):
        super(LFactorNet, self).__init__(**kwargs)
        self.num_users = num_users
        self.num_movies = num_movies
        self.embedding_size = embedding_size
        self.user_embedding = layers.Embedding(
            num_users,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-6),
        )
        self.user_bias = layers.Embedding(num_users, 1)
        self.glb_bias = tf.Variable(0., trainable=True) 
        self.movie_embedding = layers.Embedding(
            num_movies,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-6),
        )
        self.movie_bias = layers.Embedding(num_movies, 1)

    def call(self, inputs):
        user_vector = self.user_embedding(inputs[:, 0])
        user_bias = self.user_bias(inputs[:, 0])
        movie_vector = self.movie_embedding(inputs[:, 1])
        movie_bias = self.movie_bias(inputs[:, 1])
        dot_user_movie = tf.tensordot(user_vector, movie_vector, 2)
        # Add all the components (including bias)
        x = dot_user_movie + user_bias + movie_bias + self.glb_bias
        return x

In [22]:
model = LFactorNet(num_users=n_user, num_movies=n_item, embedding_size=50)

metrics = [
    keras.metrics.MeanAbsoluteError(name='mae'),
    keras.metrics.RootMeanSquaredError(name='rmse')
]

model.compile(
    optimizer=keras.optimizers.Adam(1e-3), 
    loss=tf.keras.losses.MeanSquaredError(), 
    metrics=metrics
)

callbacks = [keras.callbacks.EarlyStopping( 
    monitor='val_rmse', min_delta=0, patience=5, verbose=1, 
    mode='auto', baseline=None, restore_best_weights=True)]

history = model.fit(
    x=train_pair,
    y=train_rating,
    batch_size=64,
    epochs=50,
    verbose=1,
    validation_split=.2,
)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [25]:
## make prediction
pred_rating = model.predict(test_pair).flatten()
print(pred_rating)
print('rmse: LFactorNet: %.3f' %np.sqrt(np.mean((pred_rating - test_rating)**2)))

[0.9852332  1.7486626  0.33845353 ... 2.354248   1.447402   2.3437057 ]
rmse: LFactorNet: 1.920


## Example 3: Neural Collaborative Filtering in MovieLens dataset

	Credit: The notebook is adapted from https://calvinfeng.gitbook.io/machine-learning-notebook/supervised-learning/recommender/neural_collaborative_filtering