# Neural Collaborative Filtering (NCF)

This notebook serves as an introduction to Neural Collaborative Filtering (NCF), which is an innovative algorithm based on deep neural networks to tackle the key problem in recommendation — collaborative filtering — on the basis of implicit feedback.

## 0 Global Settings and Imports

In [1]:
import sys
sys.path.append("../../")
import time
import os
import pandas as pd
import numpy as np
from reco_utils.recommender.ncf.ncf_singlenode import NCF
from reco_utils.recommender.ncf.dataset import Dataset as NCFDataset
from reco_utils.dataset import movielens
from reco_utils.dataset.python_splitters import python_chrono_split
from reco_utils.evaluation.python_evaluation import (rmse, mae, rsquared, exp_var, map_at_k, ndcg_at_k, precision_at_k, 
                                                     recall_at_k, get_top_k_items)

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))

System version: 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 12:04:33) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
Pandas version: 0.22.0


In [2]:
# Select Movielens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

## 1 Matrix factorization algorithm

NCF is new neural matrix factorization model, which ensembles Generalized Matrix Factorization (GMF) and Multi-Layer Perceptron (MLP) to unify the strengths of linearity of MF and non-linearity of MLP for modelling the user–item latent structures. NCF can be demonstrated as a framework for GMF and MLP.

### 1.1 The GMF model

In ALS, the ratings are modeled as follows:

$$\hat r_{u,i} = q_{i}^{T}p_{u}$$

GMF introduces neural CF layer as the output layer of standard MF. In this way, MF can be easily generalized
and extended. For example, if we allow the edge weights of this output layer to be learnt from data without the uniform constraint, it will result in a variant of MF that allows varying importance of latent dimensions. And if we use a non-linear function for activation, it will generalize MF to a non-linear setting which might be more expressive than the linear MF model. GMF can be shown as follows:

$$\hat{r}_{u,i} = a_{out}({h}^T( q_{i}\odot p_{u})),$$

where $\odot$ is element-wise product of vectors. Additionally, $a_{out}$ and ${h}$ denote the activation function and edge weights of the output layer respectively. MF can be interpreted as a special case of GMF. Intuitively, if we use an identity function for aout and enforce h to be a uniform vector of 1, we can exactly recover the MF model.

### 1.2 The MLP model

NCF adopts two pathways to model users and items: 1) element-wise product of vectors, 2) concatenation of vectors. To learn interactions after concatenating of users and items lantent features, the standard MLP model is applied. In this sense, we can endow the model a large level of flexibility and non-linearity to learn the interactions between $p_{u}$ and $q_{i}$. The details of MLP model are:

For the input layer, there is concatention of user and item vectors:

$$z_{1} = \phi_{1}(p_{u}, q_{i})=
\left[\begin{array}{c} 
    p_{u}\\
    q_{i}\\ 
\end{array}\right]$$

So for the hidden layers and output layer of MLP, the details are:

$$
\phi_{l}(z_{l}) = a_{out}(W^{T}_{l} z_{l} + b_{l}),\ (l=2,3,..,L-1)\\
$$
$$
\hat{r}_{u,i} = \sigma(h^T\phi(z_{L−1}))
$$

where $W_l$, $b_l$, and $a_{out}$ denote the weight matrix, bias vector, and activation function for the $l$-th layer’s perceptron, respectively. For activation functions of MLP layers, one can freely choose sigmoid, hyperbolic tangent (tanh), and Rectifier (ReLU), among others. Because of implicit data task, the activation function of the output layer is defined as sigmoid $\sigma(x)=\frac{1}{1+\exp{(-x)}}$ to restrict the predicted score to be in (0,1).


### 1.3 Fusion of GMF and MLP

To provide more flexibility to the fused model, we allow GMF and MLP to learn separate embeddings, and combine the two models by concatenating their last hidden layer. We get $\phi^{GMF}$ from GMF:
$$\phi_{u,i}^{GMF}=p_u^{GMF}\odot q_i^{GMF}$$

and obtain $\phi^{MLP}$ from MLP:

$$\phi_{u,i}^{MLP}=a_{out}(W^T_{L}(a_{out}(...a_{out}(W^T_2
\left[\begin{array}{c} 
    p_{u}^{MLP}\\
    q_{i}^{MLP}\\ 
\end{array}\right] + b_2)....))+b_L$$

Lastly, we fuse output from GMF and MLP:

$$\hat{r}_{u,i}=\sigma\left(h^T\left[\begin{array}{c} 
    \phi^{GMF}\\
    \phi^{MLP}\\ 
\end{array}\right]\right)$$

This model combines the linearity of MF and non-linearity of DNNs for modelling user–item latent structures.

### 1.4 Objective Function

We define the likelihood function as:

$$P(\mathcal{R}, \mathcal{R^-}|\mathbf{P, Q}, \Theta)=\prod_{(u,i)\in\mathcal{R}}\hat{r}_{u,i}
\prod_{(u,j)\in\mathcal{R^-}}(1-\hat{r}_{u,j})$$

Where $\mathcal{R}$ denotes the set of observed interactions, and $\mathcal{R^-}$ denotes the set of negative instances. $\mathbf{P}$ and $\mathbf{Q}$ denotes the latent factor matrix for users and items, respectively; and $\Theta$ denotes the model parameters. Taking the negative logarithm of the likelihood, we obatain the objective function to minimize for NCF method, which is known as *binary cross-entropy loss*:

$$L=-\sum_{(u,i)\in \mathcal{R}\cup\mathcal{R^-}}r_{u,i}\log \hat{r}_{u,i}+(1-r_{u,i})\log (1-\hat{r}_{u,i})$$

The optimization can be done by performing Stochastic Gradient Descent (SGD), which has been introduced by the SVD algorithm in surprise svd deep dive notebook. Our SGD method is very similar to the SVD algorithm's.

## 2 TensorFlow implementation of NCF

We will use the Movielens dataset, which is composed of integer ratings from 1 to 5.

We convert Movielens into implicit feedback, and evaluate under our *leave-one-out* evaluation protocol.

You can check the details of implementation in `reco_utils/recommender/ncf`


## 3 TensorFlow NCF movie recommender

### 3.1 Load and split data

To evaluate the performance of item recommendation, we adopted the leave-one-out evaluation.

For each user, we held out his/her latest interaction as the test set and utilized the remaining data for training. We use `python_chrono_split` to achieve this. And since it is too time-consuming to rank all items for every user during evaluation, we followed the common strategy that randomly samples 100 items that are not interacted by the user, ranking the test item among the 100 items. Our test samples will be constructed by `NCFDataset`.

In [3]:
df = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=["userID", "itemID", "rating", "timestamp"]
)

df.head()

Unnamed: 0,userID,itemID,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


In [4]:
train, test = python_chrono_split(df, 0.75)

### 3.2 Functions of NCF Dataset 

Dataset Class for NCF, where important functions are:

`negative_sampling()`, sample negative user & item pair for every positive instances, with parameter `n_neg`.

`train_loader(batch_size, shuffle=True)`, generate training batch with `batch_size`, also we can set whether `shuffle` this training set.

`test_loader()`, generate test batch by every positive test instance, (eg. \[1, 2, 1\] is a positive user & item pair in test set (\[userID, itemID, rating\] for this tuple). This function returns like \[\[1, 2, 1\], \[1, 3, 0\], \[1,6, 0\], ...\], ie. following our *leave-one-out* evaluation protocol.

In [5]:
data = NCFDataset(train=train, test=test)

### 3.3 Train NCF based on TensorFlow
The NCF has a lot of parameters. The most important ones are:

`n_factors`, which controls the dimension of the latent space. Usually, the quality of the training set predictions grows with as n_factors gets higher.

`layer_sizes`, sizes of input layer (and hidden layers) of MLP, input type is list.

`n_epochs`, which defines the number of iteration of the SGD procedure.
Note that both parameter also affect the training time.

`model_type`, we can train single `"MLP"`, `"GMF"` or combined model `"NCF"` by changing the type of model.

We will here set `n_factors` to `4`, `layer_sizes` to `[32,16,8,4]`,  `n_epochs` to `50`, `batch_size` to 256. To train the model, we simply need to call the `fit()` method.

In [7]:
model = NCF (
    n_users=data.n_users, 
    n_items=data.n_items,
    model_type="NeuMF",
    n_factors=4,
    layer_sizes=[32,16,8,4],
    n_epochs=50,
    batch_size=256,
    learning_rate=1e-3,
    verbose=5,
)

start_time = time.time()

model.fit(data)

train_time = time.time() - start_time

print("Took {} seconds for training.".format(train_time))

Training model: NeuMF
Epoch 5 [7.99s]: train_loss = 0.281639 
Epoch 10 [8.89s]: train_loss = 0.256466 
Epoch 15 [7.73s]: train_loss = 0.242392 
Epoch 20 [7.89s]: train_loss = 0.232905 
Epoch 25 [7.93s]: train_loss = 0.225265 
Epoch 30 [8.00s]: train_loss = 0.220340 
Epoch 35 [7.77s]: train_loss = 0.215326 
Epoch 40 [7.54s]: train_loss = 0.211756 
Epoch 45 [8.49s]: train_loss = 0.208063 
Epoch 50 [7.77s]: train_loss = 0.204887 
Took 397.36696910858154 seconds for training.


## 3.4 Prediction and Evaluation

### 3.4.1 Prediction

Now that our model is fitted, we can call `predict` to get some `predictions`. `predict` returns an internal object Prediction which can be easily converted back to a dataframe:

In [8]:
predictions = [[row.userID, row.itemID, model.predict(row.userID, row.itemID)]
               for (_, row) in test.iterrows()]

test_time = time.time() - start_time

predictions = pd.DataFrame(predictions, columns=['userID', 'itemID', 'prediction'])
predictions.head()

Unnamed: 0,userID,itemID,prediction
0,1.0,88.0,0.563818
1,1.0,149.0,0.006455
2,1.0,103.0,0.064066
3,1.0,239.0,0.44759
4,1.0,110.0,0.012456


### 3.4.2 "Leave-one-out" Evaluation

We randomly samples 100 items that are not interacted by the user, ranking the test item among the 100 items. The performance of a ranked list is judged by **Hit Ratio (HR)** and **Normalized Discounted Cumulative Gain (NDCG)**.

We truncated the ranked list at 10 for both metrics. As such, the HR intuitively measures whether the test item is present on the top-10 list, and the NDCG accounts for the position of the hit by assigning higher scores to hits at top ranks.

In [9]:
k = 10

ndcgs = []
hit_ratio = []

for b in data.test_loader():
    user_input, item_input, labels = b
    output = model.predict(user_input, item_input, is_list=True)

    output = np.squeeze(output)
    rank = sum(output >= output[0])
    if rank <= k:
        ndcgs.append(1 / np.log(rank + 1))
        hit_ratio.append(1)
    else:
        ndcgs.append(0)
        hit_ratio.append(0)

eval_ndcg = np.mean(ndcgs)
eval_hr = np.mean(hit_ratio)

print("HR:\t%f" % eval_hr)
print("NDCG:\t%f" % eval_ndcg)


HR:	0.481846
NDCG:	0.378893


### 3.4.3 Generic Evaluation
We remove rated movies in the top k recommendations
To compute ranking metrics, we need predictions on all user, item pairs. We remove though the items already watched by the user, since we choose not to recommend them again.

In [10]:
start_time = time.time()

users, items, preds = [], [], []
item = list(train.itemID.unique())
for user in train.userID.unique():
    user = [user] * len(item) 
    users.extend(user)
    items.extend(item)
    preds.extend(list(model.predict(user, item, is_list=True)))

all_predictions = pd.DataFrame(data={"userID": users, "itemID":items, "prediction":preds})

merged = pd.merge(train, all_predictions, on=["userID", "itemID"], how="outer")
all_predictions = merged[merged.rating.isnull()].drop('rating', axis=1)

test_time = time.time() - start_time
print("Took {} seconds for prediction.".format(test_time))

Took 3.6529297828674316 seconds for prediction.


In [11]:
k = 10
eval_map = map_at_k(test, all_predictions, col_prediction='prediction', k=k)
eval_ndcg = ndcg_at_k(test, all_predictions, col_prediction='prediction', k=k)
eval_precision = precision_at_k(test, all_predictions, col_prediction='prediction', k=k)
eval_recall = recall_at_k(test, all_predictions, col_prediction='prediction', k=k)

print("MAP:\t%f" % eval_map,
      "NDCG:\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall, sep='\n')

MAP:	0.046111
NDCG:	0.194733
Precision@K:	0.179321
Recall@K:	0.101376


## 3.5 Pre-training

To get better performance of NeuMF, we can adopt pre-training strategy. We first train GMF and MLP with random initializations until convergence. Then use their model parameters as the initialization for the corresponding parts of NeuMF’s parameters.  Please pay attention to the output layer, where we concatenate weights of the two models with

$$h^{NCF}\leftarrow \left[\begin{array}{c} 
    \alpha h^{GMF}\\
    (1-\alpha )h^{MLP}\\ 
\end{array}\right]$$

where $h^{GMF}$ and $h^{MLP}$ denote the $h$ vector of the pretrained GMF and MLP model, respectively; and $\alpha$ is a
hyper-parameter determining the trade-off between the two pre-trained models. We set $\alpha$ = 0.5.

### 3.5.1 Training GMF and MLP model
`model.save`, we can set the `dir_name` to store the parameters of GMF and MLP

In [12]:
model = NCF (
    n_users=data.n_users, 
    n_items=data.n_items,
    model_type="GMF",
    n_factors=4,
    layer_sizes=[32,16,8,4],
    n_epochs=50,
    batch_size=256,
    learning_rate=1e-3,
    verbose=5,
)

start_time = time.time()

model.fit(data)

train_time = time.time() - start_time

print("Took {} seconds for training.".format(train_time))

model.save(dir_name=".pretrain/GMF")

Training model: GMF
Epoch 5 [4.70s]: train_loss = 0.348936 
Epoch 10 [5.43s]: train_loss = 0.309113 
Epoch 15 [5.40s]: train_loss = 0.283016 
Epoch 20 [4.82s]: train_loss = 0.274695 
Epoch 25 [4.67s]: train_loss = 0.271215 
Epoch 30 [4.70s]: train_loss = 0.269850 
Epoch 35 [4.78s]: train_loss = 0.269069 
Epoch 40 [4.75s]: train_loss = 0.267751 
Epoch 45 [5.07s]: train_loss = 0.268026 
Epoch 50 [5.15s]: train_loss = 0.267242 
Took 249.30098176002502 seconds for training.


In [None]:
model = NCF (
    n_users=data.n_users, 
    n_items=data.n_items,
    model_type="MLP",
    n_factors=4,
    layer_sizes=[32,16,8,4],
    n_epochs=50,
    batch_size=256,
    learning_rate=1e-3,
    verbose=5,
)

start_time = time.time()

model.fit(data)

train_time = time.time() - start_time

print("Took {} seconds for training.".format(train_time))

model.save(dir_name=".pretrain/MLP")

Training model: MLP
Epoch 5 [7.00s]: train_loss = 0.316183 
Epoch 10 [6.75s]: train_loss = 0.294825 
Epoch 15 [6.76s]: train_loss = 0.282679 
Epoch 20 [6.96s]: train_loss = 0.272337 
Epoch 25 [6.81s]: train_loss = 0.264747 
Epoch 30 [7.16s]: train_loss = 0.259429 
Epoch 35 [6.68s]: train_loss = 0.254417 
Epoch 40 [6.55s]: train_loss = 0.251136 
Epoch 45 [6.77s]: train_loss = 0.248145 
Epoch 50 [8.06s]: train_loss = 0.245289 
Took 348.6357350349426 seconds for training.


### 3.5.2 Load pre-trained GMF and MLP model for NeuMF
`model.load`, we can set the `gmf_dir` and `mlp_dir` to store the parameters for NeuMF.

In [None]:
model = NCF (
    n_users=data.n_users, 
    n_items=data.n_items,
    model_type="NeuMF",
    n_factors=4,
    layer_sizes=[32,16,8,4],
    n_epochs=50,
    batch_size=256,
    learning_rate=1e-3,
    verbose=5,
)

model.load(gmf_dir=".pretrain/GMF", mlp_dir=".pretrain/MLP", alpha=0.5)

start_time = time.time()

model.fit(data)

train_time = time.time() - start_time

print("Took {} seconds for training.".format(train_time))

INFO:tensorflow:Restoring parameters from .pretrain/GMF/model.ckpt
INFO:tensorflow:Restoring parameters from .pretrain/MLP/model.ckpt
Training model: NeuMF
Epoch 5 [13.26s]: train_loss = 0.211389 
Epoch 10 [10.01s]: train_loss = 0.205701 
Epoch 15 [15.90s]: train_loss = 0.201626 
Epoch 20 [15.59s]: train_loss = 0.199055 
Epoch 25 [20.53s]: train_loss = 0.197161 
Epoch 30 [19.91s]: train_loss = 0.193835 
Epoch 35 [17.18s]: train_loss = 0.192848 


You can use beforementioned evaluation methods to evaluate the pre-trained `NCF` Model.

### Reference: 
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu & Tat-Seng Chua Neural Collaborative Filtering: https://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf

Official NCF implementation [Keras with Theano]: https://github.com/hexiangnan/neural_collaborative_filtering

Other nice NCF implementation [Pytorch]: https://github.com/LaceyChen17/neural-collaborative-filtering