# Recommender Systems with DGL

## Introduction

Graph Neural Networks (GNN), as a methodology of learning representations on graphs, has gained much attention recently.  Various models such as Graph Convolutional Networks, GraphSAGE, etc. are proposed to obtain representations of whole graphs, or nodes on a single graph.

A primary goal of Collaborative Filtering (CF) is to automatically make predictions about a user's interest, e.g. whether/how a user would interact with a set of items, given the interaction history of the user herself, as well as the histories of other users.  The user-item interaction can also be viewed as a bipartite graph, where users and items form two sets of nodes, and edges connecting them stands for interactions.  The problem can then be formulated as a *link-prediction* problem, where we try to predict whether an edge (of a given type) exists between two nodes.

Based on this intuition, the academia developed multiple new models for CF, including but not limited to:

* Geometric Learning Approaches
  * [Geometric Matrix Completion](https://papers.nips.cc/paper/5938-collaborative-filtering-with-graph-information-consistency-and-scalable-methods.pdf)
  * [Recurrent Multi-graph CNN](https://arxiv.org/pdf/1704.06803.pdf)
* Graph-convolutional Approaches
  * Models such as [R-GCN](https://arxiv.org/pdf/1703.06103.pdf) or [GraphSAGE](https://github.com/stellargraph/stellargraph/tree/develop/demos/link-prediction/hinsage) also apply.
  * [Graph Convolutional Matrix Completion](https://arxiv.org/abs/1706.02263)
  * [PinSage](https://arxiv.org/pdf/1806.01973.pdf)
  
In this hands-on tutorial, we will demonstrate how to write GraphSAGE in DGL + MXNet, and how to apply it in a recommender system setting.

## Dependencies

* Latest DGL release: `conda install -c dglteam dgl`
* `pandas`
* `stanfordnlp`
* `mxnet`
* `tqdm` for displaying the progress bar.

## Loading data

In this tutorial, we focus on rating prediction on MovieLens-1M dataset.  The data comes from [MovieLens](http://files.grouplens.org/datasets/movielens/ml-1m.zip) and is shipped with the notebook already.

After loading and train-validation-test-splitting the dataset, we process the movie title into (padded) word-ID sequences, and other features into categorical variables (i.e. integers).  We then store them as node features on the graph.

Since user features and item features are different, we pad both types of features with zeros.

All of the above is encapsulated in `movielens.MovieLens` class for clarity of this notebook.

In [None]:
import mxnet as mx
from mxnet import ndarray as nd, autograd, gluon
from mxnet.gluon import nn
import dgl
import dgl.function as FN
import numpy as np

In [None]:
import movielens
import stanfordnlp

# IMPORTANT!!!
# If you don't have stanfordnlp installed and the English models downloaded, please uncomment this statement
#stanfordnlp.download('en', force=True)

ml = movielens.MovieLens('ml-100k')

## See the features in the MovieLens dataset

The MovieLens dataset has some user features and movie features.

User features:
* age,
* gender,
* occupation,
* zip code,

Movie features:
* genre,
* year,
* title

We use one-hot encoding for "age", "gender", "occupation", "zip code" and "year". "genre" uses multi-hop encoding while "title" encodes the frequency of different words. For simplicity, we store "genre" and "title" in float32 dense matrices.

In [None]:
g = ml.g
print(g.ndata)

MovieLens is a bipartite graph. It has user nodes and movie nodes. When we construct the graph, we add a vector on the node data to identify the node type of every node. User nodes are `1` and movie nodes are `0`.

In [None]:
g.ndata['type']

## Compute embeddings on the MovieLens dataset

"age", "gender", "occupation", "zip code" and "year" use one-hot encoding. "genre" and "title" are stored in float32 dense matrices. In addition, we add one-hot encoding for every user node and every movie node.

We add them to construct the inital node features.

In [None]:
def mix_embeddings(ndata, gcn):
    """Adds external (categorical and numeric) features into node representation G.ndata['h']"""
    extra_repr = []
    for key, value in ndata.items():
        if (value.dtype == np.int64) and key in gcn.emb:
            result = gcn.emb[key](value)
            if result.ndim == 3:    # bag of words: the result would be a (n_nodes x seq_len x feature_size) tensor
                mask = value != 0
                result = (result * mask.expand_dims(2).astype(float)).sum(1) / mask.sum(1)
            extra_repr.append(result)
        elif (value.dtype == np.float32) and key in gcn.proj:
            result = gcn.proj[key](value)
            extra_repr.append(result)
    ndata['h'] = ndata['h'] + nd.stack(*extra_repr, axis=0).sum(axis=0)

class MovieLensEmbedding(nn.Block):
    def __init__(self, G, feature_size):
        super(MovieLensEmbedding, self).__init__()
        self.emb = {}
        self.proj = {}

        for key, scheme in G.node_attr_schemes().items():
            if scheme.dtype == np.int64:
                # This is for one-hot encoding.
                n_items = G.ndata[key].max().asscalar()
                self.emb[key] = nn.Embedding(n_items + 1, feature_size)
                # We need to make it a member of the class in order to
                # initialize the weight matrix.
                setattr(self, 'emb_' + key, self.emb[key])
            elif scheme.dtype == np.float32:
                seq = nn.Sequential()
                with seq.name_scope():
                    w = nn.Dense(feature_size)
                    seq.add(w)
                    seq.add(nn.LeakyReLU(0.1))
                self.proj[key] = seq
                # We need to make it a member of the class in order to
                # initialize the weight matrix.
                setattr(self, 'proj_' + key, seq)
        
        self.node_emb = nn.Embedding(G.number_of_nodes() + 1, feature_size)

    def forward(self, ndata, nids):
        ndata['h'] = self.node_emb(nids + 1)
        mix_embeddings(ndata, self)
    

## Model

We can now write a GraphSAGE layer.  In GraphSAGE, the node representation is updated with the representation in the previous layer as well as an aggregation (often mean) of "messages" sent from all neighboring nodes.

### Algorithm

The algorithm of a single GraphSAGE layer goes as follows for each node $v$:

1. $h_{\mathcal{N}(v)} \gets \mathtt{Average}_{u \in \mathcal{N}(v)} h_{u}$
2. $h_{v} \gets \sigma\left(W \cdot \mathtt{CONCAT}(h_v, h_{\mathcal{N}(v)})\right)$
3. $h_{v} \gets h_{v} / \lVert h_{v} \rVert_2$

where

* $\mathtt{Average}$ can be replaced by any kind of aggregation including `sum`, `max`, or even an LSTM.
* $\sigma$ is any non-linearity function (e.g. `LeakyReLU`)

We simply repeat the computation above for multiple GraphSAGE layers.

In [None]:
class GraphSage(nn.Block):
    def __init__(self, feature_size, n_layers, G):
        super(GraphSage, self).__init__()
        
        self.feature_size = feature_size
        self.n_layers = n_layers

        self.layers = gluon.nn.Sequential()
        for i in range(n_layers):
            self.layers.add(GraphSageNodeUpdate(feature_size))

        self.G = G
        self.emb = MovieLensEmbedding(G, feature_size)

    def forward(self, G):
        assert G.number_of_nodes() == self.G.number_of_nodes()
        all_nodes = mx.nd.arange(G.number_of_nodes(), dtype=np.int64)
        self.emb(G.ndata, all_nodes)
            
        if self.n_layers == 0:
            return nf.layers[i].data['h']
        
        G.ndata['deg'] = G.in_degrees(all_nodes).astype(np.float32)
        for i in range(self.n_layers):
            G.update_all(FN.copy_src('h', 'h'), FN.sum('h', 'h_agg'), self.layers[i])

        return G.ndata['h']

## GraphSage node update function

The MovieLens dataset has two types of nodes: users and movies. We need to perform separate node update functions on the two types of nodes.

The node update function performs the computation of the last two steps as shown above.

For the movie nodes,

$h_{m} \gets \sigma\left(W0 \cdot \mathtt{CONCAT}(h_m, h_{\mathcal{N}(m)} / d_{\mathcal{N}(m)})\right)$, 
$h_{m} \gets h_{m} / \lVert h_{m} \rVert_2$

For the user nodes,

$h_{u} \gets \sigma\left(W1 \cdot \mathtt{CONCAT}(h_u, h_{\mathcal{N}(u)} / d_{\mathcal{N}(u)})\right)$,
$h_{u} \gets h_{u} / \lVert h_{u} \rVert_2$

In [None]:
class GraphSageNodeUpdate(nn.Block):
    def __init__(self, feature_size):
        super(GraphSageNodeUpdate, self).__init__()

        self.feature_size = feature_size
        self.W0 = nn.Dense(feature_size)
        self.W1 = nn.Dense(feature_size)
        self.leaky_relu = nn.LeakyReLU(0.1)

    def forward(self, nodes):
        # Node embedding from the previous layer.
        h = nodes.data['h']
        # Aggregation of the node embeddings in the neighborhood
        h_agg = nodes.data['h_agg']
        # Degree of the vertex.
        deg = nodes.data['deg'].expand_dims(1)
        h_concat = nd.concat(h, h_agg / nd.maximum(deg, 1e-6), dim=1)
        
        # There are two types of nodes. Each type should have their own model weights.
        h_new0 = self.W0(h_concat)
        h_new1 = self.W1(h_concat)
        # We need to pick the right embedding
        h_new = nd.where(nodes.data['type'], h_new0, h_new1)
        
        h_new = self.leaky_relu(h_new)
        # Layer norm
        return {'h': h_new / nd.maximum(h_new.norm(axis=1, keepdims=True), 1e-6)}

## Rating score

In this graph, a movie node can only connect to a user mode, vice versa. There are no connections between movie nodes, nor between user nodes.

For recommendation, the rating on item $j$ by user $i$ is defined by $u_i^T v_j$.

We minimize $$\Sigma_{i,j}(r_{i,j}-(u_i^T v_j))^2$$.

In practice, recommendation models have user bias term and movie bias term. Thus, we minimize
$$\Sigma_{i,j}(r_{i,j}-(u_i^T v_j + b_{u_i} + b_{v_j}))^2$$.

When we generate a `NodeFlow`, roughly half of the target nodes are both movie nodes and half of them are user nodes. When we run GraphSage on the target nodes, we basically improve the embedding of the target nodes with their neighbors. Then we use the final node embedding for rating prediction.

In [None]:
class GraphSAGERecommenderFull(nn.Block):
    def __init__(self, gcn):
        super(GraphSAGERecommenderFull, self).__init__()
        
        with self.name_scope():
            self.gcn = gcn
            self.node_biases = self.params.get(
                'node_biases',
                init=mx.init.Zero(),
                shape=(gcn.G.number_of_nodes()+1,))
        
    def forward(self, G, src, dst):
        h_output = self.gcn(G)
        h_src = h_output[src]
        h_dst = h_output[dst]
        score = (h_src * h_dst).sum(1) + self.node_biases.data()[src+1] + self.node_biases.data()[dst+1]
        return score

## Construct the training set

We train on a subset of edges in the MovieLens dataset. To construct the training set, we takes all the edges for training and construct a graph with these edges.

We first use `filter_edges` to select the edge Ids for training. We call `edge_subgraph` to construct the induced subgraph with the training edges. In the induced subgraph, we preserve all nodes from the parent graph.

In [None]:
g = ml.g
# Find the subgraph of all "training" edges
train_eid = g.filter_edges(lambda edges: edges.data['train']).astype('int64')
g_train = g.edge_subgraph(train_eid, preserve_nodes=True)
g_train.copy_from_parent()
rating_train = g_train.edata['rating']
src_train, dst_train = g_train.all_edges()

## Construct the test set

Similarly, we use `filter_edges` to select the edge Ids for testing.

In [None]:
eid_test = g.filter_edges(lambda edges: edges.data['test']).astype('int64')
src_test, dst_test = g.find_edges(eid_test)
rating_test = g.edges[eid_test].data['rating']

## Run the model

In [None]:
model = GraphSAGERecommenderFull(GraphSage(100, 1, g_train))
model.collect_params().initialize(ctx=mx.cpu())
trainer = gluon.Trainer(model.collect_params(), 'adam', {'learning_rate': 0.0002, 'wd': 1e-9})

for epoch in range(200):
    for i in range(100):
        # Training
        with mx.autograd.record():
            score = model(g_train, src_train, dst_train)
            loss = ((score - rating_train) ** 2).mean()
            loss.backward()
        trainer.step(1)

        # Testing
        h = model.gcn(g)
        # Compute test RMSE
        node_biases = model.node_biases.data()
        score = (h[src_test] * h[dst_test]).sum(1) + node_biases[src_test + 1] + node_biases[dst_test + 1]
        test_rmse = nd.sqrt(((score - rating_test) ** 2).mean())

    print('Training loss:', loss.asscalar(), 'Test RMSE:', test_rmse.asscalar())