<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# User2Item recommendations with LightGCN 
We offer an example to help readers to run a ID-based collaborative filtering baseline with LightGCN. <br>
LightGCN is a simple and neat Graph Convolution Network (GCN) model for recommender systems.  I It uses a GCN to learn the embeddings of users/items, with the goal that low-order and high-order user-item interactions are explicitly exploited into the embedding function.
<img src="https://recodatasets.z20.web.core.windows.net/kdd2020/images%2FLightGCN-graphexample.JPG" width="600">



The model architecture is illustrated as follows:
<img src="https://recodatasets.z20.web.core.windows.net/images/lightGCN-model.jpg" width="600">

For more details and instructions, please refer to [lightgcn_deep_dive.ipynb](../../02_model_collaborative_filtering/lightgcn_deep_dive.ipynb).

In [1]:
import os
import pandas as pd
import numpy as np
import tensorflow as tf
from reco_utils.common.timer import Timer
from reco_utils.recommender.deeprec.models.graphrec.lightgcn import LightGCN
from reco_utils.recommender.deeprec.DataModel.ImplicitCF import ImplicitCF
from reco_utils.dataset import movielens
from reco_utils.dataset.python_splitters import python_stratified_split
from reco_utils.evaluation.python_evaluation import map_at_k, ndcg_at_k, precision_at_k, recall_at_k
from reco_utils.common.constants import SEED as DEFAULT_SEED
from reco_utils.recommender.deeprec.deeprec_utils import prepare_hparams
from reco_utils.recommender.deeprec.deeprec_utils import cal_metric
from utils.general import *
from utils.data_helper import *
from utils.task_helper import *

tf.logging.set_verbosity(tf.logging.ERROR)

In [2]:
tag = 'small'

In [3]:
lightgcn_dir = 'data_folder/my/LightGCN-training-folder'
rawdata_dir = 'data_folder/my/DKN-training-folder'
create_dir(lightgcn_dir)

First, we need to transform the raw dataset into LightGCN's input data format:

In [4]:
prepare_dataset(lightgcn_dir, rawdata_dir, tag)

load_instance_file: train_small.txt   done.
load_instance_file: valid_small.txt   done.
load_instance_file: test_small.txt   done.


In [5]:
df_train = pd.read_csv(
        os.path.join(lightgcn_dir, 'lightgcn_train_{0}.txt'.format(tag)),
        sep=' ',
        engine="python",
        names=['userID', 'itemID', 'rating'],
        header=0
    )

In [6]:
df_train.head()

Unnamed: 0,userID,itemID,rating
0,2556758139,1639559569,0
1,2556758139,2750948673,0
2,2556758139,3009232636,0
3,2556758139,1997686688,0
4,2630447844,2253252279,1


LightGCN only takes positive user-item interactions for model training. Pairs with rating < 1 will be ignored by the model.

In [7]:
df_valid = pd.read_csv(
        os.path.join(lightgcn_dir, 'lightgcn_valid_{0}.txt'.format(tag)),
        sep=' ',
        engine="python",
        names=['userID', 'itemID', 'rating'],
        header=0
    )

In [8]:
data = ImplicitCF(
    train=df_train, test=df_valid, seed=0,
    col_user='userID',
    col_item='itemID',
    col_rating='rating'
)

In [9]:
yaml_file = './lightgcn.yaml'


hparams = prepare_hparams(yaml_file,                          
                          learning_rate=0.005,
                          eval_epoch=1,
                          top_k=10,
                          save_model=True,
                          epochs=15,
                          save_epoch=1
                         )
hparams.MODEL_DIR = os.path.join(lightgcn_dir, 'saved_models')


In [10]:
hparams.values

<bound method HParams.values of HParams([('DNN_FIELD_NUM', None), ('EARLY_STOP', 100), ('FEATURE_COUNT', None), ('FIELD_COUNT', None), ('L', None), ('MODEL_DIR', 'data_folder/my/LightGCN-training-folder/saved_models'), ('PAIR_NUM', None), ('SUMMARIES_DIR', None), ('T', None), ('activation', None), ('att_fcn_layer_sizes', None), ('attention_activation', None), ('attention_dropout', 0.0), ('attention_layer_sizes', None), ('attention_size', None), ('batch_size', 1024), ('cate_embedding_dim', None), ('cate_vocab', None), ('contextEmb_file', None), ('cross_activation', 'identity'), ('cross_l1', 0.0), ('cross_l2', 0.0), ('cross_layer_sizes', None), ('cross_layers', None), ('data_format', None), ('decay', 0.0001), ('dilations', None), ('dim', None), ('doc_size', None), ('dropout', [0.0]), ('dtype', 32), ('embed_l1', 0.0), ('embed_l2', 0.0), ('embed_size', 64), ('embedding_dropout', 0.3), ('enable_BN', False), ('entityEmb_file', None), ('entity_dim', None), ('entity_embedding_method', None), (

In [11]:
model = LightGCN(hparams, data, seed=0)

Already create adjacency matrix.
Already normalize adjacency matrix.
Using xavier initialization.


In [12]:
with Timer() as train_time:
    model.fit()

print("Took {} seconds for training.".format(train_time.interval))

Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_1
Epoch 1 (train)13.8s + (eval)1.2s: train loss = 0.08667 = (mf)0.08563 + (embed)0.00104, recall = 0.18498, ndcg = 0.09494, precision = 0.01850, map = 0.06812
Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_2
Epoch 2 (train)12.8s + (eval)1.1s: train loss = 0.01980 = (mf)0.01793 + (embed)0.00187, recall = 0.22820, ndcg = 0.12585, precision = 0.02282, map = 0.09494
Save model to path /data/home/jialia/jialia/kdd2020tutorial/formal_03/recommenders/scenarios/academic/KDD2020-tutorial/data_folder/my/LightGCN-training-folder/saved_models/epoch_3
Epoch 3 (train)12.8s + (eval)1.1s: train loss = 0.01252 = (mf)0.01021 + (embed)0.00231, recall = 0.25020, ndcg = 0.13682, precision = 0.02502, map

In [13]:
user_emb_file = os.path.join(lightgcn_dir, 'user.emb.txt')
item_emb_file = os.path.join(lightgcn_dir, 'item.emb.txt')
model.infer_embedding(
    user_emb_file,
    item_emb_file    
)

To compare LightGCN's performance with DKN, we need to make predictions on the same test set. So we infer the users/items embedding, then compute the similarity scores between each pairs of user-item in the test set.

In [14]:
def infer_scores_via_embeddings(test_filename, user_emb_file, item_emb_file):
    print('loading embedding file...', end=' ')
    user2vec = load_emb_file(user_emb_file)
    item2vec = load_emb_file(item_emb_file)
    preds, labels, groupids = [], [], []
    with open(test_filename, 'r') as rd:
        while True:
            line = rd.readline()
            if not line:
                break
            words = line.strip().split('%')
            tokens = words[0].split(' ')
            userid = words[1]
            itemid = tokens[2]
            pred = user2vec[userid].dot(item2vec[itemid])
            preds.append(pred)
            labels.append(int(tokens[0]))
            groupids.append(userid)
    print('done')
    return labels, preds, groupids
            

In [15]:
test_filename = os.path.join(rawdata_dir, 'test_{}.txt'.format(tag)) 
labels, preds, group_keys = infer_scores_via_embeddings(test_filename, user_emb_file, item_emb_file)
group_labels, group_preds = group_labels(labels, preds, group_keys)


loading embedding file... done


In [16]:
res_pairwise = cal_metric(
                group_labels, group_preds, ['ndcg@2;4;6', "group_auc"]
            )
print(res_pairwise)
res_pointwise = cal_metric(labels, preds, ['auc'])
print(res_pointwise)    

{'ndcg@2': 0.4026, 'ndcg@4': 0.4953, 'ndcg@6': 0.5346, 'group_auc': 0.8096}
{'auc': 0.8092}


### Reference: 
1. Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang & Meng Wang, LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation, 2020, https://arxiv.org/abs/2002.02126