# A Short Introduction to NeuRec

This example aims to describe the building blocks of NeuRec.

Following this example, researchers can fast implement their idea and conduct experiments.

## Dependencies

- numpy>=1.17
- scipy>=1.3.1
- pandas>=0.17
- reckit==0.2.0
- tensorflow==1.14.0 or pytorch==1.4.0

## Configurator

Read configuration

The class `Configurator` is designed to read arguments from ini-style configuration files and/or parse arguments from command line.
The arguments can convert to `int`, `float`, `bool`, `list` and `None` automatically.

The format of arguments in command line is "--arg_name arg_value":
```bash
python main.py --model MF --num_thread 8 --metric ["Recall", "NDCG"]
```

Using `Configurator.add_config()` to read ini-style configuration files:

In [1]:
from reckit import Configurator

config = Configurator()
config.add_config("Preprocess.ini", section="Preprocess")
# config.parse_cmd()  # Parse the arguments from command line.


The argument `section` will be activated only if there are more than one sections in configuration file, i.e. if there is only one section and whatever the name is, the arguments will be read from it.

**Note**, the arguments from command line have the highest priority than that from configuration file.
That is, if there are same argument name in configuration file and command line, the value in the former will be overwritten by that in the latter, whenever the command line is phased before or after reading ini files.

## Preprocessor

This process is not necessary.
If your dataset has already preprocessed, you can directly go to [Dataset](#Dataset).

In [2]:
from reckit import Preprocessor

data = Preprocessor()
data.load_data(config.filename, sep=config.separator, columns=config.file_column)
if config.drop_duplicates is True:
    data.drop_duplicates(keep="first")  # drop duplicates except for the first or last occurrence

data.filter_data(user_min=5, item_min=5)  # filter users and items with a few interactions
if config.remap_id is True:
    data.remap_data_id()  # convert user and item IDs to integers, start from 0

if config.splitter == "leave_out":
    data.split_data_by_leave_out(valid=config.valid, test=config.test,
                                 by_time=config.by_time)
elif config.splitter == "ratio":
    data.split_data_by_ratio(train=config.train, valid=config.valid,
                             test=config.test, by_time=config.by_time)

data.save_data()  # save the preprocessed data


loading data...
filtering items...
filtering users...
remapping user IDs...
remapping item IDs...
splitting data by ratio...
saving data to disk...
2020-09-07 09:15:45.636: 
columns = UIRT
filename = dataset/ml-100k.rating
sep = 	
item_min = 5
user_min = 5
remap_user_id = True
remap_item_id = True
split_by = ratio
train = 0.7
valid = 0.0
test = 0.3
by_time = False
2020-09-07 09:15:45.636: Data statistic:
2020-09-07 09:15:45.637: The number of users: 943
2020-09-07 09:15:45.637: The number of items: 1349
2020-09-07 09:15:45.638: The number of ratings: 99287
2020-09-07 09:15:45.638: Average actions of users: 105.29
2020-09-07 09:15:45.639: Average actions of items: 73.60
2020-09-07 09:15:45.639: The sparsity of the dataset: 92.195075%


## Dataset

In [3]:
import os

config = Configurator()
config.add_config("NeuRec.ini", section="NeuRec")  # read basic settings
# config.parse_cmd()

model_cfg = os.path.join("conf", "MF.ini")  # model cfg path
config.add_config(model_cfg, section="hyperparameters", used_as_summary=True)


The prefix name of data files is same as the data_dir, and the suffix/extension names are 'train', 'test', 'user2id', 'item2id'.

Directory structure:

    data_dir
        ├── data_dir.train      // training data
        ├── data_dir.valid      // validation data, optional
        ├── data_dir.test       // test data
        ├── data_dir.user2id    // user to id, optional
        ├── data_dir.item2id    // item to id, optional

In [4]:
from data import Dataset
dataset = Dataset(config.data_dir, config.sep, config.file_column)
print(dataset)


Dataset statistics:
Name: ml-100k_ratio_u5_i5
The number of users: 943
The number of items: 1349
The number of ratings: 99287
Average actions of users: 105.29
Average actions of items: 73.60
The sparsity of the dataset: 92.195075%

The number of training: 69918
The number of validation: 0
The number of testing: 29369


## Logger

This class can show a message on standard output and write it into the file named `filename` simultaneously.
This is convenient for observing and saving training results.

In [5]:
from reckit import Logger
import time
# create logger filename
data_name = dataset.data_name  # dataset name
timestamp = time.time()  # run time
model_name = config.recommender  # model name
model_param = config.summarize()  # return a string of model parameters
param_str = f"{data_name}_{model_name}_{model_param}"
run_id = f"{param_str[:150]}_{timestamp:.8f}"

log_dir = os.path.join("log", data_name, model_name)  # logger directory
logger_name = os.path.join(log_dir, run_id + ".log")  # full path
logger = Logger(logger_name)

logger.info(dataset)  # show and write dataset info
logger.info(config)  # show and write config info


2020-09-07 09:15:45.762: Dataset statistics:
Name: ml-100k_ratio_u5_i5
The number of users: 943
The number of items: 1349
The number of ratings: 99287
Average actions of users: 105.29
Average actions of items: 73.60
The sparsity of the dataset: 92.195075%

The number of training: 69918
The number of validation: 0
The number of testing: 29369
2020-09-07 09:15:45.763: NeuRec:[NeuRec]:
recommender=MF
platform=pytorch
data_dir=dataset/ml-100k_ratio_u5_i5
file_column=UIRT
sep='\t'
gpu_id=0
gpu_mem=0.99
metric=["Recall", "NDCG"]
top_k=[10,20]
test_thread=4
test_batch_size=64
seed=2020

MF:[hyperparameters]:
lr=0.001
reg=0.001
embedding_size=64
batch_size=1024
epochs=500
is_pairwise=True
loss_func=bpr
param_init=normal


## Evaluator

Evaluation metrics of `Evaluator` are configurable and can automatically fit both leave-one-out and fold-out data splitting without specific indication.

In [6]:
from reckit import Evaluator

user_train_dict = dataset.train_data.to_user_dict()
user_test_dict = dataset.test_data.to_user_dict()
evaluator = Evaluator(user_train_dict, user_test_dict,
                      metric=config.metric, top_k=config.top_k,
                      batch_size=config.test_batch_size,
                      num_thread=config.test_thread)


## PairwiseSampler

`PairwiseSampler` is an encapsulation of `Dataset` to do negative item sampling and construct training instances.

In [7]:
from data import PairwiseSampler

data_iter = PairwiseSampler(dataset.train_data, num_neg=1,
                            batch_size=config["batch_size"], 
                            shuffle=True, drop_last=False)


## Matrix Factorization with PyTorch

In [8]:
# define model
import torch
import torch.nn as nn
from util.pytorch import inner_product
from util.pytorch import get_initializer


class MF(nn.Module):
    def __init__(self, num_users, num_items, embed_dim):
        super(MF, self).__init__()

        # user and item embeddings
        self.user_embeddings = nn.Embedding(num_users, embed_dim)
        self.item_embeddings = nn.Embedding(num_items, embed_dim)

        self.item_biases = nn.Embedding(num_items, 1)

        # weight initialization
        self.reset_parameters()

    def reset_parameters(self, init_method="uniform"):
        init = get_initializer(init_method)
        zero_init = get_initializer("zeros")
        init(self.user_embeddings.weight)
        init(self.item_embeddings.weight)
        zero_init(self.item_biases.weight)

    def forward(self, user_ids, item_ids):
        user_embs = self.user_embeddings(user_ids)
        item_embs = self.item_embeddings(item_ids)
        item_bias = self.item_biases(item_ids)
        ratings = inner_product(user_embs, item_embs) + torch.squeeze(item_bias)
        return ratings

    def predict(self, user_ids):
        user_ids = torch.from_numpy(np.asarray(user_ids)).long().to(self.item_embeddings.weight.device)
        user_embs = self.user_embeddings(user_ids)
        ratings = torch.matmul(user_embs, self.item_embeddings.weight.T)
        ratings += torch.squeeze(self.item_biases.weight)
        return ratings.cpu().detach().numpy()


In [9]:
# training
from util.pytorch import pairwise_loss, pointwise_loss
from util.common import Reduction
from util.pytorch import l2_loss
import numpy as np

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
mf = MF(dataset.num_users, dataset.num_items, config["embedding_size"]).to(device)
mf.reset_parameters(config["param_init"])
optimizer = torch.optim.Adam(mf.parameters(), lr=config["lr"])

logger.info(evaluator.metrics_info())  # show the metrics information
for epoch in range(20):
    mf.train()
    for bat_users, bat_pos_items, bat_neg_items in data_iter:
        bat_users = torch.from_numpy(bat_users).long().to(device)
        bat_pos_items = torch.from_numpy(bat_pos_items).long().to(device)
        bat_neg_items = torch.from_numpy(bat_neg_items).long().to(device)
        yui = mf(bat_users, bat_pos_items)
        yuj = mf(bat_users, bat_neg_items)

        loss = pairwise_loss("bpr", yui-yuj, reduction=Reduction.SUM)
        reg_loss = l2_loss(mf.user_embeddings(bat_users),
                           mf.item_embeddings(bat_pos_items),
                           mf.item_embeddings(bat_neg_items),
                           mf.item_biases(bat_pos_items),
                           mf.item_biases(bat_neg_items))
        loss += config["reg"] * reg_loss
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    mf.eval()
    result = evaluator.evaluate(mf)
    logger.info("epoch %d:\t%s" % (epoch, result))
print("done")


2020-09-07 09:15:48.084: metrics:	Recall@10   	Recall@20   	NDCG@10     	NDCG@20     
2020-09-07 09:15:48.582: epoch 0:	0.10472979  	0.16936544  	0.28584769  	0.27690309  
2020-09-07 09:15:48.956: epoch 1:	0.12934996  	0.19387847  	0.35781866  	0.33092606  
2020-09-07 09:15:49.297: epoch 2:	0.13976939  	0.20833862  	0.37019661  	0.34513870  
2020-09-07 09:15:49.636: epoch 3:	0.14822304  	0.22378592  	0.37593025  	0.35645196  
2020-09-07 09:15:49.997: epoch 4:	0.15608349  	0.23660547  	0.38383657  	0.36740601  
2020-09-07 09:15:50.336: epoch 5:	0.15860106  	0.24483863  	0.39068785  	0.37650907  
2020-09-07 09:15:50.690: epoch 6:	0.16183163  	0.24846396  	0.39420494  	0.38145208  
2020-09-07 09:15:51.027: epoch 7:	0.16267784  	0.25457168  	0.39702213  	0.38693318  
2020-09-07 09:15:51.367: epoch 8:	0.16336972  	0.25615507  	0.39993602  	0.38893095  
2020-09-07 09:15:51.729: epoch 9:	0.16387241  	0.25937146  	0.40065935  	0.39105812  
2020-09-07 09:15:52.070: epoch 10:	0.16473994  	0.2596

## Matrix Factorization with TensorFlow

In [10]:
# define model
import tensorflow as tf
from util.tensorflow import inner_product, l2_loss
from util.tensorflow import pairwise_loss
from util.tensorflow import get_initializer, get_session


class MF(object):
    def __init__(self, config, num_users, num_items):
        self.emb_size = config["embedding_size"]
        self.lr = config["lr"]
        self.reg = config["reg"]
        self.epochs = config["epochs"]
        self.batch_size = config["batch_size"]
        self.param_init = config["param_init"]
        self.loss_func = config["loss_func"]

        self.num_users, self.num_items = num_users, num_items

        self._build_model()
        self.sess = get_session(config["gpu_mem"])

    def _create_variable(self):
        self.user_ph = tf.placeholder(tf.int32, [None], name="user")
        self.pos_item_ph = tf.placeholder(tf.int32, [None], name="pos_item")
        self.neg_item_ph = tf.placeholder(tf.int32, [None], name="neg_item")
        self.label_ph = tf.placeholder(tf.float32, [None], name="label")

        # embedding layers
        init = get_initializer(self.param_init)
        zero_init = get_initializer("zeros")
        self.user_embeddings = tf.Variable(init([self.num_users, self.emb_size]),
                                           name="user_embedding")
        self.item_embeddings = tf.Variable(init([self.num_items, self.emb_size]),
                                           name="item_embedding")
        self.item_biases = tf.Variable(zero_init([self.num_items]), name="item_bias")

    def _build_model(self):
        self._create_variable()
        user_emb = tf.nn.embedding_lookup(self.user_embeddings, self.user_ph)
        pos_item_emb = tf.nn.embedding_lookup(self.item_embeddings, self.pos_item_ph)
        neg_item_emb = tf.nn.embedding_lookup(self.item_embeddings, self.neg_item_ph)
        pos_bias = tf.gather(self.item_biases, self.pos_item_ph)
        neg_bias = tf.gather(self.item_biases, self.neg_item_ph)

        yi_hat = inner_product(user_emb, pos_item_emb) + pos_bias
        yj_hat = inner_product(user_emb, neg_item_emb) + neg_bias

        # reg loss
        model_loss = pairwise_loss("bpr", yi_hat-yj_hat, reduction=Reduction.SUM)
        reg_loss = l2_loss(user_emb, pos_item_emb, pos_bias, neg_item_emb, neg_bias)

        final_loss = model_loss + self.reg * reg_loss

        self.train_opt = tf.train.AdamOptimizer(self.lr).minimize(final_loss, name="train_opt")

        # for evaluation
        u_emb = tf.nn.embedding_lookup(self.user_embeddings, self.user_ph)
        self.batch_ratings = tf.matmul(u_emb, self.item_embeddings, transpose_b=True) + self.item_biases

    def predict(self, users):
        all_ratings = self.sess.run(self.batch_ratings, feed_dict={self.user_ph: users})
        return all_ratings
    



Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


In [11]:
# training
num_users, num_items = dataset.num_users, dataset.num_items
mf = MF(config, num_users, num_items)
logger.info(evaluator.metrics_info())
for epoch in range(20):
    for bat_users, bat_pos_items, bat_neg_items in data_iter:
        feed = {mf.user_ph: bat_users,
                mf.pos_item_ph: bat_pos_items,
                mf.neg_item_ph: bat_neg_items}
        mf.sess.run(mf.train_opt, feed_dict=feed)
    result = evaluator.evaluate(mf)
    logger.info("epoch %d:\t%s" % (epoch, result))
print("done")






2020-09-07 09:15:56.572: metrics:	Recall@10   	Recall@20   	NDCG@10     	NDCG@20     
2020-09-07 09:15:56.883: epoch 0:	0.09928416  	0.16243352  	0.28622663  	0.27297863  
2020-09-07 09:15:57.095: epoch 1:	0.11891628  	0.17962168  	0.33301952  	0.30941290  
2020-09-07 09:15:57.291: epoch 2:	0.12585293  	0.19218649  	0.34554151  	0.32398596  
2020-09-07 09:15:57.499: epoch 3:	0.13744630  	0.20872766  	0.35996604  	0.34028333  
2020-09-07 09:15:57.699: epoch 4:	0.15097035  	0.22414714  	0.37924343  	0.35717216  
2020-09-07 09:15:57.916: epoch 5:	0.15545547  	0.23639758  	0.38656548  	0.36843264  
2020-09-07 09:15:58.153: epoch 6:	0.16020486  	0.24717879  	0.39365819  	0.37842610  
2020-09-07 09:15:58.363: epoch 7:	0.16115277  	0.25156903  	0.39415100  	0.38187385  
2020-09-07 09:15:58.580: epoch 8:	0.16128425  	0.25755256  	0.39690319  	0.38882911  
2020-09-07 09:15:58.797: epoch 9:	0.16385663  	0.25982428  	0.40062752  	0.39204168  
2020-09-07 09:15:59.054: epoch 10:	0.16487202  	0.