<a href="https://colab.research.google.com/github/wendy60/Hybrid-recommender-system/blob/second-submit/NRMS_Neural_News_Recommendation_with_Multi_Head_Self_Attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I use an open source project from github, so I need to declare the copyright for each model. I use the MIND public dataset and the python package -- recommenders from microsoft.

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# **Global settings and imports**

In [None]:
pip install recommenders

Collecting recommenders
  Downloading recommenders-0.7.0-py3-none-manylinux1_x86_64.whl (314 kB)
[?25l[K     |█                               | 10 kB 26.6 MB/s eta 0:00:01[K     |██                              | 20 kB 24.0 MB/s eta 0:00:01[K     |███▏                            | 30 kB 11.1 MB/s eta 0:00:01[K     |████▏                           | 40 kB 9.4 MB/s eta 0:00:01[K     |█████▏                          | 51 kB 5.1 MB/s eta 0:00:01[K     |██████▎                         | 61 kB 5.7 MB/s eta 0:00:01[K     |███████▎                        | 71 kB 5.4 MB/s eta 0:00:01[K     |████████▍                       | 81 kB 6.1 MB/s eta 0:00:01[K     |█████████▍                      | 92 kB 4.7 MB/s eta 0:00:01[K     |██████████▍                     | 102 kB 5.0 MB/s eta 0:00:01[K     |███████████▌                    | 112 kB 5.0 MB/s eta 0:00:01[K     |████████████▌                   | 122 kB 5.0 MB/s eta 0:00:01[K     |█████████████▌                  | 133 kB 

In [None]:
pip install tensorflow-gpu==1.15.2

Collecting tensorflow-gpu==1.15.2
  Downloading tensorflow_gpu-1.15.2-cp37-cp37m-manylinux2010_x86_64.whl (410.9 MB)
[K     |████████████████████████████████| 410.9 MB 31 kB/s 
Collecting tensorboard<1.16.0,>=1.15.0
  Downloading tensorboard-1.15.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 68.1 MB/s 
Collecting tensorflow-estimator==1.15.1
  Downloading tensorflow_estimator-1.15.1-py2.py3-none-any.whl (503 kB)
[K     |████████████████████████████████| 503 kB 79.1 MB/s 
Collecting gast==0.2.2
  Downloading gast-0.2.2.tar.gz (10 kB)
Collecting keras-applications>=1.0.8
  Downloading Keras_Applications-1.0.8-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 8.5 MB/s 
Building wheels for collected packages: gast
  Building wheel for gast (setup.py) ... [?25l[?25hdone
  Created wheel for gast: filename=gast-0.2.2-py3-none-any.whl size=7554 sha256=d4961145cac902be6c8e4959dd30af11f67cf49cb12c3c76238de831e881e653
  Stored in directo

In [None]:
import sys
import os
import numpy as np
import zipfile
from tqdm import tqdm
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources 
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.nrms import NRMSModel
from recommenders.models.newsrec.io.mind_iterator import MINDIterator
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))


System version: 3.7.12 (default, Sep 10 2021, 00:21:48) 
[GCC 7.5.0]
Tensorflow version: 1.15.2


# **Prepare parameters**

In [None]:
epochs = 5
seed = 42
batch_size = 32

# Options: demo, small, large
MIND_type = 'demo'

# **Download and load data**

In [None]:
tmpdir = TemporaryDirectory()
data_path = tmpdir.name

## train data
train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
## validation data
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
## word embedding and user dictionary files
wordEmb_file = os.path.join(data_path, "utils", "embedding.npy")
userDict_file = os.path.join(data_path, "utils", "uid2index.pkl")
wordDict_file = os.path.join(data_path, "utils", "word_dict.pkl")
## configuration file
yaml_file = os.path.join(data_path, "utils", r'nrms.yaml')

mind_url, mind_train_dataset, mind_dev_dataset, mind_utils = get_mind_data_set(MIND_type)

if not os.path.exists(train_news_file):
    download_deeprec_resources(mind_url, os.path.join(data_path, 'train'), mind_train_dataset)
    
if not os.path.exists(valid_news_file):
    download_deeprec_resources(mind_url, \
                               os.path.join(data_path, 'valid'), mind_dev_dataset)
if not os.path.exists(yaml_file):
    download_deeprec_resources(r'https://recodatasets.z20.web.core.windows.net/newsrec/', \
                               os.path.join(data_path, 'utils'), mind_utils)

100%|██████████| 17.0k/17.0k [00:01<00:00, 12.4kKB/s]
100%|██████████| 9.84k/9.84k [00:01<00:00, 9.19kKB/s]
100%|██████████| 95.0k/95.0k [00:04<00:00, 22.2kKB/s]


# **Create hyper-parameters**

In [None]:
hparams = prepare_hparams(yaml_file, 
                          wordEmb_file=wordEmb_file,
                          wordDict_file=wordDict_file, 
                          userDict_file=userDict_file,
                          batch_size=batch_size,
                          epochs=epochs,
                          show_step=10)
print(hparams)

data_format=news,iterator_type=None,support_quick_scoring=True,wordEmb_file=/tmp/tmp0c5hadae/utils/embedding.npy,wordDict_file=/tmp/tmp0c5hadae/utils/word_dict.pkl,userDict_file=/tmp/tmp0c5hadae/utils/uid2index.pkl,vertDict_file=None,subvertDict_file=None,title_size=30,body_size=None,word_emb_dim=300,word_size=None,user_num=None,vert_num=None,subvert_num=None,his_size=50,npratio=4,dropout=0.2,attention_hidden_dim=200,head_num=20,head_dim=20,cnn_activation=None,dense_activation=None,filter_num=200,window_size=3,vert_emb_dim=100,subvert_emb_dim=100,gru_unit=400,type=ini,user_emb_dim=50,learning_rate=0.0001,loss=cross_entropy_loss,optimizer=adam,epochs=5,batch_size=32,show_step=10,metrics=['group_auc', 'mean_mrr', 'ndcg@5;10']


In [None]:
iterator = MINDIterator


# **Train the NRMS model**

In [None]:
model = NRMSModel(hparams, iterator, seed=seed)


In [None]:
print(model.run_eval(valid_news_file, valid_behaviors_file))


586it [00:02, 230.18it/s]
236it [00:06, 37.13it/s]
7538it [00:01, 4918.88it/s]


{'group_auc': 0.4792, 'mean_mrr': 0.2059, 'ndcg@5': 0.2045, 'ndcg@10': 0.2701}


In [None]:
%%time
model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)


step 1080 , total_loss: 1.5143, data_loss: 1.4070: : 1086it [01:36, 11.31it/s]
586it [00:01, 501.95it/s]
236it [00:05, 42.13it/s]
7538it [00:01, 5144.49it/s]


at epoch 1
train info: logloss loss:1.5139183657823567
eval info: group_auc:0.576, mean_mrr:0.2458, ndcg@10:0.3321, ndcg@5:0.2602
at epoch 1 , train time: 96.0 eval time: 17.5


step 1080 , total_loss: 1.4159, data_loss: 1.2717: : 1086it [01:31, 11.85it/s]
586it [00:01, 496.46it/s]
236it [00:05, 42.27it/s]
7538it [00:01, 4973.85it/s]


at epoch 2
train info: logloss loss:1.4159993352591442
eval info: group_auc:0.6017, mean_mrr:0.2558, ndcg@10:0.3475, ndcg@5:0.2735
at epoch 2 , train time: 91.6 eval time: 17.5


step 1080 , total_loss: 1.3759, data_loss: 1.1641: : 1086it [01:31, 11.84it/s]
586it [00:01, 504.42it/s]
236it [00:05, 41.86it/s]
7538it [00:01, 5074.44it/s]


at epoch 3
train info: logloss loss:1.375670265536721
eval info: group_auc:0.6052, mean_mrr:0.2634, ndcg@10:0.3553, ndcg@5:0.2825
at epoch 3 , train time: 91.7 eval time: 17.4


step 1080 , total_loss: 1.3502, data_loss: 1.1403: : 1086it [01:31, 11.83it/s]
586it [00:01, 500.29it/s]
236it [00:05, 42.28it/s]
7538it [00:01, 4925.39it/s]


at epoch 4
train info: logloss loss:1.3503898715885085
eval info: group_auc:0.6103, mean_mrr:0.2668, ndcg@10:0.3576, ndcg@5:0.2881
at epoch 4 , train time: 91.8 eval time: 17.6


step 1080 , total_loss: 1.3300, data_loss: 1.3003: : 1086it [01:31, 11.85it/s]
586it [00:01, 488.08it/s]
236it [00:05, 42.23it/s]
7538it [00:01, 4985.97it/s]


at epoch 5
train info: logloss loss:1.3302437960552687
eval info: group_auc:0.6095, mean_mrr:0.2687, ndcg@10:0.3606, ndcg@5:0.2887
at epoch 5 , train time: 91.6 eval time: 17.7
CPU times: user 9min 5s, sys: 26.2 s, total: 9min 31s
Wall time: 9min 10s


<recommenders.models.newsrec.models.nrms.NRMSModel at 0x7fc763ec4690>

In [None]:
%%time
res_syn = model.run_eval(valid_news_file, valid_behaviors_file)
print(res_syn)

586it [00:01, 470.37it/s]
236it [00:05, 42.01it/s]
7538it [00:01, 5024.04it/s]


{'group_auc': 0.6095, 'mean_mrr': 0.2687, 'ndcg@5': 0.2887, 'ndcg@10': 0.3606}
CPU times: user 17.9 s, sys: 1.88 s, total: 19.8 s
Wall time: 17.6 s
