# Neural News Recommendation 
In this notebook, we will demonstrate how to use the `recommender` library to facilitate news recommendation.

## Data format:
For quicker training and evaluaiton, we sample MINDdemo dataset of 5k users from [MIND small dataset](https://msnews.github.io/). The MINDdemo dataset has the same file format as MINDsmall and MINDlarge. If you want to try experiments on MINDsmall
 and MINDlarge, please change the dowload source.
 Select the MIND_type parameter from ['large', 'small', 'demo'] to choose dataset.
 
**MINDdemo_train** is used for training, and **MINDdemo_dev** is used for evaluation. Training data and evaluation data are composed of a news file and a behaviors file. You can find more detailed data description in [MIND repo](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md)

### news data
This file contains news information including newsid, category, subcatgory, news title, news abstarct, news url and entities in news title, entities in news abstarct.
One simple example: <br>

`N46466	lifestyle	lifestyleroyals	The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By	Shop the notebooks, jackets, and more that the royals can't live without.	https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata	[{"Label": "Prince Philip, Duke of Edinburgh", "Type": "P", "WikidataId": "Q80976", "Confidence": 1.0, "OccurrenceOffsets": [48], "SurfaceForms": ["Prince Philip"]}, {"Label": "Charles, Prince of Wales", "Type": "P", "WikidataId": "Q43274", "Confidence": 1.0, "OccurrenceOffsets": [28], "SurfaceForms": ["Prince Charles"]}, {"Label": "Elizabeth II", "Type": "P", "WikidataId": "Q9682", "Confidence": 0.97, "OccurrenceOffsets": [11], "SurfaceForms": ["Queen Elizabeth"]}]	[]`
<br>

In general, each line in data file represents information of one piece of news: <br>

`[News ID] [Category] [Subcategory] [News Title] [News Abstrct] [News Url] [Entities in News Title] [Entities in News Abstract] ...`

<br>

We generate a word_dict file to tranform words in news title to word indexes, and a embedding matrix is initted from pretrained glove embeddings.

### behaviors data
One simple example: <br>
`1	U82271	11/11/2019 3:28:58 PM	N3130 N11621 N12917 N4574 N12140 N9748	N13390-0 N7180-0 N20785-0 N6937-0 N15776-0 N25810-0 N20820-0 N6885-0 N27294-0 N18835-0 N16945-0 N7410-0 N23967-0 N22679-0 N20532-0 N26651-0 N22078-0 N4098-0 N16473-0 N13841-0 N15660-0 N25787-0 N2315-0 N1615-0 N9087-0 N23880-0 N3600-0 N24479-0 N22882-0 N26308-0 N13594-0 N2220-0 N28356-0 N17083-0 N21415-0 N18671-0 N9440-0 N17759-0 N10861-0 N21830-0 N8064-0 N5675-0 N15037-0 N26154-0 N15368-1 N481-0 N3256-0 N20663-0 N23940-0 N7654-0 N10729-0 N7090-0 N23596-0 N15901-0 N16348-0 N13645-0 N8124-0 N20094-0 N27774-0 N23011-0 N14832-0 N15971-0 N27729-0 N2167-0 N11186-0 N18390-0 N21328-0 N10992-0 N20122-0 N1958-0 N2004-0 N26156-0 N17632-0 N26146-0 N17322-0 N18403-0 N17397-0 N18215-0 N14475-0 N9781-0 N17958-0 N3370-0 N1127-0 N15525-0 N12657-0 N10537-0 N18224-0`
<br>

In general, each line in data file represents one instance of an impression. The format is like: <br>

`[Impression ID] [User ID] [Impression Time] [User Click History] [Impression News]`

<br>

User Click History is the user historical clicked news before Impression Time. Impression News is the displayed news in an impression, which format is:<br>

`[News ID 1]-[label1] ... [News ID n]-[labeln]`

<br>
Label represents whether the news is clicked by the user. All information of news in User Click History and Impression News can be found in news data file.

In [3]:
! pip install -q recommenders
! pip install -q scrapbook

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scrapbook
  Downloading scrapbook-0.5.0-py3-none-any.whl (34 kB)
Collecting papermill
  Downloading papermill-2.3.4-py3-none-any.whl (37 kB)
Collecting ansiwrap
  Downloading ansiwrap-0.8.4-py2.py3-none-any.whl (8.5 kB)
Collecting jupyter-client>=6.1.5
  Downloading jupyter_client-7.3.4-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 66.0 MB/s 
Collecting traitlets>=4.2
  Downloading traitlets-5.3.0-py3-none-any.whl (106 kB)
[K     |████████████████████████████████| 106 kB 53.2 MB/s 
Collecting tornado>=6.0
  Downloading tornado-6.2-cp37-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (423 kB)
[K     |████████████████████████████████| 423 kB 62.6 MB/s 
Collecting textwrap3>=0.9.2
  Downloading textwrap3-0.9.2-py2.py3-none-any.whl (12 kB)
Installing collected packages: traitlets, tornado, textwrap3, jupyter

In [4]:
import sys
import os
import numpy as np
import zipfile
from tqdm import tqdm
import scrapbook as sb
from tempfile import TemporaryDirectory
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources 
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.io.mind_iterator import MINDIterator
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set

## Download and load data

In [5]:
# Options: demo, small, large
MIND_type = 'demo'

tmpdir = TemporaryDirectory()
data_path = tmpdir.name

train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
wordEmb_file = os.path.join(data_path, "utils", "embedding.npy")
userDict_file = os.path.join(data_path, "utils", "uid2index.pkl")
wordDict_file = os.path.join(data_path, "utils", "word_dict.pkl")
yaml_file = os.path.join(data_path, "utils", r'npa.yaml')

mind_url, mind_train_dataset, mind_dev_dataset, mind_utils = get_mind_data_set(MIND_type)

if not os.path.exists(train_news_file):
    download_deeprec_resources(mind_url, os.path.join(data_path, 'train'), mind_train_dataset)
    
if not os.path.exists(valid_news_file):
    download_deeprec_resources(mind_url, \
                               os.path.join(data_path, 'valid'), mind_dev_dataset)
if not os.path.exists(yaml_file):
    download_deeprec_resources(r'https://recodatasets.z20.web.core.windows.net/newsrec/', \
                               os.path.join(data_path, 'utils'), mind_utils)

100%|██████████| 17.0k/17.0k [00:05<00:00, 3.07kKB/s]
100%|██████████| 9.84k/9.84k [00:04<00:00, 2.36kKB/s]
100%|██████████| 95.0k/95.0k [00:13<00:00, 7.11kKB/s]


## Define hyperparameters for training

In [6]:
epochs = 5
seed = 42
batch_size = 32

hparams = prepare_hparams(yaml_file, 
                          wordEmb_file=wordEmb_file,
                          wordDict_file=wordDict_file, 
                          userDict_file=userDict_file,
                          batch_size=batch_size,
                          epochs=epochs)
print(hparams)

HParams object with values {'support_quick_scoring': False, 'dropout': 0.2, 'attention_hidden_dim': 200, 'head_num': 4, 'head_dim': 100, 'filter_num': 400, 'window_size': 3, 'vert_emb_dim': 100, 'subvert_emb_dim': 100, 'gru_unit': 400, 'type': 'ini', 'user_emb_dim': 100, 'learning_rate': 0.0001, 'optimizer': 'adam', 'epochs': 5, 'batch_size': 32, 'show_step': 100000, 'title_size': 10, 'his_size': 50, 'data_format': 'news', 'npratio': 4, 'metrics': ['group_auc', 'mean_mrr', 'ndcg@5;10'], 'word_emb_dim': 300, 'cnn_activation': 'relu', 'model_type': 'npa', 'loss': 'cross_entropy_loss', 'wordEmb_file': '/tmp/tmpmavt_1t_/utils/embedding.npy', 'wordDict_file': '/tmp/tmpmavt_1t_/utils/word_dict.pkl', 'userDict_file': '/tmp/tmpmavt_1t_/utils/uid2index.pkl'}


## Create models

In [7]:
from recommenders.models.newsrec.models.npa import NPAModel
from recommenders.models.newsrec.models.nrms import NRMSModel

In [8]:
iterator = MINDIterator
model = NPAModel(hparams, iterator, seed=seed)

  super(Adam, self).__init__(name, **kwargs)


In [None]:
# train the model according to your configuration
model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)

1086it [01:27, 12.39it/s]
  updates=self.state_updates,
8874it [01:12, 121.60it/s]


at epoch 1
train info: logloss loss:1.5229833546483933
eval info: group_auc:0.5722, mean_mrr:0.2441, ndcg@10:0.3286, ndcg@5:0.2591
at epoch 1 , train time: 87.7 eval time: 80.2


1086it [01:11, 15.18it/s]
8874it [01:11, 124.18it/s]


at epoch 2
train info: logloss loss:1.4103396817026437
eval info: group_auc:0.5905, mean_mrr:0.2606, ndcg@10:0.3484, ndcg@5:0.284
at epoch 2 , train time: 71.6 eval time: 78.7


1086it [01:11, 15.17it/s]
8874it [01:11, 124.04it/s]


at epoch 3
train info: logloss loss:1.3518082941873737
eval info: group_auc:0.5982, mean_mrr:0.2694, ndcg@10:0.3573, ndcg@5:0.2931
at epoch 3 , train time: 71.6 eval time: 78.8


1086it [01:11, 15.19it/s]
8874it [01:11, 124.24it/s]


at epoch 4
train info: logloss loss:1.2990738118331515
eval info: group_auc:0.5933, mean_mrr:0.2672, ndcg@10:0.3533, ndcg@5:0.288
at epoch 4 , train time: 71.5 eval time: 78.7


1086it [01:11, 15.19it/s]
8874it [01:11, 124.58it/s]


at epoch 5
train info: logloss loss:1.2647306379582561
eval info: group_auc:0.5944, mean_mrr:0.2684, ndcg@10:0.3554, ndcg@5:0.2888
at epoch 5 , train time: 71.5 eval time: 78.5


<recommenders.models.newsrec.models.npa.NPAModel at 0x7f79f92d5610>

In [None]:
res_syn = model.run_eval(valid_news_file, valid_behaviors_file)
print(res_syn)

8874it [01:10, 124.99it/s]


{'group_auc': 0.5944, 'mean_mrr': 0.2684, 'ndcg@5': 0.2888, 'ndcg@10': 0.3554}
