<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# DKN : Deep Knowledge-Aware Network for News Recommendation
DKN \[1\] is a deep learning model which incorporates information from knowledge graph for better news recommendation. Specifically, DKN uses TransX \[2\] method for knowledge graph representaion learning, then applies a CNN framework, named KCNN, to combine entity embedding with word embedding and generate a final embedding vector for a news article. CTR prediction is made via an attention-based neural scorer. 

## Properties of DKN:
- DKN is a content-based deep model for CTR prediction rather than traditional ID-based collaborative filtering. 
- It makes use of knowledge entities and common sense in news content via joint learning from semantic-level and knnowledge-level representations of news articles.
- DKN uses an attention module to dynamically calculate a user's aggregated historical representaition.

## Reference environment 
we tested this notebook with two environment settings:
cpu with tensorflow==1.15.2 and gpu with tensorflow==1.15.2 on July 1, 2020.

## Data format:
### DKN takes several files as input as follows:
- training / validation / test files: each line in these files represents one instance. Impressionid is used to evaluate performance within an impression session, so it is only used when evaluating, you can set it to 0 for training data. The format is : <br> 
`[label] [userid] [CandidateNews]%[impressionid] `<br> 
e.g., `1 train_U1 N1%0` <br> 
- user history file: each line in this file represents a users' click history. You need to set history_size parameter in config file, which is the max number of user's click history we use. We will automatically keep the last history_size number of user click history, if user's click history is more than history_size, and we will automatically padding 0 if user's click history less than history_size. the format is : <br> 
`[Userid] [newsid1,newsid2...]`<br>
e.g., `train_U1 N1,N2` <br> 
- document feature file:
It contains the word and entity features of news. News article is represented by (aligned) title words and title entities. To take a quick example, a news title may be : Trump to deliver State of the Union address next week , then the title words value may be CandidateNews:34,45,334,23,12,987,3456,111,456,432 and the title entitie value may be: entity:45,0,0,0,0,0,0,0,0,0. Only the first value of entity vector is non-zero due to the word Trump. The title value and entity value is hashed from 1 to n(n is the number of distinct words or entities). Each feature length should be fixed at k(doc_size papameter), if the number of words in document is more than k, you should truncate the document to k words, and if the number of words in document is less than k, you should padding 0 to the end. 
the format is like: <br> 
`[Newsid] [w1,w2,w3...wk] [e1,e2,e3...ek]`
- word embedding/entity embedding/ context embedding files: These are npy files of pretrained embeddings. After loading, each file is a [n+1,k] two-dimensional matrix, n is the number of words(or entities) of their hash dictionary, k is dimension of the embedding, note that we keep embedding 0 for zero padding. 
In this experiment, we used GloVe\[4\] vectors to initialize the word embedding. We trained entity embedding using TransE\[2\] on knowledge graph and context embedding is the average of the entity's neighbors in the knowledge graph.<br>

## Global settings and imports

In [1]:
import sys
sys.path.append("../../")
from reco_utils.recommender.deeprec.deeprec_utils import *
from reco_utils.recommender.deeprec.models.dkn import *
from reco_utils.recommender.deeprec.io.dkn_iterator import *
import papermill as pm


## Download and load data

In [2]:
data_path = '../../tests/resources/deeprec/dkn'
yaml_file = os.path.join(data_path, r'dkn.yaml')
train_file = os.path.join(data_path, r'train_mind_demo.txt')
valid_file = os.path.join(data_path, r'valid_mind_demo.txt')
test_file = os.path.join(data_path, r'test_mind_demo.txt')
news_feature_file = os.path.join(data_path, r'doc_feature.txt')
user_history_file = os.path.join(data_path, r'user_history.txt')
wordEmb_file = os.path.join(data_path, r'word_embeddings_100.npy')
entityEmb_file = os.path.join(data_path, r'TransE_entity2vec_100.npy')
contextEmb_file = os.path.join(data_path, r'TransE_context2vec_100.npy')
infer_embedding_file = os.path.join(data_path, r'infer_embedding.txt')
if not os.path.exists(yaml_file):
    download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path, 'mind-demo.zip')
    
    

## Create hyper-parameters

In [3]:
epoch=5
run_MIND_small = True

In [4]:
hparams = prepare_hparams(yaml_file,
                          news_feature_file = news_feature_file,
                          user_history_file = user_history_file,
                          wordEmb_file=wordEmb_file,
                          entityEmb_file=entityEmb_file,
                          contextEmb_file=contextEmb_file,
                          epochs=epoch)
print(hparams)

W0724 12:27:56.582168 140011160835904 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



kg_file=None,user_clicks=None,FEATURE_COUNT=None,FIELD_COUNT=None,data_format=dkn,PAIR_NUM=None,DNN_FIELD_NUM=None,n_user=None,n_item=None,n_user_attr=None,n_item_attr=None,iterator_type=None,SUMMARIES_DIR=None,MODEL_DIR=None,wordEmb_file=../../tests/resources/deeprec/dkn/word_embeddings_100.npy,entityEmb_file=../../tests/resources/deeprec/dkn/TransE_entity2vec_100.npy,contextEmb_file=../../tests/resources/deeprec/dkn/TransE_context2vec_100.npy,news_feature_file=../../tests/resources/deeprec/dkn/doc_feature.txt,user_history_file=../../tests/resources/deeprec/dkn/user_history.txt,use_entity=True,use_context=True,doc_size=10,history_size=50,word_size=12600,entity_size=3987,entity_dim=100,entity_embedding_method=None,transform=True,train_ratio=None,dim=100,layer_sizes=[300],cross_layer_sizes=None,cross_layers=None,activation=['sigmoid'],cross_activation=identity,user_dropout=False,dropout=[0.0],attention_layer_sizes=100,attention_activation=relu,attention_dropout=0.0,model_type=dkn,method

In [5]:
input_creator = DKNTextIterator

## Train the DKN model

In [6]:
model = DKN(hparams, input_creator)

W0724 12:27:57.883669 140011160835904 module_wrapper.py:139] From ../../reco_utils/recommender/deeprec/models/dkn.py:41: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0724 12:27:57.905923 140011160835904 deprecation.py:323] From /home/miguel/anaconda/envs/reco_gpu/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:2825: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0724 12:27:57.989511 140011160835904 module_wrapper.py:139] From ../../reco_utils/recommender/deeprec/models/base_model.py:29: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

W0724 12:27:57.992974 140011160835904 module_wrapper.py:139] From ../../reco_utils/recommender/deeprec/io/dkn_iterator.py:40: The na

In [7]:
print(model.run_eval(valid_file))

{'auc': 0.5125, 'group_auc': 0.5119, 'mean_mrr': 0.1495, 'ndcg@5': 0.1394, 'ndcg@10': 0.2033}


In [8]:
model.fit(train_file, valid_file)

at epoch 1
train info: logloss loss:0.6935237138453176
eval info: auc:0.5577, group_auc:0.5343, mean_mrr:0.1816, ndcg@10:0.2441, ndcg@5:0.1898
at epoch 1 , train time: 142.2 eval time: 19.9
at epoch 2
train info: logloss loss:0.6216785455659285
eval info: auc:0.5781, group_auc:0.5381, mean_mrr:0.1837, ndcg@10:0.2479, ndcg@5:0.1919
at epoch 2 , train time: 139.9 eval time: 20.5
at epoch 3
train info: logloss loss:0.588279095115298
eval info: auc:0.5867, group_auc:0.5478, mean_mrr:0.184, ndcg@10:0.2503, ndcg@5:0.1894
at epoch 3 , train time: 139.6 eval time: 20.0
at epoch 4
train info: logloss loss:0.5617338119674538
eval info: auc:0.583, group_auc:0.5645, mean_mrr:0.184, ndcg@10:0.2501, ndcg@5:0.1835
at epoch 4 , train time: 139.3 eval time: 20.0
at epoch 5
train info: logloss loss:0.541451787165666
eval info: auc:0.5909, group_auc:0.5872, mean_mrr:0.1922, ndcg@10:0.2614, ndcg@5:0.1918
at epoch 5 , train time: 139.4 eval time: 20.3


<reco_utils.recommender.deeprec.models.dkn.DKN at 0x7f56dd6b0940>

Now we can test again the performance on valid set:

In [9]:
res = model.run_eval(valid_file)
print(res)
pm.record("res", res)

{'auc': 0.5909, 'group_auc': 0.5872, 'mean_mrr': 0.1922, 'ndcg@5': 0.1918, 'ndcg@10': 0.2614}


  This is separate from the ipykernel package so we can avoid doing imports until


## Document embedding inference API
After training, you can get document embedding through this document embedding inference API. The input file format is same with document feature file. The output file fomrat is: `[Newsid] [embedding]`

In [10]:
model.run_get_embedding(news_feature_file, infer_embedding_file)

<reco_utils.recommender.deeprec.models.dkn.DKN at 0x7f56dd6b0940>

## Quick start with MINDsmall dataset
To help you get a quick start of using DKN on MIND dataset, we offer an example running on MINDsmall dataset, it contains a script to transform MINDsmall dataset to Microsoft Recommender DKN data format below.


Now we start to running on MINDsmall dataset:

In [11]:
data_path = '../../tests/resources/deeprec/dkn/MINDsmall/'
yaml_file = os.path.join(data_path, r'dkn_MINDsmall.yaml')
train_file = os.path.join(data_path, r'train_mind_small.txt')
valid_file = os.path.join(data_path, r'valid_mind_small.txt')
news_feature_file = os.path.join(data_path, r'doc_feature.txt')
user_history_file = os.path.join(data_path, r'user_history_small.txt')
wordEmb_file = os.path.join(data_path, r'word_embeddings_5w_100.npy')
entityEmb_file = os.path.join(data_path, r'entity_embeddings_5w_100.npy')

In [12]:
hparams = prepare_hparams(yaml_file,
                          news_feature_file = news_feature_file,
                          user_history_file = user_history_file,
                          wordEmb_file=wordEmb_file,
                          entityEmb_file=entityEmb_file,
                          epochs=5)


In [13]:
if run_MIND_small:
    os.system("bash ../../tests/resources/deeprec/dkn/MINDsmall/scripts/build_dkn_data.sh")
    input_creator = DKNTextIterator
    model = DKN(hparams, input_creator)
    model.fit(train_file, valid_file)

FileNotFoundError: [Errno 2] No such file or directory: '../../tests/resources/deeprec/dkn/MINDsmall/word_embeddings_5w_100.npy'

## Running models with large dataset
Here are performances using the whole MIND dataset \[3\]. 

MIND dataset is a large-scale English news dataset. It was collected from anonymized behavior logs of Microsoft News website. MIND contains 1,000,000 users, 161,013 news articles and 15,777,377 impression logs. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression.

| Models | g-AUC | MRR |NDCG@5 | NDCG@10 |
| :------| :------: | :------: | :------: | :------ |
| LibFM | 0.5993 | 0.2823 | 0.3005 | 0.3574 |
| Wide&Deep | 0.6216 | 0.2931 | 0.3138 | 0.3712 |
| DKN | 0.6436 | 0.3128 | 0.3371 | 0.3908|


Note that the results of DKN are using Microsoft recommender and the results of the first two models come from the MIND paper \[3\].
We compare the results on the same test dataset. 

One epoch takes 6381.3s (5066.6s for training, 1314.7s for evaluating) for DKN on GPU. Hardware specification for running the large dataset: <br>
GPU: Tesla P100-PCIE-16GB <br>
CPU: 6 cores Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz

## Reference
\[1\] Wang, Hongwei, et al. "DKN: Deep Knowledge-Aware Network for News Recommendation." Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2018.<br>
\[2\] Knowledge Graph Embeddings including TransE, TransH, TransR and PTransE. https://github.com/thunlp/KB2E <br>
\[3\] Wu, Fangzhao, et al. "MIND: A Large-scale Dataset for News Recommendation" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>
\[4\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/