<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# DKN : Deep Knowledge-Aware Network for News Recommendation
DKN \[1\] is a deep learning model which incorporates information from knowledge graph for better news recommendation. Specifically, DKN uses TransX \[2\] method for knowledge graph representaion learning, then applies a CNN framework, named KCNN, to combine entity embedding with word embedding and generate a final embedding vector for a news article. CTR prediction is made via an attention-based neural scorer. 
<img src="https://recodatasets.z20.web.core.windows.net/kdd2020/images%2FDKN-introduction-pic.JPG" width="600">

## Properties of DKN:
- DKN is a content-based deep model for CTR prediction rather than traditional ID-based collaborative filtering. 
- It makes use of knowledge entities and common sense in news content via joint learning from semantic-level and knnowledge-level representations of news articles.
- DKN uses an attention module to dynamically calculate a user's aggregated historical representaition.



## Data format:
### DKN takes several files as input as follows:
- training / validation / test files: each line in these files represents one instance. Impressionid is used to evaluate performance within an impression session, so it is only used when evaluating, you can set it to 0 for training data. The format is : <br> 
`[label] [userid] [CandidateNews]%[impressionid] `<br> 
e.g., `1 train_U1 N1%0` <br> 
- user history file: each line in this file represents a users' click history. You need to set his_size parameter in config file, which is the max number of user's click history we use. We will automatically keep the last his_size number of user click history, if user's click history is more than his_size, and we will automatically padding 0 if user's click history less than his_size. the format is : <br> 
`[Userid] [newsid1,newsid2...]`<br>
e.g., `train_U1 N1,N2` <br> 
- document feature file:
It contains the word and entity features of news. News article is represented by (aligned) title words and title entities. To take a quick example, a news title may be : Trump to deliver State of the Union address next week , then the title words value may be CandidateNews:34,45,334,23,12,987,3456,111,456,432 and the title entitie value may be: entity:45,0,0,0,0,0,0,0,0,0. Only the first value of entity vector is non-zero due to the word Trump. The title value and entity value is hashed from 1 to n(n is the number of distinct words or entities). Each feature length should be fixed at k(doc_size papameter), if the number of words in document is more than k, you should truncate the document to k words, and if the number of words in document is less than k, you should padding 0 to the end. 
the format is like: <br> 
`[Newsid] [w1,w2,w3...wk] [e1,e2,e3...ek]`
- word embedding/entity embedding/ context embedding files: These are npy files of pretrained embeddings. After loading, each file is a [n+1,k] two-dimensional matrix, n is the number of words(or entities) of their hash dictionary, k is dimension of the embedding, note that we keep embedding 0 for zero padding. 
In this experiment, we used GloVe\[4\] vectors to initialize the word embedding. We trained entity embedding using TransE\[2\] on knowledge graph and context embedding is the average of the entity's neighbors in the knowledge graph.<br>

## Global settings and imports

In [1]:
from reco_utils.recommender.deeprec.deeprec_utils import *
from reco_utils.recommender.deeprec.models.dkn import *
from reco_utils.recommender.deeprec.io.dkn_iterator import *
import time

import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)

## data paths
Usually we will debug and search hyper-parameters on a small dataset.  You can switch between the small dataset and full dataset by changing the value of `tag`.

In [2]:
tag = 'small' # small or full

In [3]:
data_path = 'data_folder/my/DKN-training-folder'

yaml_file = './dkn.yaml' #  os.path.join(data_path, r'../../../../../../dkn.yaml')
train_file = os.path.join(data_path, r'train_{0}.txt'.format(tag))
valid_file = os.path.join(data_path, r'valid_{0}.txt'.format(tag))
test_file = os.path.join(data_path, r'test_{0}.txt'.format(tag))
user_history_file = os.path.join(data_path, r'user_history_{0}.txt'.format(tag))
news_feature_file = os.path.join(data_path, r'../paper_feature.txt')
wordEmb_file = os.path.join(data_path, r'word_embedding.npy')
entityEmb_file = os.path.join(data_path, r'entity_embedding.npy')
contextEmb_file = os.path.join(data_path, r'context_embedding.npy')
infer_embedding_file = os.path.join(data_path, r'infer_embedding.txt')
    

## Create hyper-parameters

In [4]:
epoch=5
hparams = prepare_hparams(yaml_file,
                          news_feature_file = news_feature_file,
                          user_history_file = user_history_file,
                          wordEmb_file=wordEmb_file,
                          entityEmb_file=entityEmb_file,
                          contextEmb_file=contextEmb_file,
                          epochs=epoch,
                          is_clip_norm=True,
                          max_grad_norm=0.5,
                          history_size=20,
                          MODEL_DIR=os.path.join(data_path, 'save_models'),
                          learning_rate=0.001,
                          embed_l2=0.0,
                          layer_l2=0.0,
                          use_entity=True,
                          use_context=True
                         )
print(hparams.values)

<bound method HParams.values of HParams([('DNN_FIELD_NUM', None), ('EARLY_STOP', 100), ('FEATURE_COUNT', None), ('FIELD_COUNT', None), ('L', None), ('MODEL_DIR', 'data_folder/my/DKN-training-folder/save_models'), ('PAIR_NUM', None), ('SUMMARIES_DIR', None), ('T', None), ('activation', ['sigmoid']), ('att_fcn_layer_sizes', None), ('attention_activation', 'relu'), ('attention_dropout', 0.0), ('attention_layer_sizes', 32), ('attention_size', None), ('batch_size', 100), ('cate_embedding_dim', None), ('cate_vocab', None), ('contextEmb_file', 'data_folder/my/DKN-training-folder/context_embedding.npy'), ('cross_activation', 'identity'), ('cross_l1', 0.0), ('cross_l2', 0.0), ('cross_layer_sizes', None), ('cross_layers', None), ('data_format', 'dkn'), ('decay', None), ('dilations', None), ('dim', 32), ('doc_size', 15), ('dropout', [0.0]), ('dtype', 32), ('embed_l1', 0.0), ('embed_l2', 0.0), ('embed_size', None), ('embedding_dropout', 0.3), ('enable_BN', False), ('entityEmb_file', 'data_folder/m

In [5]:
input_creator = DKNTextIterator

## Train the DKN model
<img src="https://recodatasets.z20.web.core.windows.net/kdd2020/images%2FDKN-main.JPG" width="600">

In [6]:
model = DKN(hparams, input_creator)

In [7]:
t01 = time.time()
print(model.run_eval(valid_file))
t02 = time.time()
print((t02-t01)/60)

{'auc': 0.4975, 'group_auc': 0.4995, 'mean_mrr': 0.4499, 'ndcg@2': 0.3188, 'ndcg@4': 0.5096, 'ndcg@6': 0.5844}
0.2868070244789124


In [8]:
model.fit(train_file, valid_file)

at epoch 1
train info: logloss loss:0.31074869422989354
eval info: auc:0.9233, group_auc:0.9227, mean_mrr:0.871, ndcg@2:0.8764, ndcg@4:0.9031, ndcg@6:0.9044
at epoch 1 , train time: 158.5 eval time: 15.7
at epoch 2
train info: logloss loss:0.23968442060617945
eval info: auc:0.9389, group_auc:0.9359, mean_mrr:0.8922, ndcg@2:0.8978, ndcg@4:0.9189, ndcg@6:0.9201
at epoch 2 , train time: 157.4 eval time: 15.8
at epoch 3
train info: logloss loss:0.21604214868106048
eval info: auc:0.9449, group_auc:0.941, mean_mrr:0.8986, ndcg@2:0.905, ndcg@4:0.9241, ndcg@6:0.9249
at epoch 3 , train time: 157.3 eval time: 15.7
at epoch 4
train info: logloss loss:0.20288348058693245
eval info: auc:0.9483, group_auc:0.9457, mean_mrr:0.906, ndcg@2:0.9126, ndcg@4:0.9298, ndcg@6:0.9305
at epoch 4 , train time: 157.2 eval time: 15.8
at epoch 5
train info: logloss loss:0.19293928237546187
eval info: auc:0.9496, group_auc:0.9481, mean_mrr:0.9091, ndcg@2:0.9168, ndcg@4:0.9321, ndcg@6:0.9328
at epoch 5 , train time: 1

<reco_utils.recommender.deeprec.models.dkn.DKN at 0x7f7c617c2898>

Now we can test again the performance on valid set:

In [9]:
t01 = time.time()
print(model.run_eval(test_file))
t02 = time.time()
print((t02-t01)/60)

{'auc': 0.94, 'group_auc': 0.9374, 'mean_mrr': 0.7071, 'ndcg@2': 0.6735, 'ndcg@4': 0.746, 'ndcg@6': 0.7647}
0.4620617826779683


## Document embedding inference API
After training, you can get document embedding through this document embedding inference API. The input file format is same with document feature file. The output file fomrat is: `[Newsid] [embedding]`

In [10]:
model.run_get_embedding(news_feature_file, infer_embedding_file)

<reco_utils.recommender.deeprec.models.dkn.DKN at 0x7f7c617c2898>

we compre with DKN performance between using knowledge entities or without using knowledge entities (DKN(-)):

| Models | Group-AUC | MRR |NDCG@2 | NDCG@4 |
| :------| :------: | :------: | :------: | :------ |
| DKN | 0.9557 | 0.8993 | 0.8951 | 0.9123 |
| DKN(-) | 0.9506 | 0.8817 | 0.8758 | 0.8982 |
| LightGCN | 0.8608 | 0.5605 | 0.4975 | 0.5792 |

## Reference
\[1\] Wang, Hongwei, et al. "DKN: Deep Knowledge-Aware Network for News Recommendation." Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2018.<br>
