<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Data manipulation
This notebook provides necessary steps to generate DKN's input dataset from the MAG COVID-19 raw dataset 

In [1]:
import os 
import codecs
import pickle
import time 
from datetime import datetime  
import random
import numpy as np
import math

from utils.task_helper import *
from utils.general import *
from utils.data_helper import *


## Preparing paper related files
First let's generate data for papers. 
For DKN, the paper data format is like: <br>
`[Newsid] [w1,w2,w3...wk] [e1,e2,e3...ek]` <br>
where w and e are the indices of words and entities sequence of this paper. 
Words and entities are aligned. To take a quick example, a paper with title is:  <br> `One Health approach in the South East Asia region: opportunities and challenges` <br> 
Then the title words value can be <br>  `101,56,23,14,1,69,256,887,365,32,11,567` <br>   and the title entitie value can be: <br>  `10,10,0,0,0,45,45,45,0,0,0,0` <br>  The first two values of entities sequence is 10, indicating that these two words corresponding to the same entity. The title value and entity value is hashed from 1 to n and m(n/m is the number of distinct words/entities). 

In [2]:
InFile_dir = 'data_folder/raw'
OutFile_dir = 'data_folder/my'
create_dir(OutFile_dir)

Path_PaperTitleAbs_bySentence = os.path.join(InFile_dir, 'PaperTitleAbs_bySentence.txt')
Path_PaperFeature = os.path.join(OutFile_dir, 'paper_feature.txt')

max_word_size_per_paper = 15 

Step 1 is to hash the words and entities. <br>
For simplicy, in this tutorial we only use the paper title to repsesent the content of paper. Definitely you can use more content, such as paper abstract and paper body. <br>
Each feature length should be fixed at k (max_word_size_per_paper), if the number of words in document is more than k, we will truncate the document to k words. If the number of words in document is less than k, we will pad 0 to the end. 

In [3]:
word2idx = {}
entity2idx = {}
relation2idx = {}
word2idx, entity2idx = gen_paper_content(
    Path_PaperTitleAbs_bySentence, Path_PaperFeature, word2idx, entity2idx, field=["Title"], doc_len=max_word_size_per_paper
)


loading file PaperTitleAbs_bySentence.txt...
loading line: 880000, time elapses: 10.1s  
parsing into feature file  ...
parsed paper count: 110000, time elapses: 0.5s 


Step 2 is to generate the data of the knowledge graph, in turns of a set of triples: <br>
`head, tail, relation` <br>

In [4]:
word2idx_filename = os.path.join(OutFile_dir, 'word2idx.pkl')
entity2idx_filename = os.path.join(OutFile_dir, 'entity2idx.pkl')

Path_RelatedFieldOfStudy = os.path.join(InFile_dir, 'RelatedFieldOfStudy.txt')
OutFile_dir_KG = os.path.join(OutFile_dir, 'KG')
create_dir(OutFile_dir_KG)

gen_knowledge_relations(Path_RelatedFieldOfStudy, OutFile_dir_KG, entity2idx, relation2idx) 

processing file RelatedFieldOfStudy.txt... done.


The data files will be outputed to the folder `OutFile_dir_KG`.  <br>
To train word embeddings, we need a collection of sentences:

In [5]:
Path_SentenceCollection = os.path.join(OutFile_dir, 'sentence.txt')
gen_sentence_collection(
    Path_PaperTitleAbs_bySentence,
    Path_SentenceCollection,
    word2idx
)

## save the id mapper
with open(word2idx_filename, 'wb') as f:
    pickle.dump(word2idx, f)
dump_dict_as_txt(word2idx, os.path.join(OutFile_dir, 'word2id.tsv'))
with open(entity2idx_filename, 'wb') as f:
    pickle.dump(entity2idx, f)

loading file PaperTitleAbs_bySentence.txt...
loading line: 880000, time elapses: 8.8s 

## Prepare user related files
Next we generate user related files.
Our first task is user-to-paper recommendations. For each user, we collect his/her complete cited papers, and arrange them in chronological order. The recommendation task can then be formulated as: given a user's citation history, to predict what paper he/she will cite in the future.

In [6]:

_t0 = time.time()

Path_PaperReference = os.path.join(InFile_dir, 'PaperReferences.txt')
Path_PaperAuthorAffiliations = os.path.join(InFile_dir, 'PaperAuthorAffiliations.txt')
Path_Papers = os.path.join(InFile_dir, 'Papers.txt')
Path_Author2ReferencePapers = os.path.join(OutFile_dir, 'Author2ReferencePapers.tsv')

author2paper_list = load_author_paperlist(Path_PaperAuthorAffiliations)
paper2date = load_paper_date(Path_Papers)
paper2reference_list = load_paper_reference(Path_PaperReference)

author2reference_list = get_author_reference_list(author2paper_list, paper2reference_list, paper2date)

output_author2reference_list(
    author2reference_list,
    Path_Author2ReferencePapers
)

OutFile_dir_DKN = os.path.join(OutFile_dir, 'DKN-training-folder')
create_dir(OutFile_dir_KG)


loading PaperAuthorAffiliations.txt...
loading Papers.txt...
loading PaperReferences.txt...
parsing user's reference list ...
parsed user count: 430000, time elapses: 3.6s 
outputing author reference list


#### DKN takes several more files as inputs:
- training / validation / test files: each line in these files represents one instance. Impressionid is used to evaluate performance within an impression session, so it is only used when evaluating, you can set it to 0 for training data. The format is : <br> 
`[label] [userid] [CandidateNews]%[impressionid] `<br> 
e.g., `1 train_U1 N1%0` <br> 
- user history file: each line in this file represents a users' citation history. You need to set his_size parameter in config file, which is the max number of user's click history we use. We will automatically keep the last his_size number of user click history, if user's click history is more than his_size, and we will automatically padding 0 if user's click history less than his_size. the format is : <br> 
`[Userid] [newsid1,newsid2...]`<br>
e.g., `train_U1 N1,N2` <br> 

DKN take recommendations as a binary classification problem. We sample negative instances according to item's popularity:
<img src="https://recodatasets.z20.web.core.windows.net/kdd2020/images/item-popularity.JPG" width="600">

In [7]:
gen_experiment_splits(
    Path_Author2ReferencePapers,
    OutFile_dir_DKN,
    Path_PaperFeature,
    item_ratio=0.1,
    tag='small',
    process_num=2
)

_t1 = time.time()
print('time elapses for user is : {0:.1f}s'.format(_t1 - _t0))

expanding user behaviors...
processing user number : 287000, time elapses: 1.7s done. 
sample number in train / valid / test is 150874 / 8198 / 8198
negative sampling for train...
sampling process 0:  150000 / 150874, time elapses: 28.3s                                                                                                                                                                                                                                                                                                         	sampling process 1 done.
	sampling process 0 done.
negative sampling for validation...
sampling process 1:  8000 / 8198, time elapses: 1.5s                	sampling process 0 done.
	sampling process 1 done.
negative sampling for test...
sampling process 1:  8000 / 8198, time elapses: 1.6s                	sampling process 0 done.
	sampling process 1 done.
done.
time elapses for user is : 51.8s


## Prepare item2item recommendation dataset
Our second recommendation scenario is about item-to-item recommendations. Given a paper, we can recommend a list of related papers for users to cite.
Here we use a supervised learning approach to train this model. Each instance is a tuple of <paper_a, paper_b, label>. Label = 1 means the pair is highly related; otherwise the label will be 0.
The positive labels are constructed in the following three ways: <br>
1. Paper A and B overlap a lot in their reference list; 
2. Paper A and B are co-cited by many other papers;
3. Paper A and B are published in 12 months by the same author (first author).

In [8]:
OutFile_dir_item2item = r'data_folder/my/item2item'
create_dir(OutFile_dir_item2item)
Path_PaperFeature
item_set = load_has_feature_items(Path_PaperFeature)


Path_PaperReference = os.path.join(InFile_dir, 'PaperReferences.txt')
pair2CocitedCnt, pair2CoReferenceCnt = gen_paper_cocitation(Path_PaperReference)

Path_paper_pair_cocitation = os.path.join(OutFile_dir_item2item, 'paper_pair_cocitation_cnt.csv')
Path_paper_pair_coreference = os.path.join(OutFile_dir_item2item, 'paper_pair_coreference_cnt.csv')

with open(Path_paper_pair_cocitation, 'w') as wt:
    for p, v in pair2CocitedCnt.items():
        if p[0] in item_set and p[1] in item_set:
            wt.write('{0},{1},{2}\n'.format(p[0], p[1], v))

with open(Path_paper_pair_coreference, 'w') as wt:
    for p, v in pair2CoReferenceCnt.items():
        if p[0] in item_set and p[1] in item_set:
            wt.write('{0},{1},{2}\n'.format(p[0], p[1], v))
            
            
Path_Papers = os.path.join(InFile_dir, 'Papers.txt')
Path_PaperAuthorAffiliations = os.path.join(InFile_dir, 'PaperAuthorAffiliations.txt')
paper2date = load_paper_date(Path_Papers)
author2paper_list, paper2author_set = load_paper_author_relation(Path_PaperAuthorAffiliations)
Path_FirstAuthorPaperPair = os.path.join(OutFile_dir_item2item, 'paper_pair_cofirstauthor.csv')
first_author_pairs = gen_paper_pairs_from_same_author(
    author2paper_list, paper2author_set, paper2date, Path_FirstAuthorPaperPair, item_set
)

loading PaperReferences.txt...
process paper num 53400 / 53452...time elapses: 8.8s	Done.
process paper num 73600 / 73699...time elapses: 48.9s	Done.
loading Papers.txt...
loading PaperAuthorAffiliations.txt...
process author num 435800 / 435822...time elapses: 1.0s

Now let's separate the instances into training and validation set, and conduct negative sampling:

In [9]:
split_train_valid_file(
    [Path_paper_pair_cocitation, Path_FirstAuthorPaperPair, Path_paper_pair_coreference],
    OutFile_dir_DKN
)
gen_negative_instances(
    item_set,
    os.path.join(OutFile_dir_DKN, 'item2item_train.txt'),
    os.path.join(OutFile_dir_DKN, 'item2item_train_instances.txt'),
    9
)
gen_negative_instances(
    item_set,
    os.path.join(OutFile_dir_DKN, 'item2item_valid.txt'),
    os.path.join(OutFile_dir_DKN, 'item2item_valid_instances.txt'),
    9
)


negative sampling for file item2item_train.txt...
process line num 182500 / 182537...time elapses: 3.6s	done.
negative sampling for file item2item_valid.txt...
process line num 45600 / 45613...time elapses: 0.9s	done.


Generating the full dataset will take a longer time, let it run in the background freely...

In [10]:
gen_experiment_splits(
    Path_Author2ReferencePapers,
    OutFile_dir_DKN,
    Path_PaperFeature,
    item_ratio=1.0,
    tag='full',
    process_num=8
) 

expanding user behaviors...
processing user number : 287000, time elapses: 8.7s done. 
sample number in train / valid / test is 1782333 / 125010 / 125010
negative sampling for train...


sampling process 1:  1014000 / 1782333, time elapses: 698.0s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            

sampling process 5:  125000 / 125010, time elapses: 82.3s   	sampling process 5 done.
sampling process 6:  125000 / 125010, time elapses: 83.7s      	sampling process 6 done.
sampling process 2:  125000 / 125010, time elapses: 84.2s 	sampling process 2 done.
negative sampling for test...
sampling process 1:  125000 / 125010, time elapses: 81.9s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              