<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>


# Pretraining word and entity embeddings
This notebook trains word embeddings and entity embeddings for DKN initializations.

In [1]:
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
import time
from utils.general import *
import numpy as np
import pickle
from utils.task_helper import *

In [2]:
class MySentenceCollection:
    def __init__(self, filename):
        self.filename = filename
        self.rd = None

    def __iter__(self):
        self.rd = open(self.filename, 'r', encoding='utf-8', newline='\r\n')
        return self

    def __next__(self):
        line = self.rd.readline()
        if line:
            return list(line.strip('\r\n').split(' '))
        else:
            self.rd.close()
            raise StopIteration


In [3]:
InFile_dir = 'data_folder/my'
OutFile_dir = 'data_folder/my/pretrained-embeddings'
OutFile_dir_KG = 'data_folder/my/KG'
OutFile_dir_DKN = 'data_folder/my/DKN-training-folder'

Wrod2vec [4] can learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. We use word2vec algorithm implemented in Gensim [5] to generate word embeddings. 
<img src="https://recodatasets.z20.web.core.windows.net/kdd2020/images%2Fword2vec.JPG" width="300">

In [4]:
def train_word2vec(Path_sentences, OutFile_dir):     
    OutFile_word2vec = os.path.join(OutFile_dir, r'word2vec.model')
    OutFile_word2vec_txt = os.path.join(OutFile_dir, r'word2vec.txt')
    create_dir(OutFile_dir)

    print('start to train word embedding...', end=' ')
    my_sentences = MySentenceCollection(Path_sentences)
    model = Word2Vec(my_sentences, size=32, window=5, min_count=1, workers=8, iter=10) # user more epochs for better accuracy

    model.save(OutFile_word2vec)
    model.wv.save_word2vec_format(OutFile_word2vec_txt, binary=False)
    print('\tdone . ')

Path_sentences = os.path.join(InFile_dir, 'sentence.txt')

t0 = time.time()
train_word2vec(Path_sentences, OutFile_dir)
t1 = time.time()
print('time elapses: {0:.1f}s'.format(t1 - t0))

start to train word embedding... 	done . 
time elapses: 218.2s


We leverage a graph embedding model to encode entities into embedding vectors.
<img src="https://recodatasets.z20.web.core.windows.net/kdd2020/images%2Fkg-embedding-math.JPG" width="600">
<img src="https://recodatasets.z20.web.core.windows.net/kdd2020/images%2Fkg-embedding.JPG" width="600">
We use an open-source implementation of TransE (https://github.com/thunlp/Fast-TransX) for generating knowledge graph embeddings:

In [5]:
!bash ./run_transE.sh

/mnt/jialia/kdd2020/recommenders/examples/07_tutorials/KDD2020-tutorial
Cloning into 'Fast-TransX'...
remote: Enumerating objects: 439, done.[K
remote: Total 439 (delta 0), reused 0 (delta 0), pack-reused 439[K
Receiving objects: 100% (439/439), 10.01 MiB | 13.27 MiB/s, done.
Resolving deltas: 100% (130/130), done.
epoch 0 449712.375000
epoch 1 393208.781250
epoch 2 337558.531250
epoch 3 331067.187500
epoch 4 306186.406250
epoch 5 284518.781250
epoch 6 267733.031250
epoch 7 247449.140625
epoch 8 229839.609375
epoch 9 213476.515625


DKN take considerations of both the entity embeddings and its context embeddings.
<img src="https://recodatasets.z20.web.core.windows.net/kdd2020/images/context-embedding.JPG" width="600">

In [6]:
##### build context embedding
EMBEDDING_LENGTH = 32
entity_file = os.path.join(OutFile_dir_KG, 'entity2vec.vec') 
context_file = os.path.join(OutFile_dir_KG, 'context2vec.vec')   
kg_file = os.path.join(OutFile_dir_KG, 'train2id.txt')   
gen_context_embedding(entity_file, context_file, kg_file, dim=EMBEDDING_LENGTH)

In [7]:
load_np_from_txt(
        os.path.join(OutFile_dir_KG, 'entity2vec.vec'),
        os.path.join(OutFile_dir_DKN, 'entity_embedding.npy'),
    )
load_np_from_txt(
        os.path.join(OutFile_dir_KG, 'context2vec.vec'),
        os.path.join(OutFile_dir_DKN, 'context_embedding.npy'),
    )
format_word_embeddings(
    os.path.join(OutFile_dir, 'word2vec.txt'),
    os.path.join(InFile_dir, 'word2idx.pkl'),
    os.path.join(OutFile_dir_DKN, 'word_embedding.npy')
)


## Reference
\[1\] Wang, Hongwei, et al. "DKN: Deep Knowledge-Aware Network for News Recommendation." Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2018.<br>
\[2\] Knowledge Graph Embeddings including TransE, TransH, TransR and PTransE. https://github.com/thunlp/KB2E <br>
 of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>
\[3\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/ <br>
\[4\] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’13). Curran Associates Inc., Red Hook, NY, USA, 3111–3119. <br>
\[5\] Gensim  Word2vec embeddings : https://radimrehurek.com/gensim/models/word2vec.html <br>