# Training DRKG Using TransE_L2
This notebook shows how to train DRKG embeddings using TransE_L2

Before training the model, you need to download the original DRKG source file into your local storage, e.g., ./data/drkg.tsv

## Install DGL-KE
Before training the model, we need to install dgl and dgl-ke packages as well as other dependencies. 

In [3]:
!pip3 install torch
!pip3 install dgl 
!pip3 install dglke



In [5]:
!pip install --upgrade torch torchvision

Collecting torch
  Downloading torch-1.6.0-cp37-cp37m-manylinux1_x86_64.whl (748.8 MB)
[K     |█████████████▋                  | 317.7 MB 121.3 MB/s eta 0:00:04

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████▎       | 569.2 MB 127.5 MB/s eta 0:00:02

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[K     |███████████████████████████████ | 723.1 MB 124.1 MB/s eta 0:00:01

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████████████| 748.8 MB 10 kB/s 
[?25hCollecting torchvision
  Downloading torchvision-0.7.0-cp37-cp37m-manylinux1_x86_64.whl (5.9 MB)
[K     |████████████████████████████████| 5.9 MB 71.6 MB/s eta 0:00:01
Installing collected packages: torch, torchvision
  Attempting uninstall: torch
    Found existing installation: torch 1.4.0
    Uninstalling torch-1.4.0:
      Successfully uninstalled torch-1.4.0
  Attempting uninstall: torchvision
    Found existing installation: torchvision 0.5.0
    Uninstalling torchvision-0.5.0:
      Successfully uninstalled torchvision-0.5.0
[31mERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

fastai 1.0.61 requires nvidia-ml-py3, which is not installed.[0m
Successfully installed torch

## Prepare train/valid/test set
Before training, we need to split the original drkg into train/valid/test set with a 9:0.5:0.5 manner.

In [8]:
import pandas as pd
import numpy as np
import sys
sys.path.insert(1, '../utils')
from utils import download_and_extract
download_and_extract()
drkg_file = '../data/drkg/drkg.tsv'

df = pd.read_csv(drkg_file, sep="\t")
triples = df.values.tolist()

In [4]:
!ls

Edge_score_analysis.ipynb
Edge_similarity_based_on_link_recommendation_results.ipynb
Entity_similarity_analysis.ipynb
Readme.md
Relation_similarity_analysis.ipynb
Train_embeddings.ipynb


We get 5,869,293 triples, now we will split them into three files

In [9]:
num_triples = len(triples)
num_triples

5874260

In [10]:
# Please make sure the output directory exist.
seed = np.arange(num_triples)
np.random.shuffle(seed)

train_cnt = int(num_triples * 0.9)
valid_cnt = int(num_triples * 0.05)
train_set = seed[:train_cnt]
train_set = train_set.tolist()
valid_set = seed[train_cnt:train_cnt+valid_cnt].tolist()
test_set = seed[train_cnt+valid_cnt:].tolist()

with open("train/drkg_train.tsv", 'w+') as f:
    for idx in train_set:
        f.writelines("{}\t{}\t{}\n".format(triples[idx][0], triples[idx][1], triples[idx][2]))
        
with open("train/drkg_valid.tsv", 'w+') as f:
    for idx in valid_set:
        f.writelines("{}\t{}\t{}\n".format(triples[idx][0], triples[idx][1], triples[idx][2]))

with open("train/drkg_test.tsv", 'w+') as f:
    for idx in test_set:
        f.writelines("{}\t{}\t{}\n".format(triples[idx][0], triples[idx][1], triples[idx][2]))

## Training TransE_l2 model
We can training the TransE_l2 model by simplying using DGL-KE command line. For more information about using DGL-KE please refer to https://github.com/awslabs/dgl-ke.

Here we train the model using 8 GPUs on an AWS p3.16xlarge instance.

In [11]:
!pip install --upgrade dgl-cu101

Requirement already up-to-date: dgl-cu101 in /opt/conda/lib/python3.7/site-packages (0.5.1)


In [19]:
!pip3 install dgl==0.4.3

Collecting dgl==0.4.3
  Using cached dgl-0.4.3-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
Installing collected packages: dgl
  Attempting uninstall: dgl
    Found existing installation: dgl 0.5.1
    Uninstalling dgl-0.5.1:
      Successfully uninstalled dgl-0.5.1
Successfully installed dgl-0.4.3


In [13]:
!pip install --upgrade dglke

Requirement already up-to-date: dglke in /opt/conda/lib/python3.7/site-packages (0.1.1)


In [None]:
!DGLBACKEND=pytorch dglke_train --dataset DRKG --data_path ./train --data_files drkg_train.tsv drkg_valid.tsv drkg_test.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 2048 \
--neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --gpu 0 1 2 3 4 5 6 7 --num_proc 8 --neg_sample_size_eval 10000 --async_update

Reading train triples....
Finished. Read 5286834 train triples.
Reading valid triples....
Finished. Read 293713 valid triples.
Reading test triples....
Finished. Read 293713 test triples.
|Train|: 5286834
random partition 5286834 edges into 8 parts
part 0 has 660855 edges
part 1 has 660855 edges
part 2 has 660855 edges
part 3 has 660855 edges
part 4 has 660855 edges
part 5 has 660855 edges
part 6 has 660855 edges
part 7 has 660849 edges
|valid|: 293713
|test|: 293713
Total initialize time 19.015 seconds


## Get Entity and Relation Embeddings
The resulting model, i.e., the entity and relation embeddings can be found under ./ckpts. (Please refer to the first line of the training log for the specific location.)

The overall process will generate 4 important files:

  - Entity embedding: ./ckpts/<model\_name>_<dataset\_name>_<run_\id>/xxx\_entity.npy
  - Relation embedding: ./ckpts/<model\_name>_<dataset\_name>_<run\_id>/xxx\_relation.npy
  - The entity id mapping, formated in <entity\_name> <entity\_id> pair: <data\_path>/entities.tsv
  - The relation id mapping, formated in <relation\_name> <relation\_id> pair: <data\_path>/relations.tsv

In [2]:
!ls ./ckpts/TransE_l2_DRKG_0/
!ls ./train/

ls: cannot access './ckpts/TransE_l2_DRKG_0/': No such file or directory
drkg_test.tsv  drkg_train.tsv  drkg_valid.tsv


## A Glance of the Entity and Relation Embeddings

In [None]:
node_emb = np.load('./ckpts/TransE_l2_DRKG_0/DRKG_TransE_l2_entity.npy')
relation_emb = np.load('./ckpts/TransE_l2_DRKG_0/DRKG_TransE_l2_relation.npy')

print(node_emb.shape)
print(relation_emb.shape)