# Training Our Self-Classified Data Using ComplEx
This notebook shows how to train embeddings using ComplEx

## Prepare train/valid/test set
Before training, we need to split the original drkg into train/valid/test set with a 9:0.5:0.5 manner.

In [1]:
import dgl

Using backend: pytorch


In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/triplets.tsv', sep = '\t')

triples = df.values.tolist()

We get 2,063,235 triples, now we will split them into three files

In [None]:
num_triples = len(triples)
num_triples

In [None]:
# Please make sure the output directory exist.
import random
random.seed(0) ## make sure the seedlist is psudo-random
seed = np.arange(num_triples)
random.shuffle(seed)

train_cnt = int(num_triples * 0.9)
valid_cnt = int(num_triples * 0.05)
train_set = seed[:train_cnt]
train_set = train_set.tolist()
valid_set = seed[train_cnt:train_cnt+valid_cnt].tolist()
test_set = seed[train_cnt+valid_cnt:].tolist()

with open("train/triple_train.tsv", 'w+') as f:
    for idx in train_set:
        f.writelines("{}\t{}\t{}\n".format(triples[idx][0], triples[idx][1], triples[idx][2]))
        
with open("train/triple_valid.tsv", 'w+') as f:
    for idx in valid_set:
        f.writelines("{}\t{}\t{}\n".format(triples[idx][0], triples[idx][1], triples[idx][2]))

with open("train/triple_test.tsv", 'w+') as f:
    for idx in test_set:
        f.writelines("{}\t{}\t{}\n".format(triples[idx][0], triples[idx][1], triples[idx][2]))

## Training ComplEx model
We can training the ComplEx model by simplying using DGL-KE command line. For more information about using DGL-KE please refer to https://github.com/awslabs/dgl-ke.

Here we train the model using 1 GPUs on an AWS p2.xlarge instance.

In [None]:
!DGLBACKEND=pytorch dglke_train --dataset DRKG--data_path ./train --data_files drkg_train.tsv drkg_valid.tsv drkg_test.tsv --format 'raw_udd_hrt' --model_name ComplEx --batch_size 2048 \
--neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --gpu 0 --neg_sample_size_eval 10000

## Get Entity and Relation Embeddings
The resulting model, i.e., the entity and relation embeddings can be found under ./ckpts. (Please refer to the first line of the training log for the specific location.)

The overall process will generate 4 important files:

  - Entity embedding: ./ckpts/<model\_name>_<dataset\_name>_<run_\id>/xxx\_entity.npy
  - Relation embedding: ./ckpts/<model\_name>_<dataset\_name>_<run\_id>/xxx\_relation.npy
  - The entity id mapping, formated in <entity\_name> <entity\_id> pair: <data\_path>/entities.tsv
  - The relation id mapping, formated in <relation\_name> <relation\_id> pair: <data\_path>/relations.tsv

In [None]:
!ls ./ckpts/ComplEx_triple_1/
!ls ./train/

## A Glance of the Entity and Relation Embeddings

In [None]:
node_emb = np.load('./ckpts/ComplEx_triple_1/triple_ComplEx_entity.npy')
relation_emb = np.load('./ckpts/ComplEx_triple_1/triple_ComplEx_relation.npy')

print(node_emb.shape)
print(relation_emb.shape)