# PyKeen

Pykeen is a python package that generates knowledge graph embeddings while abstracting away the training loop and evaluation. The knowledge graph embeddings obtained using pykeen are reproducible, and they convey precise semantics in the knowledge graph.

To read about it more, please refer [this](https://analyticsindiamag.com/complete-guide-to-pykeen-python-knowledge-embeddings-for-knowledge-graphs/) article.

# Code Implementation

## PyKeen Installation

Installation of pykeen is quite simple. You can just do a pip install.

In [None]:
!python -m pip install pip --upgrade --user -q --no-warn-script-location
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels sklearn nltk gensim --user -q --no-warn-script-location




In [None]:
!python -m pip install pykeen==1.0.4 --user -q

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
from pykeen.pipeline import pipeline
pipeline_result = pipeline(
    dataset='Nations',
    model='TransE',
)
pipeline_result.save_to_directory('nations_transe')

## Data

Pykeen provides lots of Open Source datasets as classes for seamless integration with the rest of the module.Let’s check out the OpenBioLink Knowledge graph in this article.

In [None]:
from pykeen.datasets import OpenBioLink
dataset = OpenBioLink()
training_triples_factory = dataset.training

In [None]:
from pykeen.datasets import OpenBioLink
dataset = OpenBioLink()
dataset.training.triples

## Model, Optimizer and Training Approach

Next, we need to pick an embedding model to extract embeddings from the OpenBioLink Knowledge graph. Following is the code to load TransE model in pykeen:

In [None]:
# Pick a model
from pykeen.models import TransE
model = TransE(triples_factory=training_triples_factory)

We can choose optimizers from torch to train the model.

In [None]:
# Pick an optimizer from Torch
from torch.optim import Adam
optimizer = Adam(params=model.get_grad_params())

In [None]:
# Pick a training approach (sLCWA or LCWA)
from pykeen.training import SLCWATrainingLoop
training_loop = SLCWATrainingLoop(model=model, optimizer=optimizer)

We need to select a training approach to use to train the model 

## Training and Evaluation

We are all set to train the model now. Following command trains the model.

In [None]:
training_loop.train(num_epochs=5, batch_size=256)

Following is the code  to evaluate the trained model using a test set.

In [None]:
# Pick an evaluator
from pykeen.evaluation import RankBasedEvaluator
evaluator = RankBasedEvaluator()

# Get triples to test
mapped_triples = dataset.testing.mapped_triples

# Evaluate
results = evaluator.evaluate(model, mapped_triples, batch_size=128)
print(results)

In [None]:
results.to_df()

In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(results)

## Pipeline

PyKeen provides a high-level entry point to access the models. It is called a pipeline. We should provide all the information about the model to the pipeline, and the pipeline takes care of everything required for training.

In [None]:
from pykeen.pipeline import pipeline
pipeline_result = pipeline(
    dataset='Nations',
    model='TransE',
    evaluator='RankBasedEvaluator',
    training_loop='sLCWA',
    negative_sampler='basic',
    model_kwargs=dict(
        scoring_fct_norm=2,
    ),
)
pipeline_result.save_to_directory('nations_transe')

## Hyper Parameter Optimization

PyKeen provides a hyper parameter optimization pipeline function pykeen.hpo.hpo_pipeline().It uses optuna in the backend and does optimization.Following is a code snippet that shows how to optimize the hyperparameters.

In [None]:
from pykeen.hpo import hpo_pipeline
hpo_pipeline_result = hpo_pipeline(
   n_trials=30,
   dataset='Nations',
   model='TransE',
   loss='MarginRankingLoss',
   model_kwargs_ranges=dict(
        embedding_dim=dict(type=int, low=100, high=500, q=100),
    ),
   loss_kwargs_ranges=dict(
       margin=dict(type=float, low=1.0, high=2.0),
   ),
)

## Saving and Restoring Model

PyKeen Models are torch models with utility functions on the top. We can use the torch’s functionality to save and reload a model.

In [None]:
import torch
torch.save(model,'trained_model.pkl')
my_pykeen_model = torch.load('trained_model.pkl')

We can also save the model checkpoints during training to restore the training process if training fails due to a crash.This functionality can be added using the training_kwargs argument

In [None]:
training_kwargs=dict(
        num_epochs=2000,
        checkpoint_name='my_checkpoint.pt',
        checkpoint_directory='doctests/checkpoint_dir',
        checkpoint_frequency=5,
    ),

In [None]:
dir(model)

## Results

We have taken a knowledge graph and converted all the entities and relations into embeddings. Let’s see some of the interesting information we can extract from these embeddings.

What are the possible phenotypes observed due to the presence of the gene NCBIGENE:534? 

In [None]:
import numpy as np
np.array([['NCBIGENE:534', 'GENE_PHENOTYPE']])[:,0]

In [None]:
#Predict all tails
predicted_tails_df = model.predict_tails('NCBIGENE:534', 'GENE_PHENOTYPE')
predicted_tails_df#.head(10)

In [None]:
# Predict relations
predicted_relations_df = model.get_relation_prediction_df('brazil', 'uk')
# Predict heads
predicted_heads_df = model.get_head_prediction_df('conferences', 'brazil')

# Score top K triples
top_k_predictions_df = model.get_all_prediction_df(k=150)