# Implementation of Ampligraph
Ampligraph is a library designed to generate knowledge graph embeddings and combine these with model-specific scoring functions to predict unseen and novel links. 

## Setup

Ampligraph presently works only with Tensorflow 1.x which will no longer be supported as of August 1, 2022. [In accordance with issue #262 of the library repository on GitHub](https://github.com/Accenture/AmpliGraph/issues/262), an update of Ampligraph which works with Tensorflow 2.x is in development, and this notebook will be modified accordingly once this update is released.

In [None]:
# IF RUNNING LOCALLY: install tensorflow version lower than 2 in your working environment
%tensorflow_version 1.x

In [None]:
!pip install ampligraph

In [None]:
# import statements
import numpy as np
import pandas as pd
import ampligraph
import requests

from ampligraph.datasets import load_from_csv
from ampligraph.evaluation import train_test_split_no_unseen
from ampligraph.latent_features import ComplEx
from ampligraph.evaluation import evaluate_performance
from ampligraph.evaluation import mr_score, mrr_score, hits_at_n_score
from ampligraph.utils import create_tensorboard_visualizations


import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)

from scipy.special import expit


In [None]:
# for running locally:
# X = load_from_csv('.', '../data/best_models/knowledge-graph.csv', sep=',')

X = load_from_csv('.', 'knowledge-graph.csv', sep=',')
print(len(X))
X[:5, ]

## Defining train and test datasets
Here the dataset is divided into data used for training the selected model and data which this trained model will test against for accuracy.

Since all entities must be represented in the training and testing data sets by being apart of at least 1 sampled triple, the `train_test_split_no_unseen` function is used to ensure that no entity is left unrepresented.

In [None]:
num_test = int(len(X) * (20 / 100))

data = {}

data['train'], data['test'] = train_test_split_no_unseen(X, test_size=num_test, seed=0, allow_duplication=False) 

print('Train set size: ', data['train'].shape)
print('Test set size: ', data['test'].shape)


## Training the model

As stated in the Ampligraph documentation, the default model parameters are: 
- **k** : the dimensionality of the embedding space ('size' of space which these embeddings will occupy).
- **eta** ($\eta$) : the number of negative, or false triples that must be generated at training runtime for each positive, or true triple
- **batches_count** : the number of batches in which the training set is split during the training loop. 
  - *Context*: if you have a `csv` with 500 rows of data and you set the `batches_count` to 5, the data will be divided into 100 batches (500/5) with each batch containing 5 samples from the data).
- **epochs** : the number of epochs to train the model for.
  - *Context*: Epochs are complete passes through the dataset-- continuing from example above, one epoch would be 100 batches (aka 100 updates to the model). There should be more epochs than batches as the model needs to see the same data more than once in order to gauge improvement.
- **optimizer** : the Adam optimizer, with a learning rate of 1e-3 set via the optimizer_params kwarg.
- **loss** : pairwise loss, with a margin of 0.5 set via the loss_params kwarg.
- **regularizer** : $L_p$ regularization with $p=2$, i.e. l2 regularization. $\lambda$ = 1e-5, set via the regularizer_params kwarg.



In [None]:
model = ComplEx(batches_count=50,
               seed=0,
               epochs=1000, #pretty much settles down by 400
               k=400,
               eta=15,
               optimizer='adam',
               optimizer_params={'lr':1e-4},
               loss='multiclass_nll',
               regularizer='LP',
               regularizer_params={'p':3, 'lambda':1e-5},
               verbose=True)



In [None]:
model.fit(data['train'], early_stopping = False)

In [None]:
#july15, 100 epochs
model.fit(data['train'], early_stopping = False)

## Evaluating the model

The `evaluate_performance` function is given our test set, then outputs a series of ranks which evaluate the likelyhood which a given triple is true (1 indicating the highest likelyhood of truth).

In [None]:
positives_filter = X

ranks = evaluate_performance(data['test'], 
                             model=model, 
                             filter_triples=positives_filter,   # corruption strategy filter defined above 
                             use_default_protocol=True, # corrupt subj and obj separately while evaluating
                             verbose=True)



In [None]:
print(ranks)

### Metrics

The `mrr_score` looks at how the positive triples are ranked in the `ranks` vector, and outputs the mean of these ranks. The percentage form of the rank for each individual positive triple is calculated by 1/n where n is the given rank. The MRR score is calculated by adding together all the ranks in their percentage form, and then dividing by number of positive triples that were evaluated. This gives us an idea of where the positive triples are most often being ranked by the model when it is evaluating for truth.
- In [the example given in the Ampligraph documentation](https://docs.ampligraph.org/en/latest/generated/ampligraph.evaluation.mrr_score.html), the first triple is initially ranked 2, which becomes 1/2 = **0.5**. The second triple is initially ranked 1, which becomes 1/1 = **1**. To calculate the MRR score: (0.5 + 1) / 2 (total number of triples evaluated).

The `hits_at_n_score` indicates how many times on average a true triple was ranked in the top-N ('n' being the value we indicate). This tells us how accurately our model is predicting true relationships.
- An explanation of top-N accuracy can be found [here](https://stats.stackexchange.com/q/331508)





In [None]:
mrr = mrr_score(ranks)
print("MRR: %.2f" % (mrr))

hits_10 = hits_at_n_score(ranks, n=10)
print("Hits@10: %.2f" % (hits_10))
hits_3 = hits_at_n_score(ranks, n=3)
print("Hits@3: %.2f" % (hits_3))
hits_1 = hits_at_n_score(ranks, n=1)
print("Hits@1: %.2f" % (hits_1))

## Predicting New Links

Link prediction allows us to infer missing links in a graph. To allow for link prediction to occur, the model is presented with a series of candidate statements and told to evaluate the likelyhood that they are true.

In [None]:
X_unseen = np.array([
  ['Giacomo Medici', 'employed', 'Marion True'],
  ['Giacomo Medici','sold_antiquities_to', 'Marion True'],
  ['Marion True', 'bought_from', 'Giacomo Medici'],
  ['Roger Cornelius Russell Yorke', 'bought_from', 'Robin Symes'],
  ['Fritz Bürki', 'sold_antiquities_to', 'Leon Levy'],
  ['Gianfranco Becchina', 'partnered', 'Hischam Aboutaam'],
  ['Robert Hecht', 'sold_antiquities_to', 'Barbara Fleischman']
])


unseen_filter = np.array(list({tuple(i) for i in np.vstack((positives_filter, X_unseen))}))

ranks_unseen = evaluate_performance(
    X_unseen, 
    model=model, 
    filter_unseen=True,
    filter_triples=unseen_filter,   # corruption strategy filter defined above 
    corrupt_side = 's+o',
    use_default_protocol=False, # corrupt subj and obj separately while evaluating
    verbose=True
)

scores = model.predict(X_unseen)

In [None]:
probs = expit(scores)

rankings = pd.DataFrame(list(zip([' '.join(x) for x in X_unseen], 
                      ranks_unseen, 
                      np.squeeze(scores),
                      np.squeeze(probs))), 
             columns=['statement', 'rank', 'score', 'prob']).sort_values("score")

In [None]:
#  inspect the scores 
pd.set_option('display.max_colwidth', 300)
pd.set_option('max_rows', 350)
rankings = rankings.reset_index(drop=True)
rankings


In [None]:
# train/evaluation splits the data, which allows us to evaluate the accuracy of the model
# so now, train a model on the complete knowledge graph, THEN do discovery

model.fit(X)

In [None]:
from ampligraph.latent_features import save_model, restore_model

# for running locally:
# save_model(model, '../data/best_models/best_model.pkl')
save_model(model, 'best_model.pkl')


In [None]:
# reload a model from pickle
from ampligraph.latent_features import restore_model

# for running locally:
# model = restore_model('../data/best_models/best_model.pkl')
model = restore_model('./best_model.pkl')

In [None]:
from ampligraph.discovery import discover_facts

# top_n=3 the cutoff for rank to be considered true
discover_facts(X, model, top_n=1, max_candidates=20000, strategy='entity_frequency', target_rel='bought_from', seed=42)


In [None]:
# lets score that then, after cleaning out the logically unsound and the already existing statements
# statements below are compiled from every strategy except random & exhaustive in the first rank

X_unseen = np.array([
  ['Benjamin Bishop Johnson', 'bought_from', 'Fred Drew'],
  ['Charles Craig', 'bought_from', 'David Swetnam'],
  ['Dietrich von Bothmer', 'bought_from', 'Gianfranco Becchina'],
  ['Giacomo Medici', 'bought_from', 'Nikolas Koutoulakis'],
  ['Harry Brown', 'bought_from', 'Johnnie Brown Fell'],
  ['Hydra Gallery', 'bought_from', "Antonio ‘Nino' Savoca"],
  ['J Paul Getty Museum', 'bought_from', 'Frieda Tchacos'],
  ['J Paul Getty Museum', 'bought_from', 'Samuel Schweitzer'],
  ['Joel Malter', 'bought_from', 'Marquis of Tavistock'],
  ['Leon Levy', 'bought_from', 'Fritz Bürki'],
  ['Leon Levy', 'bought_from', 'Fritz Bürki'],
  ['Leon Levy', 'bought_from', 'Fritz Bürki'],
  ['Leonardo Patterson', 'bought_from', 'Clive Hollinshead'],
  ['Marion True', 'bought_from', 'Giacomo Medici'],
  ['Pereda', 'bought_from', 'J Paul Getty Museum'],
  ['Robert Hecht', 'bought_from', 'Robin Symes'],
  ['Roger Cornelius Russell Yorke', 'bought_from', 'Harry Brown'],
  ['Roger Cornelius Russell Yorke', 'bought_from', 'Harry Brown'],
  ['Vaman Ghiya', 'bought_from', 'David Bernstein']
])

unseen_filter = np.array(list({tuple(i) for i in np.vstack((positives_filter, X_unseen))}))

ranks_unseen = evaluate_performance(
    X_unseen, 
    model=model, 
    filter_unseen=True,
    filter_triples=unseen_filter,   # corruption strategy filter defined above 
    corrupt_side = 's+o',
    use_default_protocol=False, # corrupt subj and obj separately while evaluating
    verbose=True
)

scores = model.predict(X_unseen)

probs = expit(scores)

rankings = pd.DataFrame(list(zip([' '.join(x) for x in X_unseen], 
                      ranks_unseen, 
                      np.squeeze(scores),
                      np.squeeze(probs))), 
             columns=['statement', 'rank', 'score', 'prob']).sort_values("score")

# inspect the scores 
pd.set_option('display.max_colwidth', 300)
pd.set_option('max_rows', 350)
rankings = rankings.reset_index(drop=True)
rankings


In [None]:
from ampligraph.discovery import find_nearest_neighbours
neighbors, dist = find_nearest_neighbours(model,
                                           entities=['Giacomo Medici','Marion True','Robin Symes'],
                                           n_neighbors=5)

print(neighbors, dist)

In [None]:
from ampligraph.discovery import discover_facts
# top_n=3 the cutoff for rank to be considered true

# sold_antiquities_to is inverse of bought_from
# try 'partnered'

p_result = discover_facts(X, model, top_n=1, max_candidates=20000, strategy='cluster_squares', target_rel='partnered', seed=42)


In [None]:
p_result

In [None]:
# lets score that then, after cleaning out the logically unsound and the already existing statements

X_unseen = np.array([
  ['Clive Hollinshead', 'partnered', 'Harry Brown'],
  ['Charles Craig', 'partnered', 'Roger Cornelius Russell Yorke'],
  ['United States Customs', 'partnered', 'Royal Canadian Mounted Police'],
  ['Mario Bruno', 'partnered', 'Giacomo Medici'],
  ['Roger Cornelius Russell Yorke', 'partnered', 'Charles Craig'],
  ['Anton Tkalec', 'partnered', 'Mansur Mokhtarzade'],
  ['Michael Kelly', 'partnered', 'Miguel de Osma Berckemeyer'],
  ['Robert Hecht', 'partnered', 'Robin Symes'],
  ['Clive Hollinshead', 'partnered', 'Harry Brown']
])

unseen_filter = np.array(list({tuple(i) for i in np.vstack((positives_filter, X_unseen))}))

ranks_unseen = evaluate_performance(
    X_unseen, 
    model=model, 
    filter_unseen=True,
    filter_triples=unseen_filter,   # corruption strategy filter defined above 
    corrupt_side = 's+o',
    use_default_protocol=False, # corrupt subj and obj separately while evaluating
    verbose=True
)

scores = model.predict(X_unseen)

probs = expit(scores)

rankings = pd.DataFrame(list(zip([' '.join(x) for x in X_unseen], 
                      ranks_unseen, 
                      np.squeeze(scores),
                      np.squeeze(probs))), 
             columns=['statement', 'rank', 'score', 'prob']).sort_values("score")

# inspect the scores 
pd.set_option('display.max_colwidth', 300)
pd.set_option('max_rows', 350)
rankings = rankings.reset_index(drop=True)
rankings




In [None]:
from ampligraph.discovery import query_topn

query_topn(model, top_n=3,
           head='Marion True', relation='partnered', tail=None,
           ents_to_consider=None, rels_to_consider=None)

## Tensorboard Visualizing 


In [None]:
# reload a model from pickle
from ampligraph.latent_features import restore_model

# for running locally:
# model = restore_model('../data/best_models/best_model.pkl')
model = restore_model('./best_model.pkl')


In [None]:
from ampligraph.utils import create_tensorboard_visualizations

In [None]:
create_tensorboard_visualizations(model, '4thtc_embeddings')

In [None]:
# restart the runtime to reset tensorflow to 2.x

In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [None]:
%tensorboard --logdir ./kg_embeddings

Another codebase for further visualizations

https://github.com/roosyay/CoDa_Hypotheses/blob/master/4.%20Visualisation.ipynb

https://link.springer.com/chapter/10.1007/978-3-030-77385-4_28

In [None]:
!zip -r out.zip tc_embeddings/