# Installing pykeen

For installing our extensions for PyKEEN v.1.4.0, clone this specific version of PyKEEN inside your machine and follow the instructions on the `README.md` file in [this GitHub directory](https://github.com/sntcristian/and-kge/tree/main/pykeen-extension). Then, execute the following steps: <br/>
1. open the command line inside the folder in which your modified version of PyKEEN is.
2. install the library in development mode.
3. install sentence-transformers library (this will be used by our preprocessing classes).

In [None]:
cd "/content/drive/MyDrive/thesis_project/pykeen-1.4.0"

/content/drive/MyDrive/thesis_project/pykeen-1.4.0


In [None]:
!pip install -e .
!pip install sentence-transformers

# Importing libraries

**Note:** if you have problem in importing PyKEEN inside the Jupyter environment (happens in Colab), restart the runtime.

In [None]:
import json
import pykeen
import torch
from sklearn.cluster import AgglomerativeClustering
import time
from tqdm import tqdm
import numpy as np
import random

# Load model

You can find the model used in this notebook in this [Zenodo repository](https://zenodo.org/record/5569490#.YW7u4NlBwwQ).

In [None]:
distmult_kge = torch.load("distmult.pkl", map_location=torch.device('cpu'))

# Load data

The data processed by our clustering algorithms is a dictionary containing blocks of `(author, publication)` pairs where each key is an ambigous name and the value of that key is a list of publications written by different authors. The script show a sample of the input file used in this notebook, with ambigous publications written by authors which correspond to "Ali M".<br/>
The data used in this research is available in this [GitHub repository](https://github.com/sntcristian/and-kge/blob/main/author_disambiguation/OC-782K/data/lnfi_blocks.json).

In [16]:
with open("lnfi_blocks.json", "r") as f:
    eval_data = json.load(f)
f.close()

print(eval_data["Ali M"])

[{'author': 'https://github.com/arcangelo7/time_agnostic/ra/20818', 'coauthors': [], 'family_name': 'Ali', 'given_name': 'Mona Farouk', 'label': 'http://orcid.org/0000-0002-2928-6669', 'references': ['https://github.com/arcangelo7/time_agnostic/br/110391', 'https://github.com/arcangelo7/time_agnostic/br/119968', 'https://github.com/arcangelo7/time_agnostic/br/138728', 'https://github.com/arcangelo7/time_agnostic/br/138982', 'https://github.com/arcangelo7/time_agnostic/br/138983', 'https://github.com/arcangelo7/time_agnostic/br/138984', 'https://github.com/arcangelo7/time_agnostic/br/138985', 'https://github.com/arcangelo7/time_agnostic/br/138986', 'https://github.com/arcangelo7/time_agnostic/br/138987', 'https://github.com/arcangelo7/time_agnostic/br/138988', 'https://github.com/arcangelo7/time_agnostic/br/138989', 'https://github.com/arcangelo7/time_agnostic/br/138990', 'https://github.com/arcangelo7/time_agnostic/br/138991', 'https://github.com/arcangelo7/time_agnostic/br/138992', 'h

# Clustering of Knowledge Graph Embeddings (KGEs)

This is the algorithm used to cluster the publications in the `eval data` dictionary. This function takes as input a KGE model (`distmult_kge`), and several parameters to perform Hierarchical Agglomerative Clustering on the embeddings. <br/>
The feature clustered is a concatenation of the vector associated to the publication and of the vector associated to the author, in order to have a more meaningful feature<br/>
The output of the function is organized specularly to the input data.

In [17]:
def cluster_KGEs(model, blocks, affinity_type, linkage, threshold):
    entity_representation_modules = model.entity_embeddings
    entity_to_id = model.triples_factory.entity_to_id
    output_data = dict()
    print("clustering blocks")
    pbar = tqdm(total=len(blocks))
    start_time = time.time()
    n = 0
    for name in blocks.keys():
        block = blocks[name]
        n += 1
        output_data[name] = list()
        works_idx = torch.tensor([entity_to_id[pub["work"]] for pub in block],
                                  dtype=torch.long)
        works_embeddings = entity_representation_modules.forward(indices=works_idx).detach().numpy()

        authors_idx = torch.tensor([entity_to_id[pub["author"]] for pub in block],
                                 dtype=torch.long)
        authors_embeddings = entity_representation_modules.forward(indices=authors_idx).detach().numpy()

        concat_embeddings = np.hstack((works_embeddings, authors_embeddings))

        result = AgglomerativeClustering(n_clusters=None, affinity=affinity_type, linkage=linkage, compute_full_tree=True,
                                        distance_threshold=threshold).fit(concat_embeddings)

        for entry, cluster_label in zip(block, result.labels_):
            new_d = dict()
            new_d["author"] = entry["author"]
            new_d["work"] = entry["work"]
            new_d["label"] = "disambiguated-"+str(n)+"#"+str(cluster_label)
            output_data[name].append(new_d)
        pbar.update(1)
    pbar.close()
    print("process took %s seconds" % (time.time() - start_time))
    return output_data


In [19]:
cluster_data = cluster_KGEs(model=model, blocks=eval_data, affinity_type="cosine", linkage="single", threshold=0.6) 

clustering blocks


100%|██████████| 184/184 [00:00<00:00, 1212.27it/s]

process took 0.15309762954711914 seconds





here is a sample of the output of the function

In [20]:
print(cluster_data["Ali M"])

[{'author': 'https://github.com/arcangelo7/time_agnostic/ra/20818', 'work': 'https://github.com/arcangelo7/time_agnostic/br/138980', 'label': 'disambiguated-1#2'}, {'author': 'https://github.com/arcangelo7/time_agnostic/ra/20782', 'work': 'https://github.com/arcangelo7/time_agnostic/br/138727', 'label': 'disambiguated-1#3'}, {'author': 'https://github.com/arcangelo7/time_agnostic/ra/247452', 'work': 'https://github.com/arcangelo7/time_agnostic/br/152815', 'label': 'disambiguated-1#1'}, {'author': 'https://github.com/arcangelo7/time_agnostic/ra/197191', 'work': 'https://github.com/arcangelo7/time_agnostic/br/48125', 'label': 'disambiguated-1#0'}]


In [21]:
with open("model_labels.json", "w") as output_file:
      json.dump(cluster_data, output_file, indent=4, sort_keys=True)
output_file.close()

# Model evaluation

If the input data is provided with ground-truth labels, we can compare the pairwise **Precision**, **Recall** and **F1 Score** of our model based on the ground-truth and the labels given by the model.

In [18]:
def evaluate_no_macro(x_blocks, y_blocks):
    true_positive = 0
    true_negative = 0
    false_positive = 0
    false_negative = 0
    pbar = tqdm(total=len(x_blocks.keys()))
    for name in x_blocks.keys():
        cluster_block = x_blocks[name]
        true_block = y_blocks[name]
        for idx1 in range(1, len(cluster_block)):
            cluster_label1 = cluster_block[idx1]["label"]
            true_label1 = true_block[idx1]["label"]
            for idx2 in range(0, idx1):
                cluster_label2 = cluster_block[idx2]["label"]
                true_label2 = true_block[idx2]["label"]
                if cluster_label1 == cluster_label2 and true_label1 == true_label2:
                    true_positive += 1
                elif cluster_label1 == cluster_label2 and true_label1 != true_label2:
                    false_positive += 1
                elif cluster_label1 != cluster_label2 and true_label1 == true_label2:
                    false_negative += 1
                else:
                    true_negative += 1
        pbar.update(1)
    pbar.close()

    total_comparisons = true_positive + false_positive + true_negative + false_negative
    total_positives = true_positive + false_negative
    total_negatives = true_negative + false_positive
    precision = true_positive / (true_positive + false_positive)
    recall = true_positive / (true_positive + false_negative)
    f1_score = 2 * ((precision*recall)/(precision+recall))

    output_dict = {
        "total_comparisons": total_comparisons,
        "total_positives": total_positives,
        "total_negatives": total_negatives,
        "true_positive": true_positive,
        "true_negative": true_negative,
        "false_positive": false_positive,
        "false_negative": false_negative,
        "precision": precision,
        "recall": recall,
        "F1 score": f1_score
    }
    return output_dict

In [23]:
evaluation_output = evaluate_no_macro(cluster_data, eval_data)
print(evaluation_output)

100%|██████████| 184/184 [00:00<00:00, 47674.32it/s]

{'total_comparisons': 3156, 'total_positives': 1484, 'total_negatives': 1672, 'true_positive': 996, 'true_negative': 1582, 'false_positive': 90, 'false_negative': 488, 'precision': 0.9171270718232044, 'recall': 0.6711590296495957, 'F1 score': 0.7750972762645916}



