# Static link prediction
In this notebook, we will apply models from the PyKEEN package to generate link predictions on the entire graph as a static object.

In [1]:
import networkx as nx
from pykeen.triples import TriplesFactory
from pykeen.pipeline import pipeline

2024-06-01 14:33:48.647773: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Read in the network and format

In [2]:
graph = nx.read_graphml('../data/kg/all_drought_dt_co_occurrence_graph_02May2024.graphml')

In [3]:
edgelist = nx.to_pandas_edgelist(graph)
edgelist.head()

Unnamed: 0,source,target,uids_of_origin,first_year_mentioned,is_drought,is_desiccation,num_doc_mentions_all_time
0,peg-induced drought tolerance,sesame,WOS:000623658100043,2021,True,False,1
1,peg-induced drought tolerance,sesame drought tolerance,WOS:000623658100043,2021,True,False,1
2,peg-induced drought tolerance,otsa,WOS:000621810600016,2020,True,False,1
3,peg-induced drought tolerance,p5cr,WOS:000621810600016,2020,True,False,1
4,peg-induced drought tolerance,glgx,WOS:000621810600016,2020,True,False,1


In [4]:
def get_predicate(row):
    if row.is_drought:
        if row.is_desiccation:
            return 'both'
        else:
            return 'drought'
    else:
        if row.is_desiccation:
            return 'desiccation'

In [5]:
edgelist['predicate'] = edgelist.apply(get_predicate, axis=1)

In [6]:
edgelist.head()

Unnamed: 0,source,target,uids_of_origin,first_year_mentioned,is_drought,is_desiccation,num_doc_mentions_all_time,predicate
0,peg-induced drought tolerance,sesame,WOS:000623658100043,2021,True,False,1,drought
1,peg-induced drought tolerance,sesame drought tolerance,WOS:000623658100043,2021,True,False,1,drought
2,peg-induced drought tolerance,otsa,WOS:000621810600016,2020,True,False,1,drought
3,peg-induced drought tolerance,p5cr,WOS:000621810600016,2020,True,False,1,drought
4,peg-induced drought tolerance,glgx,WOS:000621810600016,2020,True,False,1,drought


In [7]:
triples = edgelist[['source', 'predicate', 'target']].to_numpy()

In [8]:
tf = TriplesFactory.from_labeled_triples(triples, create_inverse_triples=True)

## Train a model

In [9]:
training, validation, testing = tf.split([0.8, 0.1, 0.1])

using automatically assigned random_state=3243504460


In [14]:
result = pipeline(
    training=training,
    validation=validation,
    testing=testing,
    stopper='early',
    model='RESCAL',
    training_kwargs=dict(
        num_epochs=1,
        checkpoint_name='dt_rescal_notebook.pt',
        checkpoint_frequency=0
    )
    
)

INFO:pykeen.pipeline.api:=> no training loop checkpoint file found at '/mnt/home/lotrecks/.data/pykeen/checkpoints/dt_rescal_notebook.pt'. Creating a new file.
INFO:pykeen.pipeline.api:Using device: None
INFO:pykeen.stoppers.early_stopping:Inferred checkpoint path for best model weights: /mnt/home/lotrecks/.data/pykeen/checkpoints/best-model-weights-f73342c0-e282-40c9-b25e-9e508ee7bde5.pt
INFO:pykeen.training.training_loop:=> no checkpoint found at '/mnt/home/lotrecks/.data/pykeen/checkpoints/dt_rescal_notebook.pt'. Creating a new file.
INFO:pykeen.triples.triples_factory:Creating inverse triples.


Training epochs on cpu:   0%|          | 0/1 [00:00<?, ?epoch/s]

INFO:pykeen.triples.triples_factory:Creating inverse triples.


Training batches on cpu:   0%|          | 0/8058 [00:00<?, ?batch/s]


KeyboardInterrupt



## Use embedding representations to make predictions
Rather than use the built-in prediction capabilities of the embedding model (which I have previously found to be terrible), I would like to try taking the embeddings and using them in a RF model to start. The idea is to use the node embeddings as features, and try to predict new edges.

In [None]:
model = result.model
node_reps = model.entity_representations