<a href="https://colab.research.google.com/github/versant2612/jnotebooks/blob/main/kgtk/03_kg_graph_embeddingsRESCAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install kgtk==1.0.1

Collecting kgtk==1.0.1
  Downloading kgtk-1.0.1-py3-none-any.whl (550 kB)
[?25l[K     |▋                               | 10 kB 19.3 MB/s eta 0:00:01[K     |█▏                              | 20 kB 9.2 MB/s eta 0:00:01[K     |█▉                              | 30 kB 7.3 MB/s eta 0:00:01[K     |██▍                             | 40 kB 6.7 MB/s eta 0:00:01[K     |███                             | 51 kB 5.5 MB/s eta 0:00:01[K     |███▋                            | 61 kB 5.5 MB/s eta 0:00:01[K     |████▏                           | 71 kB 5.7 MB/s eta 0:00:01[K     |████▊                           | 81 kB 6.5 MB/s eta 0:00:01[K     |█████▍                          | 92 kB 6.1 MB/s eta 0:00:01[K     |██████                          | 102 kB 5.2 MB/s eta 0:00:01[K     |██████▌                         | 112 kB 5.2 MB/s eta 0:00:01[K     |███████▏                        | 122 kB 5.2 MB/s eta 0:00:01[K     |███████▊                        | 133 kB 5.2 MB/s eta 0:00:01[K 

# KGTK graph-embeddings

`kgtk graph-embeddings` command takes as input a KGTK edge file and compute the embeddings of the graph nodes and relations.

Please refer to [graph-embeddings documentation](https://kgtk.readthedocs.io/en/latest/analysis/graph_embeddings/) for further details.

In [20]:
import os
from kgtk.configure_kgtk_notebooks import ConfigureKGTK
from kgtk.functions import kgtk, kypher
from gensim.models import KeyedVectors
import tempfile
import pandas as pd
import numpy as np
import h5py, torch
from torchbiggraph.model import ComplexDiagonalDynamicOperator, DotComparator, CosComparator
import json

In [21]:
# Parameters

# Folder on local machine where to create the output and temporary folders
input_path = None
output_path = "/tmp/projects"
project_name = "tutorial-graph-embeddings"
input_files_url="https://github.com/usc-isi-i2/kgtk-tutorial-files/raw/main/datasets/arnold-profiled"

#### Define a custom json files config to download the file `derived_P31x` as it is not part of default download list

In [22]:
extra_files_config = {
    "derived_P31x": "derived.P31x.tsv"
    }

open('/root/extra_files.json', 'w').write(json.dumps(extra_files_config))

36

In [23]:
files = [
    "all",
    "label",
    "alias",
    "description",
    "item",
    "qualifiers",
    "p31",
    "p279star",
    "derived_P31x"
]
ck = ConfigureKGTK(files, input_files_url=input_files_url)
ck.configure_kgtk(input_graph_path=input_path,
                  output_path=output_path,
                  project_name=project_name,
                  json_config_file = '/root/extra_files.json')

User home: /root
Current dir: /content
KGTK dir: /
Use-cases dir: //use-cases
--2021-11-22 02:29:12--  https://github.com/usc-isi-i2/kgtk-tutorial-files/raw/main/datasets/arnold-profiled/all.tsv.gz
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/usc-isi-i2/kgtk-notebooks/raw/main/datasets/arnold-profiled/all.tsv.gz [following]
--2021-11-22 02:29:12--  https://github.com/usc-isi-i2/kgtk-notebooks/raw/main/datasets/arnold-profiled/all.tsv.gz
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/usc-isi-i2/kgtk-notebooks/main/datasets/arnold-profiled/all.tsv.gz [following]
--2021-11-22 02:29:12--  https://raw.githubusercontent.com/usc-isi-i2/kgtk-notebooks/main/datasets/arnold-profiled/all.tsv.gz
Resolving raw.githubusercontent.com (raw.github

In [24]:
ck.print_env_variables()

STORE: /tmp/projects/tutorial-graph-embeddings/temp.tutorial-graph-embeddings/wikidata.sqlite3.db
KGTK_LABEL_FILE: /root/isi-kgtk-tutorial/input/labels.en.tsv.gz
EXAMPLES_DIR: //examples
TEMP: /tmp/projects/tutorial-graph-embeddings/temp.tutorial-graph-embeddings
KGTK_GRAPH_CACHE: /tmp/projects/tutorial-graph-embeddings/temp.tutorial-graph-embeddings/wikidata.sqlite3.db
GRAPH: /root/isi-kgtk-tutorial/input
KGTK_OPTION_DEBUG: false
OUT: /tmp/projects/tutorial-graph-embeddings
kypher: kgtk query --graph-cache /tmp/projects/tutorial-graph-embeddings/temp.tutorial-graph-embeddings/wikidata.sqlite3.db
USE_CASES_DIR: //use-cases
kgtk: kgtk
all: /root/isi-kgtk-tutorial/input/all.tsv.gz
label: /root/isi-kgtk-tutorial/input/labels.en.tsv.gz
alias: /root/isi-kgtk-tutorial/input/aliases.en.tsv.gz
description: /root/isi-kgtk-tutorial/input/descriptions.en.tsv.gz
item: /root/isi-kgtk-tutorial/input/claims.wikibase-item.tsv.gz
qualifiers: /root/isi-kgtk-tutorial/input/qualifiers.tsv.gz
p31: /root/is

In [25]:
ck.load_files_into_cache()

kgtk query --graph-cache /tmp/projects/tutorial-graph-embeddings/temp.tutorial-graph-embeddings/wikidata.sqlite3.db -i "/root/isi-kgtk-tutorial/input/all.tsv.gz" --as all  -i "/root/isi-kgtk-tutorial/input/labels.en.tsv.gz" --as label  -i "/root/isi-kgtk-tutorial/input/aliases.en.tsv.gz" --as alias  -i "/root/isi-kgtk-tutorial/input/descriptions.en.tsv.gz" --as description  -i "/root/isi-kgtk-tutorial/input/claims.wikibase-item.tsv.gz" --as item  -i "/root/isi-kgtk-tutorial/input/qualifiers.tsv.gz" --as qualifiers  -i "/root/isi-kgtk-tutorial/input/derived.P31.tsv.gz" --as p31  -i "/root/isi-kgtk-tutorial/input/derived.P279star.tsv.gz" --as p279star  -i "/root/isi-kgtk-tutorial/input/derived.P31x.tsv" --as derived_P31x  --limit 3
node1	label	node2	id	node2;wikidatatype
P10	alias	'gif'@en	P10-alias-en-282226-0	
P10	alias	'animation'@en	P10-alias-en-2f86d8-0	
P10	alias	'media'@en	P10-alias-en-c1427e-0	


In [26]:
# dimension of the output embeddings vector
vector_dimension = 30 

# output path for embeddings file in w2v format
vector_output_w2v_path = f"{os.environ['OUT']}/arnold.embeddings.augmented.{vector_dimension}.w2v.tsv"

os.environ['VECTOR_DIMENSION'] = str(vector_dimension)

## Compute RESCAL Graph Embeddings

In this notebook we will compute graph embeddings using `kgtk graph-embeddings` command for the `arnold` subgraph and demonstrate a few applications.

First step is to augment the `claims.wikibase-item.tsv.gz` file with `derived.P31x.tsv` file which contains occupations for humans as `instance of (P31)`

- `claims.wikibase-item.tsv.gz`: KGTK claims file non literal edges only
- `derived.P31x.tsv`: file with additional P31x links, adding occupation as `instance of` (computed)

In [27]:
!kgtk cat -i $item \
-i $GRAPH/derived.P31x.tsv \
-o $GRAPH/claims.wikibase-item.augmented.tsv.gz

### Run `kgtk graph-embeddings`

The `kgtk graph-embeddings` command takes as input a KGTK edge file and computes graph embeddings of user specified type, producing vectors of user specified dimensions.

The following parameters are used in this instance:

- `-op RESCAL`: compute RESCAL graph embeddings
- `--dimension 30`: desired dimension of the vectors
- `-ot w2v`: output format - w2v
- `--retain_temporary_data True`: retain the byproduct files, which we will use in subsequent steps
- `-T <folder path>`: temporary folder where the temporary files will be stored
- `-i <file>`: input file
- `-o <file>`: output file
- `--log <file>`: log file

**NOTE**: This cell will take ~15 minutes to run on the default Google Colab VM.

In [28]:
kgtk(f""" graph-embeddings
            -op RESCAL \
            --dimension $VECTOR_DIMENSION \
            -ot w2v \
            --retain_temporary_data True \
            -T $TEMP \
            -i $GRAPH/claims.wikibase-item.augmented.tsv.gz \
            -o {vector_output_w2v_path} \
            --log $TEMP/ge.log.txt
    """)

In Processing, Please go to /tmp/projects/tutorial-graph-embeddings/temp.tutorial-graph-embeddings/ge.log.txt to check details
Processed Finished.



#### Take a peek at the embeddings file.

In [29]:
kgtk(f"""head -i {vector_output_w2v_path}""")

In input header '66014 30': 
Invalid Quantity



Unnamed: 0,66014 30
0,Q424388 0.020481726 -0.024514707 0.038093463 -...
1,P2015 -0.038088001 0.001974118 0.025847377 0.0...
2,Q65048002 -0.047681786 -0.002064858 -0.0389210...
3,Q4398244 -0.068573214 0.019802140 -0.019038625...
4,Q358421 0.028888235 -0.011480494 0.054447353 0...
5,Q11953074 0.018478738 0.080301389 -0.034265213...
6,Q29642812 -0.000654230 0.065951467 0.050808102...
7,Q3744866 -0.035372514 0.043066524 0.011590560 ...
8,Q1640949 -0.081472144 -0.090333983 0.007368165...
9,Q492555 0.023064287 0.025230367 0.025415583 0....


O header do w2v precisa conter o número de nós e o numero de dimensões. O vetor é separado por espaços. 

### Load the vectors into `gensim`

To find similar vectors based on cosine similarity

In [30]:
ge_vectors = KeyedVectors.load_word2vec_format(f"{vector_output_w2v_path}", binary=False)

Define a function to compute the `topn` similar vectors, and get the labels and descriptions of the matching Qnodes.

Dado um vetor de entrada contendo o ID e os embeddings a função abaixo do pacote gensim retorna para um ID (positive) os topn IDs mais semelhantes. 

vectors.most_similar(positive=positive, topn=topn)

In [31]:
def kgtk_most_similar(
    vectors,
    positive,
    relation_label="similarity_score",
    add_label_description=True,
    output_path=None,
    topn=25,
):
    """
    find topn similar Qnodes, add label and decription for the Qnodes
    
    :param vectors: vector space loaded into gensim KeyedVectors model
    :param positive: vector(s) or Qnode(s) to find similar entities for
    :param relation_label: name of the property to be used for the output file
    :param add_label_description: boolean parameter to add label and description for matched entities
    :param output_path: path to store the output file
    :param topn: desirednumber of similar entities
    """
    result = []
    if add_label_description:
        fp = tempfile.NamedTemporaryFile(
            mode="w", suffix=".tsv", delete=False, encoding="utf-8"
        )
        fp.write("node1\tlabel\tnode2\n")
        for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):
            fp.write("{}\t{}\t{}\n".format(qnode, relation_label, similarity))
        filename = fp.name
        fp.close()

        os.environ["_temp_file"] = filename

        result = !$kypher -i label -i description -i "$_temp_file" --as sim \
--match 'sim: (n1)-[]->(similarity), label: (n1)-[]->(lab), description: (n1)-[]->(des)' \
--return 'distinct n1 as node1, similarity as node2, "similarity" as label, lab as `node1;label`, des as `node1;description`' \
--order-by 'cast(similarity, float) desc' 
        
        os.remove(filename)
        
    else:
        result.append("node1\tlabel\tnode2\n")
        for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):
            result.append("{}\t{}\t{}\n".format(qnode, relation_label, similarity))

    if output_path:
        handle = open(output_path, "w")
        for line in result:
            handle.write(line)
            handle.write("\n")
        handle.close()
    else:
        columns = result[0].split("\t")
        data = []
        for line in result[1:]:
            data.append(line.split("\t"))
        return pd.DataFrame(data, columns=columns)

### Link Prediction

The following code reads the vectors for Qnodes as `head` and Properties as `relation`.

The files used in the code are produced by `kgtk graph-embeddings` code as a byproduct, in the folder specified by the `-T` option

In [32]:
relation_names_list = json.load(open(f"{os.environ['TEMP']}/output/dynamic_rel_names.json"))
entity_names_list = json.load(open(f"{os.environ['TEMP']}/output/entity_names_all_0.json"))
prop_count = len(relation_names_list)

# operators
operator_lhs = ComplexDiagonalDynamicOperator(vector_dimension, prop_count)
operator_rhs = ComplexDiagonalDynamicOperator(vector_dimension, prop_count)
comparator = DotComparator()
cos_comparator = CosComparator()
with h5py.File(f"{os.environ['TEMP']}/output/model/model.v100.h5", "r") as hf:
    operator_state_dict_lhs = {
        "real": torch.from_numpy(hf["model/relations/0/operator/lhs/real"][...]),
        "imag": torch.from_numpy(hf["model/relations/0/operator/lhs/imag"][...]),
    }
    operator_state_dict_rhs = {
        "real": torch.from_numpy(hf["model/relations/0/operator/rhs/real"][...]),
        "imag": torch.from_numpy(hf["model/relations/0/operator/rhs/imag"][...]),
    }
    
operator_lhs.load_state_dict(operator_state_dict_lhs)
operator_rhs.load_state_dict(operator_state_dict_rhs)

# Load the embeddings
with h5py.File(f"{os.environ['TEMP']}/output/model/embeddings_all_0.v100.h5", "r") as hf:
    arnold_embedding = torch.from_numpy(hf["embeddings"][...])


entity_to_index = {}
for i, entity in enumerate(entity_names_list):
    entity_to_index[entity] = i
    

rel_index = {}
for i, rel in enumerate(relation_names_list):
    rel_index[rel] = i

KeyError: ignored

The following function takes as input a `Qnode` and a `Property`, and outputs a vector which should be similar to the value of the relation.

For example, Qnode: `Q37079` = Tom Cruise, Property: `P166` = awards received and output a vector similar to awards. We will see this equation in action in the subsequent examples.

head + relation ~= tail *(h,r,?)*

In [33]:
def get_embed(head, relation=None):
    ''' This function generate the embeddings for the tail entities:
            Head entities: Obtained from the model
            Head + relation: Obtained using torch
        :param head: subject Qnode
        :param relation: optional property
    '''
    if relation is None:
        return arnold_embedding[entity_to_index[head], :].detach().numpy()
    return  operator_lhs(
                arnold_embedding[entity_to_index[head], :].view(1, vector_dimension),
                torch.tensor([rel_index[relation]])
            ).detach().numpy()[0]

#### Get the vector for `Q37079` (Tom Cruise) + `P166` (award received), then find most similar entities

a função get_embed recebe o head e o relation e "soma" para achar o tail aproximado e retorna o vetor de embeddings desse tail. Depois esse vetor é usado como o referencial de entrada para a kgtk_most_similar (parametro positive) para obter os 10 (topn) mais similares.

In [34]:
_vector = get_embed('Q37079', 'P166')
kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)

NameError: ignored

#### Get the vector for `Q170564` (Terminator 2: Judgement Day) + `P161` (cast member), then find most similar entities

In [None]:
_vector = get_embed('Q170564', 'P161')
kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)

#### Get the vector for `Q104123` (Pulp Fiction) + `P161` (cast member), then find most similar entities

In [None]:
_vector = get_embed('Q104123', 'P161')
kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)

#### Get the vector for `Q2685` (Arnold Schwarzenegger), then find most similar entities

a função get_embed recebe o ID do nó e retorna o vetor de embeddings desse ID. Depois esse vetor é usado como o referencial de entrada para a kgtk_most_similar (parametro positive) para obter os 10 (topn) mais similares.

In [35]:
_vector = get_embed('Q2685')
kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)

NameError: ignored

#### Get the vector for `Q103148` (Lahn River), then find most similar entities

In [None]:
_vector = get_embed('Q103148')
kgtk_most_similar(ge_vectors, positive=[_vector], topn=10)