# CSKG embeddings

This notebook computes similarity between nodes in CSKG and performs grounding of questions/answers to CSKG.

We will play with two different families of embeddings: graph and text embeddings.

## Graph embeddings 

The graph embeddings have been computed by the command:

`python embeddings/embedding_click.py -i input/kgtk_framenet.tsv -o output/kgtk_framenet`

using the `embedding/embedding_click.py` script in this repository. This command invokes the Facebook PyBigGraph (PBG) library and computes graph embeddings with the ComplEx algorithm.

We are currently integrating this function into the KGTK package, to make it more accessible to the AI community.

## Text embeddings
The text embeddings were computed by using the KGTK `text-embedding` command as follows:
```
kgtk text_embedding \
    --embedding-projector-metadata-path none \
    --label-properties "label" \
    --isa-properties "/r/IsA" \
    --description-properties "/r/DefinedAs" \
    --property-value "/r/Causes" "/r/UsedFor" "/r/PartOf" "/r/AtLocation" "/r/CapableOf" \
    "/r/CausesDesire" "/r/SymbolOf" "/r/MadeOf" "/r/LocatedNear" "/r/Desires" "/r/HasProperty" "/r/HasFirstSubevent" \
    "/r/HasLastSubevent" "at:xAttr" "at:xEffect" "at:xIntent" "at:xNeed" "at:xReact" "at:xWant" \
    --has-properties "" \
    -f kgtk_format \
    --output-data-format kgtk_format \
    --model bert-large-nli-cls-token \
    --save-embedding-sentence \
    -i sorted.tsv.gz \
    -p sorted.tsv.gz \
    > cskg_embedings.txt
```

# Setup

```
conda create -n mowgli-env python=3.6 anaconda
source activate mowgli-env

cd grounding
pip install -r requirements.txt
conda install --yes faiss-cpu -c pytorch -n mowgli-env
python -m spacy download en_core_web_lg
conda install -c conda-forge python-annoy
cd ..
```

## I. Load embeddings

In [1]:
from annoy import AnnoyIndex

In [2]:
# Dimension of the embeddings - choose one of 100, 300, 400
dim=100
distance='angular'
trees=10

In [3]:
tsv_filename='../output/embeddings/entity_embedding_%d.tsv' % dim

In [4]:
t = AnnoyIndex(dim, distance)  # Length of item vector that will be indexed
node2id={}
id2node={}
with open(tsv_filename, 'r') as f:
    i=0
    for line in f:
        node, *data=line.split()
        v=[float(d) for d in data]
        t.add_item(i, v)
        node2id[node]=i
        id2node[i]=node
        i+=1
t.build(trees) # number of trees (more -> higher precision at query time)
t.save('complex_%d.ann' % dim)

True

In [5]:
u = AnnoyIndex(dim, distance)
u.load('complex_%d.ann' % dim) # super fast, will just mmap the file

True

## II. Most similar nodes in CSKG

In [6]:
def obtain_similar_nodes(node, num_nodes=10):
    node_id=node2id[node]
    return [id2node[i] for i in u.get_nns_by_item(node_id, num_nodes+1)[1:]]

In [7]:
nodes=['/c/en/turtle', '/c/en/happy', '/c/en/turtle/n/wn/animal']
for node in nodes:
    print(node, obtain_similar_nodes(node), '\n')

/c/en/turtle ['/c/en/mock_turtle_soup/n', '/c/en/feeder_fish/n', '/c/en/freshwater', '/c/en/neckatee/n', '/c/en/big_cheeks', '/c/en/bockey/n', '/c/en/containing_things', '/c/en/catanadromous/a', '/c/en/downblouse/a', '/c/en/trebbiano/n'] 

/c/en/happy ['/c/en/happies/n', '/c/en/excited', '/c/en/people_who_enjoy_life', '/c/en/exultant/a', 'at:personx_feels_so_good', 'at:personx_finds_____to_play_with', '/c/en/gladsome/a', 'at:personx_makes_some_friends', 'at:personx_is_a_dream_come_true', 'at:personx_is_a_young_girl'] 

/c/en/turtle/n/wn/animal ['rg:en_carunculous', '/c/en/filmically/r', '/c/en/leroij', '/c/en/mata_mata_turtle/n', '/c/en/wide_bodied', '/c/en/surpassive', '/c/en/lepro', '/c/en/cognoscitive', '/c/en/run_red_light', 'at:gains_enemies'] 



## III. Compute similarity between two nodes

In [19]:
node1='/c/en/sailor'
node2='/c/en/man'

In [20]:
u.get_distance(node2id[node1], node2id[node2])

1.3065621852874756

## IV. Parsing questions and answers

In [1]:
from grounding.graphify import parse


In [2]:
sentences=[
    'Max looked for the onions so that he could  make a stew.',
    'To get the bathroom counters dry after washing your face, take a small hand lotion and wipe away the extra water around the sink.',
    'To get the bathroom counters dry after washing your face, take a small hand towel and wipe away the extra water around the sink.'
]

In [3]:
parse_trees=parse.graphify_dataset(sentences)

  "num_layers={}".format(dropout, num_layers))
100%|██████████| 3/3 [00:00<00:00,  5.99it/s]


In [4]:
for sent_data in parse_trees:
    print('Sentence:', sent_data['sentence'])
    print('Tokenized sentence', sent_data['tokenized_sentence'])
    
    nodes={}
    for n_id, n_data in sent_data['nodes'].items():
        nodes[n_id]=n_data['phrase']
    
    for e_id, e_data in sent_data['edges'].items():
        print('NODE1:', ' '.join(nodes[e_data['head_node_id']]), 'RELATION', e_data['edge_name'], 'NODE2', ' '.join(nodes[e_data['tail_node_id']]) )
    print()

Sentence: Max looked for the onions so that he could  make a stew.
Tokenized sentence ['Max', 'looked', 'for', 'the', 'onions', 'so', 'that', 'he', 'could', 'make', 'a', 'stew', '.']
NODE1: looked RELATION ARG0 NODE2 Max
NODE1: looked RELATION ARG1 NODE2 for the onions
NODE1: looked RELATION ARGM-PRP NODE2 so that he could make a stew
NODE1: make RELATION ARG0 NODE2 he
NODE1: make RELATION ARGM-MOD NODE2 could
NODE1: make RELATION ARG1 NODE2 a stew
NODE1: so that he could make a stew RELATION sub NODE2 make
NODE1: so that he could make a stew RELATION sub NODE2 he
NODE1: so that he could make a stew RELATION sub NODE2 could
NODE1: so that he could make a stew RELATION sub NODE2 a stew
NODE1: Max RELATION coref NODE2 he

Sentence: To get the bathroom counters dry after washing your face, take a small hand lotion and wipe away the extra water around the sink.
Tokenized sentence ['To', 'get', 'the', 'bathroom', 'counters', 'dry', 'after', 'washing', 'your', 'face', ',', 'take', 'a', 'smal

## V. Grounding questions and questions to ConceptNet

In [10]:
from grounding.graphify import link

In [12]:
linked_data=link.link(parse_trees, embedding_file='grounding/numberbatch-en-19.08.txt')

100%|██████████| 516782/516782 [00:37<00:00, 13925.21it/s]


In [16]:
for sent_data in linked_data:
    print('Sentence:', sent_data['sentence'])
    for n_id, n_data in sent_data['nodes'].items():
        print('Node phrase:', n_data['phrase'])
        for c in reversed(n_data['candidates']):
            print(c)
        print()
        
    print()

Sentence: Max looked for the onions so that he could  make a stew.
Node phrase: ['looked']
{'uri': '/c/en/give_glad_eye', 'score': 0.2498762607574463}
{'uri': '/c/en/uplook', 'score': 0.2498762607574463}
{'uri': '/c/en/look', 'score': 0.24307048320770264}
{'uri': '/c/en/lookt', 'score': 0.1385062336921692}
{'uri': '/c/en/looked', 'score': -1.0967254638671875e-05}

Node phrase: ['Max']
{'uri': '/c/en/maxy', 'score': 0.2219405174255371}
{'uri': '/c/en/maxie', 'score': 0.2098349928855896}
{'uri': '/c/en/maxine', 'score': 0.19799458980560303}
{'uri': '/c/en/maximus', 'score': 0.1656038761138916}
{'uri': '/c/en/max', 'score': 2.1219253540039062e-05}

Node phrase: ['for', 'the', 'onions']
{'uri': '/c/en/vidalia_onion', 'score': 0.11641442775726318}
{'uri': '/c/en/onion', 'score': 0.10344946384429932}
{'uri': '/c/en/ingredient_in_salsa', 'score': 0.09622251987457275}
{'uri': '/c/en/onions', 'score': 8.165836334228516e-05}
{'uri': '/c/en/for', 'score': 2.2649765014648438e-06}

Node phrase: ['s