This notebook will creates all data needed for evaluation.The data includes predictions based on either graph or text embeddings. The predictions will be created once (in theory) and stored on disk.

In [40]:
import predict # a script contains related function for making predictions
import importlib
importlib.reload(predict)

<module 'predict' from '/nas/home/binzhang/cskg/embeddings/predict.py'>

## Parameters for invoking the notebook

input:
- `cue_target`: file path of cue-target.xml (contains the ground truth of USF-FAN dataset)
- `cskg_connected`: file path of cskg_connected.tsv (contains the raw cskg edge information)
- `bert_embs`: file path of bert-nli-large-embeddings.tsv.gz (contains the text embeddings for nodes)
- `kgtk_embs`: file path of trans_log_dot_0.1.tsv.gz (contains the graph embeddings for nodes)

output:

- `trues`: file path of trues.json (conatins the validation data for evaluation)
- `graph_pred`: file path of graph_predictions.json (predictions generated by graph embedding and faiss)
- `text_pred`: file path of text_predictions.json (predictions generated by text embedding and faiss)
- `modi_text_pred`: file path of text_predictions.json (predictions generated by text embedding and faiss), here we add an addition step to proceed the return targets

In [30]:
##input
cue_target = '../input/cue-target.xml'
cskg_connected = '../input/cskg_connected.tsv'
kgtk_embs = '../input/trans_log_dot_0.1.tsv.gz' 
bert_embs = '../input/bert-nli-large-embeddings.tsv.gz'

##output
trues = '../output/trues.json'
graph_pred  = '../output/graph_predictions.json'
text_pred  = '../output/text_predictions.json'
modi_text_pred  = '../output/modi_text_predictions.json'

## Load Common datasets

variables:

- `USF_FAN_dict` : A dictionary whose key is a cue's label, value is a list containing cue's similar targets in decreasing order of similarity <br>e.g. 'turtle': ['slow','shell','tortoise','animal',...]
- `CSKG_label_dict` : A dictionary whose key is the label of the node, value is a list of node IDs, whode node's label is the corresponding key. <br> e.g. 'turtle': ['Q1705322', '/c/en/turtle', ...]
- `CSKG_inv_dict` : A inverted index dictionary recording the correspondence between the ID and label of each node. The key is the node ID, the value is the node's label corresponding to the ID <br> e.g.'Q1705322':'turtle', '/c/en/turtle', 'Q997698':'book'
- `ground_truth`: A dictionary whose key is both in USF_FAN and CSKG, value the same value as the USF_FAN_dict for the cue, this is used as the gold_list

In [3]:
##1. load USF_FAN dataset
USF_FAN_dict = predict.load_truth(cue_target)
print(f"Targets for 'turtle' in USF_FAN: {USF_FAN_dict['turtle'][:5]}...")

Targets for 'turtle' in USF_FAN: ['slow', 'shell', 'tortoise', 'animal', 'green']...


In [4]:
##2. load CSKG edge information(get all labels and nodes data)
CSKG_label_dict,CSKG_inv_dict = predict.load_cskg(cskg_connected)
print(f"Nodes with the label 'turtle' on CSKG': {CSKG_label_dict['turtle'][:5]}...\n")
print(f"Label for node 'Q32945370 on CSKG': {CSKG_inv_dict['Q32945370']}")

Nodes with the label 'turtle' on CSKG': ['/c/en/turtle/n/wikt/en_2', 'Q1705322', '/c/en/turtle/n/wn/artifact', 'Q32945370', '/c/en/turtle/v/wn/motion']...

Label for node 'Q32945370 on CSKG': turtle


In [5]:
##3. compra USF_FAN's label and CSKG's label to get common ones as our ground turth
ground_truth = predict.get_ground_truth(USF_FAN_dict,CSKG_label_dict)
predict.export_cue_targets(ground_truth,trues)
print(f"Length of ground_truth: {len(ground_truth)}, length of USF_FAN_dict: {len(USF_FAN_dict)}")

Length of ground_truth: 5011, length of USF_FAN_dict: 5018


## Create predictions based on graph embeddings

variables:

- `graph_node_emb`: A dictionary whose key is the Node id, value is the graph embeddings for such node.
- `graph_label_emb`: A dictionary whose key is the Node label , value is the average graph embeddings for such node.
- `graph_index`: A faiss index keeps the index for the graph label embeddings.
- `graph_label_ix`:  A dictionary whose key is the graph_index's number, value is the label. This dictionary is aimed at recording each label's position for future mapping.
- `graph_query_dict`:A dictionary whose key is both in ground_truth and CSKG, value is the graph embedding value for labels on CSKG nodes
- `graph_neighbor_dict`:   A dictionary whose key is a label in CSKG, value is a list containing the label's similar targets in decreasing order of cosine similarity, each item in the list is a tuple, first item is the similar target, and second one is the similarity to the label.<br>
example: {'a': [('s', 0.9048489),('more', 0.88388747),('c', 0.8800387)...]...}
- `graph_predictions`: A dictionary with same key with ground_truth , but the value is the list of neighbors generated by faiss searching according to the cue's graph embeddings

In [6]:
%%time
##1.load graph embedding for each node id on CSKG (1min)
graph_node_emb = predict.graph_emb_load(kgtk_embs) 
print(f"Graph embeddings for node 'Q32945370': {graph_node_emb['Q32945370'][:5]}...")

Graph embeddings for node 'Q32945370': [0.254152298, -0.446585357, 0.152848288, 0.144540176, 0.129683152]...
CPU times: user 1min 16s, sys: 3.78 s, total: 1min 19s
Wall time: 1min 19s


In [7]:
%%time
##2.get the embedding for each label, since each label may have multiple nodes, 
## so here we use their average embeddings as the embedding value for such label (30s)
graph_label_emb = predict.get_label_emb(graph_node_emb,CSKG_label_dict)
print(f"Graph embeddings for lable 'turtle': {graph_label_emb['turtle'][:5]}...")

Graph embeddings for lable 'turtle': [0.15604649623076924, -0.14148455738461538, -0.0034780889230769238, -0.1762729807692308, 0.16463685692307692]...
CPU times: user 29.3 s, sys: 2.71 s, total: 32 s
Wall time: 32 s


In [8]:
##3.build a faiss index for graph embddings (10s)
graph_index,graph_label_ix = predict.build_index(graph_label_emb)
print(f"Total number of labels in graph_index: {graph_index.ntotal}")
print(f"The index 9999 points to the label : {graph_label_ix[9999]}")

Total number of labels in graph_index: 1537680
The index 9999 points to the label : untimely


In [9]:
## 4.create query set for CSKG node, here the query set's labels are same as groud truth's labels
graph_query_dict =  predict.create_queryset(ground_truth,graph_label_emb)
print(f" Graph embddings for query label 'black',\n {graph_query_dict['black']}\n{graph_query_dict['black'].shape}")

 Graph embddings for query label 'black',
 [[ 0.17928712 -0.13591672  0.01267633  0.05109588  0.12160839  0.13548541
  -0.05430475 -0.19000961  0.08617514  0.0871693   0.04200997 -0.0019516
   0.01005572 -0.05972502  0.11564244  0.05589715  0.0324163   0.01095609
  -0.11914153  0.09515327 -0.11379138  0.02665885 -0.03155296 -0.06613164
  -0.07865904  0.14886895 -0.00828494  0.07173332 -0.08187395 -0.02142016
  -0.00032857  0.07197008 -0.14334884 -0.03807215  0.01257057 -0.09659239
  -0.0765395   0.13656841 -0.12027948  0.05086154 -0.13277571 -0.01514891
   0.14886029  0.04981698  0.15932456 -0.19051534  0.09474476 -0.06964613
  -0.14163026  0.12131403 -0.00805583 -0.06236116 -0.02707643 -0.17910491
   0.00504959  0.0686815  -0.11555821  0.00990293  0.06747081 -0.07370351
  -0.07396634 -0.07172409 -0.04467475 -0.02662679  0.12883261  0.17289415
   0.08157898 -0.16289166 -0.0800919   0.10152796  0.1769218  -0.0878875
  -0.11570306 -0.02765672  0.15355319  0.00549699  0.07113624  0.150103

In [10]:
##5. Neighbor Searching for all ground truth's labels, Here the returned targets' number for a label is equal to 
# the ones on ground turth => (@X)

# before searching all labels, let's search one label as an example
## e.g : search neighbors for label 'person'
tmp1 = predict.get_label_neighbor(graph_query_dict['person'],graph_index,graph_label_ix,5,include=False)
tmp2 = predict.get_label_neighbor(graph_query_dict['person'],graph_index,graph_label_ix,5,include=True)
print(f"Searching result for label 'person'(not include itself): {tmp1}\n")
print(f"Searching result for label 'person'(include itself):     {tmp2}")

Searching result for label 'person'(not include itself): [('man', 0.98969), ('boy', 0.98480916), ('girl', 0.9805683), ('people', 0.969992), ('black', 0.96583045)]

Searching result for label 'person'(include itself):     [('person', 1.0), ('man', 0.98969), ('boy', 0.98480916), ('girl', 0.9805683), ('people', 0.969992)]


In [11]:
## search all labels
graph_neighbor_dict = predict.neighbor_search(graph_query_dict,ground_truth,graph_index,graph_label_ix,1)
graph_predictions = predict.get_pred_dict(graph_neighbor_dict)
predict.export_cue_targets(graph_predictions,graph_pred)

100%|███████████████████████████████████████| 5011/5011 [07:17<00:00, 11.45it/s]


In [12]:
print(f"Searching result for label 'turtle'on ground turth:   {ground_truth['turtle'][:5]}...")
print(f"Searching result for label 'turtle'on CSKG:           {graph_predictions['turtle'][:5]}...")

print()
print(f"The format for graph_neighbor_dict(use 'turtle' as an example) neighbor_dict['turtle']:\
      {graph_neighbor_dict['turtle'][:5]}...")

Searching result for label 'turtle'on ground turth:   ['slow', 'shell', 'tortoise', 'animal', 'green']...
Searching result for label 'turtle'on CSKG:           ['skeleton', 'rock', 'style', 'frog', 'channel']...

The format for graph_neighbor_dict(use 'turtle' as an example) neighbor_dict['turtle']:      [('skeleton', 0.84700084), ('rock', 0.84658915), ('style', 0.8444997), ('frog', 0.8442888), ('channel', 0.842693)]...


In [14]:
## We can also use different X to do neighbors searching
# for X in [1,2,3,5,10]:
#     graph_neighbor_dict = predict.neighbor_search(graph_query_dict,ground_truth,graph_index,graph_label_ix,X)
#     graph_predictions = predict.get_pred_dict(graph_neighbor_dict)

## Create predictions based on text embeddings

variables:

- `text_node_emb`: A dictionary whose key is the Node id, value is the text embeddings for such node.
- `text_label_emb`: A dictionary whose key is the Node label , value is the average text embeddings for such node.
- `text_index`: A faiss index keeps the index for the text label embeddings.
- `text_label_ix`:  A dictionary whose key is the text_index's number, value is the label. This dictionary is aimed at recording each label's position for future mapping.

- `text_query_dict`:A dictionary whose key is both in ground_truth and CSKG, value is the text embedding value for labels on CSKG nodes

- `text_neighbor_dict`:   A dictionary whose key is a label in CSKG, value is a list containing the label's similar targets in decreasing order of cosine similarity, each item in the list is a tuple, first item is the similar target, and second one is the similarity to the label.<br>
example: {'a': [('s', 0.9048489),('more', 0.88388747),('c', 0.8800387)...]...}
- `text_predictions`: A dictionary with same key with ground_truth , but the value is the list of neighbors generated by faiss searching according to the cue's text embeddings

In [18]:
##1. load text embedding for each node id, due to file size, use tqdm to check process
text_node_emb = predict.txt_emb_load(bert_embs)

100%|███████████████████████████████| 2161048/2161048 [11:58<00:00, 3007.18it/s]


In [19]:
print(f"Text embeddings for node 'Q32945370': {text_node_emb['Q32945370'][:5]}...")

Text embeddings for node 'Q32945370': [-0.20672296, 0.13673855, -0.49276882, -0.13373701, -0.37759098]...


In [20]:
%%time
##2. get the embedding for each label, since each label may have multiple nodes, 
## so here we use their average embeddings (5min)
text_label_emb = predict.get_label_emb(text_node_emb,CSKG_label_dict)
print(f"Text embeddings for lable 'turtle': {text_label_emb['turtle'][:5]}...")

Text embeddings for lable 'turtle': [0.19174036153846152, -0.3205488705384616, 0.09568463307692308, -0.014491030384615376, -0.32387494615384615]...
CPU times: user 3min 37s, sys: 54.5 s, total: 4min 31s
Wall time: 4min 31s


In [21]:
%%time
##3. build a faiss index for text embddings  (1min)
text_index,text_label_ix = predict.build_index(text_label_emb)
print(f"Total number of labels in graph_index: {graph_index.ntotal}")
print(f"The index 9999 points to the label : {graph_label_ix[9999]}")

Total number of labels in graph_index: 1537680
The index 9999 points to the label : untimely
CPU times: user 1min 30s, sys: 24 s, total: 1min 54s
Wall time: 1min 29s


In [23]:
## 4.create query set for CSKG node, here the query set's labels are same as groud truth's labels
text_query_dict =  predict.create_queryset(ground_truth,text_label_emb)
print(f"Text embddings for query label 'black',\n {text_query_dict['black']}\n{text_query_dict['black'].shape}")

Text embddings for query label 'black',
 [[ 0.03769894  0.00105902  0.01100399 ... -0.01912858  0.00293697
  -0.01704775]]
(1, 1024)


In [24]:
##5. Neighbor Searching for all ground truth's labels, Here the returned targets' number for a label is equal to 
# the ones on ground turth => (@X)

# before searching all labels, let's search one label as an example
## e.g : search neighbors for label 'person'
tmp1 = predict.get_label_neighbor(text_query_dict['person'],text_index,text_label_ix,5,include=False)
tmp2 = predict.get_label_neighbor(text_query_dict['person'],text_index,text_label_ix,5,include=True)
print(f"Searching result for label 'person'(not include itself): {tmp1}\n")
print(f"Searching result for label 'person'(include itself):     {tmp2}")

Searching result for label 'person'(not include itself): [('man', 0.9726254), ('men', 0.96999484), ('boy', 0.96972156), ('board', 0.9528501), ('area', 0.95093006)]

Searching result for label 'person'(include itself):     [('person', 1.0000002), ('man', 0.9726254), ('men', 0.96999484), ('boy', 0.96972156), ('board', 0.9528501)]


In [25]:
## search all labels
text_neighbor_dict = predict.neighbor_search(text_query_dict,ground_truth,text_index,text_label_ix,1)
text_predictions = predict.get_pred_dict(text_neighbor_dict)
predict.export_cue_targets(text_predictions,text_pred)

100%|███████████████████████████████████████| 5011/5011 [54:51<00:00,  1.52it/s]


In [27]:
##6.Modified Neighbor Searching 
##
## Since there are too many nosiy targets for text embedding labels, here we process the return 
## targets according to the followsing rules:
  #  1. lev(query_label, target) < threshold, e.g.  lev('give', 'gives')>= 0.8, then disgard 'gives'
  # 2. target not in query_label e.g  'turtle' in 'green turtle', then disgard 'green turtle'

In [41]:
text_neighbor_dict = predict.adp_neighbor_search(text_query_dict,ground_truth,text_index,text_label_ix,1,0.8)
text_predictions = predict.get_pred_dict(text_neighbor_dict)
predict.export_cue_targets(text_predictions,modi_text_pred)

100%|███████████████████████████████████████| 5011/5011 [57:09<00:00,  1.46it/s]


In [42]:
## We can also use different X to do neighbors searching
# for X in [1,2,3,5,10]:
#     # use neighbor_search or adp_neighbor_search
#     text_neighbor_dict = predict.neighbor_search(text_query_dict,ground_truth,text_index,text_label_ix,X) 
#     text_predictions = predict.get_pred_dict(text_neighbor_dict)