A notebook for CSKG edge sentences embedding and clustering analysis.

In [1]:
import clustering # a script contains related function for sentence embedding and clustering
import importlib
importlib.reload(clustering)
import warnings
warnings.filterwarnings('ignore')

## Parameters for invoking the notebook

input

- `cskg_connected`: file path of cskg_connected.tsv (contains the raw cskg edge information)
- `cskg_connected_dim`: file path of cskg_connected_dim.tsv.gz（contains dimension-based (manual) clusters result）

output:
- `cskg_lexicalized`: file path of cskg_lexicalized.tsv (contains lexicalization of each edge on CSKG), each line has three item,edge_id,lexicalization, and sentence(separated by tab) 
- `edge_embeddings_bert`: file path of edge_embeddings_bert.tsv (contains edge id and its embeddings generated by sentence-transformer-bert on CSKG, not raw data but generated by methods )
- `edge_embeddings_roberta`: file path of edge_embeddings_robert.tsv (contains edge id and its embeddings generated by sentence-transformer-roberta on CSKG, not raw data but generated by methods )

- `clstr_bert`:  file path of clstr_bert.tsv（contains automatic clusters result by using sentence-transformer-bert model）
- `clstr_roberta`: file path of clstr_bert.tsv（contains automatic clusters result by using  sentence-transformer-roberta model)

- `log_bert`: a folder keeps tensorboard projector's configuration for bert sentence embedding and its predicted cluster labels
- `log_roberta`: a folder keeps tensorboard projector's configuration for roberta sentence embedding and its predicted cluster labels
- `log_bert_human`: a folder keeps tensorboard projector's configuration for bert sentence embedding and its cluster labels by human
- `log_roberta_human`: a folder keeps tensorboard projector's configuration for roberta sentence embedding and its cluster labels by human

In [2]:
## input
cskg_connected = '../input/cskg_connected.tsv'
cskg_connected_dim = '../input/cskg_connected_dim.tsv.gz'
## output
cskg_lexicalized = '../output/cskg_lexicalized.tsv'
edge_embeddings_bert= '../output/edge_embeddings_bert.tsv'
edge_embeddings_roberta = '../output/edge_embeddings_roberta.tsv'
clstr_bert = '../output/clstr_bert.tsv'
clstr_roberta = '../output/clstr_roberta.tsv'
log_bert = '../output/log_bert'
log_roberta = '../output/log_roberta'
log_bert_human = '../output/log_bert_human'
log_roberta_human = '../output/log_roberta_human'

## Load  Datasets

- `manually_res`:  A Dictionary whose key is the egde id, the value is the cluster label (manully)
- `edge_list`:  A list contain multiple tuples kepping each edge's nodes and relation information each tuple's format is  (edge_id, node1_lbl, rel_lbl, node2_lbl, rel_meta)
- `rel_dict`:  A dictionary whose key the relation ID , value keeps the relation label accoring to the relation ID. The value is also a dictionary whose key is the relation label, the value is the occurrence such relation label appears on CSKG <br> example: '/r/IsA': {'is a': 242358, 'subproperty of': 1, 'subclass of': 47501, 'instance of': 26685} 

In [6]:
## 0.load cluster result generated manually  (for the final comparison)
manually_res = clustering.load_clstr_hand(cskg_connected_dim)
print(f"clustering label for edge  id '/c/en/0.22_inch_calibre-/r/IsA-/c/en/5.6_millimetres-0000': \
{manually_res['/c/en/0.22_inch_calibre-/r/IsA-/c/en/5.6_millimetres-0000']}\n")

print(f"edges on manually_res: {len(manually_res)}",end='\t')
print(f"cluster number: {len(set(list(manually_res.values())))}")

clustering label for edge  id '/c/en/0.22_inch_calibre-/r/IsA-/c/en/5.6_millimetres-0000': taxonomic

edges on manually_res: 5822389	cluster number: 12


In [4]:
## 1.load CSKG edges ,get each edge's nodes and relation info
edge_list = clustering.get_edge(cskg_connected)
rel_dict = clustering.rel_mapping(edge_list) 

print(f"an edge on CSKG:{edge_list[0]}")
print()
print(f"labels for relation ID '/r/IsA': {rel_dict['/r/IsA']}")

an edge on CSKG:('/c/en/0.22_inch_calibre-/r/IsA-/c/en/5.6_millimetres-0000', '0.22 inch calibre', 'is a', '5.6 millimetres', '/r/IsA')

labels for relation ID '/r/IsA': {'is a': 242358, 'subproperty of': 1, 'subclass of': 47501, 'instance of': 26685}


In [5]:
### small finding:  Here we can see that the richest reltion ID is '/r/LocatedNear', having 8752 labels 
##((dive into more??))
relation_types = sorted(rel_dict.items(),key=lambda x:len(x[1]),reverse=True)
# here you can see even if the relation labels belong to one relation meta type, these descrpitiona are various
print(len(relation_types[0][1]),list(relation_types[0][1].items())[:20])

8752 [('has', 16438), ('on', 27337), ('on a', 1341), ('in', 13550), ('of', 7047), ('of a', 1069), ('of an', 60), ('behind', 6378), ('facing', 97), ('carries', 66), ('full of', 211), ('of street', 2), ('attaches to', 2), ('black', 76), ('has a', 8677), ('doors on', 1), ('made of', 843), ('near', 2419), ('horses head', 1), ('attached to', 1417)]


## Create  lexicalization

- `rel_template`: A dictionary made manually keeps the template for different relation types
- `edge_sent_list`: A list contain multiple tuples kepping each edge's id, lexicalization and generated sentence each tuple's format is (edge_id, lexicalization, sentence)

In [6]:
## 2.get relation template and generate lexicalization for each edge
rel_template = clustering.rel_template
## create cskg_lexicalized.tsv for future usage
edge_sent_list = clustering.create_lexi(edge_list,rel_template,cskg_lexicalized)
print(f"rel_template['/r/SimilarTo']: {clustering.rel_template['/r/SimilarTo']}")

rel_template['/r/SimilarTo']: is similar to


## Sentence embedding

- `sent_embs_bert`: A list contains multiple tupels, each tuple contians the edge's id and edge's sentence embedding generated by sentence-transformer-bert 
- `sent_embs_roberta`: A list contains multiple tupels, each tuple contians the edge's id and edge's sentence embedding generated by sentence-transformer-roberta


We have storeed them into tsv  files for repetitive usage, so we can use load_sent_emb() function to import them
while it still takes much time.

In [7]:
%%time
## 3. use two different pre-models to do sentence embedding （more than 10h!）
# here we first check if the sentence embedding exists, if yes, then we load them
# otherwise we will generate it by sentence transformer 
try:
    print('load sent_embs_bert...')
    sent_embs_bert = clustering.load_sent_emb(edge_embeddings_bert)
    print(f"# of edge:{len(sent_embs_bert)}, dimension for each setence vec: {len(sent_embs_bert[0][1])}\n")
    print('load sent_embs_roberta...')
    sent_embs_roberta = clustering.load_sent_emb(edge_embeddings_roberta) 
    print(f"# of edge:{len(sent_embs_roberta)}, dimension for each setence vec: {len(sent_embs_roberta[0][1])}")
    
except:
    print('no existing files, now we generate them (takes more than 10hs!)')
    sent_embs_bert = clustering.get_sent_emb('bert-large-nli-stsb-mean-tokens',edge_sent_list,edge_embeddings_bert)
    sent_embs_roberta = ra.get_sent_emb('roberta-large-nli-stsb-mean-tokens',edge_sent_list,edge_embeddings_bert)

  0%|          | 4053758/121448308839 [00:00<49:59, 40482314.16it/s]

load sent_embs_bert...


100%|█████████▉| 121448284267/121448308839 [1:01:32<00:00, 32893585.61it/s]
  0%|          | 4499978/120706669991 [00:00<44:53, 44813168.16it/s]

# of edge:5957575, dimension for each setence vec: 1024

load sent_embs_roberta...


100%|█████████▉| 120706645419/120706669991 [1:00:39<00:00, 33161709.72it/s]

# of edge:5957575, dimension for each setence vec: 1024
CPU times: user 1h 49min 4s, sys: 12min 47s, total: 2h 1min 51s
Wall time: 2h 2min 12s





## Edge Clustering by k-means
- `clstr_res_bert` :A Dictionary whose key is the egde id, the value is the predicted cluster label by k-means and the embedding model is bert-large-nli-stsb-mean-tokens
- `clstr_res_robert` :A Dictionary whose key is the egde id, the value is the predicted cluster label by k-means and the embedding model is roberta-large-nli-stsb-mean-tokens

In [3]:
%%time
##4. According to the sentence embedding,  do clustering by k-means
# here we first check if the sentence embedding exists, if yes, then we load them
# otherwise we will generate it by their sentence embedding (To be honest, this step should 
# be proceeded before step3 if we can make sure that the cluster results are ready )
try:
    print('load cluster result generated by bert model...')
    clstr_res_bert = clustering.load_clstr_auto(clstr_bert)
    print(f"cluster label for '/c/en/0.22_inch_calibre-/r/IsA-/c/en/5.6_millimetres-0000'\
    {clstr_res_bert['/c/en/0.22_inch_calibre-/r/IsA-/c/en/5.6_millimetres-0000']}")
    print('load cluster result of roberta...')
    clstr_res_roberta = clustering.load_clstr_auto(clstr_roberta)
    print(f"cluster label for '/c/en/0.22_inch_calibre-/r/IsA-/c/en/5.6_millimetres-0000'\
    {clstr_res_roberta['/c/en/0.22_inch_calibre-/r/IsA-/c/en/5.6_millimetres-0000']}")
    
except:
    print('no existing files, now we generate them...')
    clstr_res_bert = clustering.edge_cluster(sent_embs_bert,clstr_bert,cluster_num=13)
    clstr_res_roberta = clustering.edge_cluster(sent_embs_roberta,cluster_num=13)

load cluster result generated by bert model...
cluster label for '/c/en/0.22_inch_calibre-/r/IsA-/c/en/5.6_millimetres-0000'    3
load cluster result of roberta...
cluster label for '/c/en/0.22_inch_calibre-/r/IsA-/c/en/5.6_millimetres-0000'    3
CPU times: user 12.8 s, sys: 3.55 s, total: 16.3 s
Wall time: 16.5 s


## Calculate adjusted rand index metric between antomatic result and human result

In [9]:
ari_bert = clustering.adj_rank_index(clstr_res_bert,manually_res)
ari_robert = clustering.adj_rank_index(clstr_res_roberta,manually_res)
print(f"adjusted rank index by using bert is:    {ari_bert}")
print(f"adjusted rank index by using roberta is: {ari_robert}")

adjusted rank index by using bert is:    0.22624362772679796
adjusted rank index by using roberta is: 0.23544859881340863


## Visualize edge embddings 

We can use tensorboard projector to visualize edge embeddings
After executing the following code, a log folder will be generated automatically, then using `tensorboard --logdir={log_folder}`to check visualization


In [11]:
# log_bert = '../output/log_bert'
# log_roberta = '../output/log_roberta'
# log_bert_human = '../output/log_bert_human'
# log_roberta_human = '../output/log_roberta_human'
# clustering. visualisation(clstr_res_bert, sent_embs_bert, log_bert)

In [None]:
# bert embdding + human labels
clustering.visualisation(manually_res, sent_embs_bert, log_bert_human,'bert_human')

In [22]:
# bert embdding + auto labels
clustering.visualisation(clstr_res_bert, sent_embs_bert, log_bert,'bert_auto')

In [None]:
# roberta embdding + human labels
clustering.visualisation(manually_res, sent_embs_roberta, log_roberta_human,'roberta_human')

In [None]:
# roberta embdding + auto labels
clustering.visualisation(clstr_res_roberta, sent_embs_roberta, log_roberta,'roberta_auto')