# Embeddings

In [1]:
import sys  
sys.path.insert(0, 'tutorial')
from tutorial_setup import *

ALIAS: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/aliases.en.tsv.gz"
ALL: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/all.tsv.gz"
CLAIMS: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/claims.tsv.gz"
DESCRIPTION: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/descriptions.en.tsv.gz"
EXAMPLES_DIR: "/Users/pedroszekely/Documents/GitHub/kgtk/examples"
GE: "/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding"
ISA: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/derived.isa.tsv.gz"
ITEM: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/claims.wikibase-item.tsv.gz"
LABEL: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/labels.en.tsv.gz"
OUT: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v5"
P279: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/derived.P279.tsv.gz"
P279STAR: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/derived.P279star.tsv.gz"
PROPERTY_DATATYPES: "/Users/pedroszekely/Downloads/kypher/wikidata_o

In [2]:
%cd {output_path}

/Users/pedroszekely/Downloads/kypher


## Graph Embeddings

Normally, we would use `Q154ITEM`, but the partioning failed so we will compute it using kypher

In [59]:
os.environ["Q154GRAPH"] = os.environ["TEMP"] + "/Q154.edges.4.tsv.gz"

In [78]:
!zcat < "$Q154ITEM" | head

/bin/bash: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts/claims.wikibase-item.tsv.gz: No such file or directory


In [60]:
!zcat < "$Q154GRAPH" | wc

  197739  812559 10804072


In [61]:
!$kypher -i "$Q154GRAPH" -i "$TEMP"/Q154.metadata.property.datatype.tsv.gz -i "$Q154LABEL" \
--match 'edges: (n1)-[l {label: property}]->(n2), datatype: (property)-[]->(dt:`wikibase-item`), label: (n1)-[]->(lab)' \
--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \
-o "$GE"/geinput.tsv

        1.06 real         0.81 user         0.20 sys


We have over 60,000 lines:

In [62]:
!wc "$GE"/geinput.tsv

   66666  266664 3307366 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/geinput.tsv


Compute the graph embeddings using the default settings. Our output file `translation.txt` will be in word2vec format so we can usi it diectly in gensim

In [63]:
!$kgtk graph-embeddings --verbose -i "$GE"/geinput.tsv \
-o "$GE"/embeddings.txt \
--retain_temporary_data True \
--operator translation \
--workers 5 \
--log "$GE"/ge.log \
-T "$GE" \
-ot w2v \
-e 600

In Processing, Please go to /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/ge.log to check details
Opening the input file: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/geinput.tsv
KgtkReader: File_path.suffix: .tsv
KgtkReader: reading file /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/geinput.tsv
header: id	node1	label	node2
node1 column found, this is a KGTK edge file
KgtkReader: Special columns: node1=1 label=2 node2=3 id=0
KgtkReader: Reading an edge file.
Opening the output file: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/tmp_geinput.tsv
File_path.suffix: .tsv
KgtkWriter: writing file /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/tmp_geinput.tsv
header: id	node1	label	node2
Processing the input records.
Processed 66665 records.
Processed Finished.
      339.63 real      1692.48 user       184.77 sys


Let's look at the output direcory

In [64]:
!ls -hl "$GE"

total 449544
-rw-r--r--   1 pedroszekely  staff   101K Dec 26 16:09 Q27.sim.tsv
-rw-r--r--   1 pedroszekely  staff    44K Dec 25 22:18 Q27.tsv
-rw-r--r--   1 pedroszekely  staff   177K Dec 26 16:09 Q29.Q45.Q142.sim.tsv
-rw-r--r--   1 pedroszekely  staff    43K Dec 25 22:36 Q29.Q45.sim.tsv
-rw-r--r--   1 pedroszekely  staff    85K Dec 26 16:09 Q29.sim.tsv
-rw-r--r--   1 pedroszekely  staff   158K Dec 27 17:50 Q332378.sim.tsv
-rw-r--r--   1 pedroszekely  staff    88K Dec 27 17:50 Q374.sim.tsv
-rw-r--r--   1 pedroszekely  staff    87K Dec 26 16:09 Q502268.sim.tsv
-rw-r--r--   1 pedroszekely  staff    44K Dec 25 22:11 Q502268.tsv
-rw-r--r--   1 pedroszekely  staff   4.3K Dec 25 21:33 Q610672.tsv
-rw-r--r--   1 pedroszekely  staff    54M Dec 28 10:41 embeddings.txt
-rw-r--r--   1 pedroszekely  staff   955K Dec 28 10:41 ge.log
-rw-r--r--   1 pedroszekely  staff   3.2M Dec 28 10:35 geinput.tsv
-rw-r--r--   1 pedroszekely  staff   973K Dec 23 12:41 geinput.tsv.gz
drwxr-xr-x  10 pedroszekely  s

Let's peek at the file, we have 44K vectors of dimension 100

In [65]:
!head -2 "$GE"/embeddings.txt

44695 100
Q214601 -0.102628224 0.359375894 -0.157025114 -0.464361548 0.037833635 -0.339900166 -0.124001674 -0.416769415 0.415618241 -0.207958832 0.225649863 0.398222595 -0.155308485 -0.316138476 -0.122366287 0.172886118 0.612552524 -0.207907990 0.059321735 0.362125903 0.009907527 0.251414299 0.391911834 0.040169623 0.391867101 -0.460243762 -0.060899656 -0.026212368 0.440496027 -0.302922249 0.312714458 -0.217636093 -0.009538481 0.103229381 0.134138778 -0.129249051 0.379549921 -0.601356268 0.118008241 0.252372622 0.345629156 -0.044342130 -0.307582080 0.062566765 -0.201944143 -0.009374032 -0.075185008 0.344059110 -0.060755178 0.329203039 -0.332021683 0.072686754 0.146988109 0.374888718 -0.252653658 0.052430481 -0.140533656 0.154647931 0.171625122 -0.573895752 0.054081772 0.007514662 -0.107331567 -0.306097299 0.256348401 0.298310518 0.125003204 -0.093739904 -0.259183556 0.440674067 -0.216976196 -0.029400053 0.131144762 -0.538199782 -0.248879820 0.120181866 -0.310127497 -0.459496319 0.40604

Load the vecotrs in gensim

In [66]:
path = os.environ['GE'] + "/embeddings.txt"
ge_vectors = KeyedVectors.load_word2vec_format(path, binary=False)

In [67]:
# Q502268 is Johnnie Walker
ge_vectors['Q502268']

array([ 0.47429726,  0.63725895, -0.5864697 , -0.30693457,  0.05878768,
       -0.47394025,  0.3690229 ,  0.47179508,  0.10914682,  0.05990786,
        0.29345718, -0.50989914, -0.0930867 , -0.9283536 ,  0.4654144 ,
       -0.5097026 , -0.0989987 ,  0.5390261 ,  0.3539935 , -0.1391855 ,
       -0.7672231 , -1.1408933 ,  0.8327929 ,  0.494192  ,  0.33971542,
        0.07139378,  0.551032  , -0.19357733,  0.367034  , -0.5566508 ,
       -0.37830138, -0.5741709 ,  0.27643484,  0.64525676, -0.00791502,
       -0.40929148,  0.13406757, -0.2794057 , -0.03993555,  0.64496636,
       -0.48054212,  1.0069914 , -0.4344204 ,  0.8777195 , -0.8854221 ,
        0.680318  , -0.16724458, -0.01672718, -0.621057  , -0.47410202,
       -0.58648473, -0.9190876 ,  0.5850317 ,  0.39297456, -0.0438199 ,
       -1.0063976 ,  0.05002118,  1.1778641 , -0.4606614 , -0.4312578 ,
        0.14105996, -0.66551495,  0.31018743,  0.6011955 , -0.927424  ,
       -0.03615357,  0.43011943, -0.50710917,  0.21786   ,  0.94

Find the most similar qnodes to `Q15874936`, the qnode for Michelob.

In [68]:
ge_vectors.most_similar(positive=['Q15874936'], topn=5)

[('Q610672', 0.9382461905479431),
 ('Q85269976', 0.7516792416572571),
 ('Q5647008', 0.7455666065216064),
 ('Q4921899', 0.7417541146278381),
 ('Q2567026', 0.73448646068573)]

This is hard to use because the reuslt are qnodes and we have no idea what they are. Let's define a function to fetch the labels and descriptions so that we can interpret the results more easily

`kgtk_most_similar` is a wrapper to gensim's `most_similar` function, and it is designed to output the results in KGTK format. The `kgtk_path` is required if we want to output the labels and descriptios as this path is where the `labels.en.tsv.gz` and `descriptions.en.tsv.gz` files care stored. You can optionally provide a `output_path` to tell it to sotre the results in a file; otherwise the results will be returned as a dataframe.

In [9]:
def kgtk_most_similar(
    vectors,
    positive,
    relation_label="similarity_score",
    kg_path=None,
    add_label_description=True,
    output_path=None,
    topn=25,
):
    """"""
    result = []
    if add_label_description and kg_path:
        fp = tempfile.NamedTemporaryFile(
            mode="w", suffix=".tsv", delete=False, encoding="utf-8"
        )
        fp.write("node1\tlabel\tnode2\n")
        for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):
            fp.write("{}\t{}\t{}\n".format(qnode, relation_label, similarity))
        filename = fp.name
        fp.close()

        os.environ["_label_graph"] = kg_path + "/labels.en.tsv.gz"
        os.environ["_description_graph"] = kg_path + "/descriptions.en.tsv.gz"
        os.environ["_temp_file"] = filename

        result = !$kypher_raw -i "$_label_graph" -i "$_description_graph" -i "$_temp_file" --as sim \
--match 'sim: (n1)-[]->(similarity), label: (n1)-[]->(lab), description: (n1)-[]->(des)' \
--return 'distinct n1 as node1, similarity as node2, "similarity" as label, lab as `node1;label`, des as `node1;description`' \
--order-by 'cast(similarity, float) desc' 
        
        os.remove(filename)
        
    else:
        result.append("node1\tlabel\tnode2\n")
        for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):
            result.append("{}\t{}\t{}\n".format(qnode, relation_label, similarity))

    if output_path:
        handle = open(output_path, "w")
        for line in result:
            handle.write(line)
            handle.write("\n")
        handle.close()
    else:
        columns = result[0].split("\t")
        data = []
        for line in result[1:]:
            data.append(line.split("\t"))
        return pd.DataFrame(data, columns=columns)

Let's give it a try:

In [69]:
# Q15874936 is Michelob
kgtk_most_similar(ge_vectors, positive=['Q15874936'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q610672,0.9382461905479432,similarity,'Budweiser'@en,'brand of pale lager'@en
1,Q85269976,0.7516792416572571,similarity,'Busch Beer'@en,'brand of beer owned by Anheuser-Busch'@en
2,Q2397992,0.7307605743408203,similarity,'malt liquor'@en,'beer style'@en
3,Q4912182,0.7301785945892334,similarity,'Billy Beer'@en,'beer produced in the United States'@en
4,Q5149389,0.7276897430419922,similarity,'Colt 45'@en,'malt liquor'@en
5,Q694536,0.7243717312812805,similarity,'American whiskey'@en,'Whiskey produced in the United States'@en
6,Q3079990,0.7207326292991638,similarity,'Four Loko'@en,'Drink'@en


## Text embeddings

In [15]:
!zcat < $OUT/all.tsv.gz | head

id	node1	label	node2
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508
P10-P1659-P1651-c4068028-0	P10	P1659	P1651
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238
P10-P1659-P51-86aca4c5-0	P10	P1659	P51
P10-P1855-Q15075950-7eff6d65-0	P10	P1855	Q15075950
P10-P1855-Q69063653-c8cdb04c-0	P10	P1855	Q69063653
zcat: error writing to output: Broken pipe


Explain the command here

In [None]:
!$kgtk text-embedding -i $OUT/all.tsv.gz \
--embedding-projector-metadata-path none \
--label-properties label \
--isa-properties P31 P279 P452 P106 \
--description-properties description \
--property-value P186 P17 P127 P176 P169 \
--has-properties "" \
-f kgtk_format \
--output-data-format kgtk_format \
--save-embedding-sentence \
--model bert-large-nli-cls-token \
-o "$TE" \
> "$TE"/text-embedding.tsv

Duration --parallel 1
16348.11 real     16066.21 user       315.45 sys

The text embeddings are output in KGTK format and we need them in word2vec format (need to enhance the command to produce w2v format). For now, define a function to convert the KGTK embeddings to w2v format.

In [16]:
def convert_kgtk_to_w2v(input_path, output_path, text_embedding_label="text_embedding"):
    """
    Convert a KGTK file (node1/label/node2) that contains embeddings to the w2v format
    """
    vector_count = 0
    vector_length = 0
    
    # Read the file once to count the lines as we need to put them at the top of the w2v file
    with open(input_path, "r") as kgtk_file:
        next(kgtk_file)
        for line in kgtk_file:
            items = line.split("\t")
            qnode = items[0]
            label = items[1]
            if label == text_embedding_label:
                if vector_count == 0:
                    vector_length = len(items[2].split(","))
                vector_count += 1
        kgtk_file.close()

    with open(output_path, "w") as w2v_file:
        w2v_file.write("{} {}\n".format(vector_count, vector_length))
        with open(input_path, "r") as kgtk_file:
            next(kgtk_file)
            for line in kgtk_file:
                items = line.split("\t")
                qnode = items[0]
                label = items[1]
                if label == text_embedding_label:
                    vector = items[2].replace(",", " ")
                    w2v_file.write(qnode + " " + vector)
            kgtk_file.close()
        w2v_file.close()

In [17]:
convert_kgtk_to_w2v(os.environ['TE'] + "/text-embedding.tsv", os.environ['TE'] + "/embeddings.txt")

Let's look at the output file, the embeddings have 1024 dimensions

In [18]:
!head -2 "$TE"/embeddings.txt

56017 1024
undirected_pagerank -0.42267796 0.3995441 0.5533569 -0.71286017 0.35639343 0.23904479 -0.2763573 0.37157294 -0.4283453 1.3224101 0.6862846 0.19590487 -0.6082015 -0.11240994 0.33890438 -0.20922732 -0.23069456 -0.021294963 -1.912606 0.49719235 0.6929876 0.011938913 -1.5600294 0.20473605 -0.17875122 0.45237 -0.09061487 0.0838695 0.039139077 -0.5781012 -0.2535121 0.065458305 -0.34608266 -0.42478928 -0.4474916 -0.23409875 -0.13160512 -0.076800026 -0.6984711 0.12516521 -0.42880625 -0.85138726 0.04815936 -0.6207587 -0.08866266 -1.6658425 -0.51067406 -0.34878105 0.33144328 -0.69933593 -0.36479193 -0.6388813 0.76048696 0.12395467 -0.88557744 0.34427696 1.2574033 -0.65131736 -0.9506962 0.6257681 0.36623836 0.716814 0.36953598 -1.3571995 0.2660646 -1.2076085 0.09180403 -0.36115 0.42118248 -0.92440283 -0.32160524 -0.14557533 -0.50016695 -0.12131537 -0.74813855 0.5254087 0.42912796 -0.73770857 -0.39519224 1.1647401 0.63930184 -0.33095387 -0.17238976 0.19148383 -0.31919938 -0.7583614 0.15

Load the text embeddings in gensim

In [19]:
te_path = os.environ['TE'] + "/text-embedding.w2v.txt"
te_vectors = KeyedVectors.load_word2vec_format(te_path, binary=False)

### Compare the graph and text embeddings

Most similar nodes to Johnnie Walker using the **graph embeddings**

In [70]:
# Q502268 is Johnnie Walker
kgtk_most_similar(ge_vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q1799948,0.8258494138717651,similarity,'Ladies of Leisure'@en,'1930 film by Frank Capra'@en
1,Q4865371,0.8253136277198792,similarity,'Bartlet for America'@en,'episode of The West Wing (S3 E9)'@en
2,Q2288328,0.8127017021179199,similarity,'The Matinee Idol'@en,"'1928 film by Walt Disney, Frank Capra'@en"
3,Q7736602,0.8067774772644043,similarity,'The Girl of the Golden West'@en,'1930 film by John Francis Dillon'@en
4,Q7084279,0.8052603006362915,similarity,'Old Ironsides'@en,'1926 film by James Cruze'@en
5,Q209135,0.7267813086509705,similarity,'East Ayrshire'@en,'council area of Scotland'@en
6,Q628737,0.6894252896308899,similarity,'Campbeltown Single Malts'@en,'single malt Scotch whiskies distilled in the ...
7,P3029,0.6503645777702332,similarity,'UK National Archives ID'@en,"'identifier for a person, family or organisati..."
8,Q32358417,0.6465252041816711,similarity,'Category:Births in East Ayrshire'@en,'Wikimedia category'@en
9,Q501320,0.6421428918838501,similarity,'Mortlach Distillery'@en,'whisky distillery'@en


Most similar nodes to Johnnie Walker using the **text embeddings**

In [21]:
# Q502268 is Johnnie Walker
kgtk_most_similar(te_vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q280,0.9379171133041382,similarity,'Lagavulin Distillery'@en,"'Scotch whisky distillery in Lagavulin, Islay,..."
1,Q2490031,0.9346836805343628,similarity,'William Grant & Sons'@en,'Scottish company which distills Scotch whisky...
2,Q1543646,0.9012988805770874,similarity,'Rob Roy'@en,'cocktail based on Scotch whisky'@en
3,Q2168523,0.8907997012138367,similarity,'The Famous Grouse'@en,'brand of Scotch whisky'@en
4,Q1069502,0.8856703042984009,similarity,'Chivas Regal'@en,'Blended Scotch Whisky produced by Chivas Brot...
5,Q4821838,0.8762272596359253,similarity,'Aultmore distillery'@en,"'whisky distillery in Moray, Scotland, UK'@en"
6,Q4720319,0.8761684894561768,similarity,'Alexander Walker'@en,'Scottish whisky distiller'@en
7,Q1754978,0.8664095401763916,similarity,'Rusty Nail'@en,'cocktail mixing Drambuie and Scotch whisky'@en
8,Q42032478,0.8583760857582092,similarity,'Tiree Whisky Company'@en,'company that sells whisky on the island of Ti...
9,Q20031443,0.8488548994064331,similarity,'Something Special'@en,'blended Scotch whisky'@en


The graph embeddings produce poor results as the top matches are not related to whiskey. The text embeddings look much better.

Most similar nodes to Michelob using the **graph embeddings**

In [71]:
# Q15874936 is Michelob
kgtk_most_similar(ge_vectors, positive=['Q15874936'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q610672,0.9382461905479432,similarity,'Budweiser'@en,'brand of pale lager'@en
1,Q85269976,0.7516792416572571,similarity,'Busch Beer'@en,'brand of beer owned by Anheuser-Busch'@en
2,Q2397992,0.7307605743408203,similarity,'malt liquor'@en,'beer style'@en
3,Q4912182,0.7301785945892334,similarity,'Billy Beer'@en,'beer produced in the United States'@en
4,Q5149389,0.7276897430419922,similarity,'Colt 45'@en,'malt liquor'@en
5,Q694536,0.7243717312812805,similarity,'American whiskey'@en,'Whiskey produced in the United States'@en
6,Q3079990,0.7207326292991638,similarity,'Four Loko'@en,'Drink'@en


Most similar nodes to Michelob using the **text embeddings**

In [23]:
# Q15874936 is Michelob
kgtk_most_similar(te_vectors, positive=['Q15874936'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q2011473,0.9664472341537476,similarity,'Fantôme'@en,'brand of beer'@en
1,Q3315575,0.9586231708526612,similarity,'Bersalis'@en,'beer brand'@en
2,Q3518554,0.9563601016998292,similarity,'Floris'@en,'beer brand'@en
3,Q15076069,0.9531255960464478,similarity,'Marckloff'@en,'beer brand'@en
4,Q1277388,0.9511646628379822,similarity,'Pripps Blå'@en,'beer brand'@en
5,Q1917255,0.9475076794624328,similarity,'St-Idesbald'@en,'beer'@en
6,Q263980,0.9443504810333252,similarity,'Soproni'@en,'beer mark'@en
7,Q3337782,0.9438232779502868,similarity,'Carrousel'@en,'Beer'@en


The graph embeddings contain some bad results, but the top matches are better as they include beers that are more closely related to Michelob. The text embeddings are reasonable as they include only beers.

Most similar nodes to vodka using the **graph embeddings**

In [72]:
# Q374 is vodka
kgtk_most_similar(ge_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q7468032,0.870713472366333,similarity,'Vodka'@en,'Detective Conan character'@en
1,Q11328065,0.8691505193710327,similarity,'Balalaika'@en,"'Japanese short drink, cocktail'@en"
2,Q20577688,0.8569937348365784,similarity,'.vodka'@en,'top-level Internet domain'@en
3,Q2206588,0.8523291349411011,similarity,'Caipiroska'@en,'cocktail prepared with vodka'@en
4,Q21189725,0.8516043424606323,similarity,'Red Eye Louie\\\\'s Vodquila'@en,'blend of vodka and tequila'@en
5,Q23712704,0.8362079858779907,similarity,'EB-11 / Vodka'@en,'encyclopedic article'@en
6,Q7151801,0.8258033394813538,similarity,'Category:Vodkas'@en,'Wikimedia category'@en


Most similar nodes to vodka using the **text embeddings**

In [25]:
# Q374 is vodka
kgtk_most_similar(te_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q4869283,0.959851622581482,similarity,'Batini'@en,'vodka-based cocktail'@en
1,Q3562046,0.959536910057068,similarity,'Vodka Stinger'@en,'type of cocktail'@en
2,Q2206588,0.943680465221405,similarity,'Caipiroska'@en,'cocktail prepared with vodka'@en
3,Q22236238,0.9384630918502808,similarity,'Mariette'@en,"'vodka, alcohol'@en"
4,Q7939317,0.9203515648841858,similarity,'Vodka Cruiser'@en,'brand of vodka-based alcoholic drink'@en
5,Q11802565,0.9155371189117432,similarity,'Pan Tadeusz'@en,'brand of vodka'@en
6,Q268057,0.9129104614257812,similarity,'cosmopolitan'@en,'cocktail made with vodka'@en
7,Q4782617,0.9107505679130554,similarity,'Aqua Velva'@en,'vodka and gin based cocktail'@en


The graph embeddings are noisy as the top matches include nodes not related to vodka, the text embeddings look much better.

Let's look at countries now as the differences between the two types of embeddings are more striking.
The graph embeddings retrieve nodes that are related to Ireland:

In [73]:
# Q27 Ireland
kgtk_most_similar(ge_vectors, positive=['Q27'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q9676,0.8768500089645386,similarity,'Isle of Man'@en,'British Crown dependency'@en
1,Q4368623,0.8426293730735779,similarity,'Category:Republic of Ireland'@en,'Wikimedia category'@en
2,Q184760,0.817055344581604,similarity,'County Monaghan'@en,'county in Ireland'@en
3,Q164421,0.8150794506072998,similarity,'Connacht'@en,'province in Ireland'@en
4,Q1263077,0.8149058818817139,similarity,'DAA'@en,'company that owns and operates Dublin Airport...
5,Q131438,0.8103623986244202,similarity,'Munster'@en,'province in Ireland'@en
6,Q187402,0.8079863786697388,similarity,'County Cavan'@en,'county in Ireland'@en
7,Q162475,0.8054779171943665,similarity,'County Cork'@en,'county in Ireland'@en
8,Q107397,0.802614688873291,similarity,'County Leitrim'@en,'county in Ireland'@en
9,Q179325,0.8024470806121826,similarity,'County Sligo'@en,'county in Ireland'@en


THe text embeddings retrieve other countries:

In [27]:
# Q27 Ireland
kgtk_most_similar(te_vectors, positive=['Q27'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q191,0.7959819436073303,similarity,'Estonia'@en,'sovereign state in Northern Europe'@en
1,Q37,0.7896063327789307,similarity,'Lithuania'@en,'sovereign state in Northeastern Europe'@en
2,Q34,0.7771986722946167,similarity,'Sweden'@en,'sovereign state in Northern Europe'@en
3,Q35,0.7717932462692261,similarity,'Denmark'@en,'sovereign state and Scandinavian country in n...
4,Q756617,0.7578498125076294,similarity,'Kingdom of Denmark'@en,"'sovereign unitary state in Europe, the Arctic..."
5,Q33,0.7564055919647217,similarity,'Finland'@en,'sovereign state in Northern Europe'@en
6,Q16965019,0.7521861791610718,similarity,'North borough of Brescia'@en,'one of 5 boroughs of Brescia'@en
7,Q1526538,0.7520326972007751,similarity,'Reykjavík North'@en,'one of the six constituencies (kjördæmi) of I...
8,Q189,0.7486690282821655,similarity,'Iceland'@en,"'sovereign state in Northern Europe, situated ..."
9,Q22,0.7369431257247925,similarity,'Scotland'@en,"'country in Northwest Europe, part of the Unit..."


### Using the embeddings in queries to the KG

In [28]:
# Q281 whiskey
# Q282 wine
# Q3246609 mixed drink
# Q374 vodka
# Q332378 is absolut

Get the most similar nodes to **absolut**, the swedish vodka using the text embeddings and put it in a file

In [29]:
# Q332378 is absolut
kgtk_most_similar(te_vectors, positive=['Q332378'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['TE'] + "/Q332378.sim.tsv")

In [30]:
result = !head "$TE"/Q332378.sim.tsv
kgtk_to_dataframe(result)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q7312560,0.9494208097457886,similarity,'Renat'@en,'Swedish vodka'@en
1,Q406157,0.9068878293037416,similarity,'bäsk'@en,'Swedish style spiced liquor'@en
2,Q1034035,0.8990318775177002,similarity,'Finlandia Vodka'@en,'Finnish brand of vodka'@en
3,Q374,0.8908252716064453,similarity,'vodka'@en,'distilled alcoholic beverage'@en
4,Q2553569,0.8900324106216431,similarity,'Vodka Martini'@en,'cocktail made with vodka and vermouth'@en
5,Q2206588,0.8866583108901978,similarity,'Caipiroska'@en,'cocktail prepared with vodka'@en
6,Q268057,0.8860777616500854,similarity,'cosmopolitan'@en,'cocktail made with vodka'@en
7,Q4021706,0.8785413503646851,similarity,'Xan'@en,'Vodka from Goygol'@en
8,Q4869283,0.8784171342849731,similarity,'Batini'@en,'vodka-based cocktail'@en


Suppose I have absolut vodka and I want to make a cocktail. I can use the KG graph of the most similar nodes to absolut, and search the KG for mixed drinks (`Q3246609`) that appear in the list of most similar nodes to absolut.

Here are some drinks we can make with absolut vodka. The query starts with our similarity file (`Q332378.sim.tsv`) in clause `sim` and filters it to select the qnodes that are instances of mixed drink (`Q3246609`) using clauses `isa` and `star`. Then the first `claims` clause selects those that have vodka as an ingredient (`Q374`) and the second `claims` clause retrieves the other ingredients.

In [32]:
result = !$kypher_raw -i "$ISA" -i "$P279STAR" -i "$TE"/Q332378.sim.tsv -i "$Q154CLAIMS" -i "$Q154LABEL" \
--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class), \
  claims: (n1)-[:P186]->(:Q374), claims: (n1)-[:P186]->(ingredient), label: (ingredient)-[]->(i_label)' \
--return 'distinct n1 as node1, similarity as node2, n1.label, n1.description, \
  ingredient as ingredient, i_label as `ingredient label`' \
--order-by 'cast(similarity, float) desc' \
--where 'class = "Q3246609"' \
--limit 20 

kgtk_to_dataframe(result)

Unnamed: 0,node1,node2,node1;label,node1;description,ingredient,ingredient label
0,Q2553569,0.8900324106216431,'Vodka Martini'@en,'cocktail made with vodka and vermouth'@en,Q1105343,'cocktail glass'@en
1,Q2553569,0.8900324106216431,'Vodka Martini'@en,'cocktail made with vodka and vermouth'@en,Q1621080,'olive'@en
2,Q2553569,0.8900324106216431,'Vodka Martini'@en,'cocktail made with vodka and vermouth'@en,Q26877166,'lemon twist'@en
3,Q2553569,0.8900324106216431,'Vodka Martini'@en,'cocktail made with vodka and vermouth'@en,Q26877423,'dry vermouth'@en
4,Q2553569,0.8900324106216431,'Vodka Martini'@en,'cocktail made with vodka and vermouth'@en,Q374,'vodka'@en
5,Q2206588,0.8866583108901978,'Caipiroska'@en,'cocktail prepared with vodka'@en,Q374,'vodka'@en
6,Q1966883,0.8709859848022461,'Yorsh'@en,'Russian drink of beer and vodka'@en,Q374,'vodka'@en
7,Q1966883,0.8709859848022461,'Yorsh'@en,'Russian drink of beer and vodka'@en,Q44,'beer'@en
8,Q1723060,0.8683922290802002,'Kamikaze'@en,"'cocktail of vodka, triple sec and lime juice'@en",Q1105343,'cocktail glass'@en
9,Q1723060,0.8683922290802002,'Kamikaze'@en,"'cocktail of vodka, triple sec and lime juice'@en",Q3539556,'triple sec'@en


The results are good, lots of choices of cocktails. Note that the embeddings are able to generalize from a specific vodka to vodka in general. The example also illustrates that KGTK can use the results of queries to gensim within queries to the KG.

When we try the query using the graph embeddings, and do not explictly filter the ingredients to include vodka:

In [34]:
# Q332378 is absolut
kgtk_most_similar(ge_vectors, positive=['Q332378'], kg_path=os.environ['OUT'] + "/parts", topn=2000, output_path=os.environ['GE'] + "/Q332378.sim.tsv")

In [36]:
result = !$kypher_raw -i "$ISA" -i "$P279STAR" -i "$GE"/Q332378.sim.tsv \
--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class)' \
--return 'distinct n1 as node1, similarity as node2, "similarity" as label, n1.label, n1.description' \
--order-by 'cast(similarity, float) desc' \
--where 'class = "Q3246609"' \
--limit 10 

kgtk_to_dataframe(result)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q3527971,0.4424980580806732,similarity,'Ti\\\\\\\\'Punch'@en,'cocktail'@en
1,Q594392,0.3889206945896148,similarity,'B-52'@en,"'cocktail of coffee liqueur, Irish cream and t..."
2,Q7535970,0.373583436012268,similarity,'Skittle Bomb'@en,'bomb shot cocktail'@en
3,Q7209010,0.3714387416839599,similarity,'Polar Bear'@en,'mint chocolate cocktail'@en
4,Q3309707,0.3705223202705383,similarity,'Hawaiian Punch'@en,'Fruit punch brand'@en
5,Q12738893,0.3702288269996643,similarity,'Quentão'@en,'Brazilian hot drink made ​​from cachaça and s...
6,Q2935472,0.3678890466690063,similarity,'Campari Soda'@en,'pre-mixed drink made by Campari'@en
7,Q70428,0.3663345277309418,similarity,'Karsk'@en,'Scandinavian cocktail'@en
8,Q590793,0.3614485263824463,similarity,'Vesper'@en,"'cocktail originally made of gin, vodka, and K..."


The results are poor as for the most part, the retrieved cocktails do not have vodka. Let's try the query with vodka instead of absolut vodka.

Now let get the qnodes that are similar to vodka (`Q374`) using the graph embeddings:

In [37]:
# Q374 vodka
kgtk_most_similar(ge_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['GE'] + "/Q374.sim.tsv")

In [38]:
result = !$kypher_raw -i "$ISA" -i "$P279STAR" -i "$GE"/Q374.sim.tsv \
--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class)' \
--return 'distinct n1 as node1, similarity as node2, "similarity" as label, n1.label, n1.description' \
--order-by 'cast(similarity, float) desc' \
--where 'class = "Q3246609"' \
--limit 10 

kgtk_to_dataframe(result)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q11328065,0.8384641408920288,similarity,'Balalaika'@en,"'Japanese short drink, cocktail'@en"
1,Q2206588,0.8186914920806885,similarity,'Caipiroska'@en,'cocktail prepared with vodka'@en
2,Q3562046,0.6592038869857788,similarity,'Vodka Stinger'@en,'type of cocktail'@en
3,Q1966883,0.5952204465866089,similarity,'Yorsh'@en,'Russian drink of beer and vodka'@en
4,Q5459745,0.5736489295959473,similarity,'flirtini'@en,"'cocktail containing vodka, champagne and pine..."
5,Q455914,0.5721926093101501,similarity,'Vodka Red Bull'@en,'alcoholic beverage'@en
6,Q5103598,0.5712590217590332,similarity,'Chocolate Cake'@en,'cocktail'@en
7,Q26879480,0.5568693280220032,similarity,'Godmother'@en,'cocktail'@en
8,Q5580053,0.5458002090454102,similarity,'Golden Russian'@en,'cocktail of vodka and Galliano'@en
9,Q3900577,0.5457539558410645,similarity,'Pertini'@en,'cocktail drink with honey'@en


The results are good. Somehow, the graph embeddings are able to rerieve the cocktails that have vodka, but cannot generalize from absolut vodka to vodka.

## Produce files to load in the Google Embedding Projector
The Goodle embedding projector (https://projector.tensorflow.org) is a tool for visualizing embeddings. To use it we need two files:

- a TSV file with the vectors
- a TSV file with the metadata, in the same order as the vectors

We don't want to load all the vectors in the projectors because it is too many to visualize. We will load only the following types as it will be interesting to see whether they cluster properly.

In [74]:
focus_types = {
    "Q3246609": "mixed drink",
    "Q44": "beer",
    "Q282": "wine",
    "Q281": "whiskey",
    "Q374": "vodka",
    "Q6256": "country",
}

To do the filteriing, we construct a dictionary that maps every q-node in the KG to the set of all its superclasses. We will use this dictionary later to tag each q-node with one of the focus types. For every q-node we will test if the focus type is in the set of all super-classes.

In [75]:
classes_result = !$kypher_raw -i "$ISA" -i "$Q154CLAIMS" -i "$TEMP"/Q154.descendant.tsv -i "$P279STAR" \
--match 'isa: (n1)-[]->(c), P279: (c)-[]->(class), claims: ()-[]->(class), descendant: (n1)-[]->()' \
--return 'distinct n1 as qnode, class as class' 

class_dict = {}
for r in classes_result[1:]:
    row = r.split("\t")
    qnode = row[0]
    isa = row[1]
    entry = class_dict.get(qnode)
    if entry is None:
        class_dict[qnode] = set()
        entry = class_dict[qnode]
    entry.add(isa)

Let's look at the class_dict for Johnnie Walker (`Q502268`). We see that Johnnie Walker has many super classes.

In [76]:
class_dict['Q502268']

{'Q102205',
 'Q1048607',
 'Q11024',
 'Q11028',
 'Q11064354',
 'Q111352',
 'Q11435',
 'Q1150070',
 'Q1166770',
 'Q11795009',
 'Q1190554',
 'Q1194058',
 'Q12055130',
 'Q124291',
 'Q12767945',
 'Q131257',
 'Q13878858',
 'Q1400881',
 'Q1422299',
 'Q14819853',
 'Q14912053',
 'Q154',
 'Q15401930',
 'Q1554231',
 'Q1632297',
 'Q16686448',
 'Q16722960',
 'Q167270',
 'Q1681365',
 'Q16887380',
 'Q16889133',
 'Q169336',
 'Q1704572',
 'Q174984',
 'Q1786828',
 'Q1865992',
 'Q187931',
 'Q1914636',
 'Q20817253',
 'Q20937557',
 'Q2095',
 'Q214609',
 'Q2150504',
 'Q2200417',
 'Q22269697',
 'Q22272508',
 'Q22294683',
 'Q22299433',
 'Q22299483',
 'Q223557',
 'Q23009552',
 'Q23009675',
 'Q2424752',
 'Q25481995',
 'Q266328',
 'Q26717101',
 'Q26907166',
 'Q2695280',
 'Q27166344',
 'Q281',
 'Q2844972',
 'Q28555911',
 'Q28728771',
 'Q28732711',
 'Q28823',
 'Q28877',
 'Q28921572',
 'Q2944660',
 'Q29651519',
 'Q2990593',
 'Q2996394',
 'Q31464082',
 'Q3249551',
 'Q337060',
 'Q34394',
 'Q3505845',
 'Q35120',
 'Q35

In [77]:
def focus_type(qnode):
    """
    Retrieve the focus type for any qnode, and return "other" for nodes that are not instances of our focus types.
    """
    for t in focus_types.keys():
        classes = class_dict.get(qnode)
        if classes and t in classes:
            return focus_types[t]
        if qnode in country_qnodes:
            return "country"
    return "other"

Construct `country_qnodes`, the set of all country qnodes

In [78]:
country_result = !$kypher_raw -i "$ISA" -i "$P279STAR" -i "$Q154CLAIMS" \
--match 'claims: (country)-[]->(), isa: (country)-[:isa]->(c), P279: (c)-[]->(:Q6256)' \
--return 'distinct country as country' 

country_qnodes = set()
for r in country_result[1:]:
    country_qnodes.add(r)

Construct `alcoholic_qnodes`, the set of all alcoholic beverage qnodes.

In [79]:
alcoholic_qnodes = set()
for line in open(os.environ["TEMP"] + "/Q154.descendant.tsv", "r"):
    alcoholic_qnodes.add(line.split("\t")[0])

The `build_embedding_projector_vectors` builds the vectors file, a TSV file with one line for each vector. We do this by scanning through the full embeddings file and selecting qnodes that are in our set of `alcoholic_qnodes` or `coutnry_qnodes`. We also write a file of all the qnodes that select. We will use this file later to construct the metadat file. We have to be careful to list the qnodes in the metadata file in the same order as they appear in the vectors file.

In [80]:
def build_embedding_projector_vectors(embeddings_path):
    input_path = embeddings_path + "/embeddings.txt"
    vectors_path = embeddings_path + "/projector.vectors.tsv"
    qnodes_path = embeddings_path + "/projector.qnodes.tsv"

    input_file = open(input_path, "r")
    vectors_file = open(vectors_path, "w")
    qnodes_file = open(qnodes_path, "w")

    qnodes_file.write("node1\n")

    with open(input_path, "r") as w2v_file:
        next(w2v_file)
        for line in w2v_file:
            items = line.split(" ")
            qnode = items[0]
            if qnode in alcoholic_qnodes or qnode in country_qnodes:
                vectors_file.write("\t".join(items[1:]))
                qnodes_file.write("{}\n".format(qnode))

    input_file.close()
    vectors_file.close()
    qnodes_file.close()

In [81]:
build_embedding_projector_vectors(os.environ["GE"])

Let's take a peek at our qnodes file, which we use in the next step.

In [82]:
!head "$GE"/projector.qnodes.tsv

node1
Q2640292
Q56282772
Q2535077
Q42901501
Q3053356
Q28739942
Q3688856
Q3945342
Q3555792


The `build_embedding_projector_metadata` uses a kypher query to retreive the labels of the qnodes (in a later version we will also include the descriptions; for now we don't because the query filters out qnodes that don't have descriptions, and unfortunaely, many alcoholic beverages are missing English descriptions).

The idea is:
- Retrieve the labels for all the qnodes using the kypher query. The query returns the results in arbitrary order.
- Build a dictionary that maps each node to the metadata that we want.
- Scan the qnodes file and for each qnode, write a metadata line in the metadata file (`projector.metadata.tsv`)

Our metadata file has three columns (you can have as many as you want):
- tag: includes the label and the focus type as it is often difficult to tell from the tag what type of beverage it is
- qnode
- focus type

In [83]:
def build_embedding_projector_metadata(embeddings_path):
    kg_path = os.environ["OUT"] + "/parts"
    os.environ["_label_graph"] = kg_path + "/labels.en.tsv.gz"
    os.environ["_description_graph"] = kg_path + "/descriptions.en.tsv.gz"
    os.environ["_qnodes"] = embeddings_path + "/projector.qnodes.tsv"

    #result = !$kypher_raw -i "$_label_graph" -i "$_description_graph" -i "$_qnodes" \
    #--match 'qnodes: (n1)-[]->(), label: (n1)-[]->(lab), description: (n1)-[]->(des)' \
    #--return 'distinct n1 as node1, lab as `node1;label`, des as `node1;description`' 
    
    result = !$kypher_raw -i "$_label_graph" -i "$_description_graph" -i "$_qnodes" \
    --match 'qnodes: (n1)-[]->(), label: (n1)-[]->(lab)' \
    --return 'distinct n1 as node1, lab as `node1;label`'
    
    metadata_path = embeddings_path + "/projector.metadata.tsv"
    metadata_file = open(metadata_path, "w")
    metadata_file.write("tag\tqnode\ttype\n")

    qnode_dict = {}
    for line in result[1:]:
        items = line.split("\t")
        qnode = items[0]
        # qnode_dict[qnode] = "{} ({})".format(items[1], items[2])
        qnode_dict[qnode] = "{}".format(items[1])

    with open(os.environ["_qnodes"]) as qnodes_file:
        next(qnodes_file)
        for line in qnodes_file:
            qnode = line[:-1]
            ftype = focus_type(qnode)
            tag = qnode_dict.get(qnode)
            if tag is None:
                tag = qnode
            tag = "{} ({})".format(qnode_dict.get(qnode), ftype)
            metadata_file.write("{}\t{}\t{}\n".format(tag, qnode, ftype))

    metadata_file.close()
    qnodes_file.close()       

In [84]:
build_embedding_projector_metadata(os.environ["GE"])

Check that the file sizes are correct, the metadata file has one more line as it as headers.

In [85]:
!wc "$GE"/projector.metadata.tsv "$GE"/projector.vectors.tsv

    2244   11695   95421 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/projector.metadata.tsv
    2243  224300 2808982 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding/projector.vectors.tsv
    4487  235995 2904403 total


In [86]:
!head -1 "$GE"/projector.vectors.tsv

-0.134678289	0.328208089	0.410306215	0.311301500	0.363741189	-0.397925675	-0.163192526	0.551493526	-0.638068974	-0.709253848	0.075043671	-0.037439797	-0.499106675	-0.633605242	0.428065211	0.577703059	-0.945140064	0.482611597	0.198202372	0.359114230	0.249259233	-0.434400380	-0.269524962	0.549175620	0.736032188	0.178097680	0.041504886	-0.492026001	-0.080035999	-0.076001510	-0.057112414	-0.272092074	0.229329199	-0.500828743	0.199075758	0.492696315	0.410107374	-0.412885010	-0.354030132	0.048465252	0.521094620	0.203816339	-0.304734200	-0.199651301	-0.740915835	0.014186437	-0.378538668	0.544250429	-0.487764388	0.103201188	-0.548755169	-0.423733592	-0.130399838	-0.122459903	-0.555753589	0.169917032	0.528418005	0.376666993	-0.106688112	-0.312881023	-0.290667921	-0.414196581	-0.016444767	0.757796407	-0.267977566	0.477938861	-0.153773859	0.383622676	0.340801269	0.678838015	-0.499238700	-1.093385816	-0.130329251	0.741248250	-0.075507775	-0.105734833	-0.120644622	0.129789278	0.444864303	0.08069029

Now build the projector files for the text embeddings, and check that the sizes are ok

In [52]:
build_embedding_projector_vectors(os.environ["TE"])

In [53]:
build_embedding_projector_metadata(os.environ["TE"])

In [54]:
!wc "$TE"/projector.metadata.tsv "$TE"/projector.vectors.tsv "$TE"/projector.qnodes.tsv

    2782   14542  118841 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/text-embedding/projector.metadata.tsv
    2781 2847744 31710917 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/text-embedding/projector.vectors.tsv
    2782    2782   24800 /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/text-embedding/projector.qnodes.tsv
    8345 2865068 31854558 total


### Google embedding projector
- open https://projector.tensorflow.org
- Load your files using the load button
- configure the visualization

Here we searched on the right for absolut vodka, and we see the closest vecotrs as well as the cluster where it belongs:
![Google embedding projector](assets/embedding-projector.png "Google embedding projector")

### UMAP visualization of the graph embeddings


Very few vodkas, hard to see them in the visualization.


![UMAP visualization](assets/graph-embedding-umap-13.png "UMAP visualization of graph embeddings")

### UMAP visualization of the text embeddings
Very few vodkas, har to see them in the visualization.


![UMAP visualization](assets/text-embedding-umap-17.png "UMAP visualization of text embeddings")

In [89]:
from sentence_transformers import SentenceTransformer


class ComputeEmbeddings:
    def __init__(self, model_name=None):
        if not model_name:
            self.model_name = 'bert-large-nli-cls-token'
        else:
            self.model_name = model_name

        self.model = SentenceTransformer(self.model_name)

    def get_vectors(self, sentence):
        """
            main function to get the vector representations of the descriptions
        """
        if isinstance(sentence, bytes):
            sentence = sentence.decode("utf-8")
        return self.model.encode([sentence], show_progress_bar=False)

In [90]:
em = ComputeEmbeddings()

In [177]:
v = em.get_vectors("beer company")[0]

In [178]:
te_vectors.similar_by_vector(v)

[('Q22333354', 0.88853520154953),
 ('Q878975', 0.8738116025924683),
 ('Q4880037', 0.8519435524940491),
 ('Q1637028', 0.8471935987472534),
 ('Q28530481', 0.8351479768753052),
 ('Q696787', 0.8316407203674316),
 ('Q20571254', 0.8302997350692749),
 ('Q460206', 0.829006552696228),
 ('Q899967', 0.8265884518623352),
 ('Q6439205', 0.8240088224411011)]

In [158]:
!wd u Q2744746

[90mid[39m Q2744746
[42mLabel[49m La Chouffe
[44mDescription[49m beer brand from Belgium
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mbeer brand [90m(Q15075508)[39m
