# Embeddings

In [1]:
import sys  
sys.path.insert(0, 'tutorial')
from tutorial_setup import *

ALIAS: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/aliases.en.tsv.gz"
ALL: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/all.tsv.gz"
CLAIMS: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/claims.tsv.gz"
DESCRIPTION: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/descriptions.en.tsv.gz"
EXAMPLES_DIR: "/Users/amandeep/Github/kgtk/examples"
GE: "/Users/amandeep/Documents/kypher/temp.wikidata_os_v5/graph-embedding"
ISA: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/derived.isa.tsv.gz"
ITEM: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/claims.wikibase-item.tsv.gz"
LABEL: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/labels.en.tsv.gz"
OUT: "/Users/amandeep/Documents/kypher/wikidata_os_v5"
P279: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/derived.P279.tsv.gz"
P279STAR: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/derived.P279star

In [2]:
%cd {output_path}

/Users/amandeep/Documents/kypher


## Graph Embeddings

Normally, we would use `Q154ITEM`, but the partioning failed so we will compute it using kypher

Amandeep, Jan 14, 2021: Partition succeeded, change this?

In [3]:
os.environ["Q154GRAPH"] = os.environ["TEMP"] + "/Q154.edges.4.tsv.gz"

In [4]:
!zcat < "$Q154ITEM" | head

id	node1	label	node2	node2;wikidatatype
P10-P1855-Q15075950-7eff6d65-0	P10	P1855	Q15075950	wikibase-item
P10-P1855-Q69063653-c8cdb04c-0	P10	P1855	Q69063653	wikibase-item
P10-P1855-Q7378-555592a4-0	P10	P1855	Q7378	wikibase-item
P10-P31-Q18610173-85ef4d24-0	P10	P31	Q18610173	wikibase-item
P1001-P1855-Q103163-54a6fd56-0	P1001	P1855	Q103163	wikibase-item
P1001-P1855-Q11696-cdbf391b-0	P1001	P1855	Q11696	wikibase-item
P1001-P1855-Q12371988-12c10bc0-0	P1001	P1855	Q12371988	wikibase-item
P1001-P1855-Q181574-7f428c9b-0	P1001	P1855	Q181574	wikibase-item
P1001-P1855-Q19689183-3f30ea56-0	P1001	P1855	Q19689183	wikibase-item
zcat: error writing to output: Broken pipe


In [5]:
!zcat < "$Q154GRAPH" | wc

  208254  857521 11405766


In [6]:
!$kypher -i "$Q154GRAPH" -i "$TEMP"/Q154.metadata.property.datatype.tsv.gz -i "$Q154LABEL" \
--match 'edges: (n1)-[l {label: property}]->(n2), datatype: (property)-[]->(dt:`wikibase-item`), label: (n1)-[]->(lab)' \
--return 'distinct l as id, n1 as node1, l.label as label, n2 as node2' \
-o "$GE"/geinput.tsv

We have over 60,000 lines:

In [7]:
!wc "$GE"/geinput.tsv

   86663  346652 4354175 /Users/amandeep/Documents/kypher/temp.wikidata_os_v5/graph-embedding/geinput.tsv


Compute the graph embeddings using the default settings. Our output file `translation.txt` will be in word2vec format so we can usi it diectly in gensim

In [8]:
!$kgtk graph-embeddings --verbose -i "$GE"/geinput.tsv \
-o "$GE"/embeddings.txt \
--retain_temporary_data True \
--operator translation \
--workers 5 \
--log "$GE"/ge.log \
-T "$GE" \
-ot w2v \
-e 600

In Processing, Please go to /Users/amandeep/Documents/kypher/temp.wikidata_os_v5/graph-embedding/ge.log to check details
Opening the input file: /Users/amandeep/Documents/kypher/temp.wikidata_os_v5/graph-embedding/geinput.tsv
KgtkReader: File_path.suffix: .tsv
KgtkReader: reading file /Users/amandeep/Documents/kypher/temp.wikidata_os_v5/graph-embedding/geinput.tsv
header: id	node1	label	node2
node1 column found, this is a KGTK edge file
KgtkReader: Special columns: node1=1 label=2 node2=3 id=0
KgtkReader: Reading an edge file.
Opening the output file: /Users/amandeep/Documents/kypher/temp.wikidata_os_v5/graph-embedding/tmp_geinput.tsv
File_path.suffix: .tsv
KgtkWriter: writing file /Users/amandeep/Documents/kypher/temp.wikidata_os_v5/graph-embedding/tmp_geinput.tsv
header: id	node1	label	node2
Processing the input records.
Processed 86662 records.
Processed Finished.
      375.28 real      2742.21 user       395.41 sys


Let's look at the output direcory

In [11]:
!ls -hl "$GE"

total 137760
-rw-r--r--   1 amandeep  staff    61M Jan 14 10:21 embeddings.txt
-rw-r--r--   1 amandeep  staff   953K Jan 14 10:21 ge.log
-rw-r--r--   1 amandeep  staff   4.2M Jan 14 10:14 geinput.tsv
drwxr-xr-x  10 amandeep  staff   320B Jan 14 10:20 [34moutput[m[m
-rw-r--r--   1 amandeep  staff   1.6M Jan 14 10:14 tmp_geinput.tsv


Let's peek at the file, we have 44K vectors of dimension 100

In [12]:
!head -2 "$GE"/embeddings.txt

50595 100
Q1601968 -0.189542875 0.263272136 -0.467894673 -0.096159801 -0.260767043 -0.148187160 0.697097719 0.024128364 -0.066908717 0.088979505 -0.533382595 0.519155979 0.196563646 -0.731619656 -0.154397860 0.190175638 -0.356174946 -0.057612449 0.326212198 0.021003731 0.018041078 -0.041870262 -0.353938043 0.282775432 -0.172888696 -0.304219842 -0.016697280 -0.002433700 -0.339767098 0.378536344 -0.306584060 -0.070071086 -0.747376978 0.033365436 0.256729245 0.501700222 -0.411833316 0.624401689 0.290195197 -0.180381507 -0.747302055 0.005587193 0.020948432 -0.069586039 0.003260779 0.201090351 -0.174071208 0.599602699 -0.247729421 -0.059140623 -0.165437430 -0.008114293 -0.037619047 0.519872963 0.261861920 -0.195989668 0.049719550 -0.160675868 0.019724635 -0.515699267 0.068126924 0.127018496 0.223180100 -0.209688857 0.157721624 -0.255295545 0.235112444 -0.051866431 -0.292870343 0.577912152 0.083747998 0.532344818 -0.161367849 0.323484302 0.116688535 0.136800468 -0.000167912 0.324934244 0.380

Load the vecotrs in gensim

In [13]:
path = os.environ['GE'] + "/embeddings.txt"
ge_vectors = KeyedVectors.load_word2vec_format(path, binary=False)

In [14]:
# Q502268 is Johnnie Walker
ge_vectors['Q502268']

array([ 0.28878835, -0.35322767,  0.26007372,  0.4524916 , -0.08659455,
        0.4521467 , -0.04694197,  0.6531368 ,  0.07050663,  0.2013739 ,
       -0.03594051,  0.31004724, -0.07058927,  0.13580728,  0.02396606,
        0.1671944 ,  0.31901574, -0.43167108,  0.45736042,  0.5251668 ,
        0.855177  ,  0.17596672, -1.166093  , -0.3054819 ,  0.09798457,
        0.3316938 ,  0.13404633, -1.2845033 ,  0.7766631 ,  0.06099582,
       -0.13624653,  0.26146337, -0.64696527, -0.06243888, -0.8558508 ,
        0.23678172,  1.3432237 , -0.0358898 ,  0.4447332 , -0.7872621 ,
        1.3248678 , -1.5155113 ,  0.4170288 ,  1.1499861 , -0.19721869,
        0.05920016, -0.39469308, -1.0859811 ,  0.3567379 , -0.36590043,
        0.57748246,  0.22165635,  0.23174508,  0.5522654 ,  0.06546084,
       -0.4951431 ,  0.00652562,  0.69837475,  0.32054716, -0.7406864 ,
        0.32900575, -0.5262449 , -0.21720454, -0.76523215,  0.10023912,
       -0.49566865,  0.39148957,  1.6146023 , -0.7188857 , -0.75

Find the most similar qnodes to `Q15874936`, the qnode for Michelob.

In [15]:
ge_vectors.most_similar(positive=['Q15874936'], topn=5)

[('Q610672', 0.8602399230003357),
 ('Q14694794', 0.819470226764679),
 ('Q2567026', 0.8172732591629028),
 ('Q2706702', 0.7299268841743469),
 ('Q1888522', 0.7114064693450928)]

This is hard to use because the reuslt are qnodes and we have no idea what they are. Let's define a function to fetch the labels and descriptions so that we can interpret the results more easily

`kgtk_most_similar` is a wrapper to gensim's `most_similar` function, and it is designed to output the results in KGTK format. The `kgtk_path` is required if we want to output the labels and descriptios as this path is where the `labels.en.tsv.gz` and `descriptions.en.tsv.gz` files care stored. You can optionally provide a `output_path` to tell it to sotre the results in a file; otherwise the results will be returned as a dataframe.

In [16]:
def kgtk_most_similar(
    vectors,
    positive,
    relation_label="similarity_score",
    kg_path=None,
    add_label_description=True,
    output_path=None,
    topn=25,
):
    """"""
    result = []
    if add_label_description and kg_path:
        fp = tempfile.NamedTemporaryFile(
            mode="w", suffix=".tsv", delete=False, encoding="utf-8"
        )
        fp.write("node1\tlabel\tnode2\n")
        for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):
            fp.write("{}\t{}\t{}\n".format(qnode, relation_label, similarity))
        filename = fp.name
        fp.close()

        os.environ["_label_graph"] = kg_path + "/labels.en.tsv.gz"
        os.environ["_description_graph"] = kg_path + "/descriptions.en.tsv.gz"
        os.environ["_temp_file"] = filename

        result = !$kypher_raw -i "$_label_graph" -i "$_description_graph" -i "$_temp_file" --as sim \
--match 'sim: (n1)-[]->(similarity), label: (n1)-[]->(lab), description: (n1)-[]->(des)' \
--return 'distinct n1 as node1, similarity as node2, "similarity" as label, lab as `node1;label`, des as `node1;description`' \
--order-by 'cast(similarity, float) desc' 
        
        os.remove(filename)
        
    else:
        result.append("node1\tlabel\tnode2\n")
        for (qnode, similarity) in vectors.most_similar(positive=positive, topn=topn):
            result.append("{}\t{}\t{}\n".format(qnode, relation_label, similarity))

    if output_path:
        handle = open(output_path, "w")
        for line in result:
            handle.write(line)
            handle.write("\n")
        handle.close()
    else:
        columns = result[0].split("\t")
        data = []
        for line in result[1:]:
            data.append(line.split("\t"))
        return pd.DataFrame(data, columns=columns)

Let's give it a try:

In [17]:
# Q15874936 is Michelob
kgtk_most_similar(ge_vectors, positive=['Q15874936'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q610672,0.8602399230003357,similarity,'Budweiser'@en,'brand of pale lager'@en
1,Q14694794,0.819470226764679,similarity,'Salitos'@en,'American beer brand'@en
2,Q212654,0.6493383646011353,similarity,'Washington Football Team'@en,'American football team'@en
3,Q1341618,0.6382841467857361,similarity,'Leffe'@en,'trademark'@en
4,Q85269976,0.6330214738845825,similarity,'Busch Beer'@en,'brand of beer owned by Anheuser-Busch'@en
5,Q21286736,0.6311758160591125,similarity,'Samuel Adams'@en,'American brand of beer'@en


## Text embeddings

In [18]:
!zcat < $OUT/all.tsv.gz | head

id	node1	label	node2
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508
P10-P1659-P1651-c4068028-0	P10	P1659	P1651
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238
P10-P1659-P51-86aca4c5-0	P10	P1659	P51
P10-P1855-Q15075950-7eff6d65-0	P10	P1855	Q15075950
P10-P1855-Q69063653-c8cdb04c-0	P10	P1855	Q69063653
zcat: error writing to output: Broken pipe


The `kgtk text-embedding` command computes sentence vectors for each Qnode in the knowledge graph. The input to this command is a sorted KGTK edge file.

This is a two step process,

**Create a sentence for a Qnode using user specified properties**

In the command below, we have specified the following options,

- `--label-properties label` specifies that the property `label` has the label for the Qnode.
- `--isa-properties P31 P279 P452 P106` specified that `instance of` for the Qnode is defined by the propeties `P31 P279 P452 P106`
- `--description-properties description` specifies that the property `description` has the description for the Qnode.
- `--property-value P186 P17 P127 P176 P169` tells the command to use property-label and values from the properties `P186 P17 P127 P176 P169` to add additional context to sentence for the Qnode

Example sentence here

**Compute sentence vector using the sentence created in the previous step**

The command then computes a vector for the sentence using one of the models, specified as,
- `--model bert-large-nli-cls-token`

For more information on this command, please [click here](https://kgtk.readthedocs.io/en/latest/analysis/text_embedding/)


In [19]:
!$kgtk text-embedding -i $OUT/all.tsv.gz \
--embedding-projector-metadata-path none \
--label-properties label \
--isa-properties P31 P279 P452 P106 \
--description-properties description \
--property-value P186 P17 P127 P176 P169 \
--has-properties "" \
-f kgtk_format \
--output-data-format kgtk_format \
--save-embedding-sentence \
--model bert-large-nli-cls-token \
-o "$TE" \
> "$TE"/text-embedding.tsv

100%|██████████████████████████████████████| 1.24G/1.24G [03:02<00:00, 6.80MB/s]
100%|███████████████████████████████████| 61298/61298 [4:10:29<00:00,  4.08it/s]
    15300.45 real     14970.96 user       261.58 sys


Duration --parallel 1
15300.45 real     14970.96 user       261.58 sys

The text embeddings are output in KGTK format and we need them in word2vec format (need to enhance the command to produce w2v format). For now, define a function to convert the KGTK embeddings to w2v format.

In [20]:
def convert_kgtk_to_w2v(input_path, output_path, text_embedding_label="text_embedding"):
    """
    Convert a KGTK file (node1/label/node2) that contains embeddings to the w2v format
    """
    vector_count = 0
    vector_length = 0
    
    # Read the file once to count the lines as we need to put them at the top of the w2v file
    with open(input_path, "r") as kgtk_file:
        next(kgtk_file)
        for line in kgtk_file:
            items = line.split("\t")
            qnode = items[0]
            label = items[1]
            if label == text_embedding_label:
                if vector_count == 0:
                    vector_length = len(items[2].split(","))
                vector_count += 1
        kgtk_file.close()

    with open(output_path, "w") as w2v_file:
        w2v_file.write("{} {}\n".format(vector_count, vector_length))
        with open(input_path, "r") as kgtk_file:
            next(kgtk_file)
            for line in kgtk_file:
                items = line.split("\t")
                qnode = items[0]
                label = items[1]
                if label == text_embedding_label:
                    vector = items[2].replace(",", " ")
                    w2v_file.write(qnode + " " + vector)
            kgtk_file.close()
        w2v_file.close()

In [21]:
convert_kgtk_to_w2v(os.environ['TE'] + "/text-embedding.tsv", os.environ['TE'] + "/embeddings.txt")

Let's look at the output file, the embeddings have 1024 dimensions

In [24]:
!head -10 "$TE"/embeddings.txt | tail -2

Q99970346 -0.35467714 0.11079551 -0.011766396 -0.64368856 0.7587074 -0.029240295 -0.34339845 -0.06344555 -1.6708547 0.388923 0.016877629 -0.016170679 0.17811266 1.0552806 -0.10560113 0.5062175 -0.37100965 -0.43509555 -0.7369594 0.9275887 0.6351612 -0.026170328 -0.6812031 -0.49427545 0.15076277 0.497177 -0.5669475 0.33832487 0.38121685 -0.34155178 -0.03627377 0.019129895 0.32135636 -1.3127131 0.2910208 -0.6110071 -0.21233878 -0.26547825 -0.48265418 0.19074659 -0.221765 -0.6583791 0.26793227 0.106484234 -0.51117957 -0.9209578 -0.53469723 -0.8773248 -0.50579745 0.28408417 -0.33325395 0.9733218 -0.20266499 1.2573 -0.67561316 -0.42509234 0.93198144 0.104132675 -0.72978777 0.61797714 0.5810334 0.11720219 -0.5360808 -1.2015952 0.31788793 -1.6091578 0.29825193 0.25895777 0.34890306 -0.64605564 -0.8923556 -0.6606609 0.27037627 0.13712278 0.047953844 0.9390667 -1.0347372 -1.0345485 0.6995126 0.8249064 0.35724065 0.27384388 -0.73517066 -0.35368446 -1.0148574 1.8248662 -0.07850242 1.4354324 0.5072

Load the text embeddings in gensim

In [26]:
te_path = os.environ['TE'] + "/embeddings.txt"
te_vectors = KeyedVectors.load_word2vec_format(te_path, binary=False)

### Compare the graph and text embeddings

Most similar nodes to Johnnie Walker using the **graph embeddings**

In [27]:
# Q502268 is Johnnie Walker
kgtk_most_similar(ge_vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q1799948,0.8151147961616516,similarity,'Ladies of Leisure'@en,'1930 film by Frank Capra'@en
1,Q4865371,0.8101491928100586,similarity,'Bartlet for America'@en,'episode of The West Wing (S3 E9)'@en
2,Q7084279,0.8094995617866516,similarity,'Old Ironsides'@en,'1926 film by James Cruze'@en
3,Q7736602,0.8019922971725464,similarity,'The Girl of the Golden West'@en,'1930 film by John Francis Dillon'@en
4,Q2288328,0.7627399563789368,similarity,'The Matinee Idol'@en,"'1928 film by Walt Disney, Frank Capra'@en"
5,Q209135,0.6924302577972412,similarity,'East Ayrshire'@en,'council area of Scotland'@en
6,Q628737,0.6563564538955688,similarity,'Campbeltown Single Malts'@en,'single malt Scotch whiskies distilled in the ...
7,Q1761185,0.6248385906219482,similarity,'Pimm\\\\'s'@en,'Alcoholic drink brand'@en
8,Q773797,0.6149157285690308,similarity,'Dalwhinnie distillery'@en,'whisky distillery'@en


Most similar nodes to Johnnie Walker using the **text embeddings**

In [28]:
# Q502268 is Johnnie Walker
kgtk_most_similar(te_vectors, positive=['Q502268'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q280,0.9379171133041382,similarity,'Lagavulin Distillery'@en,"'Scotch whisky distillery in Lagavulin, Islay,..."
1,Q2490031,0.9346836805343628,similarity,'William Grant & Sons'@en,'Scottish company which distills Scotch whisky...
2,Q1543646,0.9012988805770874,similarity,'Rob Roy'@en,'cocktail based on Scotch whisky'@en
3,Q382947,0.8983699083328247,similarity,'Scotch whisky'@en,"'malt or grain whisky (or a blend of the two),..."
4,Q2168523,0.8907997012138367,similarity,'The Famous Grouse'@en,'brand of Scotch whisky'@en
5,Q1069502,0.8856704235076904,similarity,'Chivas Regal'@en,'Blended Scotch Whisky produced by Chivas Brot...
6,Q6744642,0.8838940858840942,similarity,'malt whisky'@en,"'Distilled spirit from Scotland (a/k/a \\\\""Sc..."
7,Q4821838,0.8762272596359253,similarity,'Aultmore distillery'@en,"'whisky distillery in Moray, Scotland, UK'@en"
8,Q4720319,0.8761684894561768,similarity,'Alexander Walker'@en,'Scottish whisky distiller'@en
9,Q1754978,0.8664095401763916,similarity,'Rusty Nail'@en,'cocktail mixing Drambuie and Scotch whisky'@en


The graph embeddings produce poor results as the top matches are not related to whiskey. The text embeddings look much better.

Most similar nodes to Michelob using the **graph embeddings**

In [29]:
# Q15874936 is Michelob
kgtk_most_similar(ge_vectors, positive=['Q15874936'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q610672,0.8602399230003357,similarity,'Budweiser'@en,'brand of pale lager'@en
1,Q14694794,0.819470226764679,similarity,'Salitos'@en,'American beer brand'@en
2,Q212654,0.6493383646011353,similarity,'Washington Football Team'@en,'American football team'@en
3,Q1341618,0.6382841467857361,similarity,'Leffe'@en,'trademark'@en
4,Q85269976,0.6330214738845825,similarity,'Busch Beer'@en,'brand of beer owned by Anheuser-Busch'@en
5,Q21286736,0.6311758160591125,similarity,'Samuel Adams'@en,'American brand of beer'@en


Most similar nodes to Michelob using the **text embeddings**

In [30]:
# Q15874936 is Michelob
kgtk_most_similar(te_vectors, positive=['Q15874936'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q2011473,0.9664472341537476,similarity,'Fantôme'@en,'brand of beer'@en
1,Q3315575,0.9586231708526612,similarity,'Bersalis'@en,'beer brand'@en
2,Q3518554,0.9563601016998292,similarity,'Floris'@en,'beer brand'@en
3,Q15076069,0.9531255960464478,similarity,'Marckloff'@en,'beer brand'@en
4,Q1277388,0.951164722442627,similarity,'Pripps Blå'@en,'beer brand'@en
5,Q1917255,0.9475076794624328,similarity,'St-Idesbald'@en,'beer'@en
6,Q263980,0.9443504810333252,similarity,'Soproni'@en,'beer mark'@en


The graph embeddings contain some bad results, but the top matches are better as they include beers that are more closely related to Michelob. The text embeddings are reasonable as they include only beers.

Most similar nodes to vodka using the **graph embeddings**

In [31]:
# Q374 is vodka
kgtk_most_similar(ge_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q4220357,0.8792701959609985,similarity,'Kizlyarka'@en,"'Grape vodka made in Kizlyar, Dagestan, Russia..."
1,Q26236698,0.8739076852798462,similarity,'Trump Vodka'@en,'American brand of vodka produced by The Trump...
2,Q5458524,0.8690915703773499,similarity,'Fleischmann\\\\'s vodka'@en,"'gin, and whiskey'@en"
3,Q2401798,0.8629519939422607,similarity,'Ursus'@en,'Icelandic-Dutch vodka'@en
4,Q8050608,0.8593697547912598,similarity,'Yazi Ginger Vodka'@en,'brand of vodka'@en
5,Q7468032,0.8534938097000122,similarity,'Vodka'@en,'Detective Conan character'@en
6,Q20577688,0.8512636423110962,similarity,'.vodka'@en,'top-level Internet domain'@en
7,Q22236232,0.8460223078727722,similarity,'Kors Vodka'@en,'Good'@en
8,Q22236238,0.8386521339416504,similarity,'Mariette'@en,"'vodka, alcohol'@en"


Most similar nodes to vodka using the **text embeddings**

In [32]:
# Q374 is vodka
kgtk_most_similar(te_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q4869283,0.9598516821861268,similarity,'Batini'@en,'vodka-based cocktail'@en
1,Q3562046,0.9595369696617126,similarity,'Vodka Stinger'@en,'type of cocktail'@en
2,Q2206588,0.9436805248260498,similarity,'Caipiroska'@en,'cocktail prepared with vodka'@en
3,Q22236238,0.9384632110595704,similarity,'Mariette'@en,"'vodka, alcohol'@en"
4,Q7939317,0.9203516244888306,similarity,'Vodka Cruiser'@en,'brand of vodka-based alcoholic drink'@en
5,Q11802565,0.915537178516388,similarity,'Pan Tadeusz'@en,'brand of vodka'@en
6,Q268057,0.9129105806350708,similarity,'cosmopolitan'@en,'cocktail made with vodka'@en
7,Q4782617,0.9107506275177002,similarity,'Aqua Velva'@en,'vodka and gin based cocktail'@en


The graph embeddings are noisy as the top matches include nodes not related to vodka, the text embeddings look much better.

Let's look at countries now as the differences between the two types of embeddings are more striking.
The graph embeddings retrieve nodes that are related to Ireland:

In [33]:
# Q27 Ireland
kgtk_most_similar(ge_vectors, positive=['Q27'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q9676,0.8533604145050049,similarity,'Isle of Man'@en,'British Crown dependency'@en
1,Q178283,0.8005253672599792,similarity,'County Limerick'@en,'county in Ireland'@en
2,Q93195,0.8004276752471924,similarity,'Ulster'@en,'province in Ireland'@en
3,Q31747,0.7979310154914856,similarity,'Irish Free State'@en,'state on the island of Ireland between Decemb...
4,Q4368623,0.7965912818908691,similarity,'Category:Republic of Ireland'@en,'Wikimedia category'@en
5,Q184760,0.7961042523384094,similarity,'County Monaghan'@en,'county in Ireland'@en
6,Q178626,0.7955714464187622,similarity,'County Mayo'@en,'county in Ireland'@en
7,Q107397,0.7947046756744385,similarity,'County Leitrim'@en,'county in Ireland'@en
8,Q187402,0.7924615144729614,similarity,'County Cavan'@en,'county in Ireland'@en
9,Q162475,0.7917241454124451,similarity,'County Cork'@en,'county in Ireland'@en


THe text embeddings retrieve other countries:

In [34]:
# Q27 Ireland
kgtk_most_similar(te_vectors, positive=['Q27'], kg_path=os.environ['OUT'] + "/parts", topn=10)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q191,0.7966251969337463,similarity,'Estonia'@en,'sovereign state in northeastern Europe'@en
1,Q37,0.7891267538070679,similarity,'Lithuania'@en,'sovereign state in northeastern Europe'@en
2,Q20,0.7881592512130737,similarity,'Norway'@en,'sovereign state in northern Europe'@en
3,Q34,0.7823097109794617,similarity,'Sweden'@en,'sovereign state in northern Europe'@en
4,Q35,0.7809572815895081,similarity,'Denmark'@en,'sovereign state in northern Europe that is pa...
5,Q33,0.7614077925682068,similarity,'Finland'@en,'sovereign state in northern Europe'@en
6,Q1526538,0.7550898194313049,similarity,'Reykjavík North'@en,'one of the six constituencies (kjördæmi) of I...
7,Q16965019,0.7516392469406128,similarity,'North borough of Brescia'@en,'one of 5 boroughs of Brescia'@en
8,Q189,0.7509456276893616,similarity,'Iceland'@en,"'sovereign state in Northern Europe, situated ..."
9,Q22,0.7428288459777832,similarity,'Scotland'@en,"'country in Northwest Europe, part of the Unit..."


### Using the embeddings in queries to the KG

In [35]:
# Q281 whiskey
# Q282 wine
# Q3246609 mixed drink
# Q374 vodka
# Q332378 is absolut

Get the most similar nodes to **absolut**, the swedish vodka using the text embeddings and put it in a file

In [36]:
# Q332378 is absolut
kgtk_most_similar(te_vectors, positive=['Q332378'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['TE'] + "/Q332378.sim.tsv")

In [37]:
result = !head "$TE"/Q332378.sim.tsv
kgtk_to_dataframe(result)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q7312560,0.9494207501411438,similarity,'Renat'@en,'Swedish vodka'@en
1,Q406157,0.9068877696990968,similarity,'bäsk'@en,'Swedish style spiced liquor'@en
2,Q1034035,0.8990318775177002,similarity,'Finlandia Vodka'@en,'Finnish brand of vodka'@en
3,Q374,0.8908253908157349,similarity,'vodka'@en,'distilled alcoholic beverage'@en
4,Q2553569,0.8900324106216431,similarity,'vodka martini'@en,'cocktail made with vodka and vermouth'@en
5,Q2206588,0.8866581916809082,similarity,'Caipiroska'@en,'cocktail prepared with vodka'@en
6,Q268057,0.8860777616500854,similarity,'cosmopolitan'@en,'cocktail made with vodka'@en
7,Q4021706,0.8785414695739746,similarity,'Xan'@en,'Vodka from Goygol'@en
8,Q4869283,0.8784171938896179,similarity,'Batini'@en,'vodka-based cocktail'@en


Suppose I have absolut vodka and I want to make a cocktail. I can use the KG graph of the most similar nodes to absolut, and search the KG for mixed drinks (`Q3246609`) that appear in the list of most similar nodes to absolut.

Here are some drinks we can make with absolut vodka. The query starts with our similarity file (`Q332378.sim.tsv`) in clause `sim` and filters it to select the qnodes that are instances of mixed drink (`Q3246609`) using clauses `isa` and `star`. Then the first `claims` clause selects those that have vodka as an ingredient (`Q374`) and the second `claims` clause retrieves the other ingredients.

In [50]:
result = !$kypher_raw -i "$ISA" -i "$P279STAR" -i "$TE"/Q332378.sim.tsv -i "$Q154CLAIMS" -i "$Q154LABEL" \
--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class), \
  claims: (n1)-[:P186]->(:Q374), claims: (n1)-[:P186]->(ingredient), label: (ingredient)-[]->(i_label)' \
--return 'distinct n1 as node1, similarity as node2, n1.label, n1.description, \
  ingredient as ingredient, i_label as `ingredient label`' \
--order-by 'cast(similarity, float) desc' \
--where 'class = "Q3246609"' \
--limit 20 

kgtk_to_dataframe(result)

Unnamed: 0,node1,node2,node1;label,node1;description,ingredient,ingredient label
0,Q2553569,0.8900324106216431,'vodka martini'@en,'cocktail made with vodka and vermouth'@en,Q1105343,'cocktail glass'@en
1,Q2553569,0.8900324106216431,'vodka martini'@en,'cocktail made with vodka and vermouth'@en,Q1621080,'olive'@en
2,Q2553569,0.8900324106216431,'vodka martini'@en,'cocktail made with vodka and vermouth'@en,Q26877166,'lemon twist'@en
3,Q2553569,0.8900324106216431,'vodka martini'@en,'cocktail made with vodka and vermouth'@en,Q26877423,'dry vermouth'@en
4,Q2553569,0.8900324106216431,'vodka martini'@en,'cocktail made with vodka and vermouth'@en,Q374,'vodka'@en
5,Q2206588,0.8866581916809082,'Caipiroska'@en,'cocktail prepared with vodka'@en,Q374,'vodka'@en
6,Q1966883,0.8709858655929565,'Yorsh'@en,'Russian drink of beer and vodka'@en,Q374,'vodka'@en
7,Q1966883,0.8709858655929565,'Yorsh'@en,'Russian drink of beer and vodka'@en,Q44,'beer'@en
8,Q1723060,0.8683922290802002,'Kamikaze'@en,"'cocktail of vodka, triple sec and lime juice'@en",Q1105343,'cocktail glass'@en
9,Q1723060,0.8683922290802002,'Kamikaze'@en,"'cocktail of vodka, triple sec and lime juice'@en",Q3539556,'triple sec'@en


The results are good, lots of choices of cocktails. Note that the embeddings are able to generalize from a specific vodka to vodka in general. The example also illustrates that KGTK can use the results of queries to gensim within queries to the KG.

**This cell sometimes does not produce results. Seems to be randomly working?**

When we try the query using the graph embeddings, and do not explictly filter the ingredients to include vodka:

In [51]:
# Q332378 is absolut
kgtk_most_similar(ge_vectors, positive=['Q332378'], kg_path=os.environ['OUT'] + "/parts", topn=2000, output_path=os.environ['GE'] + "/Q332378.sim.tsv")

In [57]:
result = !$kypher_raw -i "$ISA" -i "$P279STAR" -i "$GE"/Q332378.sim.tsv \
--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class)' \
--return 'distinct n1 as node1, similarity as node2, "similarity" as label, n1.label, n1.description' \
--order-by 'cast(similarity, float) desc' \
--where 'class = "Q3246609"' \
--limit 10 

kgtk_to_dataframe(result)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q3562046,0.5027239322662354,similarity,'Vodka Stinger'@en,'type of cocktail'@en
1,Q11346028,0.4882080554962158,similarity,'Yokohama'@en,'gin-based cocktail'@en
2,Q11328065,0.4468060433864593,similarity,'Balalaika'@en,"'Japanese short drink, cocktail'@en"
3,Q2206588,0.4369325637817383,similarity,'Caipiroska'@en,'cocktail prepared with vodka'@en
4,Q3157309,0.4058018922805786,similarity,'Jack Rose'@en,'short drink popular in the 1920s and 1930s'@en
5,Q455914,0.4020183980464935,similarity,'Vodka Red Bull'@en,'alcoholic beverage'@en
6,Q921623,0.3949605226516723,similarity,'Sazerac'@en,'cognac or whiskey cocktail'@en
7,Q87587764,0.3918575346469879,similarity,'The Transporter'@en,'cocktail'@en
8,Q11313508,0.390922337770462,similarity,'Sledgehammer'@en,'cocktail with vodka'@en
9,Q11335008,0.3907725214958191,similarity,'Blue Monday'@en,'cocktail'@en


The results are poor as for the most part, the retrieved cocktails do not have vodka. Let's try the query with vodka instead of absolut vodka.

**This cell sometimes does not produce results. Seems to be randomly working, same as above?**

Now let get the qnodes that are similar to vodka (`Q374`) using the graph embeddings:

In [58]:
# Q374 vodka
kgtk_most_similar(ge_vectors, positive=['Q374'], kg_path=os.environ['OUT'] + "/parts", topn=1000, output_path=os.environ['GE'] + "/Q374.sim.tsv")

In [59]:
result = !$kypher_raw -i "$ISA" -i "$P279STAR" -i "$GE"/Q374.sim.tsv \
--match 'sim: (n1)-[]->(similarity), isa: (n1)-[]->(isa), star: (isa)-[]->(class)' \
--return 'distinct n1 as node1, similarity as node2, "similarity" as label, n1.label, n1.description' \
--order-by 'cast(similarity, float) desc' \
--where 'class = "Q3246609"' \
--limit 10 

kgtk_to_dataframe(result)

Unnamed: 0,node1,node2,label,node1;label,node1;description
0,Q2206588,0.6694541573524475,similarity,'Caipiroska'@en,'cocktail prepared with vodka'@en
1,Q3562046,0.5724750757217407,similarity,'Vodka Stinger'@en,'type of cocktail'@en
2,Q11328065,0.5633806586265564,similarity,'Balalaika'@en,"'Japanese short drink, cocktail'@en"
3,Q455914,0.5060765147209167,similarity,'Vodka Red Bull'@en,'alcoholic beverage'@en
4,Q5459745,0.4931340217590332,similarity,'flirtini'@en,"'cocktail containing vodka, champagne and pine..."
5,Q26879480,0.4924023151397705,similarity,'Godmother'@en,'cocktail'@en
6,Q11346028,0.4803248643875122,similarity,'Yokohama'@en,'gin-based cocktail'@en
7,Q5580053,0.4762502014636993,similarity,'Golden Russian'@en,'cocktail of vodka and Galliano'@en
8,Q1966883,0.4719317555427551,similarity,'Yorsh'@en,'Russian drink of beer and vodka'@en
9,Q5103598,0.4694012403488159,similarity,'Chocolate Cake'@en,'cocktail'@en


The results are good. Somehow, the graph embeddings are able to rerieve the cocktails that have vodka, but cannot generalize from absolut vodka to vodka.

## Produce files to load in the Google Embedding Projector
The Goodle embedding projector (https://projector.tensorflow.org) is a tool for visualizing embeddings. To use it we need two files:

- a TSV file with the vectors
- a TSV file with the metadata, in the same order as the vectors

We don't want to load all the vectors in the projectors because it is too many to visualize. We will load only the following types as it will be interesting to see whether they cluster properly.

In [60]:
focus_types = {
    "Q3246609": "mixed drink",
    "Q44": "beer",
    "Q282": "wine",
    "Q281": "whiskey",
    "Q374": "vodka",
    "Q6256": "country",
}

To do the filteriing, we construct a dictionary that maps every q-node in the KG to the set of all its superclasses. We will use this dictionary later to tag each q-node with one of the focus types. For every q-node we will test if the focus type is in the set of all super-classes.

In [61]:
classes_result = !$kypher_raw -i "$ISA" -i "$Q154CLAIMS" -i "$TEMP"/Q154.descendant.tsv -i "$P279STAR" \
--match 'isa: (n1)-[]->(c), P279: (c)-[]->(class), claims: ()-[]->(class), descendant: (n1)-[]->()' \
--return 'distinct n1 as qnode, class as class' 

class_dict = {}
for r in classes_result[1:]:
    row = r.split("\t")
    qnode = row[0]
    isa = row[1]
    entry = class_dict.get(qnode)
    if entry is None:
        class_dict[qnode] = set()
        entry = class_dict[qnode]
    entry.add(isa)

Let's look at the class_dict for Johnnie Walker (`Q502268`). We see that Johnnie Walker has many super classes.

In [65]:
class_dict['Q502268']

{'Q102205',
 'Q107715',
 'Q11024',
 'Q11028',
 'Q111352',
 'Q11435',
 'Q1150070',
 'Q1166770',
 'Q11795009',
 'Q1190554',
 'Q1194058',
 'Q12767945',
 'Q131257',
 'Q13878858',
 'Q1400881',
 'Q1422299',
 'Q154',
 'Q15401930',
 'Q1554231',
 'Q15619164',
 'Q1632297',
 'Q16686448',
 'Q16722960',
 'Q167270',
 'Q1681365',
 'Q16887380',
 'Q16889133',
 'Q1704572',
 'Q174984',
 'Q1786828',
 'Q17988854',
 'Q187931',
 'Q1914636',
 'Q20817253',
 'Q20937557',
 'Q2095',
 'Q2150504',
 'Q22269697',
 'Q22272508',
 'Q22294683',
 'Q223557',
 'Q23009552',
 'Q23009675',
 'Q2424752',
 'Q246672',
 'Q25481995',
 'Q26907166',
 'Q27166344',
 'Q281',
 'Q28728771',
 'Q28732711',
 'Q28813620',
 'Q28877',
 'Q2944660',
 'Q29651519',
 'Q2990593',
 'Q2996394',
 'Q309314',
 'Q31464082',
 'Q3249551',
 'Q337060',
 'Q35120',
 'Q35758',
 'Q3695082',
 'Q382947',
 'Q386724',
 'Q40050',
 'Q4026292',
 'Q427581',
 'Q42848',
 'Q43460564',
 'Q4373292',
 'Q4406616',
 'Q4437984',
 'Q46737',
 'Q478798',
 'Q483247',
 'Q488383',
 'Q529

In [66]:
def focus_type(qnode):
    """
    Retrieve the focus type for any qnode, and return "other" for nodes that are not instances of our focus types.
    """
    for t in focus_types.keys():
        classes = class_dict.get(qnode)
        if classes and t in classes:
            return focus_types[t]
        if qnode in country_qnodes:
            return "country"
    return "other"

Construct `country_qnodes`, the set of all country qnodes

In [67]:
country_result = !$kypher_raw -i "$ISA" -i "$P279STAR" -i "$Q154CLAIMS" \
--match 'claims: (country)-[]->(), isa: (country)-[:isa]->(c), P279: (c)-[]->(:Q6256)' \
--return 'distinct country as country' 

country_qnodes = set()
for r in country_result[1:]:
    country_qnodes.add(r)

Construct `alcoholic_qnodes`, the set of all alcoholic beverage qnodes.

In [68]:
alcoholic_qnodes = set()
for line in open(os.environ["TEMP"] + "/Q154.descendant.tsv", "r"):
    alcoholic_qnodes.add(line.split("\t")[0])

The `build_embedding_projector_vectors` builds the vectors file, a TSV file with one line for each vector. We do this by scanning through the full embeddings file and selecting qnodes that are in our set of `alcoholic_qnodes` or `coutnry_qnodes`. We also write a file of all the qnodes that select. We will use this file later to construct the metadat file. We have to be careful to list the qnodes in the metadata file in the same order as they appear in the vectors file.

In [69]:
def build_embedding_projector_vectors(embeddings_path):
    input_path = embeddings_path + "/embeddings.txt"
    vectors_path = embeddings_path + "/projector.vectors.tsv"
    qnodes_path = embeddings_path + "/projector.qnodes.tsv"

    input_file = open(input_path, "r")
    vectors_file = open(vectors_path, "w")
    qnodes_file = open(qnodes_path, "w")

    qnodes_file.write("node1\n")

    with open(input_path, "r") as w2v_file:
        next(w2v_file)
        for line in w2v_file:
            items = line.split(" ")
            qnode = items[0]
            if qnode in alcoholic_qnodes or qnode in country_qnodes:
                vectors_file.write("\t".join(items[1:]))
                qnodes_file.write("{}\n".format(qnode))

    input_file.close()
    vectors_file.close()
    qnodes_file.close()

In [70]:
build_embedding_projector_vectors(os.environ["GE"])

Let's take a peek at our qnodes file, which we use in the next step.

In [71]:
!head "$GE"/projector.qnodes.tsv

node1
Q282221
Q87193814
Q2535077
Q30932951
Q2715616
Q2013069
Q2395665
Q103349186
Q91304643


The `build_embedding_projector_metadata` uses a kypher query to retreive the labels of the qnodes (in a later version we will also include the descriptions; for now we don't because the query filters out qnodes that don't have descriptions, and unfortunaely, many alcoholic beverages are missing English descriptions).

The idea is:
- Retrieve the labels for all the qnodes using the kypher query. The query returns the results in arbitrary order.
- Build a dictionary that maps each node to the metadata that we want.
- Scan the qnodes file and for each qnode, write a metadata line in the metadata file (`projector.metadata.tsv`)

Our metadata file has three columns (you can have as many as you want):
- tag: includes the label and the focus type as it is often difficult to tell from the tag what type of beverage it is
- qnode
- focus type

In [72]:
def build_embedding_projector_metadata(embeddings_path):
    kg_path = os.environ["OUT"] + "/parts"
    os.environ["_label_graph"] = kg_path + "/labels.en.tsv.gz"
    os.environ["_description_graph"] = kg_path + "/descriptions.en.tsv.gz"
    os.environ["_qnodes"] = embeddings_path + "/projector.qnodes.tsv"

    #result = !$kypher_raw -i "$_label_graph" -i "$_description_graph" -i "$_qnodes" \
    #--match 'qnodes: (n1)-[]->(), label: (n1)-[]->(lab), description: (n1)-[]->(des)' \
    #--return 'distinct n1 as node1, lab as `node1;label`, des as `node1;description`' 
    
    result = !$kypher_raw -i "$_label_graph" -i "$_description_graph" -i "$_qnodes" \
    --match 'qnodes: (n1)-[]->(), label: (n1)-[]->(lab)' \
    --return 'distinct n1 as node1, lab as `node1;label`'
    
    metadata_path = embeddings_path + "/projector.metadata.tsv"
    metadata_file = open(metadata_path, "w")
    metadata_file.write("tag\tqnode\ttype\n")

    qnode_dict = {}
    for line in result[1:]:
        items = line.split("\t")
        qnode = items[0]
        # qnode_dict[qnode] = "{} ({})".format(items[1], items[2])
        qnode_dict[qnode] = "{}".format(items[1])

    with open(os.environ["_qnodes"]) as qnodes_file:
        next(qnodes_file)
        for line in qnodes_file:
            qnode = line[:-1]
            ftype = focus_type(qnode)
            tag = qnode_dict.get(qnode)
            if tag is None:
                tag = qnode
            tag = "{} ({})".format(qnode_dict.get(qnode), ftype)
            metadata_file.write("{}\t{}\t{}\n".format(tag, qnode, ftype))

    metadata_file.close()
    qnodes_file.close()       

In [73]:
build_embedding_projector_metadata(os.environ["GE"])

Check that the file sizes are correct, the metadata file has one more line as it as headers.

In [74]:
!wc "$GE"/projector.metadata.tsv "$GE"/projector.vectors.tsv

    2866   14978  122404 /Users/amandeep/Documents/kypher/temp.wikidata_os_v5/graph-embedding/projector.metadata.tsv
    2865  286500 3573349 /Users/amandeep/Documents/kypher/temp.wikidata_os_v5/graph-embedding/projector.vectors.tsv
    5731  301478 3695753 total


In [75]:
!head -1 "$GE"/projector.vectors.tsv

0.802669048	-0.964344561	-0.193773538	-0.148945779	-0.007782838	-0.157434672	0.577200949	-1.281233549	0.198331118	0.589500129	-0.006058669	0.147418320	0.740693808	-0.437887728	-0.351121515	-0.356504679	0.168335319	-0.468036890	-0.580174923	0.120565958	-0.499842435	-0.195666820	-0.684497178	-0.058450073	0.352301121	-0.851026535	0.460445255	0.314766288	-0.118312418	0.283018619	-0.240185469	0.794400871	-1.590748668	-0.490659028	-0.339496970	0.402179122	-0.488775909	0.495471150	1.355804205	-1.034523129	0.491589516	-0.487702727	-0.119838580	-0.583049595	1.449240685	0.106511809	-0.815095901	-0.102886721	-0.648804426	0.569476008	0.583674550	0.472205669	0.419318050	-0.293761969	0.076773778	0.139429271	0.949091315	0.289780438	0.848724961	1.315896273	0.899223864	0.246369481	-0.170488834	0.431306630	-0.076363429	0.023316680	-0.383828551	0.707150340	0.351247162	-0.161992744	0.972927213	1.075955629	-0.828405678	0.659954309	0.946677089	-1.575971842	-0.908337712	-0.438164473	0.067075402	0.461092591	-

Now build the projector files for the text embeddings, and check that the sizes are ok

In [76]:
build_embedding_projector_vectors(os.environ["TE"])

In [77]:
build_embedding_projector_metadata(os.environ["TE"])

In [78]:
!wc "$TE"/projector.metadata.tsv "$TE"/projector.vectors.tsv "$TE"/projector.qnodes.tsv

    2866   14978  122404 /Users/amandeep/Documents/kypher/temp.wikidata_os_v5/text-embedding/projector.metadata.tsv
    2865 2933760 32668364 /Users/amandeep/Documents/kypher/temp.wikidata_os_v5/text-embedding/projector.vectors.tsv
    2866    2866   25582 /Users/amandeep/Documents/kypher/temp.wikidata_os_v5/text-embedding/projector.qnodes.tsv
    8597 2951604 32816350 total


### Google embedding projector
- open https://projector.tensorflow.org
- Load your files using the load button
- configure the visualization

Here we searched on the right for absolut vodka, and we see the closest vecotrs as well as the cluster where it belongs:
![Google embedding projector](assets/embedding-projector.png "Google embedding projector")

### UMAP visualization of the graph embeddings


Very few vodkas, hard to see them in the visualization.


![UMAP visualization](assets/graph-embedding-umap-13.png "UMAP visualization of graph embeddings")

### UMAP visualization of the text embeddings
Very few vodkas, har to see them in the visualization.


![UMAP visualization](assets/text-embedding-umap-17.png "UMAP visualization of text embeddings")

In [89]:
from sentence_transformers import SentenceTransformer


class ComputeEmbeddings:
    def __init__(self, model_name=None):
        if not model_name:
            self.model_name = 'bert-large-nli-cls-token'
        else:
            self.model_name = model_name

        self.model = SentenceTransformer(self.model_name)

    def get_vectors(self, sentence):
        """
            main function to get the vector representations of the descriptions
        """
        if isinstance(sentence, bytes):
            sentence = sentence.decode("utf-8")
        return self.model.encode([sentence], show_progress_bar=False)

In [90]:
em = ComputeEmbeddings()

In [177]:
v = em.get_vectors("beer company")[0]

In [178]:
te_vectors.similar_by_vector(v)

[('Q22333354', 0.88853520154953),
 ('Q878975', 0.8738116025924683),
 ('Q4880037', 0.8519435524940491),
 ('Q1637028', 0.8471935987472534),
 ('Q28530481', 0.8351479768753052),
 ('Q696787', 0.8316407203674316),
 ('Q20571254', 0.8302997350692749),
 ('Q460206', 0.829006552696228),
 ('Q899967', 0.8265884518623352),
 ('Q6439205', 0.8240088224411011)]

In [158]:
!wd u Q2744746

[90mid[39m Q2744746
[42mLabel[49m La Chouffe
[44mDescription[49m beer brand from Belgium
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mbeer brand [90m(Q15075508)[39m
