# Query knowledge graphs and embeddings with KGTK Kypher-V

Kypher-V supports import and queries over vector data. Kypher-V extends
Kypher to allow work with unstructured data such as text, images, and so
on, represented by embedding vectors. Kypher-V provides efficient storage,
indexing and querying of large-scale vector data on a laptop. It is fully
integrated into Kypher to enable expressive hybrid queries over
Wikidata-size structured and unstructured data. To the best of our
knowledge, this is the first system providing such a functionality in a
query language for knowledge graphs.

Please see the [**Kypher-V Manual**](https://kgtk.readthedocs.io/en/latest/transform/query/#kypher-v)
for an introduction to the basic concepts and usage.

<A NAME="setup"></A>
### Setup

Some preliminaries to facilitate command invocation and result formatting:

In [1]:
import re
from IPython.display import display, HTML
from kgtk.functions import kgtk

def show_html(img_width=150):
    """Display command output in 'out' as HTML after munging image links for inline display."""
    output = '\n'.join(out)
    html = re.sub(r'<td>&quot;(https?://upload.wikimedia.org/[^<]+)&quot;</td>', 
                  f'<td style="width:{img_width}px;vertical-align:top"><img " src="\\1"/></td>', 
                  output)
    display(HTML(html))

This notebook contains a number of example queries using Kypher-V. The queries assume the existence of a number of similarity graph caches in the DB directory which are defined here via shell variables:

In [2]:
DB="/kgtk-data/kypherv"
%env DB={DB}
%env MAIN={DB}/wikidata-20221102-dwd-v8-main.sqlite3.db
%env COMPLEX={DB}/wikidata-20221102-dwd-v8-complex-embeddings.sqlite3.db
%env TRANSE={DB}/wikidata-20221102-dwd-v8-transe-embeddings.sqlite3.db
%env ABSTRACT={DB}/wikidata-20221102-dwd-v8-abstract-embeddings.sqlite3.db
%env IMAGE={DB}/wikimedia-capcom-image-embeddings-v2.sqlite3.db

env: DB=/kgtk-data/kypherv
env: MAIN=/kgtk-data/kypherv/wikidata-20221102-dwd-v8-main.sqlite3.db
env: COMPLEX=/kgtk-data/kypherv/wikidata-20221102-dwd-v8-complex-embeddings.sqlite3.db
env: TRANSE=/kgtk-data/kypherv/wikidata-20221102-dwd-v8-transe-embeddings.sqlite3.db
env: ABSTRACT=/kgtk-data/kypherv/wikidata-20221102-dwd-v8-abstract-embeddings.sqlite3.db
env: IMAGE=/kgtk-data/kypherv/wikimedia-capcom-image-embeddings-v2.sqlite3.db


If you copied the graph caches to a different location, please adjust the
paths and definitions accordingly.

Throughout the notebook we use a number of different invocation styles for
the `kgtk` command to better control the appearance of the generated output.
We either use it via the `!kgtk ...` syntax directly, use the `kgtk(...)`
function which produces an HTML rendering of a Pandas frame containing the
result, or we use the `show_html` function for some additional control on
how long texts and inline images are displayed.  All of these incantations
should be straightforward to translate into a shell environment if needed.

<A NAME="graph-caches"></A>
### Similarity graph caches

The examples in this notebook use a number of different standard and similarity
graph caches based on `wikidata-20221102-dwd-v8`.  These graph caches are
available in the `DB` directory of the `ckg06` server from where they can be
copied or accessed directly in example queries.  It will generally not be
possible to run the notebook directly from that server, so if you want to
run and experiment with the notebook in a Jupyter environment, you have to
copy the graph caches to a different location where a notebook server can be run.
Make sure to also include the associated ANNS index files that end in
a `.faiss.idx` extension.

This notebook also does not show how the individual similarity caches were
constructed.  To see how that can be done, please consult
the [**Kypher-V Manual**](https://kgtk.readthedocs.io/en/latest/transform/query/#kypher-v)
or look at the respective `*.db.build.txt` files in the `DB` directory.  For reference,
we show just one incantation here on how the `COMPLEX` graph cache was built.  Other
graph caches were built similarly with some modifications to adjust for differences in
the embedding data used (for `COMPLEX` this takes about 3 hours to run):

```
$ export WD=.../datasets/wikidata-20221102-dwd-v8

$ cat $WD/wikidatadwd.complEx.graph-embeddings.txt | sed -e 's/ /\t/' \
      | kgtk --debug add-id --no-input-header=False --input-column-names node1 node2 \
                   --implied-label emb \
           / query --gc $DB/wikidata-20221102-dwd-v8-complex-embeddings.sqlite3.db \
                   -i - --as complex \
                   --idx vector:node2/nn/ram=25g/nlist=16k mode:valuegraph \
                   --single-user --limit 5
```

We use the following similarity graph caches which can be combined
with a main graph cache using one or more `--auxiliary-cache` or `--ac`
options.  The `COMPLEX` graph cache contains 59M 100-D ComplEx
graph embeddings:

In [3]:
!kgtk query --gc $COMPLEX --sc

Graph Cache:
DB file: /kgtk-data/kypherv/wikidata-20221102-dwd-v8-complex-embeddings.sqlite3.db
  size:  28.92 GB   	free:  0 Bytes   	modified:  2022-12-15 20:40:26

KGTK File Information:
complex:
  size:  0 Bytes   	modified:  2022-12-15 17:55:31   	graph:  graph_1

Graph Table Information:
graph_1:
  size:  29.76 GB   	created:  2022-12-15 17:55:31
  header:  ['node1', 'label', 'node2', 'id']


The `TRANSE` graph cache contains 59M 100-D TransE graph embeddings:

In [4]:
!kgtk query --gc $TRANSE --sc

Graph Cache:
DB file: /kgtk-data/kypherv/wikidata-20221102-dwd-v8-transe-embeddings.sqlite3.db
  size:  28.92 GB   	free:  0 Bytes   	modified:  2022-12-17 11:39:02

KGTK File Information:
transe:
  size:  0 Bytes   	modified:  2022-12-16 14:09:02   	graph:  graph_1

Graph Table Information:
graph_1:
  size:  29.76 GB   	created:  2022-12-16 14:09:02
  header:  ['node1', 'node2', 'label', 'id']


The `ABSTRACT` graph cache contains the sentences and embedding vectors
generated from the first sentences of Wikipedia short abstracts.  It
contains about 6M 768-D Roberta base vectors:

In [5]:
!kgtk query --gc $ABSTRACT --sc

Graph Cache:
DB file: /kgtk-data/kypherv/wikidata-20221102-dwd-v8-abstract-embeddings.sqlite3.db
  size:  26.32 GB   	free:  0 Bytes   	modified:  2023-01-09 18:14:00

KGTK File Information:
sentence:
  size:  256.32 MB   	modified:  2023-01-04 13:53:44   	graph:  graph_2
abstract:
  size:  0 Bytes   	modified:  2023-01-09 13:45:47   	graph:  graph_1

Graph Table Information:
graph_1:
  size:  25.16 GB   	created:  2023-01-09 13:45:47
  header:  ['node1', 'label', 'node2', 'id']
graph_2:
  size:  1.23 GB   	created:  2023-01-09 18:13:31
  header:  ['node1', 'label', 'node2', 'id']


The `IMAGE` graph cache contains image embeddings published by the
<a href="https://techblog.wikimedia.org/2021/09/09/the-wikipedia-image-caption-matching-challenge-and-a-huge-release-of-image-data-for-research/">
Wikipedia image/caption matching challenge</a>.  The embeddings are 2048-D vectors
taken from the second-to-last layer of a ResNet-50 neural network trained with
Imagenet data.  We only use the 2.7M images associated with English Wikipedia
pages.  The resulting vector graph cache is shown here:

In [6]:
!kgtk query --gc $IMAGE --sc

Graph Cache:
DB file: /kgtk-data/kypherv/wikimedia-capcom-image-embeddings-v2.sqlite3.db
  size:  24.39 GB   	free:  0 Bytes   	modified:  2023-01-11 14:10:32

KGTK File Information:
wiki_image:
  size:  0 Bytes   	modified:  2023-01-11 12:54:36   	graph:  graph_1

Graph Table Information:
graph_1:
  size:  24.42 GB   	created:  2023-01-11 12:54:36
  header:  ['node1', 'label', 'node2', 'id', 'page_url', 'qnode']


Finally, we also use a standard Wikidata graph cache for the claims and
labels of `wikidata-20221102-dwd-v8`.  It is called `MAIN` below.

<A NAME="vector-tables"></A>
### Vector tables are regular KGTK files

Any KGTK representation that associates a node or edge ID with a vector
will work.  A format we commonly use is where a `node1` points to a vector
literal in `node2` via an `emb` edge (but any label will do).  For example,
here we show the first three embedding edges in `COMPLEX` (the `node2;_kgtk_vec_qcell`
column is an auxiliary column automatically computed by ANNS indexing):

In [7]:
kgtk("""query --gc $COMPLEX -i complex --limit 3""")

Unnamed: 0,node1,label,node2,id,node2;_kgtk_vec_qcell
0,Q102108199,emb,b'x13x99x13?x96xb7xf9xbdxb0x99x0fxbexf1xd4|>&x...,E465008,0
1,Q28980109,emb,b'xa1xdax8e=xdfx17x1e>xffxa4y=xf8+(xbeaxb5!xbd...,E686337,0
2,Q42012492,emb,b'txb8xe4xbexfcR;?x00xd6xd1>x87x1fxcdxbeTIx88x...,E1762936,0


<A NAME="vector-computation"></A>
### Vector computation

The simplest operation in Kypher-V is a similarity computation between two vectors
which we perform here using the `ABSTRACT` graph cache:

In [8]:
kgtk(""" 
      query --gc $MAIN --ac $ABSTRACT
      -i abstract -i labels
      --match 'abstract: (x:Q868)-[]->(xv),
                         (y:Q913)-[]->(yv),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --return 'xl as xlabel, yl as ylabel, kvec_cos_sim(xv, yv) as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'Aristotle'@en,'Socrates'@en,0.908608


<A NAME="brute-force-search"></A>
### Brute-force similarity search

A more interesting operation is *similarity search* where we look
for the most similar matches for a given seed.  In the query below, we
use a simple but expensive brute-force search over about 10,000 input
vectors by computing similarities between `x` and each possible `y`,
then sorting and returning the top-10.  This is still pretty fast
given that the set of inputs is fairly small:

In [9]:
kgtk("""
      query --gc $MAIN --ac $ABSTRACT
      -i abstract -i labels -i claims
      --match 'abstract: (x:Q913)-[]->(xv), (y)-[]->(yv),
               claims:   (y)-[:P106]->(:Q4964182),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --return 'xl as xlabel, yl as ylabel, kvec_cos_sim(xv, yv) as sim'
      --order  'sim desc'
      --limit 10
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'Socrates'@en,'Socrates'@en,1.0
1,'Socrates'@en,'early life of Plato'@en,0.93826
2,'Socrates'@en,'Aristippus'@en,0.934973
3,'Socrates'@en,'Empedocles'@en,0.930798
4,'Socrates'@en,'Adamantios Korais'@en,0.928561
5,'Socrates'@en,'Menedemus'@en,0.928002
6,'Socrates'@en,'Plato'@en,0.926748
7,'Socrates'@en,'Eubulides'@en,0.925711
8,'Socrates'@en,'Iosipos Moisiodax'@en,0.924585
9,'Socrates'@en,'Henry Oldenburg'@en,0.923927


There are about 9M Q5's (humans) that have short abstract vectors:

In [10]:
kgtk("""
      query --gc $MAIN --ac $ABSTRACT
      -i abstract -i labels -i claims
      --match 'abstract: (x:Q913)-[]->(xv),
               claims:   (y)-[:P31]->(:Q5)'
      --return 'count(distinct y)' --force
     """)

Unnamed: 0,"count(DISTINCT graph_1_c2.""node1"")"
0,8944218


If we used the same brute-force search from above on this much larger set,
it would take about 5 min to run (which is why this command is disabled):

In [None]:
!time DISABLED kgtk query --gc $MAIN \
                 --ac $ABSTRACT \
      -i abstract -i labels -i claims \
      --match 'abstract: (x:Q913)-[]->(xv), (y)-[]->(yv), \
               claims:   (y)-[:P31]->(:Q5), \
               labels:   (x)-[]->(xl), (y)-[]->(yl)' \
      --return 'xl as xlabel, yl as ylabel, kvec_cos_sim(xv, yv) as sim' \
      --order  'sim desc' \
      --limit 10

```
xlabel	ylabel	sim
'Socrates'@en	'Socrates'@en	1.0000001192092896
'Socrates'@en	'Anytus'@en	0.9346579909324646
'Socrates'@en	'Heraclitus'@en	0.9344534277915955
'Socrates'@en	'Hippocrates'@en	0.9304061532020569
'Socrates'@en	'Cleisthenes'@en	0.9292828440666199
'Socrates'@en	'Aristides'@en	0.9283562898635864
'Socrates'@en	'Yannis Xirotiris'@en	0.926308274269104
'Socrates'@en	'Sotiris Trivizas'@en	0.9255445003509521
'Socrates'@en	'Aris Maragkopoulos'@en	0.9234243035316467
'Socrates'@en	'Valerios Stais'@en	0.919943630695343
93.859u 38.640s 4:49.84 45.7%	0+0k 18782808+8io 0pf+0w
```

<A NAME="indexed-search"></A>
### Indexed similarity search

For much faster search, we use an ANNS index constructed when the vector data
was imported which now runs in less than a second compared to 5 minutes before.
Results here are slightly different from above, since it does not restrict on
occupation = philosopher (we will address that later):

In [12]:
kgtk("""
      query --gc $MAIN --ac $ABSTRACT
      -i abstract -i labels -i claims
      --match 'abstract: (x:Q913)-[]->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 5, nprobe: 4}]->(y),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
      --limit 10
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'Socrates'@en,'Socrates'@en,1.0
1,'Socrates'@en,'Histories'@en,0.93762
2,'Socrates'@en,'Cadmus'@en,0.915083
3,'Socrates'@en,'Eudorus of Alexandria'@en,0.914027
4,'Socrates'@en,'John Wilkins'@en,0.913926


<A NAME="similarity-join"></A>
### Full similarity join

Below we query for three philosophers' top-k similar neighbors that are also humans and have
occupation (`P106`) philosopher.  Dynamic scaling ensures that `k` gets increased dynamically
up to `maxk` until we've found enough qualifying results for each:

In [13]:
kgtk("""
      query --gc $MAIN --ac $ABSTRACT
      -i abstract -i labels -i claims
      --match 'abstract: (x)-[]->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 5, maxk: 1024, nprobe: 4}]->(y),
               claims:   (y)-[:P106]->(:Q4964182),
                         (y)-[:P31]->(:Q5),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --where 'x in ["Q859", "Q868", "Q913"] and x != y'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'Plato'@en,'Aenesidemus'@en,0.965394
1,'Plato'@en,'Hicetas'@en,0.96499
2,'Plato'@en,'Empedocles'@en,0.962913
3,'Plato'@en,'Eubulides'@en,0.962904
4,'Plato'@en,'Aristotle'@en,0.961594
5,'Aristotle'@en,'Bryson of Achaea'@en,0.974303
6,'Aristotle'@en,'Michael Papageorgiou'@en,0.970041
7,'Aristotle'@en,'Hicetas'@en,0.967692
8,'Aristotle'@en,'Anaxarchus'@en,0.967682
9,'Aristotle'@en,'Metrodorus of Lampsacus'@en,0.967349


For comparison, here is a run without dynamic scaling which returns much fewer results, since
only a small number of the top-5 similar results for each input also satisfy the post conditions:

In [14]:
kgtk("""
      query --gc $MAIN --ac $ABSTRACT
      -i abstract -i labels -i claims
      --match 'abstract: (x)-[]->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 5, nprobe: 4}]->(y),
               claims:   (y)-[:P106]->(:Q4964182),
                         (y)-[:P31]->(:Q5),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --where 'x in ["Q859", "Q868", "Q913"] and x != y'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'Plato'@en,'Aenesidemus'@en,0.965394
1,'Plato'@en,'Hicetas'@en,0.96499
2,'Plato'@en,'Empedocles'@en,0.962913
3,'Plato'@en,'Eubulides'@en,0.962904
4,'Aristotle'@en,'Bryson of Achaea'@en,0.974303
5,'Aristotle'@en,'Michael Papageorgiou'@en,0.970041
6,'Aristotle'@en,'Hicetas'@en,0.967692
7,'Aristotle'@en,'Anaxarchus'@en,0.967682
8,'Socrates'@en,'Eudorus of Alexandria'@en,0.914027
9,'Socrates'@en,'John Wilkins'@en,0.913926


<A NAME="applications"></A>
## Example applications

### Image search

In the examples below, we use image similarity to link QNodes in Wikidata.  We
use the precomputed `IMAGE` graph cache (see above) which contains embeddings
for about 2.7M images linked to their respective Wikipedia pages and Wikidata
QNodes.  

We start with a QNode (such a the one for Barack Obama below), find one or more
images associated with that QNode, look up their image embeddings and then find
other similar images and their associated QNodes.

We do not compute any image embeddings on the fly here, we simply link nodes based
on similarity of images they are associated with.  Note that this will often not
preserve the type of the source node as can be seen in the result for Barack Obama.
To enforce such type or other restrictions additional clauses can be added.
Since there are multiple images associated with Barack Obama, we use a `not exists`
clause to only look at the first one to make the results less cluttered:

Barack Obama:

In [15]:
out = !kgtk query --gc $IMAGE --ac $MAIN \
      -i wiki_image -i labels \
      --match 'image:  (ximg)-[rx {qnode: $SEED}]->(xiv), \
                       (xiv)-[r:kvec_topk_cos_sim {k: 10, nprobe: 8}]->(yimg), \
                       (yimg)-[ry {qnode: y}]->(), \
               labels: (y)-[]->(ylabel)' \
      --where 'not exists {image: (ximg2)-[{qnode: $SEED}]->() WHERE rowid(ximg2) < rowid(ximg) }' \
      --return 'y as qnode, ylabel as label, printf("%.5g", r.similarity) as sim, yimg as image' \
      --para  SEED=Q76 \
    / html

show_html(img_width=200)

qnode,label,sim,image
Q76,'Barack Obama'@en,1.0,
Q567497,'France–Germany relations'@en,0.77576,
Q27804564,'Wahidullah Waissi'@en,0.75814,
Q7747,'Vladimir Putin'@en,0.75264,
Q188888,'Teachers\' Day'@en,0.75262,
Q702725,'Shirani Bandaranayake'@en,0.75063,
Q18274595,'list of international presidential trips made by Serzh Sargsyan'@en,0.74954,
Q1151352,'John Piper'@en,0.74702,
Q170645,'2018 FIFA World Cup'@en,0.74702,
Q381157,'Orrin Hatch'@en,0.74424,


To get more type appropriate matches, we can add a restriction to only return matches of
type human (`Q5`):

In [16]:
out = !kgtk query --gc $IMAGE --ac $MAIN \
      -i wiki_image -i labels -i claims \
      --match 'image:  (ximg)-[rx {qnode: $SEED}]->(xiv), \
                       (xiv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(yimg), \
                       (yimg)-[ry {qnode: y}]->(), \
               claims: (y)-[:P31]->(:Q5), \
               labels: (y)-->(ylabel)' \
      --where 'not exists {image: (ximg2)-[{qnode: $SEED}]->() WHERE rowid(ximg2) < rowid(ximg) }' \
      --return 'y as qnode, ylabel as label, printf("%.5g", r.similarity) as sim, yimg as image' \
      --para  SEED=Q76 \
    / html

show_html(img_width=200)

qnode,label,sim,image
Q76,'Barack Obama'@en,1.0,
Q27804564,'Wahidullah Waissi'@en,0.75814,
Q7747,'Vladimir Putin'@en,0.75264,
Q702725,'Shirani Bandaranayake'@en,0.75063,
Q1151352,'John Piper'@en,0.74702,
Q381157,'Orrin Hatch'@en,0.74424,
Q2339668,'Twan Huys'@en,0.749,
Q128949,'Miri Regev'@en,0.73791,
Q160157,'Joe Lieberman'@en,0.7345,
Q355130,'Richard Petty'@en,0.75015,


Charles Dadant: again, note that some of the results are not of type human but are
just linked to a similar image:

In [17]:
out = !kgtk query --gc $IMAGE --ac $MAIN \
      -i wiki_image -i labels \
      --match 'image: (ximg)-[rx {qnode: $SEED}]->(xiv), \
                      (xiv)-[r:kvec_topk_cos_sim {k: 10, nprobe: 8}]->(yimg), \
                      (yimg)-[ry {qnode: y}]->(), \
               labels: (y)-[]->(ylabel)' \
      --where 'not exists {image: (ximg2)-[{qnode: $SEED}]->() WHERE rowid(ximg2) < rowid(ximg) }' \
      --return 'y as qnode, ylabel as label, printf("%.5g", r.similarity) as sim, yimg as image' \
      --para  SEED=Q582964 \
      --limit 20 \
    / html

show_html(img_width=100)

qnode,label,sim,image
Q582964,'Charles Dadant'@en,1.0,
Q5956831,'Hymns Ancient and Modern'@en,0.84983,
Q3759575,'list of American Civil War generals (Confederate)'@en,0.84305,
Q6084534,'Ismael Cerna'@en,0.832,
Q26003,'Sergey Botkin'@en,0.82388,
Q5494660,'Fred Bonsor'@en,0.81946,
Q3303297,'ironmaster'@en,0.81704,
Q4631421,'22nd Regiment Alabama Infantry'@en,0.80955,
Q4641399,'5th North Carolina Regiment'@en,0.80858,


Beaumaris Castle in Wales:

In [18]:
out = !kgtk query --gc $IMAGE --ac $MAIN \
      -i wiki_image -i labels \
      --match 'image: (ximg)-[rx {qnode: $SEED}]->(xiv), \
                      (xiv)-[r:kvec_topk_cos_sim {k: 20, nprobe: 8}]->(yimg), \
                      (yimg)-[ry {qnode: y}]->(), \
               labels: (y)-[]->(ylabel)' \
      --where 'not exists {image: (ximg2)-[{qnode: $SEED}]->() WHERE rowid(ximg2) < rowid(ximg) }' \
      --return 'y as qnode, ylabel as label, printf("%.5g", r.similarity) as sim, yimg as image' \
      --para  SEED=Q756815  \
    / html

show_html()

qnode,label,sim,image
Q756815,'Beaumaris Castle'@en,1.0,
Q267153,'list of monasteries dissolved by Henry VIII of England'@en,0.79353,
Q6566349,'list of Category A listed buildings in Dumfries and Galloway'@en,0.79212,
Q40889043,'Scheduled monuments in Renfrewshire'@en,0.7897,
Q912664,'Clan MacDougall'@en,0.78582,
Q922422,'Warkworth Castle'@en,0.78453,
Q6566359,'list of Category A listed buildings in Fife'@en,0.78269,
Q16148507,'list of Historic Scotland properties'@en,0.78237,
Q11808,'castles in Great Britain and Ireland'@en,0.78151,
Q16148507,'list of Historic Scotland properties'@en,0.78122,


<A NAME="image-similarity-join"></A>

Castles similar to Beaumaris Castle but that are located in Austria (with
country (`P17`) equal to `Q40`).  We use a full vector join to get relevant
results further down the similarity list.  Note that even with `maxk=1024` we only
get a few results, and that the similarities are significantly lower than in the
previous example:

In [19]:
out = !kgtk query --gc $IMAGE --ac $MAIN \
      -i wiki_image -i labels -i claims \
      --match 'image: (ximg)-[rx {qnode: $SEED}]->(xiv), \
                      (xiv)-[r:kvec_topk_cos_sim {k: 20, nprobe: 4, maxk: 1024}]->(yimg), \
                      (yimg)-[ry {qnode: y}]->(), \
               labels: (y)-[]->(ylabel), \
               claims: (y)-[:P17]->(c:Q40)' \
      --where 'not exists {image: (ximg2)-[{qnode: $SEED}]->() WHERE rowid(ximg2) < rowid(ximg) }' \
      --return 'y as qnode, ylabel as label, printf("%.5g", r.similarity) as sim, yimg as image' \
      --para  SEED=Q756815  \
      --limit 20 \
    / html

show_html()

qnode,label,sim,image
Q1012592,'Burgruine Kaja'@en,0.72402,
Q15954565,'Austrian walled towns'@en,0.74951,
Q1015533,'Burgruine Steuerberg'@en,0.70776,
Q1015457,'Prandegg Castle'@en,0.70276,
Q188358,'Burgruine Dürnstein'@en,0.70275,


<A NAME="text-embedding-queries"></A>
## Text embedding queries:

In the following example we dynamically compute an embedding vector
for a text query and then use the similarity machinery to query for
matching QNodes.  The basic story here is the following:

- formulate a simple textual query such as 'Ancient Greek philosopher'
- create a KGTK input file for it/them and run them through the 'text-embedding' command
- query WD by finding top-k matches based on short abstract text embedding vectors
- then filter with additional restrictions to get more relevant results.

In [20]:
!echo '\
q1	Ancient Greek philosopher\n\
q2	castle in Austria\n\
q3	award-winning actor and comedian' | \
sed -e 's/^ *//' | \
kgtk cat --no-input-header --input-column-names node1 node2 --implied-label sentence \
   / add-id \
   / text-embedding -i - --model roberta-base-nli-mean-tokens \
          --output-data-format kgtk --output-property emb -o - \
   / query -i - --idx vector:node2 --as text_emb_queries --match '(x)' --return x

Running with logging level 30
2023-01-13 13:53:13.932934: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-01-13 13:53:13.932961: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
  return torch._C._cuda_getDeviceCount() > 0
Batches: 100%|████████████████████████████████████| 1/1 [00:00<00:00, 16.01it/s]
node1
q1
q2
q3


The above created 768-D text embedding vector for three short queries
using the same text embedding type as used in our `ABSTRACT` embeddings.
Now we find Wikidata QNodes whose short-abstract embedding vector is most similar
to the queries, and that satisfy any additional conditions we might have.
Note that the queries in this example are much shorter than the first sentences
of our Wikipedia abstracts, thus the similarity matching is not very good, but
we can compensate for some of that by adding additional restrictions:

Matches for "Ancient Greek philosopher" that have occupation (`P106`) philosopher:

In [21]:
out = !kgtk query --ac $MAIN --ac $ABSTRACT \
      -i text_emb_queries -i abstract -i labels -i claims -i sentence \
      --match  'queries:  (x:q1)-[]->(xv), \
                abstract: (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 4}]->(y), \
                claims:   (y)-[:P106]->(:Q4964182), \
                labels:   (y)-->(yl), \
                sentence: (y)-->(ys)' \
      --return 'y as y, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

y,ylabel,sim,ysent
Q325955,'Speusippus'@en,0.9440442323684692,Speusippus (/spjuːˈsɪpəs/; Greek: Σπεύσιππος; c. 408 – 339/8 BC) was an ancient Greek philosopher.
Q1200209,'Dercil·lides'@en,0.935701549053192,Dercyllides was an ancient Greek Platonist philosopher.
Q2927235,'Bryson of Achaea'@en,0.9300292134284972,"Bryson of Achaea (or Bryson the Achaean; Greek: Βρύσων ὁ Ἀχαιός Vryson o Acheos, gen.: Βρύσωνος Vrysonos; fl. 330 BC) was an ancient Greek philosopher."
Q9250176,'Echecratides'@en,0.9262670874595642,Echecratides (Ancient Greek: Ἐχεκρατίδης) was an Ancient Greek Peripatetic philosopher who is mentioned among the disciples of Aristotle.
Q668009,'Aristotelis the Dialectician'@en,0.9235112071037292,"Aristotle the Dialectician (or Aristoteles of Argos, Greek: Ἀριστοτέλης; fl. 3rd century BC), was an ancient Greek dialectic philosopher from Argos."
Q366031,'Anaxarchus'@en,0.921642243862152,Anaxarchus (/ˌænəɡˈzɑːrkəs/; Ancient Greek: Ἀνάξαρχος; c. 380 – c. 320 BC) was a Greek philosopher of the school of Democritus.
Q297420,'Panaetius'@en,0.9199343919754028,"Panaetius (/pəˈniːʃiəs/; Greek: Παναίτιος, translit. Panetios; c. 185 – c. 110/109 BC) of Rhodes was an ancient Greek Stoic philosopher."
Q962486,'Echecrates of Flius'@en,0.9173671007156372,Echecrates (Greek: Ἐχεκράτης) was a Pythagorean philosopher from the ancient Greek town of Phlius.
Q365977,'Bias of Priene'@en,0.9115197658538818,Bias (/ˈbaɪəs/; Greek: Βίας ὁ Πριηνεύς; fl. 6th century BC) of Priene was a Greek sage.
Q13634113,'Michael Papageorgiou'@en,0.9098575115203856,Michail Papageorgiou (Greek: Μιχαήλ Παπαγεωργίου; 1727–1796) was a Greek philosopher.


Matches for "castle in Austria" that have country (`P17`) Austria:

In [22]:
out = !kgtk query --ac $MAIN --ac $ABSTRACT \
      -i text_emb_queries -i abstract -i labels -i claims -i sentence \
      --match  'queries:  (x:q2)-[]->(xv), \
                abstract: (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
                claims:   (y)-[:P17]->(:Q40), \
                labels:   (y)-->(yl), \
                sentence: (y)-->(ys)' \
      --return 'y as y, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

y,ylabel,sim,ysent
Q673952,'Haidershofen'@en,0.9632641077041626,Haidershofen is a town located in Austria.
Q256996,'Grieskirchen'@en,0.9585073590278624,Grieskirchen is a town in Austria.
Q2240044,'Annabichl Castle'@en,0.9552702307701112,Annabichl Castle is a castle in Austria.
Q7378773,'Ruine Hauenstein'@en,0.94871723651886,"Ruine Hauenstein is a castle in Styria, Austria."
Q37809497,'Ruine Neudeck'@en,0.9469427466392516,"Ruine Neudeck is a castle in Styria, Austria."
Q7378781,'Ruine Raabeck'@en,0.946899950504303,"Ruine Raabeck is a castle in Styria, Austria."
Q4998499,'Burg Kaisersberg'@en,0.9449542760849,"Burg Kaisersberg is a castle in Styria, Austria."
Q674097,'Mannersdorf am Leithagebirge'@en,0.9442192316055298,Mannersdorf am Leithagebirge is a town in Austria.
Q7378769,'Ruine Kalsberg'@en,0.943941056728363,"Ruine Kalsberg is a castle in Styria, Austria."
Q1012734,'Burg Krems'@en,0.942907452583313,"Burg Krems is a castle in Styria, Austria."


Matches for "award-winning actor and comedian" that are of type human
and have country of citizenship (`P27`) UK:

In [23]:
out = !kgtk query --ac $MAIN --ac $ABSTRACT \
      -i text_emb_queries -i abstract -i labels -i claims -i sentence \
      --match  'queries:  (x:q3)-[]->(xv), \
                abstract: (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
                claims:   (y)-[:P31]->(:Q5), \
                          (y)-[:P27]->(:Q145), \
                labels:   (y)-->(yl), \
                sentence: (y)-->(ys)' \
      --return 'y as y, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

y,ylabel,sim,ysent
Q27924985,'Toby Williams'@en,0.9054120779037476,"Toby Williams is a British actor, writer and award-winning stand-up comedian performing both as himself and Dr George Ryegold."
Q7087463,'Oliver Cotton'@en,0.8896428942680359,"Oliver Charles Cotton (born 20 June 1944) is an English actor, comedian and playwright, known for his prolific work on stage, TV and film."
Q7704327,'Terry Duggan'@en,0.8872928619384766,"Terence A. Duggan (15 April 1932 – 1 May 2008) was a British comedian and actor who had a successful career in cabaret and variety, and played numerous character roles on television."
Q23772268,'Guz Khan'@en,0.8805128335952759,"Ghulam Dustgir \""Guz\"" Khan (born 1986) is a British comedian, impressionist, and actor best known for his work in the TV show Man Like Mobeen and stand up appearances in Live at the Apollo."
Q6988861,'Neil Linpow'@en,0.8776082992553711,"Neil Linpow is a multi-award-winning English actor, writer and filmmaker."
Q7320263,'Rhashan Stone'@en,0.8773206472396851,Rhashan Stone is an American actor and comedian based in the UK. He is best known for appearing in many comedy shows such as Desmond\'s and Mutual Friends.
Q7608608,'Stephen Ashfield'@en,0.8739212155342102,Stephen Ashfield is an Olivier Award-winning Scottish actor.
Q5290454,'Dominic Anciano'@en,0.8687206506729126,"Dominic Anciano (born 1959) is an English producer, actor, director, writer and comedian best known for his role as Sgt."
Q7626524,'Stuart Fell'@en,0.8686633110046387,Stuart Fell is a professional actor and stuntman.
Q5534773,'Geoffrey McGivern'@en,0.8672927021980286,"Geoffrey M. McGivern is a British actor in film, radio, stage and television, as well as a comedian."


<A NAME="comparing-embeddings"></A>
## Comparing different types of embeddings

Below we run a number of similarity queries for each of our various types of
embeddings to see how they behave relative to each other.  Note how they
behave quite differently, reasonable for some use cases but not so much for others:

### Philosophers:

In [24]:
kgtk("""
      query --gc $MAIN --ac $COMPLEX
      -i complex -i labels
      --match 'complex:  (x)-[]->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --where 'x in ["Q859", "Q868", "Q913"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'Plato'@en,'Plato'@en,1.0
1,'Plato'@en,'Socrates'@en,0.778851
2,'Plato'@en,'Epicurus'@en,0.7682
3,'Plato'@en,'Aratus'@en,0.744131
4,'Plato'@en,'Hippocrates'@en,0.742684
5,'Plato'@en,'Theophrastus'@en,0.732886
6,'Plato'@en,'Aeschines'@en,0.727185
7,'Plato'@en,'Antiphon of Rhamnus'@en,0.725084
8,'Plato'@en,'Gorgias'@en,0.724764
9,'Plato'@en,'Antisthenes'@en,0.723077


In [25]:
kgtk("""
      query --gc $MAIN --ac $TRANSE
      -i transe -i labels
      --match 'transe:   (x)-[]->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --where 'x in ["Q859", "Q868", "Q913"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'Plato'@en,'Plato'@en,1.0
1,'Plato'@en,'Plotinus'@en,0.752719
2,'Plato'@en,'Cornelius Nepos'@en,0.72332
3,'Plato'@en,'Bret Harte'@en,0.706325
4,'Plato'@en,'Federico Caffè'@en,0.702316
5,'Plato'@en,'Marcel Duchamp'@en,0.677284
6,'Plato'@en,'Quintus Julius Balbus'@en,0.662613
7,'Plato'@en,'Laurentius Abstemius'@en,0.662188
8,'Plato'@en,'Celso Lucio'@en,0.654929
9,'Plato'@en,'Peter von Cornelius'@en,0.684013


In [26]:
out = !kgtk query --gc $MAIN --ac $ABSTRACT \
      -i abstract -i labels -i sentence \
      --match 'abstract: (x)-[]->(xv), \
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
               labels:   (x)-[]->(xl), (y)-[]->(yl), \
               sent:     (y)-[]->(ys)' \
      --where 'x in ["Q859", "Q868", "Q913"]' \
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

xlabel,ylabel,sim,ysent
'Plato'@en,'Plato'@en,1.0,Plato (/ˈpleɪtoʊ/ PLAY-toe; Greek: Πλάτων Plátōn; 428/427 or 424/423 – 348/347 BC) was a Greek philosopher born in Athens during the Classical period in Ancient Greece.
'Plato'@en,'Aenesidemus'@en,0.965393602848053,"Aenesidemus (Ancient Greek: Αἰνησίδημος or Αἰνεσίδημος) was a Greek Pyrrhonist philosopher, born in Knossos on the island of Crete."
'Plato'@en,'Hicetas'@en,0.9649903178215028,Hicetas (Ancient Greek: Ἱκέτας or Ἱκέτης; c. 400 – c. 335 BC) was a Greek philosopher of the Pythagorean School.
'Plato'@en,'Empedocles'@en,0.9629127979278564,"Empedocles (/ɛmˈpɛdəkliːz/; Greek: Ἐμπεδοκλῆς; c. 494 – c. 434 BC, fl. 444–443 BC) was a Greek pre-Socratic philosopher and a native citizen of Akragas, a Greek city in Sicily."
'Plato'@en,'Eubulides'@en,0.9629042744636536,"Eubulides of Miletus (Ancient Greek: Εὐβουλίδης; fl. 4th century BCE) was a Greek philosopher of the Megarian school, a pupil of Euclid of Megara and a contemporary of Aristotle."
'Plato'@en,'Aristotle'@en,0.9615942239761353,"Aristotle (/ˈærɪstɒtəl/; Greek: Ἀριστοτέλης Aristotélēs, pronounced [aristotélɛːs]; 384–322 BC) was a Greek philosopher and polymath during the Classical period in Ancient Greece."
'Plato'@en,'Metrodorus of Lampsacus'@en,0.9613872766494752,"Metrodorus of Lampsacus (Greek: Μητρόδωρος Λαμψακηνός, Mētrodōros Lampsakēnos; 331/0–278/7 BC) was a Greek philosopher of the Epicurean school."
'Plato'@en,'Xenophon'@en,0.960830569267273,"Xenophon of Athens (/ˈzɛnəfən, zi-, -fɒn/; Ancient Greek: Ξενοφῶν [ksenopʰɔ̂ːn]; c. 430 – probably 355 or 354 BC) was a Greek military leader, philosopher, and historian, born in Athens."
'Plato'@en,'Anaxarchus'@en,0.9582780599594116,Anaxarchus (/ˌænəɡˈzɑːrkəs/; Ancient Greek: Ἀνάξαρχος; c. 380 – c. 320 BC) was a Greek philosopher of the school of Democritus.
'Plato'@en,'Clearchus of Soli'@en,0.957090437412262,"Clearchus of Soli (Greek: Kλέαρχoς ὁ Σολεύς, Klearkhos ho Soleus) was a Greek philosopher of the 4th–3rd century BCE, belonging to Aristotle\'s Peripatetic school."


### Countries:

In [27]:
kgtk("""
      query --gc $MAIN --ac $COMPLEX
      -i complex -i labels
      --match 'complex:  (x)-[]->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --where 'x in ["Q40", "Q41", "Q30"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'United States of America'@en,'United States of America'@en,1.0
1,'United States of America'@en,'United Kingdom'@en,0.819738
2,'United States of America'@en,'France'@en,0.810034
3,'United States of America'@en,'Canada'@en,0.79315
4,'United States of America'@en,'Spain'@en,0.791431
5,'United States of America'@en,'Australia'@en,0.780531
6,'United States of America'@en,'Thailand'@en,0.742816
7,'United States of America'@en,'South Korea'@en,0.734353
8,'United States of America'@en,'India'@en,0.730247
9,'United States of America'@en,'Mexico'@en,0.717486


In [28]:
kgtk("""
      query --gc $MAIN --ac $TRANSE
      -i transe -i labels
      --match 'transe:   (x)-[]->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --where 'x in ["Q40", "Q41", "Q30"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'United States of America'@en,'United States of America'@en,1.0
1,'United States of America'@en,'State of Scott'@en,0.790001
2,'United States of America'@en,'.سورية'@en,0.781829
3,'United States of America'@en,'Republic of South Carolina'@en,0.77654
4,'United States of America'@en,'State of Kanawha'@en,0.770454
5,'United States of America'@en,'Wedge: The Secret War between the FBI and CIA...,0.762854
6,'United States of America'@en,'Marin County'@en,0.741442
7,'United States of America'@en,'Light Stations of the United States MPS'@en,0.731502
8,'United States of America'@en,'Republic of Florida'@en,0.730721
9,'United States of America'@en,'Women's Professional Racquetball Organization...,0.737493


In [29]:
out = !kgtk query --gc $MAIN --ac $ABSTRACT \
      -i abstract -i labels -i sentence \
      --match 'abstract: (x)-[]->(xv), \
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
               labels:   (x)-[]->(xl), (y)-[]->(yl), \
               sent:     (y)-[]->(ys)' \
      --where 'x in ["Q40", "Q41", "Q30"]' \
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

xlabel,ylabel,sim,ysent
'United States of America'@en,'United States of America'@en,1.0000001192092896,"The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America."
'United States of America'@en,'North African American'@en,0.9427798390388488,North African Americans are Americans with origins in the region of North Africa.
'United States of America'@en,'Central America'@en,0.9361214637756348,Central America (Spanish: América Central [aˈmeɾika senˈtɾal] or Centroamérica [sentɾoaˈmeɾika]) is a subcontinent of North America.
'United States of America'@en,'Northern United States'@en,0.9349914789199828,"The Northern United States, commonly referred to as the American North, the Northern States, or simply the North, is a geographical or historical region of the United States."
'United States of America'@en,'Episcopal Diocese of Atlanta'@en,0.930703103542328,"The Episcopal Diocese of Atlanta is the diocese of the Episcopal Church in the United States of America, with jurisdiction over middle and north Georgia."
'United States of America'@en,'Tidewater region of Virginia'@en,0.9302077293395996,Tidewater refers to the north Atlantic coastal plain region of the United States of America.
'United States of America'@en,'Great Northern Railway'@en,0.9266144037246704,The Great Northern Railway (reporting mark GN) was an American Class I railroad.
'United States of America'@en,'American people of North American descent'@en,0.9230941534042358,American people of North American descent refers to inhabitants of the United States with lineage tracing to other North American countries.
'United States of America'@en,'Episcopal Diocese of Northern Michigan'@en,0.922684907913208,The Episcopal Diocese of Northern Michigan is the diocese of the Episcopal Church in the United States of America (TEC) with canonical jurisdiction in the Upper Peninsula of Michigan.
'United States of America'@en,'list of metropolitan areas in Northern America'@en,0.9206945300102234,"This is a list of metropolitan areas in Northern America, typically defined to include Canada and the United States as well as Bermuda (UK), Greenland (Denmark), and St. Pierre and Miquelon (France)."


### Types of animals:

In [30]:
kgtk("""
      query --gc $MAIN --ac $COMPLEX
      -i complex -i labels
      --match 'complex:  (x)-[]->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --where 'x in ["Q144", "Q146", "Q726"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'dog'@en,'dog'@en,1.0
1,'dog'@en,'hat'@en,0.72131
2,'dog'@en,'house cat'@en,0.706597
3,'dog'@en,'body armor'@en,0.692382
4,'dog'@en,'woman'@en,0.68766
5,'dog'@en,'peafowl'@en,0.686448
6,'dog'@en,'bouquet'@en,0.678995
7,'dog'@en,'hunting dog'@en,0.667602
8,'dog'@en,'logo'@en,0.647672
9,'dog'@en,'sceptre'@en,0.644042


In [31]:
kgtk("""
      query --gc $MAIN --ac $TRANSE
      -i transe -i labels
      --match 'transe:   (x)-[]->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --where 'x in ["Q144", "Q146", "Q726"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'dog'@en,'dog'@en,1.0
1,'dog'@en,'Salty and Roselle'@en,0.840819
2,'dog'@en,'Fred Basset'@en,0.831087
3,'dog'@en,'Theo'@en,0.821203
4,'dog'@en,'Heaven Sent Brandy'@en,0.820738
5,'dog'@en,'Old Hemp'@en,0.818933
6,'dog'@en,'Rubia'@en,0.81686
7,'dog'@en,'Alcmène'@en,0.81526
8,'dog'@en,'Alex the Dog'@en,0.813087
9,'dog'@en,'Edda'@en,0.810013


In [32]:
out = !kgtk query --gc $MAIN --ac $ABSTRACT \
      -i abstract -i labels -i sentence \
      --match 'abstract: (x)-[]->(xv), \
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
               labels:   (x)-[]->(xl), (y)-[]->(yl), \
               sent:     (y)-[]->(ys)' \
      --where 'x in ["Q144", "Q146", "Q726"]' \
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

xlabel,ylabel,sim,ysent
'dog'@en,'dog'@en,1.0000001192092896,The dog or domestic dog (Canis familiaris or Canis lupus familiaris) is a domesticated descendant of the wolf.
'dog'@en,'Canis simensis'@en,0.9140985012054444,"The Ethiopian wolf (Canis simensis), also called the Simien jackal and Simien fox, is a canine native to the Ethiopian Highlands."
'dog'@en,'Bucovina Shepherd Dog'@en,0.8995509743690491,The Romanian Bucovina Shepherd (Romanian: Ciobănesc Românesc de Bucovina) is a breed of livestock guardian dogs native to historical Bukovina (Bucovina) region.
'dog'@en,'Karst Shepherd'@en,0.8969720602035522,"The Karst Shepherd (Slovene: kraški ovčar or kraševec ) is a breed of dog of the livestock guardian type, originating in Slovenia."
'dog'@en,'Majorca Shepherd Dog'@en,0.8946498036384583,"The Majorca Shepherd Dog (Catalan: Ca de bestiar, Spanish: Perro de pastor mallorquín) is a domesticated breed of dog, used in the Balearic Islands of Spain, both for guarding sheep and as a general purpose farm dog."
'dog'@en,'Tornjak'@en,0.8933776021003723,"The Tornjak (pronounced [torɲâk]), is a breed of livestock guardian dog native to Bosnia and Herzegovina and Croatia."
'dog'@en,'Schapendoes'@en,0.8902978301048279,"The Schapendoes (Dutch pronunciation: [ˈsxaːpəndus]) or Dutch Sheepdog, is a breed of dog originating in the Netherlands."
'dog'@en,'Native American dogs'@en,0.8873973488807678,"Native American dogs, or Pre-Columbian dogs, were dogs living with people indigenous to the Americas."
'dog'@en,'Mozart family'@en,0.8867671489715576,"The Mozart family were the ancestors, relatives, and descendants of Wolfgang Amadeus Mozart."
'dog'@en,'Hare Indian Dog'@en,0.88480544090271,"The Hare Indian dog is an extinct domesticated canine; possibly a breed of domestic dog, coydog, or domesticated coyote; formerly found and originally bred in northern Canada by the Hare Indians for coursing."


### Handball:

In [33]:
kgtk("""
      query --gc $MAIN --ac $COMPLEX
      -i complex -i labels
      --match 'complex:  (x)-[]->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --where 'x in ["Q8418"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'handball'@en,'handball'@en,1.0
1,'handball'@en,'beach handball'@en,0.755496
2,'handball'@en,'field hockey'@en,0.747243
3,'handball'@en,'korfball'@en,0.735095
4,'handball'@en,'indoor handball'@en,0.729936
5,'handball'@en,'biathlon'@en,0.705106
6,'handball'@en,'volleyball'@en,0.704901
7,'handball'@en,'softball'@en,0.68622
8,'handball'@en,'field handball'@en,0.683182
9,'handball'@en,'orienteering'@en,0.666485


In [34]:
kgtk("""
      query --gc $MAIN --ac $TRANSE
      -i transe -i labels
      --match 'transe:   (x)-[]->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --where 'x in ["Q8418"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'handball'@en,'handball'@en,1.0
1,'handball'@en,'Wikipedia:WikiProject Handball'@en,0.830904
2,'handball'@en,'women's beach handball'@en,0.714068
3,'handball'@en,'ski jumping'@en,0.70067
4,'handball'@en,'futsal'@en,0.689599
5,'handball'@en,'indoor handball'@en,0.687324
6,'handball'@en,'biathlon'@en,0.683123
7,'handball'@en,'women's association football'@en,0.6782
8,'handball'@en,'Qatch'@en,0.65917
9,'handball'@en,'hockey'@en,0.649488


In [35]:
out = !kgtk query --gc $MAIN --ac $ABSTRACT \
      -i abstract -i labels -i sentence \
      --match 'abstract: (x)-[]->(xv), \
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
               labels:   (x)-[]->(xl), (y)-[]->(yl), \
               sent:     (y)-[]->(ys)' \
      --where 'x in ["Q8418"]' \
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

xlabel,ylabel,sim,ysent
'handball'@en,'handball'@en,1.0,"Handball (also known as team handball, European handball or Olympic handball) is a team sport in which two teams of seven players each (six outcourt players and a goalkeeper) pass a ball using their hands with the aim of throwing it into the goal of the other team."
'handball'@en,'beach handball'@en,0.8905559182167053,"Beach handball is a team sport where two teams pass and bounce or roll a ball, trying to throw it in the goal of the opposing team."
'handball'@en,'volleyball injury'@en,0.8481535315513611,"Volleyball is a game played between two opposing sides, with six players on each team, where the players use mainly their hands to hit the ball over a net and try to make the ball land on the opposing team\'s side of the court."
'handball'@en,'ball boy'@en,0.8265464305877686,"Ball boys and ball girls, also known as ball kids are individuals, usually human youths but sometimes dogs, who retrieve and supply balls for players or officials in sports such as association football, American football, bandy, cricket, tennis, baseball and basketball."
'handball'@en,'Balonpesado'@en,0.8249591588973999,"The balonpesado is a team sport, devised for both open field as closed, in which two sets of five players each try to score goals within circles drawn on the ground of each end of the field."
'handball'@en,'Screwball Scramble'@en,0.8165739178657532,Screwball Scramble is a toy made by Tomy that involves guiding a 14-millimeter-diameter chrome steel ball bearing around an obstacle course.
'handball'@en,'sepak takraw'@en,0.808526337146759,"Sepak takraw, or Sepaktakraw, also called kick volleyball, is a team sport played with a ball made of rattan or synthetic plastic between two teams of two to four players on a court resembling a badminton court."
'handball'@en,'muggle quidditch'@en,0.7998005151748657,"Quidditch, also known as quadball, is a sport of two teams of seven players each mounted on a broomstick, and is played on a hockey rink-sized pitch."
'handball'@en,'tag'@en,0.7951781749725342,"Tag (also called tig, it, tiggy, tips, tick, tip) is a playground game involving two or more players chasing other players in an attempt to \""tag\"" and mark them out of play, usually by touching with a hand."
'handball'@en,'dodgeball'@en,0.7933317422866821,"Dodgeball is a team sport in which players on two teams try to throw balls and hit opponents, while avoiding being hit themselves."


### Journalist:

In [36]:
kgtk("""
      query --gc $MAIN --ac $COMPLEX
      -i complex -i labels
      --match 'complex:  (x)-[]->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --where 'x in ["Q1930187"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'journalist'@en,'journalist'@en,1.0
1,'journalist'@en,'television presenter'@en,0.806589
2,'journalist'@en,'writer'@en,0.79479
3,'journalist'@en,'poet'@en,0.785165
4,'journalist'@en,'playwright'@en,0.77569
5,'journalist'@en,'politician'@en,0.756889
6,'journalist'@en,'short story writer'@en,0.756759
7,'journalist'@en,'actor'@en,0.751951
8,'journalist'@en,'film critic'@en,0.743542
9,'journalist'@en,'teacher'@en,0.743155


In [37]:
kgtk("""
      query --gc $MAIN --ac $TRANSE
      -i transe -i labels
      --match 'transe:   (x)-[]->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --where 'x in ["Q1930187"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'journalist'@en,'journalist'@en,1.0
1,'journalist'@en,'Category:Journalists'@en,0.719601
2,'journalist'@en,'journalistic scandal'@en,0.693972
3,'journalist'@en,'children's writer'@en,0.667045
4,'journalist'@en,'László Török'@en,0.658004
5,'journalist'@en,'columnist'@en,0.650447
6,'journalist'@en,'novelist'@en,0.644886
7,'journalist'@en,'foreign correspondent'@en,0.626349
8,'journalist'@en,'Category:Journalists of Ceará'@en,0.650367
9,'journalist'@en,'business journalist'@en,0.640736


In [38]:
out = !kgtk query --gc $MAIN --ac $ABSTRACT \
      -i abstract -i labels -i sentence \
      --match 'abstract: (x)-[]->(xv), \
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
               labels:   (x)-[]->(xl), (y)-[]->(yl), \
               sent:     (y)-[]->(ys)' \
      --where 'x in ["Q1930187"]' \
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

xlabel,ylabel,sim,ysent
'journalist'@en,'journalist'@en,0.9999999403953552,"A journalist is an individual that collects/gathers information in form of text, audio, or pictures, processes them into a news-worthy form, and disseminates it to the public."
'journalist'@en,'technology journalism'@en,0.8715080618858337,"Technology journalism is the activity, or product, of journalists engaged in the preparation of written, visual, audio or multi-media material intended for dissemination through public media, focusing on technology-related subjects."
'journalist'@en,'Information subsidy'@en,0.8664877414703369,An information subsidy is the provision of ready-to-use newsworthy information to the news media by various sources interested in gaining access to media time and space.
'journalist'@en,'media relations'@en,0.8634818196296692,"Media Relations involves working with media for the purpose of informing the public of an organization\'s mission, policies and practices in a positive, consistent and credible manner."
'journalist'@en,'journalism'@en,0.8631913065910339,"Journalism is the production and distribution of reports on the interaction of events, facts, ideas, and people that are the \""news of the day\"" and that informs society to at least some degree."
'journalist'@en,'Mediated deliberation'@en,0.851024329662323,Mediated deliberation is a form of deliberation that is achieved through the media which acts as a mediator between the mass public and elected officials.
'journalist'@en,'news conference'@en,0.8386815190315247,A press conference or news conference is a media event in which notable individuals or organizations invite journalists to hear them speak and ask questions.
'journalist'@en,'Media pilgrimage'@en,0.8380250930786133,A media pilgrimage refers to visits made to the sites mentioned in popular media.
'journalist'@en,'press kit'@en,0.836276650428772,"A press kit, often referred to as a media kit in business environments, is a pre-packaged set of promotional materials that provide information about a person, company, organization or cause and which is distributed to members of the media for promotional use."
'journalist'@en,'multimedia journalism'@en,0.8353787064552307,"Multimedia journalism is the practice of contemporary journalism that distributes news content either using two or more media formats via the Internet, or disseminating news report via multiple media platforms."


### Head of state:

In [39]:
kgtk("""
      query --gc $MAIN --ac $COMPLEX
      -i complex -i labels
      --match 'complex:  (x)-[]->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --where 'x in ["Q48352"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'head of state'@en,'head of state'@en,1.0
1,'head of state'@en,'head of government'@en,0.831704
2,'head of state'@en,'leader of organisation'@en,0.715446
3,'head of state'@en,'governor'@en,0.669001
4,'head of state'@en,'consul general'@en,0.642665
5,'head of state'@en,'Floor leader'@en,0.605335
6,'head of state'@en,'defence minister'@en,0.691881
7,'head of state'@en,'French ambassador'@en,0.636968
8,'head of state'@en,'supreme court justice'@en,0.635844
9,'head of state'@en,'Executive Secretary of the Secretariat'@en,0.568668


In [40]:
kgtk("""
      query --gc $MAIN --ac $TRANSE
      -i transe -i labels
      --match 'transe:   (x)-[]->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-[]->(xl), (y)-[]->(yl)'
      --where 'x in ["Q48352"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'head of state'@en,'head of state'@en,1.0
1,'head of state'@en,'head of government'@en,0.861243
2,'head of state'@en,'governor'@en,0.792502
3,'head of state'@en,'prime minister'@en,0.789789
4,'head of state'@en,'speaker'@en,0.781205
5,'head of state'@en,'Governor-general'@en,0.730634
6,'head of state'@en,'attorney general'@en,0.71397
7,'head of state'@en,'colonial governor'@en,0.70737
8,'head of state'@en,'foreign minister'@en,0.703298
9,'head of state'@en,'minister'@en,0.765075


In [41]:
out = !kgtk query --gc $MAIN --ac $ABSTRACT \
      -i abstract -i labels -i sentence \
      --match 'abstract: (x)-[]->(xv), \
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
               labels:   (x)-[]->(xl), (y)-[]->(yl), \
               sent:     (y)-[]->(ys)' \
      --where 'x in ["Q48352"]' \
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

xlabel,ylabel,sim,ysent
'head of state'@en,'head of state'@en,1.0,A head of state (or chief of state) is the public persona who officially embodies a state in its unity and legitimacy.
'head of state'@en,'state religion'@en,0.9016953706741332,A state religion (also called an established religion or official religion) is a religion or creed officially endorsed by a sovereign state.
'head of state'@en,'nation state'@en,0.8929967284202576,A nation state is a political unit where the state and nation are congruent.
'head of state'@en,"'Iman, Ittihad, Nazm'@en",0.8915225863456726,"Faith, Unity, Discipline (Urdu: ایمان، اتحاد، نظم) is the national motto of Pakistan."
'head of state'@en,'Freedom and Unity'@en,0.890255331993103,"\""Freedom and Unity\"" is the official motto of the U.S. state of Vermont."
'head of state'@en,'Ukrainian nationalism'@en,0.8886831402778625,"Ukrainian nationalism refers to the promotion of the unity of Ukrainians and the titular Ukraine nation state (and in a modern sense, also the \""people of Ukraine\"" in a constitutionally mandated \""territorial-civic\"" sense), as well as nation building as a means of strengthening and protecting state sovereignty within the international system of states."
'head of state'@en,'Department for Constitutional Affairs'@en,0.8855394124984741,The Department for Constitutional Affairs (DCA) was a United Kingdom government department.
'head of state'@en,'Most Excellent Majesty'@en,0.8850839734077454,Most Excellent Majesty is a form of address in the United Kingdom.
'head of state'@en,'Official culture'@en,0.8842212557792664,Official culture is the culture that receives social legitimation or institutional support in a given society.
'head of state'@en,'National Enterprise Board'@en,0.8837820887565613,The National Enterprise Board (NEB) was a United Kingdom government body.
