# Query knowledge graphs and embeddings with KGTK Kypher-V

Kypher-V supports import and queries over vector data. Kypher-V extends
Kypher to allow work with unstructured data such as text, images, and so
on, represented by embedding vectors. Kypher-V provides efficient storage,
indexing and querying of large-scale vector data on a laptop. It is fully
integrated into Kypher to enable expressive hybrid queries over
Wikidata-size structured and unstructured data. To the best of our
knowledge, this is the first system providing such a functionality in a
query language for knowledge graphs.

Please see the [**Kypher-V Manual**](https://kgtk.readthedocs.io/en/latest/transform/query/#kypher-v)
for an introduction to the basic concepts and usage.

<A NAME="setup"></A>
### Setup

Some preliminaries to facilitate command invocation and result formatting:

In [1]:
import re
from IPython.display import display, HTML
from kgtk.functions import kgtk

def show_html(img_width=150):
    """Display command output in 'out' as HTML after munging image links for inline display."""
    output = '\n'.join(out)
    html = re.sub(r'<td>&quot;(https?://upload.wikimedia.org/[^<]+)&quot;</td>', 
                  f'<td style="width:{img_width}px;vertical-align:top"><img " src="\\1"/></td>', 
                  output)
    display(HTML(html))

The Kypher-V example queries in this notebook assume the existence of a number of similarity
graph caches in the `DB` directory, which are all defined here via shell variables:

In [4]:
DB="/kgtk-data/kypherv"
%env DB={DB}
%env MAIN={DB}/wikidata-20221102-dwd-v8-main.sqlite3.db
%env COMPLEX={DB}/wikidata-20221102-dwd-v8-complex-embeddings.sqlite3.db
%env TRANSE={DB}/wikidata-20221102-dwd-v8-transe-embeddings.sqlite3.db
%env ABSTRACT={DB}/wikidata-20221102-dwd-v8-abstract-embeddings-large.sqlite3.db
%env IMAGE={DB}/wikimedia-capcom-image-embeddings-v2.sqlite3.db

env: DB=/kgtk-data/kypherv
env: MAIN=/kgtk-data/kypherv/wikidata-20221102-dwd-v8-main.sqlite3.db
env: COMPLEX=/kgtk-data/kypherv/wikidata-20221102-dwd-v8-complex-embeddings.sqlite3.db
env: TRANSE=/kgtk-data/kypherv/wikidata-20221102-dwd-v8-transe-embeddings.sqlite3.db
env: ABSTRACT=/kgtk-data/kypherv/wikidata-20221102-dwd-v8-abstract-embeddings-large.sqlite3.db
env: IMAGE=/kgtk-data/kypherv/wikimedia-capcom-image-embeddings-v2.sqlite3.db


If you copied the graph caches and their associated `.faiss.idx` ANNS index files
to a different location, please adjust the paths and definitions accordingly.

Throughout the notebook we use three different invocation styles for
the `kgtk` command to better control the appearance of the generated output.
We either use it via the `!kgtk ...` syntax directly, use the `kgtk(...)`
function which produces an HTML rendering of a Pandas frame containing the
result, or we use the `show_html` function for some additional control on
how long texts and inline images are displayed.  All of these incantations
should be straightforward to translate into a shell environment if needed.

<A NAME="graph-caches"></A>
### Similarity graph caches

The examples in this notebook rely on several standard and similarity
graph caches based on `wikidata-20221102-dwd-v8`.  These graph caches are
available in the `DB` directory of the `ckg06` server from where they can be
copied or accessed directly in example queries.  It will generally not be
possible to run the notebook directly from that server, so if you want to
run and experiment with the notebook in a Jupyter environment, you have to
copy the graph caches to a different location where a notebook server can be run.
In this case, make sure to also copy the associated ANNS index files that end in
a `.faiss.idx` extension.

This notebook does not show how the individual similarity caches were
constructed.  To see how that can be done, please consult
the [**Kypher-V Manual**](https://kgtk.readthedocs.io/en/latest/transform/query/#kypher-v)
or look at the respective `*.db.build.txt` files in the `DB` directory.  For reference,
we show just one incantation here on how the `COMPLEX` graph cache was built.  Other
graph caches were built similarly with some modifications to adjust for differences in
the embedding data used (for `COMPLEX` this takes about 2.5-3 hours to run on a laptop):

```
$ export WD=.../datasets/wikidata-20221102-dwd-v8

$ cat $WD/wikidatadwd.complEx.graph-embeddings.txt | sed -e 's/ /\t/' \
      | kgtk --debug add-id --no-input-header=False --input-column-names node1 node2 \
                   --implied-label emb \
           / query --gc $DB/wikidata-20221102-dwd-v8-complex-embeddings.sqlite3.db \
                   -i - --as complex \
                   --idx vector:node2/nn/ram=25g/nlist=16k mode:valuegraph \
                   --single-user --limit 5
```

We use the following similarity graph caches which can be combined
with a main graph cache using one or more `--auxiliary-cache` or `--ac`
options to the `query` command.  The `COMPLEX` graph cache contains
59M 100-D ComplEx graph embeddings:

In [3]:
!kgtk query --gc $COMPLEX --sc

Graph Cache:
DB file: /kgtk-data/kypherv/wikidata-20221102-dwd-v8-complex-embeddings.sqlite3.db
  size:  28.92 GB   	free:  0 Bytes   	modified:  2022-12-15 20:40:26

KGTK File Information:
complex:
  size:  0 Bytes   	modified:  2022-12-15 17:55:31   	graph:  graph_1

Graph Table Information:
graph_1:
  size:  29.76 GB   	created:  2022-12-15 17:55:31
  header:  ['node1', 'label', 'node2', 'id']


The `TRANSE` graph cache contains 59M 100-D TransE graph embeddings:

In [4]:
!kgtk query --gc $TRANSE --sc

Graph Cache:
DB file: /kgtk-data/kypherv/wikidata-20221102-dwd-v8-transe-embeddings.sqlite3.db
  size:  28.92 GB   	free:  0 Bytes   	modified:  2022-12-17 11:39:02

KGTK File Information:
transe:
  size:  0 Bytes   	modified:  2022-12-16 14:09:02   	graph:  graph_1

Graph Table Information:
graph_1:
  size:  29.76 GB   	created:  2022-12-16 14:09:02
  header:  ['node1', 'node2', 'label', 'id']


The `ABSTRACT` graph cache contains the sentences and embedding vectors
generated from the first sentences of Wikipedia short abstracts.  It
contains about 6M 1024-D Roberta large vectors (**Note**: these are different
embeddings than the ones used and reported on in the 2022 Wikidata Workshop paper,
therefore, the query results in this notebook are somewhat different):

In [5]:
!kgtk query --gc $ABSTRACT --sc

Graph Cache:
DB file: /kgtk-data/kypherv/wikidata-20221102-dwd-v8-abstract-embeddings-large.sqlite3.db
  size:  29.37 GB   	free:  0 Bytes   	modified:  2023-01-19 15:02:30

KGTK File Information:
abstract:
  size:  0 Bytes   	modified:  2023-01-19 13:24:19   	graph:  graph_1
sentence:
  size:  256.32 MB   	modified:  2023-01-04 13:53:44   	graph:  graph_2

Graph Table Information:
graph_1:
  size:  28.21 GB   	created:  2023-01-19 13:24:19
  header:  ['node1', 'label', 'node2', 'id']
graph_2:
  size:  1.23 GB   	created:  2023-01-19 15:01:41
  header:  ['node1', 'label', 'node2', 'id']


The `IMAGE` graph cache contains image embeddings published by the
<a href="https://techblog.wikimedia.org/2021/09/09/the-wikipedia-image-caption-matching-challenge-and-a-huge-release-of-image-data-for-research/">
Wikipedia image/caption matching challenge</a>.  The embeddings are 2048-D vectors
taken from the second-to-last layer of a ResNet-50 neural network trained with
Imagenet data.  We only use the 2.7M images associated with English Wikipedia
pages.  The resulting vector graph cache is shown here:

In [6]:
!kgtk query --gc $IMAGE --sc

Graph Cache:
DB file: /kgtk-data/kypherv/wikimedia-capcom-image-embeddings-v2.sqlite3.db
  size:  24.39 GB   	free:  0 Bytes   	modified:  2023-01-11 14:10:32

KGTK File Information:
wiki_image:
  size:  0 Bytes   	modified:  2023-01-11 12:54:36   	graph:  graph_1

Graph Table Information:
graph_1:
  size:  24.42 GB   	created:  2023-01-11 12:54:36
  header:  ['node1', 'label', 'node2', 'id', 'page_url', 'qnode']


Finally, we also use a standard Wikidata graph cache for the claims and
labels of `wikidata-20221102-dwd-v8`.  It is called `MAIN` below.

<A NAME="vector-tables"></A>
### Vector tables are regular KGTK files

Any KGTK representation that associates a node or edge ID with a vector
will work.  An edge format we commonly use is a `node1` pointing to a vector
literal in `node2` via an `emb` edge (but any label will do).  For example,
here we show the first three embedding edges in `COMPLEX` (the `node2;_kgtk_vec_qcell`
column is an auxiliary column automatically computed by ANNS indexing):

In [7]:
kgtk("""query --gc $COMPLEX -i complex --limit 3""")

Unnamed: 0,node1,label,node2,id,node2;_kgtk_vec_qcell
0,Q102108199,emb,b'x13x99x13?x96xb7xf9xbdxb0x99x0fxbexf1xd4|>&x...,E465008,0
1,Q28980109,emb,b'xa1xdax8e=xdfx17x1e>xffxa4y=xf8+(xbeaxb5!xbd...,E686337,0
2,Q42012492,emb,b'txb8xe4xbexfcR;?x00xd6xd1>x87x1fxcdxbeTIx88x...,E1762936,0


<A NAME="vector-computation"></A>
### Vector computation

The simplest operation in Kypher-V is a similarity computation between two vectors
which we perform here using the `ABSTRACT` graph cache:

In [6]:
kgtk(""" 
      query --gc $MAIN --ac $ABSTRACT
      -i abstract -i labels
      --match 'abstract: (x:Q868)-->(xv),
                         (y:Q913)-->(yv),
               labels:   (x)-->(xl), (y)-->(yl)'
      --return 'xl as xlabel, yl as ylabel, kvec_cos_sim(xv, yv) as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'Aristotle'@en,'Socrates'@en,0.816283


<A NAME="brute-force-search"></A>
### Brute-force similarity search

A more interesting operation is *similarity search* where we look
for the most similar matches for a given seed.  In the query below, we
use a simple but expensive brute-force search over about 10,000 input
vectors by computing similarities between `x` and each possible `y`,
then sorting and returning the top-10.  This is still pretty fast
given that the set of inputs is fairly small:

In [8]:
kgtk("""
      query --gc $MAIN --ac $ABSTRACT
      -i abstract -i labels -i claims
      --match 'abstract: (x:Q913)-->(xv), (y)-->(yv),
               claims:   (y)-[:P106]->(:Q4964182),
               labels:   (x)-->(xl), (y)-->(yl)'
      --return 'xl as xlabel, yl as ylabel, kvec_cos_sim(xv, yv) as sim'
      --order  'sim desc'
      --limit 10
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'Socrates'@en,'Socrates'@en,1.0
1,'Socrates'@en,'Adamantios Korais'@en,0.873166
2,'Socrates'@en,'Prodicus'@en,0.872791
3,'Socrates'@en,'Protagoras'@en,0.870216
4,'Socrates'@en,'Manuel Chrysoloras'@en,0.868033
5,'Socrates'@en,'Cebes'@en,0.867012
6,'Socrates'@en,'Pyrrho'@en,0.866274
7,'Socrates'@en,'Menedemus'@en,0.86322
8,'Socrates'@en,'Epicurus'@en,0.861731
9,'Socrates'@en,'Xenophon'@en,0.860759


There are about 9M Q5's (humans) in Wikidata, 1.8M of which have short abstract vectors:

In [9]:
kgtk("""
      query --gc $MAIN --ac $ABSTRACT
      -i abstract -i claims
      --match 'abstract: (x)-->(),
               claims:   (x)-[:P31]->(:Q5)'
      --return 'count(distinct x)'
     """)

Unnamed: 0,"count(DISTINCT db1_graph_1_c1.""node1"")"
0,1801484


If we used the same brute-force search from above on this much larger set,
it would take about 2 min to run (which is why this command is disabled):

In [None]:
!time DISABLED kgtk query --gc $MAIN \
                 --ac $ABSTRACT \
      -i abstract -i labels -i claims \
      --match 'abstract: (x:Q913)-->(xv), (y)-->(yv), \
               claims:   (y)-[:P31]->(:Q5), \
               labels:   (x)-->(xl), (y)-->(yl)' \
      --return 'xl as xlabel, yl as ylabel, kvec_cos_sim(xv, yv) as sim' \
      --order  'sim desc' \
      --limit 10

```
xlabel	ylabel	sim
'Socrates'@en	'Socrates'@en	1.0000001192092896
'Socrates'@en	'Adamantios Korais'@en	0.8731658458709717
'Socrates'@en	'Prodicus'@en	0.872790515422821
'Socrates'@en	'Protagoras'@en	0.8702158331871033
'Socrates'@en	'Manuel Chrysoloras'@en	0.8680326342582703
'Socrates'@en	'Cebes'@en	0.8670117259025574
'Socrates'@en	'Pyrrho'@en	0.8662737011909485
'Socrates'@en	'Menedemus'@en	0.8632197380065918
'Socrates'@en	'Epicurus'@en	0.8617314696311951
'Socrates'@en	'Xenophon'@en	0.8607585430145264
52.997u 15.548s 1:50.53 62.0%	0+0k 19477248+136io 0pf+0w
```

<A NAME="indexed-search"></A>
### Indexed similarity search

For much faster search, we use an ANNS index constructed when the vector data
was imported which now runs in less than a second compared to 5 minutes before.
Results here are slightly different from above, since it does not restrict on
occupation = philosopher (we will address that later):

In [11]:
kgtk("""
      query --gc $MAIN --ac $ABSTRACT
      -i abstract -i labels -i claims
      --match 'abstract: (x:Q913)-->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 5, nprobe: 4}]->(y),
               labels:   (x)-->(xl), (y)-->(yl)'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
      --limit 10
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'Socrates'@en,'Socrates'@en,1.0
1,'Socrates'@en,'Adamantios Korais'@en,0.873166
2,'Socrates'@en,'Prodicus'@en,0.872791
3,'Socrates'@en,'Manuel Chrysoloras'@en,0.868033
4,'Socrates'@en,'Cebes'@en,0.867012


<A NAME="similarity-join"></A>
### Full similarity join

Below we query for three philosophers' top-k similar neighbors that are also humans and have
occupation (`P106`) philosopher.  Dynamic scaling ensures that `k` gets increased dynamically
up to `maxk` until we've found enough qualifying results for each:

In [12]:
kgtk("""
      query --gc $MAIN --ac $ABSTRACT
      -i abstract -i labels -i claims
      --match 'abstract: (x)-->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 5, maxk: 1024, nprobe: 4}]->(y),
               claims:   (y)-[:P106]->(:Q4964182),
                         (y)-[:P31]->(:Q5),
               labels:   (x)-->(xl), (y)-->(yl)'
      --where 'x in ["Q859", "Q868", "Q913"] and x != y'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'Plato'@en,'Aenesidemus'@en,0.936797
1,'Plato'@en,'Aristotle'@en,0.928277
2,'Plato'@en,'Menedemus'@en,0.926272
3,'Plato'@en,'Hicetas'@en,0.92338
4,'Plato'@en,'Philo of Larissa'@en,0.921708
5,'Aristotle'@en,'Philo of Larissa'@en,0.93101
6,'Aristotle'@en,'Speusippus'@en,0.930034
7,'Aristotle'@en,'Plato'@en,0.928277
8,'Aristotle'@en,'Hicetas'@en,0.927509
9,'Aristotle'@en,'Bryson of Achaea'@en,0.92793


For comparison, here is a run without dynamic scaling which returns fewer results, since
not all of the top-5 similar results for each input also satisfy the post conditions:

In [14]:
kgtk("""
      query --gc $MAIN --ac $ABSTRACT
      -i abstract -i labels -i claims
      --match 'abstract: (x)-->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 5, nprobe: 4}]->(y),
               claims:   (y)-[:P106]->(:Q4964182),
                         (y)-[:P31]->(:Q5),
               labels:   (x)-->(xl), (y)-->(yl)'
      --where 'x in ["Q859", "Q868", "Q913"] and x != y'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'Plato'@en,'Aenesidemus'@en,0.936797
1,'Plato'@en,'Aristotle'@en,0.928277
2,'Plato'@en,'Menedemus'@en,0.926272
3,'Plato'@en,'Hicetas'@en,0.92338
4,'Aristotle'@en,'Philo of Larissa'@en,0.93101
5,'Aristotle'@en,'Speusippus'@en,0.930034
6,'Aristotle'@en,'Plato'@en,0.928277
7,'Aristotle'@en,'Hicetas'@en,0.927509
8,'Socrates'@en,'Adamantios Korais'@en,0.873166
9,'Socrates'@en,'Prodicus'@en,0.872791


<A NAME="applications"></A>
## Example applications

### Image search

In the examples below, we use image similarity to link QNodes in Wikidata.  We
use the precomputed `IMAGE` graph cache (see above) which contains embeddings
for about 2.7M images linked to their respective Wikipedia pages and Wikidata
QNodes.  

We start with a QNode (such a the one for Barack Obama below), find one or more
images associated with that QNode, look up their image embeddings and then find
other similar images and their associated QNodes.

We do not compute any image embeddings on the fly here, we simply link nodes based
on similarity of images they are associated with.  Note that this will often not
preserve the type of the source node as can be seen in the result for Barack Obama.
To enforce such type or other restrictions additional clauses can be added.
Since there are multiple images associated with Barack Obama, we use a `not exists`
clause to only look at the first one to make the results less cluttered:

Barack Obama:

In [15]:
out = !kgtk query --gc $IMAGE --ac $MAIN \
      -i wiki_image -i labels \
      --match 'image:  (ximg)-[rx {qnode: $SEED}]->(xiv), \
                       (xiv)-[r:kvec_topk_cos_sim {k: 10, nprobe: 8}]->(yimg), \
                       (yimg)-[ry {qnode: y}]->(), \
               labels: (y)-->(ylabel)' \
      --where 'not exists {image: (ximg2)-[{qnode: $SEED}]->() WHERE rowid(ximg2) < rowid(ximg) }' \
      --return 'y as qnode, ylabel as label, printf("%.5g", r.similarity) as sim, yimg as image' \
      --para  SEED=Q76 \
    / html

show_html(img_width=200)

qnode,label,sim,image
Q76,'Barack Obama'@en,1.0,
Q567497,'France–Germany relations'@en,0.77576,
Q27804564,'Wahidullah Waissi'@en,0.75814,
Q7747,'Vladimir Putin'@en,0.75264,
Q188888,'Teachers\' Day'@en,0.75262,
Q702725,'Shirani Bandaranayake'@en,0.75063,
Q18274595,'list of international presidential trips made by Serzh Sargsyan'@en,0.74954,
Q1151352,'John Piper'@en,0.74702,
Q170645,'2018 FIFA World Cup'@en,0.74702,
Q381157,'Orrin Hatch'@en,0.74424,


To get more type appropriate matches, we can add a restriction to only return matches of
type human (`Q5`):

In [16]:
out = !kgtk query --gc $IMAGE --ac $MAIN \
      -i wiki_image -i labels -i claims \
      --match 'image:  (ximg)-[rx {qnode: $SEED}]->(xiv), \
                       (xiv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(yimg), \
                       (yimg)-[ry {qnode: y}]->(), \
               claims: (y)-[:P31]->(:Q5), \
               labels: (y)-->(ylabel)' \
      --where 'not exists {image: (ximg2)-[{qnode: $SEED}]->() WHERE rowid(ximg2) < rowid(ximg) }' \
      --return 'y as qnode, ylabel as label, printf("%.5g", r.similarity) as sim, yimg as image' \
      --para  SEED=Q76 \
    / html

show_html(img_width=200)

qnode,label,sim,image
Q76,'Barack Obama'@en,1.0,
Q27804564,'Wahidullah Waissi'@en,0.75814,
Q7747,'Vladimir Putin'@en,0.75264,
Q702725,'Shirani Bandaranayake'@en,0.75063,
Q1151352,'John Piper'@en,0.74702,
Q381157,'Orrin Hatch'@en,0.74424,
Q2339668,'Twan Huys'@en,0.749,
Q128949,'Miri Regev'@en,0.73791,
Q160157,'Joe Lieberman'@en,0.7345,
Q355130,'Richard Petty'@en,0.75015,


Charles Dadant: again, note that some of the results are not of type human but are
just linked to a similar image:

In [17]:
out = !kgtk query --gc $IMAGE --ac $MAIN \
      -i wiki_image -i labels \
      --match 'image: (ximg)-[rx {qnode: $SEED}]->(xiv), \
                      (xiv)-[r:kvec_topk_cos_sim {k: 10, nprobe: 8}]->(yimg), \
                      (yimg)-[ry {qnode: y}]->(), \
               labels: (y)-->(ylabel)' \
      --where 'not exists {image: (ximg2)-[{qnode: $SEED}]->() WHERE rowid(ximg2) < rowid(ximg) }' \
      --return 'y as qnode, ylabel as label, printf("%.5g", r.similarity) as sim, yimg as image' \
      --para  SEED=Q582964 \
      --limit 20 \
    / html

show_html(img_width=100)

qnode,label,sim,image
Q582964,'Charles Dadant'@en,1.0,
Q5956831,'Hymns Ancient and Modern'@en,0.84983,
Q3759575,'list of American Civil War generals (Confederate)'@en,0.84305,
Q6084534,'Ismael Cerna'@en,0.832,
Q26003,'Sergey Botkin'@en,0.82388,
Q5494660,'Fred Bonsor'@en,0.81946,
Q3303297,'ironmaster'@en,0.81704,
Q4631421,'22nd Regiment Alabama Infantry'@en,0.80955,
Q4641399,'5th North Carolina Regiment'@en,0.80858,


Beaumaris Castle in Wales:

In [18]:
out = !kgtk query --gc $IMAGE --ac $MAIN \
      -i wiki_image -i labels \
      --match 'image: (ximg)-[rx {qnode: $SEED}]->(xiv), \
                      (xiv)-[r:kvec_topk_cos_sim {k: 20, nprobe: 8}]->(yimg), \
                      (yimg)-[ry {qnode: y}]->(), \
               labels: (y)-->(ylabel)' \
      --where 'not exists {image: (ximg2)-[{qnode: $SEED}]->() WHERE rowid(ximg2) < rowid(ximg) }' \
      --return 'y as qnode, ylabel as label, printf("%.5g", r.similarity) as sim, yimg as image' \
      --para  SEED=Q756815  \
    / html

show_html()

qnode,label,sim,image
Q756815,'Beaumaris Castle'@en,1.0,
Q267153,'list of monasteries dissolved by Henry VIII of England'@en,0.79353,
Q6566349,'list of Category A listed buildings in Dumfries and Galloway'@en,0.79212,
Q40889043,'Scheduled monuments in Renfrewshire'@en,0.7897,
Q912664,'Clan MacDougall'@en,0.78582,
Q922422,'Warkworth Castle'@en,0.78453,
Q6566359,'list of Category A listed buildings in Fife'@en,0.78269,
Q16148507,'list of Historic Scotland properties'@en,0.78237,
Q11808,'castles in Great Britain and Ireland'@en,0.78151,
Q16148507,'list of Historic Scotland properties'@en,0.78122,


<A NAME="image-similarity-join"></A>

Castles similar to Beaumaris Castle but that are located in Austria (with
country (`P17`) equal to `Q40`).  We use a full vector join to get relevant
results further down the similarity list.  Note that even with `maxk=1024` we only
get a few results, and that the similarities are significantly lower than in the
previous example:

In [19]:
out = !kgtk query --gc $IMAGE --ac $MAIN \
      -i wiki_image -i labels -i claims \
      --match 'image: (ximg)-[rx {qnode: $SEED}]->(xiv), \
                      (xiv)-[r:kvec_topk_cos_sim {k: 20, nprobe: 4, maxk: 1024}]->(yimg), \
                      (yimg)-[ry {qnode: y}]->(), \
               labels: (y)-->(ylabel), \
               claims: (y)-[:P17]->(c:Q40)' \
      --where 'not exists {image: (ximg2)-[{qnode: $SEED}]->() WHERE rowid(ximg2) < rowid(ximg) }' \
      --return 'y as qnode, ylabel as label, printf("%.5g", r.similarity) as sim, yimg as image' \
      --para  SEED=Q756815  \
      --limit 20 \
    / html

show_html()

qnode,label,sim,image
Q1012592,'Burgruine Kaja'@en,0.72402,
Q15954565,'Austrian walled towns'@en,0.74951,
Q1015533,'Burgruine Steuerberg'@en,0.70776,
Q1015457,'Prandegg Castle'@en,0.70276,
Q188358,'Burgruine Dürnstein'@en,0.70275,


<A NAME="text-embedding-queries"></A>
## Text embedding queries:

In the following example we dynamically compute an embedding vector
for a text query and then use the similarity machinery to query for
matching QNodes.  The basic story here is the following:

- formulate a simple textual query such as 'Ancient Greek philosopher'
- create a KGTK input file for it/them and run them through the 'text-embedding' command
- query WD by finding top-k matches based on short abstract text embedding vectors
- then filter with additional restrictions to get more relevant results.

In [15]:
!echo '\
q1	Ancient Greek philosopher\n\
q2	castle in Austria\n\
q3	award-winning actor and comedian' | \
sed -e 's/^ *//' | \
kgtk cat --no-input-header --input-column-names node1 node2 --implied-label sentence \
   / add-id \
   / text-embedding -i - --model roberta-large-nli-mean-tokens \
          --output-data-format kgtk --output-property emb -o - \
   / query -i - --idx vector:node2 --as text_emb_queries --match '(x)' --return x

Running with logging level 30
2023-01-20 12:48:51.752147: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-01-20 12:48:51.752170: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
  return torch._C._cuda_getDeviceCount() > 0
Batches: 100%|████████████████████████████████████| 1/1 [00:00<00:00,  4.12it/s]
node1
q1
q2
q3


The above created 1024-D text embedding vector for three short queries
using the same text embedding type as used in our `ABSTRACT` embeddings.
Now we find Wikidata QNodes whose short-abstract embedding vector is most similar
to the queries, and that satisfy any additional conditions we might have.
Note that the queries in this example are much shorter than the first sentences
of our Wikipedia abstracts, thus the similarity matching is not very good, but
we can compensate for some of that by adding additional restrictions:

Matches for "Ancient Greek philosopher" that have occupation (`P106`) philosopher:

In [17]:
out = !kgtk query --ac $MAIN --ac $ABSTRACT \
      -i text_emb_queries -i abstract -i labels -i claims -i sentence \
      --match  'queries:  (x:q1)-->(xv), \
                abstract: (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 4}]->(y), \
                claims:   (y)-[:P106]->(:Q4964182), \
                labels:   (y)-->(yl), \
                sentence: (y)-->(ys)' \
      --return 'y as y, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

y,ylabel,sim,ysent
Q1200209,'Dercil·lides'@en,0.9270190596580504,Dercyllides was an ancient Greek Platonist philosopher.
Q12901192,'Nessas of Chios'@en,0.9017531871795654,Nessos of Chios (Ancient Greek: Νεσσᾶς or Νέσσος ὁ Χῖος) was a pre-Socratic ancient Greek philosopher from the island of Chios.
Q962486,'Echecrates of Flius'@en,0.8998405933380127,Echecrates (Greek: Ἐχεκράτης) was a Pythagorean philosopher from the ancient Greek town of Phlius.
Q20379195,'Nestor of Tarsus'@en,0.8979542255401611,Nestor of Tarsus (Ancient Greek: Νέστωρ) was an ancient Greek philosopher of the Stoic school of thought.
Q3780759,'Patro the Epicurean'@en,0.8746783137321472,Patro (Greek: Πάτρων) was an Epicurean philosopher.
Q2397427,'Heraclides Lembus'@en,0.8738521933555603,"Heraclides Lembus (Greek: Ἡρακλείδης Λέμβος, Hērakleidēs Lembos) was an Ancient Greek statesman, historian and philosophical writer."
Q992324,'Eudorus of Alexandria'@en,0.8724848031997681,"Eudorus of Alexandria (Greek: Εὔδωρος ὁ Ἀλεξανδρεύς; 1st century BC) was an ancient Greek philosopher, and a representative of Middle Platonism."
Q373042,'Onasander'@en,0.8724181056022644,Onasander or Onosander (Greek: Ὀνήσανδρος Onesandros or Ὀνόσανδρος Onosandros; fl. 1st century AD) was a Greek philosopher.
Q924215,'Hecato of Rhodes'@en,0.8698219060897827,Hecato or Hecaton of Rhodes (Greek: Ἑκάτων; fl. c. 100 BC) was a Greek Stoic philosopher.
Q325955,'Speusippus'@en,0.869309663772583,Speusippus (/spjuːˈsɪpəs/; Greek: Σπεύσιππος; c. 408 – 339/8 BC) was an ancient Greek philosopher.


Matches for "castle in Austria" that have country (`P17`) Austria:

In [18]:
out = !kgtk query --ac $MAIN --ac $ABSTRACT \
      -i text_emb_queries -i abstract -i labels -i claims -i sentence \
      --match  'queries:  (x:q2)-->(xv), \
                abstract: (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
                claims:   (y)-[:P17]->(:Q40), \
                labels:   (y)-->(yl), \
                sentence: (y)-->(ys)' \
      --return 'y as y, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

y,ylabel,sim,ysent
Q2240044,'Annabichl Castle'@en,0.9592803120613098,Annabichl Castle is a castle in Austria.
Q7378781,'Ruine Raabeck'@en,0.9573754072189332,"Ruine Raabeck is a castle in Styria, Austria."
Q7431733,'Schloss Frondsberg'@en,0.9572933316230774,"Schloss Frondsberg is a castle in Styria, Austria."
Q1379421,'Burg Bideneck'@en,0.9563704133033752,"Burg Bideneck is a castle in Tyrol, Austria."
Q1013482,'Burgruine Pfannberg'@en,0.9561655521392822,"Burgruine Pfannberg is a castle in Styria, Austria."
Q7378775,'Ruine Ligist'@en,0.9561580419540404,"Ruine Ligist is a castle in Styria, Austria."
Q7378769,'Ruine Kalsberg'@en,0.956127405166626,"Ruine Kalsberg is a castle in Styria, Austria."
Q4998492,'Burg Baiersdorf'@en,0.9559961557388306,"Burg Baiersdorf is a castle in Styria, Austria."
Q7378780,'Ruine Pernegg'@en,0.9556364417076112,"Ruine Pernegg is a castle in Styria, Austria."
Q7378773,'Ruine Hauenstein'@en,0.955113649368286,"Ruine Hauenstein is a castle in Styria, Austria."


Matches for "award-winning actor and comedian" that are of type human
and have country of citizenship (`P27`) UK:

In [19]:
out = !kgtk query --ac $MAIN --ac $ABSTRACT \
      -i text_emb_queries -i abstract -i labels -i claims -i sentence \
      --match  'queries:  (x:q3)-->(xv), \
                abstract: (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
                claims:   (y)-[:P31]->(:Q5), \
                          (y)-[:P27]->(:Q145), \
                labels:   (y)-->(yl), \
                sentence: (y)-->(ys)' \
      --return 'y as y, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

y,ylabel,sim,ysent
Q7803499,'Tim FitzHigham'@en,0.866931676864624,"Tim FitzHigham FRSA FRGS, is an English comedian, author, artist and world record holder."
Q5561891,'Gill Isles'@en,0.8463524580001831,Gill Isles is a BAFTA winning TV comedy producer.
Q6988861,'Neil Linpow'@en,0.8445479869842529,"Neil Linpow is a multi-award-winning English actor, writer and filmmaker."
Q16210661,'Philip Bulcock'@en,0.8129876255989075,Philip Bulcock is an English actor who has appeared in numerous award-winning film and theatre productions.
Q7626524,'Stuart Fell'@en,0.8109696507453918,Stuart Fell is a professional actor and stuntman.
Q4424886,'Hedrick Smith'@en,0.801676869392395,Hedrick Smith is a Pulitzer Prize-winning former New York Times reporter and Emmy award-winning producer and correspondent.
Q6137252,'James Kenny'@en,0.7837412357330322,"James Kenny is a professional photographer based in the United Kingdom, best known for his fashion, celebrity portrait, and documentary work."
Q8002945,'Will Lyons'@en,0.781859815120697,"Will Lyons is a journalist, newspaper columnist, award-winning wine writer and broadcaster."
Q6229086,'John Deery'@en,0.7798240780830383,John Deery is a British award-winning film and television drama director.
Q5213767,'Dan Jones'@en,0.7785274386405945,Dan Jones is a BAFTA and Ivor Novello Award winning composer and sound designer working in film and theatre.


<A NAME="comparing-embeddings"></A>
## Comparing different types of embeddings

Below we run a number of similarity queries for each of our various types of
embeddings to see how they behave relative to each other.  Note how they
behave quite differently, reasonable for some use cases but not so much for others:

### Philosophers:

In [24]:
kgtk("""
      query --gc $MAIN --ac $COMPLEX
      -i complex -i labels
      --match 'complex:  (x)-->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-->(xl), (y)-->(yl)'
      --where 'x in ["Q859", "Q868", "Q913"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'Plato'@en,'Plato'@en,1.0
1,'Plato'@en,'Socrates'@en,0.778851
2,'Plato'@en,'Epicurus'@en,0.7682
3,'Plato'@en,'Aratus'@en,0.744131
4,'Plato'@en,'Hippocrates'@en,0.742684
5,'Plato'@en,'Theophrastus'@en,0.732886
6,'Plato'@en,'Aeschines'@en,0.727185
7,'Plato'@en,'Antiphon of Rhamnus'@en,0.725084
8,'Plato'@en,'Gorgias'@en,0.724764
9,'Plato'@en,'Antisthenes'@en,0.723077


In [25]:
kgtk("""
      query --gc $MAIN --ac $TRANSE
      -i transe -i labels
      --match 'transe:   (x)-->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-->(xl), (y)-->(yl)'
      --where 'x in ["Q859", "Q868", "Q913"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'Plato'@en,'Plato'@en,1.0
1,'Plato'@en,'Plotinus'@en,0.752719
2,'Plato'@en,'Cornelius Nepos'@en,0.72332
3,'Plato'@en,'Bret Harte'@en,0.706325
4,'Plato'@en,'Federico Caffè'@en,0.702316
5,'Plato'@en,'Marcel Duchamp'@en,0.677284
6,'Plato'@en,'Quintus Julius Balbus'@en,0.662613
7,'Plato'@en,'Laurentius Abstemius'@en,0.662188
8,'Plato'@en,'Celso Lucio'@en,0.654929
9,'Plato'@en,'Peter von Cornelius'@en,0.684013


In [20]:
out = !kgtk query --gc $MAIN --ac $ABSTRACT \
      -i abstract -i labels -i sentence \
      --match 'abstract: (x)-->(xv), \
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
               labels:   (x)-->(xl), (y)-->(yl), \
               sent:     (y)-->(ys)' \
      --where 'x in ["Q859", "Q868", "Q913"]' \
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

xlabel,ylabel,sim,ysent
'Plato'@en,'Plato'@en,1.0,Plato (/ˈpleɪtoʊ/ PLAY-toe; Greek: Πλάτων Plátōn; 428/427 or 424/423 – 348/347 BC) was a Greek philosopher born in Athens during the Classical period in Ancient Greece.
'Plato'@en,'Aenesidemus'@en,0.936797022819519,"Aenesidemus (Ancient Greek: Αἰνησίδημος or Αἰνεσίδημος) was a Greek Pyrrhonist philosopher, born in Knossos on the island of Crete."
'Plato'@en,'Aristotle'@en,0.928276777267456,"Aristotle (/ˈærɪstɒtəl/; Greek: Ἀριστοτέλης Aristotélēs, pronounced [aristotélɛːs]; 384–322 BC) was a Greek philosopher and polymath during the Classical period in Ancient Greece."
'Plato'@en,'Menedemus'@en,0.9262720942497252,Menedemus of Eretria (Greek: Μενέδημος ὁ Ἐρετριεύς; 345/44 – 261/60 BC) was a Greek philosopher and founder of the Eretrian school.
'Plato'@en,'Hicetas'@en,0.9233798980712892,Hicetas (Ancient Greek: Ἱκέτας or Ἱκέτης; c. 400 – c. 335 BC) was a Greek philosopher of the Pythagorean School.
'Plato'@en,'Philo of Larissa'@en,0.9217081069946288,Philo of Larissa (Greek: Φίλων ὁ Λαρισσαῖος Philon ho Larissaios; 159/8–84/3 BC) was a Greek philosopher.
'Plato'@en,'Metrodorus of Lampsacus'@en,0.917860209941864,"Metrodorus of Lampsacus (Greek: Μητρόδωρος Λαμψακηνός, Mētrodōros Lampsakēnos; 331/0–278/7 BC) was a Greek philosopher of the Epicurean school."
'Plato'@en,'Speusippus'@en,0.9144768714904784,Speusippus (/spjuːˈsɪpəs/; Greek: Σπεύσιππος; c. 408 – 339/8 BC) was an ancient Greek philosopher.
'Plato'@en,'Echecrates of Flius'@en,0.9101097583770752,Echecrates (Greek: Ἐχεκράτης) was a Pythagorean philosopher from the ancient Greek town of Phlius.
'Plato'@en,'Philolaus'@en,0.9096436500549316,"Philolaus (/ˌfɪləˈleɪəs/; Ancient Greek: Φιλόλαος, Philólaos; c. 470 – c. 385 BCE) was a Greek Pythagorean and pre-Socratic philosopher."


### Countries:

In [27]:
kgtk("""
      query --gc $MAIN --ac $COMPLEX
      -i complex -i labels
      --match 'complex:  (x)-->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-->(xl), (y)-->(yl)'
      --where 'x in ["Q40", "Q41", "Q30"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'United States of America'@en,'United States of America'@en,1.0
1,'United States of America'@en,'United Kingdom'@en,0.819738
2,'United States of America'@en,'France'@en,0.810034
3,'United States of America'@en,'Canada'@en,0.79315
4,'United States of America'@en,'Spain'@en,0.791431
5,'United States of America'@en,'Australia'@en,0.780531
6,'United States of America'@en,'Thailand'@en,0.742816
7,'United States of America'@en,'South Korea'@en,0.734353
8,'United States of America'@en,'India'@en,0.730247
9,'United States of America'@en,'Mexico'@en,0.717486


In [28]:
kgtk("""
      query --gc $MAIN --ac $TRANSE
      -i transe -i labels
      --match 'transe:   (x)-->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-->(xl), (y)-->(yl)'
      --where 'x in ["Q40", "Q41", "Q30"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'United States of America'@en,'United States of America'@en,1.0
1,'United States of America'@en,'State of Scott'@en,0.790001
2,'United States of America'@en,'.سورية'@en,0.781829
3,'United States of America'@en,'Republic of South Carolina'@en,0.77654
4,'United States of America'@en,'State of Kanawha'@en,0.770454
5,'United States of America'@en,'Wedge: The Secret War between the FBI and CIA...,0.762854
6,'United States of America'@en,'Marin County'@en,0.741442
7,'United States of America'@en,'Light Stations of the United States MPS'@en,0.731502
8,'United States of America'@en,'Republic of Florida'@en,0.730721
9,'United States of America'@en,'Women's Professional Racquetball Organization...,0.737493


In [21]:
out = !kgtk query --gc $MAIN --ac $ABSTRACT \
      -i abstract -i labels -i sentence \
      --match 'abstract: (x)-->(xv), \
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
               labels:   (x)-->(xl), (y)-->(yl), \
               sent:     (y)-->(ys)' \
      --where 'x in ["Q40", "Q41", "Q30"]' \
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

xlabel,ylabel,sim,ysent
'United States of America'@en,'United States of America'@en,1.0000001192092896,"The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America."
'United States of America'@en,'Flora of Arkansas'@en,0.9252734184265136,"Geobotanically, Arkansas belongs to the North American Atlantic Region."
'United States of America'@en,'Northern United States'@en,0.9048947095870972,"The Northern United States, commonly referred to as the American North, the Northern States, or simply the North, is a geographical or historical region of the United States."
'United States of America'@en,'Backcountry'@en,0.9047999382019044,The Backcountry was a region in North America.
'United States of America'@en,'Canada'@en,0.9028735160827636,Canada is a country in North America.
'United States of America'@en,'Medfield'@en,0.9025246500968932,"Medfield is a neighborhood located in north Baltimore, Maryland, United States of America."
'United States of America'@en,'Northwest Georgia'@en,0.901907742023468,Northwest Georgia is a region of the state of Georgia in the United States.
'United States of America'@en,'North America'@en,0.9008367657661438,North America is a continent in the Northern Hemisphere and almost entirely within the Western Hemisphere.
'United States of America'@en,'Tidewater region of Virginia'@en,0.9006068110466003,Tidewater refers to the north Atlantic coastal plain region of the United States of America.
'United States of America'@en,'Oreamnos'@en,0.8993678689002991,Oreamnos is a genus of North American caprines.


### Types of animals:

In [30]:
kgtk("""
      query --gc $MAIN --ac $COMPLEX
      -i complex -i labels
      --match 'complex:  (x)-->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-->(xl), (y)-->(yl)'
      --where 'x in ["Q144", "Q146", "Q726"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'dog'@en,'dog'@en,1.0
1,'dog'@en,'hat'@en,0.72131
2,'dog'@en,'house cat'@en,0.706597
3,'dog'@en,'body armor'@en,0.692382
4,'dog'@en,'woman'@en,0.68766
5,'dog'@en,'peafowl'@en,0.686448
6,'dog'@en,'bouquet'@en,0.678995
7,'dog'@en,'hunting dog'@en,0.667602
8,'dog'@en,'logo'@en,0.647672
9,'dog'@en,'sceptre'@en,0.644042


In [31]:
kgtk("""
      query --gc $MAIN --ac $TRANSE
      -i transe -i labels
      --match 'transe:   (x)-->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-->(xl), (y)-->(yl)'
      --where 'x in ["Q144", "Q146", "Q726"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'dog'@en,'dog'@en,1.0
1,'dog'@en,'Salty and Roselle'@en,0.840819
2,'dog'@en,'Fred Basset'@en,0.831087
3,'dog'@en,'Theo'@en,0.821203
4,'dog'@en,'Heaven Sent Brandy'@en,0.820738
5,'dog'@en,'Old Hemp'@en,0.818933
6,'dog'@en,'Rubia'@en,0.81686
7,'dog'@en,'Alcmène'@en,0.81526
8,'dog'@en,'Alex the Dog'@en,0.813087
9,'dog'@en,'Edda'@en,0.810013


In [22]:
out = !kgtk query --gc $MAIN --ac $ABSTRACT \
      -i abstract -i labels -i sentence \
      --match 'abstract: (x)-->(xv), \
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
               labels:   (x)-->(xl), (y)-->(yl), \
               sent:     (y)-->(ys)' \
      --where 'x in ["Q144", "Q146", "Q726"]' \
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

xlabel,ylabel,sim,ysent
'dog'@en,'dog'@en,0.9999999403953552,The dog or domestic dog (Canis familiaris or Canis lupus familiaris) is a domesticated descendant of the wolf.
'dog'@en,'wolfdog'@en,0.8726404309272766,"A wolfdog is a canine produced by the mating of a domestic dog (Canis familiaris) with a gray wolf (Canis lupus), eastern wolf (Canis lycaon), red wolf (Canis rufus), or Ethiopian wolf (Canis simensis) to produce a hybrid."
'dog'@en,'Garmr'@en,0.8452869653701782,"In Norse mythology, Garmr or Garm (Old Norse: Garmr [ˈɡɑrmz̠]; \""rag\"") is a wolf or dog associated with both Hel and Ragnarök, and described as a blood-stained guardian of Hel\'s gate."
'dog'@en,'Shaun Ellis'@en,0.839640736579895,"Shaun Ellis is an English animal researcher who lived among wolves, and adopted a pack of abandoned North American timber wolf pups."
'dog'@en,'Saarloos wolfdog'@en,0.8337552547454834,"The Saarloos Wolfdog (Dutch: Saarlooswolfhond, German: Saarlooswolfhund) is a wolf-dog breed originating from the Netherlands by the crossing of a German Shepherd with a Siberian grey wolf in 1935."
'dog'@en,'Canidae'@en,0.8309797048568726,"Canidae (/ˈkænɪdiː/; from Latin, canis, \""dog\"") is a biological family of dog-like carnivorans, colloquially referred to as dogs, and constitutes a clade."
'dog'@en,'Pembroke Welsh Corgi'@en,0.8307735323905945,"The Pembroke Welsh Corgi (/ˈkɔːrɡi/; Welsh for \""dwarf dog\"") is a cattle herding dog breed that originated in Pembrokeshire, Wales."
'dog'@en,'Wolf distribution'@en,0.8283219933509827,Wolf distribution is the species distribution of the wolf (Canis lupus).
'dog'@en,'Schizocosa stridulans'@en,0.8269302248954773,Schizocosa stridulans is a sibling species of S. ocreata and S. rovneri and is part of the wolf spider family.
'dog'@en,'Himalayan Wolf'@en,0.8238919973373413,The Himalayan wolf (Canis lupus chanco) is a canine of debated taxonomy.


### Handball:

In [33]:
kgtk("""
      query --gc $MAIN --ac $COMPLEX
      -i complex -i labels
      --match 'complex:  (x)-->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-->(xl), (y)-->(yl)'
      --where 'x in ["Q8418"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'handball'@en,'handball'@en,1.0
1,'handball'@en,'beach handball'@en,0.755496
2,'handball'@en,'field hockey'@en,0.747243
3,'handball'@en,'korfball'@en,0.735095
4,'handball'@en,'indoor handball'@en,0.729936
5,'handball'@en,'biathlon'@en,0.705106
6,'handball'@en,'volleyball'@en,0.704901
7,'handball'@en,'softball'@en,0.68622
8,'handball'@en,'field handball'@en,0.683182
9,'handball'@en,'orienteering'@en,0.666485


In [34]:
kgtk("""
      query --gc $MAIN --ac $TRANSE
      -i transe -i labels
      --match 'transe:   (x)-->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-->(xl), (y)-->(yl)'
      --where 'x in ["Q8418"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'handball'@en,'handball'@en,1.0
1,'handball'@en,'Wikipedia:WikiProject Handball'@en,0.830904
2,'handball'@en,'women's beach handball'@en,0.714068
3,'handball'@en,'ski jumping'@en,0.70067
4,'handball'@en,'futsal'@en,0.689599
5,'handball'@en,'indoor handball'@en,0.687324
6,'handball'@en,'biathlon'@en,0.683123
7,'handball'@en,'women's association football'@en,0.6782
8,'handball'@en,'Qatch'@en,0.65917
9,'handball'@en,'hockey'@en,0.649488


In [23]:
out = !kgtk query --gc $MAIN --ac $ABSTRACT \
      -i abstract -i labels -i sentence \
      --match 'abstract: (x)-->(xv), \
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
               labels:   (x)-->(xl), (y)-->(yl), \
               sent:     (y)-->(ys)' \
      --where 'x in ["Q8418"]' \
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

xlabel,ylabel,sim,ysent
'handball'@en,'handball'@en,0.9999999403953552,"Handball (also known as team handball, European handball or Olympic handball) is a team sport in which two teams of seven players each (six outcourt players and a goalkeeper) pass a ball using their hands with the aim of throwing it into the goal of the other team."
'handball'@en,'beach handball'@en,0.8959406614303589,"Beach handball is a team sport where two teams pass and bounce or roll a ball, trying to throw it in the goal of the opposing team."
'handball'@en,'Gaelic handball'@en,0.8065397143363953,"Gaelic handball (known in Ireland simply as handball; Irish: liathróid láimhe) is a sport where players hit a ball with a hand or fist against a wall in such a way as to make a shot the opposition cannot return, and that may be played with two (singles) or four players (doubles)."
'handball'@en,'Tennis polo'@en,0.80338454246521,Tennis polo (or toccer) is a field sport where two teams of ten players (nine field players and one goalkeeper) use a tennis ball to score goals by throwing the ball into a goal defended by a keeper who holds a racket.
'handball'@en,'Balonpesado'@en,0.7840337157249451,"The balonpesado is a team sport, devised for both open field as closed, in which two sets of five players each try to score goals within circles drawn on the ground of each end of the field."
'handball'@en,'Harrison Hoist'@en,0.776003897190094,"The Harrison Hoist, also known as the Chairlift, is a form of goaltending in netball where one defender lifts another defender, rugby union lineout-style, in order to catch the ball and prevent a goal scoring opportunity."
'handball'@en,'ball boy'@en,0.7694556713104248,"Ball boys and ball girls, also known as ball kids are individuals, usually human youths but sometimes dogs, who retrieve and supply balls for players or officials in sports such as association football, American football, bandy, cricket, tennis, baseball and basketball."
'handball'@en,'dodgeball'@en,0.7659051418304443,"Dodgeball is a team sport in which players on two teams try to throw balls and hit opponents, while avoiding being hit themselves."
'handball'@en,'sepak takraw'@en,0.7623462677001953,"Sepak takraw, or Sepaktakraw, also called kick volleyball, is a team sport played with a ball made of rattan or synthetic plastic between two teams of two to four players on a court resembling a badminton court."
'handball'@en,'Guts'@en,0.7574184536933899,"Guts or disc guts (sometimes guts Frisbee in reference to the trademarked brand name) is a disc game inspired by dodgeball, involving teams throwing a flying disc (rather than balls) at members of the opposing team."


### Journalist:

In [36]:
kgtk("""
      query --gc $MAIN --ac $COMPLEX
      -i complex -i labels
      --match 'complex:  (x)-->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-->(xl), (y)-->(yl)'
      --where 'x in ["Q1930187"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'journalist'@en,'journalist'@en,1.0
1,'journalist'@en,'television presenter'@en,0.806589
2,'journalist'@en,'writer'@en,0.79479
3,'journalist'@en,'poet'@en,0.785165
4,'journalist'@en,'playwright'@en,0.77569
5,'journalist'@en,'politician'@en,0.756889
6,'journalist'@en,'short story writer'@en,0.756759
7,'journalist'@en,'actor'@en,0.751951
8,'journalist'@en,'film critic'@en,0.743542
9,'journalist'@en,'teacher'@en,0.743155


In [37]:
kgtk("""
      query --gc $MAIN --ac $TRANSE
      -i transe -i labels
      --match 'transe:   (x)-->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-->(xl), (y)-->(yl)'
      --where 'x in ["Q1930187"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'journalist'@en,'journalist'@en,1.0
1,'journalist'@en,'Category:Journalists'@en,0.719601
2,'journalist'@en,'journalistic scandal'@en,0.693972
3,'journalist'@en,'children's writer'@en,0.667045
4,'journalist'@en,'László Török'@en,0.658004
5,'journalist'@en,'columnist'@en,0.650447
6,'journalist'@en,'novelist'@en,0.644886
7,'journalist'@en,'foreign correspondent'@en,0.626349
8,'journalist'@en,'Category:Journalists of Ceará'@en,0.650367
9,'journalist'@en,'business journalist'@en,0.640736


In [24]:
out = !kgtk query --gc $MAIN --ac $ABSTRACT \
      -i abstract -i labels -i sentence \
      --match 'abstract: (x)-->(xv), \
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
               labels:   (x)-->(xl), (y)-->(yl), \
               sent:     (y)-->(ys)' \
      --where 'x in ["Q1930187"]' \
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

xlabel,ylabel,sim,ysent
'journalist'@en,'journalist'@en,0.9999999403953552,"A journalist is an individual that collects/gathers information in form of text, audio, or pictures, processes them into a news-worthy form, and disseminates it to the public."
'journalist'@en,'journalism'@en,0.8947291374206543,"Journalism is the production and distribution of reports on the interaction of events, facts, ideas, and people that are the \""news of the day\"" and that informs society to at least some degree."
'journalist'@en,'news analyst'@en,0.8600048422813416,"A news analyst examines, analyses and interprets broadcast news received from various sources."
'journalist'@en,'outline of journalism'@en,0.8528070449829102,"The following outline is provided as an overview of and topical guide to journalism: Journalism – investigation and reporting of events, issues and trends to a broad audience."
'journalist'@en,'Public editor'@en,0.8508998155593872,A public editor is a position existing at some news publications; the person holding this position is responsible for supervising the implementation of proper journalism ethics at that publication.
'journalist'@en,'Index of journalism articles'@en,0.831515371799469,Articles related to the field of journalism include:
'journalist'@en,'source'@en,0.8213073015213013,"In journalism, a source is a person, publication, or knowledge other record or document that gives timely information."
'journalist'@en,'journalism ethics and standards'@en,0.8076792359352112,Journalistic ethics and standards comprise principles of ethics and good practice applicable to journalists.
'journalist'@en,'Assignment editor'@en,0.8065866231918335,"In journalism, an assignment editor is an editor – either at a newspaper or a radio or television station – who selects, develops, and plans reporting assignments, either news events or feature stories, to be covered by reporters."
'journalist'@en,'Information subsidy'@en,0.8030204176902771,An information subsidy is the provision of ready-to-use newsworthy information to the news media by various sources interested in gaining access to media time and space.


### Head of state:

In [39]:
kgtk("""
      query --gc $MAIN --ac $COMPLEX
      -i complex -i labels
      --match 'complex:  (x)-->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-->(xl), (y)-->(yl)'
      --where 'x in ["Q48352"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'head of state'@en,'head of state'@en,1.0
1,'head of state'@en,'head of government'@en,0.831704
2,'head of state'@en,'leader of organisation'@en,0.715446
3,'head of state'@en,'governor'@en,0.669001
4,'head of state'@en,'consul general'@en,0.642665
5,'head of state'@en,'Floor leader'@en,0.605335
6,'head of state'@en,'defence minister'@en,0.691881
7,'head of state'@en,'French ambassador'@en,0.636968
8,'head of state'@en,'supreme court justice'@en,0.635844
9,'head of state'@en,'Executive Secretary of the Secretariat'@en,0.568668


In [40]:
kgtk("""
      query --gc $MAIN --ac $TRANSE
      -i transe -i labels
      --match 'transe:   (x)-->(xv),
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y),
               labels:   (x)-->(xl), (y)-->(yl)'
      --where 'x in ["Q48352"]'
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'head of state'@en,'head of state'@en,1.0
1,'head of state'@en,'head of government'@en,0.861243
2,'head of state'@en,'governor'@en,0.792502
3,'head of state'@en,'prime minister'@en,0.789789
4,'head of state'@en,'speaker'@en,0.781205
5,'head of state'@en,'Governor-general'@en,0.730634
6,'head of state'@en,'attorney general'@en,0.71397
7,'head of state'@en,'colonial governor'@en,0.70737
8,'head of state'@en,'foreign minister'@en,0.703298
9,'head of state'@en,'minister'@en,0.765075


In [25]:
out = !kgtk query --gc $MAIN --ac $ABSTRACT \
      -i abstract -i labels -i sentence \
      --match 'abstract: (x)-->(xv), \
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
               labels:   (x)-->(xl), (y)-->(yl), \
               sent:     (y)-->(ys)' \
      --where 'x in ["Q48352"]' \
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

xlabel,ylabel,sim,ysent
'head of state'@en,'head of state'@en,1.0000001192092896,A head of state (or chief of state) is the public persona who officially embodies a state in its unity and legitimacy.
'head of state'@en,'Justification for the state'@en,0.8505247831344604,The justification of the state refers to the source of legitimate authority for the state or government.
'head of state'@en,'Seal of Tamil Nadu'@en,0.8389069437980652,The Emblem of Tamil Nadu is the official state emblem of Tamil Nadu and is used as the official state symbol of the Government of Tamil Nadu.
'head of state'@en,'Head of Kalmykia'@en,0.8310245275497437,The Head of Kalmykia is an elected official who serves as the head of state of Kalmykia.
'head of state'@en,'Prime Minister of the United Kingdom'@en,0.8299723863601685,The prime minister of the United Kingdom is the head of government of the United Kingdom.
'head of state'@en,'Governor of Kaduna State'@en,0.8240423202514648,The Kaduna State Governor is the head of .The governor leads the executive branch of the Government.This position places its holder in leadership of the state with command authority over the state affairs.
'head of state'@en,'Official culture'@en,0.8230085372924805,Official culture is the culture that receives social legitimation or institutional support in a given society.
'head of state'@en,'President of Tanzania'@en,0.8226332664489746,The President of the United Republic of Tanzania (Swahili: Rais wa Jamhuri ya Muungano wa Tanzania) is the head of state and head of government of the United Republic of Tanzania.
'head of state'@en,'Administrator of Ascension Island'@en,0.8202604055404663,"The Administrator of Ascension is the head of government and representative of the Governor of St Helena, Ascension and Tristan da Cunha in Ascension Island."
'head of state'@en,'Contributions Agency'@en,0.8172905445098877,The Contributions Agency was an executive agency of the United Kingdom government.
