# Query IMKG and its embeddings with KGTK Kypher-V

In [43]:
import re
from IPython.display import display, HTML
from kgtk.functions import kgtk

def show_html(img_width=150):
    """Display command output in 'out' as HTML after munging image links for inline display."""
    output = '\n'.join(out)
    html = re.sub(r'<td>&quot;(https?://upload.wikimedia.org/[^<]+)&quot;</td>', 
                  f'<td style="width:{img_width}px;vertical-align:top"><img " src="\\1"/></td>', 
                  output)
    display(HTML(html))

In [44]:
DB="kypherv"
%env DB={DB}
%env MAIN={DB}/wikidata-20221102-dwd-v8-main.sqlite3.db
%env ABSTRACT={DB}/wikidata-20221102-dwd-v8-abstract-embeddings-large.sqlite3.db
%env IMAGE={DB}/wikimedia-capcom-image-embeddings-v2.sqlite3.db

env: DB=kypherv
env: MAIN=kypherv/wikidata-20221102-dwd-v8-main.sqlite3.db
env: ABSTRACT=kypherv/wikidata-20221102-dwd-v8-abstract-embeddings-large.sqlite3.db
env: IMAGE=kypherv/wikimedia-capcom-image-embeddings-v2.sqlite3.db


In [45]:
!kgtk query --gc $ABSTRACT --sc

Graph Cache:
DB file: kypherv/wikidata-20221102-dwd-v8-abstract-embeddings-large.sqlite3.db
  size:  33.89 GB   	free:  0 Bytes   	modified:  2023-01-25 16:20:33

KGTK File Information:
/Users/filipilievski/mcs/kgtk-tutorial-aaai23/wikidata/labels.en.tsv.gz:
  size:  679.79 MB   	modified:  2023-01-25 14:11:01   	graph:  graph_3
abstract:
  size:  0 Bytes   	modified:  2023-01-19 13:24:19   	graph:  graph_1
sentence:
  size:  256.32 MB   	modified:  2023-01-04 13:53:44   	graph:  graph_2

Graph Table Information:
graph_1:
  size:  28.21 GB   	created:  2023-01-19 13:24:19
  header:  ['node1', 'label', 'node2', 'id']
graph_2:
  size:  1.23 GB   	created:  2023-01-19 15:01:41
  header:  ['node1', 'label', 'node2', 'id']
graph_3:
  size:  4.52 GB   	created:  2023-01-25 16:19:55
  header:  ['id', 'node1', 'label', 'node2', 'lang', 'rank', 'node2;wikidatatype']


In [23]:
kgtk("""query --gc $ABSTRACT -i abstract --limit 3""")

Unnamed: 0,node1,label,node2,id,node2;_kgtk_vec_qcell
0,Q1000929,emb,b'7xcbM?x88x8ax0c?x8b:xa0xbfN3\?xb9Yxac>-xe6d?...,E567,0
1,Q100146561,emb,b'6|xa0?x89x06x84?xc1x00\xbf.XVxbf$%a>x11xfex8...,E1080,0
2,Q100146569,emb,b'Gx8fy?xc9x0fx08?x06`Axbfnxdfxdbxbexeb;xda=x1...,E1088,0


Query the abstract labels:

In [26]:
kgtk(""" 
      query --gc $ABSTRACT
      -i abstract -i "wikidata/labels.en.tsv.gz"
      --match 'abstract: (x:Q83279)-->(xv),
                         (y:Q183951)-->(yv),
               label:   (x)-->(xl), (y)-->(yl)'
      --return 'xl as xlabel, yl as ylabel, kvec_cos_sim(xv, yv) as sim'
     """)

Unnamed: 0,xlabel,ylabel,sim
0,'SpongeBob SquarePants'@en,'balloon'@en,0.277892


In the examples below, we use image similarity to link QNodes in Wikidata.  We
use the precomputed `IMAGE` graph cache (see above) which contains embeddings
for about 2.7M images linked to their respective Wikipedia pages and Wikidata
QNodes.  

We start with a QNode (such a the one for Barack Obama below), find one or more
images associated with that QNode, look up their image embeddings and then find
other similar images and their associated QNodes.

We do not compute any image embeddings on the fly here, we simply link nodes based
on similarity of images they are associated with.  Note that this will often not
preserve the type of the source node as can be seen in the result for Barack Obama.
To enforce such type or other restrictions additional clauses can be added.
Since there are multiple images associated with Barack Obama, we use a `not exists`
clause to only look at the first one to make the results less cluttered:

Sponge Bob Square Pants:

In [16]:
out = !kgtk query --gc $IMAGE \
      -i wiki_image -i "wikidata/labels.en.tsv.gz" \
      --match 'image:  (ximg)-[rx {qnode: $SEED}]->(xiv), \
                       (xiv)-[r:kvec_topk_cos_sim {k: 10, nprobe: 8}]->(yimg), \
                       (yimg)-[ry {qnode: y}]->(), \
               labels: (y)-->(ylabel)' \
      --where 'not exists {image: (ximg2)-[{qnode: $SEED}]->() WHERE rowid(ximg2) < rowid(ximg) }' \
      --return 'y as qnode, ylabel as label, printf("%.5g", r.similarity) as sim, yimg as image' \
      --para  SEED=Q83279 \
    / html

show_html(img_width=200)

qnode,label,sim,image
Q83279,'SpongeBob SquarePants'@en,1.0,
Q498881,'2013 Asian Indoor-Martial Arts Games'@en,0.73503,
Q183951,'balloon'@en,0.70113,
Q7228285,'Pontiki'@en,0.69247,
Q2086354,'Marsden'@en,0.68826,
Q6537417,'Lewiston–Auburn'@en,0.6881,
Q10939861,'Tianshui Railway Station'@en,0.68097,
Q4240381,'Russian church architecture'@en,0.678,
Q749387,'paratrooper'@en,0.67306,
Q34804,'Albuquerque'@en,0.666,


To get more type appropriate matches, we can add a restriction to only return matches of
type animated series (`Q581714`):

In [None]:
out = !kgtk query --gc $IMAGE --ac $MAIN \
      -i wiki_image -i labels -i claims \
      --match 'image:  (ximg)-[rx {qnode: $SEED}]->(xiv), \
                       (xiv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(yimg), \
                       (yimg)-[ry {qnode: y}]->(), \
               claims: (y)-[:P31]->(:Q581714), \
               labels: (y)-->(ylabel)' \
      --where 'not exists {image: (ximg2)-[{qnode: $SEED}]->() WHERE rowid(ximg2) < rowid(ximg) }' \
      --return 'y as qnode, ylabel as label, printf("%.5g", r.similarity) as sim, yimg as image' \
      --para  SEED=Q83279 \
    / html

show_html(img_width=200)

To get more type appropriate matches, we can add a restriction to only return matches of
type Internet Meme (`Q2927074`):

In [None]:
out = !kgtk query --gc $IMAGE  --ac $MAIN \
      -i wiki_image -i labels -i claims \
      --match 'image:  (ximg)-[rx {qnode: $SEED}]->(xiv), \
                       (xiv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(yimg), \
                       (yimg)-[ry {qnode: y}]->(), \
               claims: (y)-[:P31]->(:Q2927074), \
               labels: (y)-->(ylabel)' \
      --where 'not exists {image: (ximg2)-[{qnode: $SEED}]->() WHERE rowid(ximg2) < rowid(ximg) }' \
      --return 'y as qnode, ylabel as label, printf("%.5g", r.similarity) as sim, yimg as image' \
      --para  SEED=Q83279 \
    / html

show_html(img_width=200)

Let's get most similar entities according to the abstract embeddings:

In [39]:
out = !kgtk query --gc $MAIN --ac $ABSTRACT \
      -i abstract -i labels -i sentence \
      --match 'abstract: (x)-->(xv), \
                         (xv)-[r:kvec_topk_cos_sim {k: 10, maxk: 1024, nprobe: 8}]->(y), \
               labels:   (x)-->(xl), (y)-->(yl), \
               sent:     (y)-->(ys)' \
      --where 'x in ["Q40", "Q41", "Q30"]' \
      --return 'xl as xlabel, yl as ylabel, r.similarity as sim, kgtk_lqstring_text(ys) as ysent' \
    / html

show_html()

^C
