# Build Graph For The Tutorial

This notebook can work for any root node, the default is `Q2685` for Schwarzenegger

In [1]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd
from IPython.display import display, HTML

import papermill as pm

sys.path.insert(0,'..')
from configure_kgtk_notebooks import ConfigureKGTK

User home: /Users/pedroszekely
Current dir: /Users/pedroszekely/Documents/GitHub/kgtk/tutorial
Use-cases dir: /Users/pedroszekely/Documents/GitHub/kgtk/use-cases


In [2]:
# Parameters

# Folder on local machine where to create the output and temporary folders
input_path = "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215/data/"
output_path = "/Users/pedroszekely/Downloads/kypher/projects"
project_name = "build-tutorial"
root = "Q2685"

Put the root q-node in the environment variable `ROOT`

In [3]:
os.environ['ROOT'] = root

In [4]:
files = [
    "claims",
    "item",
    "wikibase_property",
    "datatypes",
    "qualifiers",
    "p31",
    "p279",
    "p279star",
    "quantity",
    "time",
    "external_id",
    "globe_coordinate",
    "monolingualtext",
    "string",
    "label",
    "alias",
    "description"
]
ck = ConfigureKGTK()
ck.configure_kgtk(input_graph_path=input_path,
                  output_path=output_path,
                  project_name=project_name)

In [5]:
os.environ['KGTK_LABEL_FILE'] = "{}".format(os.environ['label']) 

In [6]:
ck.print_env_variables(files)

kypher: kgtk query --graph-cache /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/temp.build-tutorial/wikidata.sqlite3.db
GRAPH: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215/data/
kgtk: kgtk
TEMP: /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/temp.build-tutorial
STORE: /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/temp.build-tutorial/wikidata.sqlite3.db
EXAMPLES_DIR: /Users/pedroszekely/Documents/GitHub/kgtk/tutorial
USE_CASES_DIR: /Users/pedroszekely/Documents/GitHub/kgtk/use-cases
OUT: /Users/pedroszekely/Downloads/kypher/projects/build-tutorial
claims: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215/data//claims.tsv.gz
item: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215/data//claims.wikibase-item.tsv.gz
wikibase_property: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215/data//claims.wikibase-property.tsv.gz
datatypes: /Volumes/GoogleDrive/Shared drives/KGTK/datas

## Define a custom location for the store when working with full Wikidata so that I can reuse it

In [7]:
os.environ['STORE'] = "/Users/pedroszekely/Downloads/kypher/wikidata.sqlite3.db"

Turn on debugging for kypher

In [8]:
os.environ['kypher'] = "kgtk --debug query --graph-cache " + os.environ['STORE']

In [9]:
!echo "$kypher"

kgtk --debug query --graph-cache /Users/pedroszekely/Downloads/kypher/wikidata.sqlite3.db


Load all my files into the kypher cache so that all graph aliases are defined

In [10]:
ck.load_files_into_cache(file_list=files)

kgtk --debug query --graph-cache /Users/pedroszekely/Downloads/kypher/wikidata.sqlite3.db -i "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215/data//claims.tsv.gz" --as claims  -i "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215/data//claims.wikibase-item.tsv.gz" --as item  -i "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215/data//claims.wikibase-property.tsv.gz" --as wikibase_property  -i "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215/data//metadata.property.datatypes.tsv.gz" --as datatypes  -i "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215/data//qualifiers.tsv.gz" --as qualifiers  -i "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215/data//derived.P31.tsv.gz" --as p31  -i "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215/data//derived.P279.tsv.gz" --as p279  -i "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20210215/data//derived.P279star.ts

In [20]:
%cd {os.environ['OUT']}

/Users/pedroszekely/Downloads/kypher/projects/build-tutorial


# Approach:
- Select a subgraph of full Wikidata that includes people (Q5), organizations (Q43229), geographic regions (Q82794), and awards (Q618779). This graph contains all edges that connect instances of the target classes listed above. Output the graph using a single relation we call `link`.
- Starting from Schwarzenegger Q2685, compute reachable nodes in the graph computed in the previous step. This step will produce the collection of nodes that will be part of the Schwarzenegger graph.
- Extract from Wikidata all the edges that connect nodes from the previous step.
- Extract from Wikidata the time, quantity, monolingual and string properties.
- Extract from Wikidata the qualifiers for the edges computed in the previous steps.
- Extract from Wikidata the labels, aliases and descriptions for the Schwarzenegger nodes.

## Extract a subset of Wikidata to use as the base for the Schewarzenegger graph

This query takes a really long time, so don't re-execute unless you have to.

In [23]:
%%time
!$kypher -i p31 -i item -i p279star \
--match ' \
    p31: (n1)-[]->(n1_class), \
    item: (n1)-[l]->(n2), \
    p31: (n2)-[]->(n2_class), \
    p279star: (n1_class)-[]->(n1_superclass), \
    p279star: (n2_class)-[]->(n2_superclass)' \
--where 'n1_superclass in ["Q11424", "Q5", "Q43229", "Q82794", "Q618779"] and n2_superclass in ["Q11424", "Q5", "Q43229", "Q82794", "Q618779"]' \
--return 'distinct n1 as node1, "link" as label, n2 as node2, l as id' \
-o "$TEMP"/item.per.org.cw.geo.award.link.tsv.gz 

[2021-10-02 17:24:53 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_6_c1."node1" "_aLias.node1", ? "_aLias.label", graph_2_c2."node2" "_aLias.node2", graph_2_c2."id" "_aLias.id"
     FROM graph_2 AS graph_2_c2
     INNER JOIN graph_6 AS graph_6_c1, graph_6 AS graph_6_c3, graph_8 AS graph_8_c4, graph_8 AS graph_8_c5
     ON graph_2_c2."node2" = graph_6_c3."node1"
        AND graph_6_c1."node1" = graph_2_c2."node1"
        AND graph_6_c1."node2" = graph_8_c4."node1"
        AND graph_6_c3."node2" = graph_8_c5."node1"
        AND ((graph_8_c4."node2" IN (?, ?, ?, ?, ?)) AND (graph_8_c5."node2" IN (?, ?, ?, ?, ?)))
  PARAS: ['link', 'Q11424', 'Q5', 'Q43229', 'Q82794', 'Q618779', 'Q11424', 'Q5', 'Q43229', 'Q82794', 'Q618779']
---------------------------------------------
[2021-10-02 17:24:53 sqlstore]: CREATE INDEX on table graph_6 column node2 ...
[2021-10-02 17:26:12 sqlstore]: ANALYZE INDEX on table graph_6 column node2 ...
[2021-10-02 17:2

In the original graph there are qualifier values that we want to follow in the reachablity search. To do so, we will create `link` edges between the qualifier and the value of the statement on which the qualifier is defined.

In [24]:
%%time
!$kypher -i qualifiers -i datatypes -i "$TEMP"/item.per.org.cw.geo.award.link.tsv.gz --as links \
--match ' \
    links: ()-[l]->(n2), \
    qualifiers: (l)-[q {label: property}]->(qualifier), \
    datatypes: (property)-[:datatype]->(datatype) \
    ' \
--where 'datatype in ["wikibase-item"]' \
--return 'n2 as node1, "link" as label, qualifier as node2' \
/ add-id --id-style wikidata \
/ cat -i - -i "$TEMP"/item.per.org.cw.geo.award.link.tsv.gz \
-o "$TEMP"/item.per.org.cw.geo.award.link.qualifier.tsv.gz

[2021-10-02 20:53:07 sqlstore]: IMPORT graph directly into table graph_18 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/temp.build-tutorial/item.per.org.cw.geo.award.link.tsv.gz ...
[2021-10-02 20:56:37 query]: SQL Translation:
---------------------------------------------
  SELECT graph_18_c1."node2" "_aLias.node1", ? "_aLias.label", graph_5_c2."node2" "_aLias.node2"
     FROM graph_18 AS graph_18_c1
     INNER JOIN graph_4 AS graph_4_c3, graph_5 AS graph_5_c2
     ON graph_18_c1."id" = graph_5_c2."node1"
        AND graph_4_c3."node1" = graph_5_c2."label"
        AND graph_4_c3."label" = ?
        AND graph_5_c2."label" = graph_4_c3."node1"
        AND (graph_4_c3."node2" IN (?))
  PARAS: ['link', 'datatype', 'wikibase-item']
---------------------------------------------
[2021-10-02 20:56:37 sqlstore]: CREATE INDEX on table graph_5 column label ...
[2021-10-02 21:01:07 sqlstore]: ANALYZE INDEX on table graph_5 column label ...
[2021-10-02 21:01:25 sqlstore]: CREAT

Starting from `ROOT` traverse links forward in breadfirst mode up to a fixed number of levels to build the graph

In [25]:
%%time
!$kgtk reachable-nodes \
    --root $ROOT \
    --prop link \
    --label "reachable" \
    --selflink \
    --breadth-first --depth-limit 3 \
    -i "$TEMP"/item.per.org.cw.geo.award.link.qualifier.tsv.gz  \
    -o "$TEMP"/root.reachable.per.org.cw.geo.award.tsv.gz

CPU times: user 3.36 s, sys: 999 ms, total: 4.36 s
Wall time: 5min 38s


In [26]:
!$kgtk head -i "$TEMP"/root.reachable.per.org.cw.geo.award.tsv.gz

node1	label	node2
Q2685	reachable	Q2685
Q2685	reachable	Q12158205
Q2685	reachable	Q170564
Q2685	reachable	Q1765879
Q2685	reachable	Q28754213
Q2685	reachable	Q1976616
Q2685	reachable	Q551327
Q2685	reachable	Q12325509
Q2685	reachable	Q30331794
Q2685	reachable	Q557584


Index the resulting file in kypher

In [27]:
!$kypher -i $TEMP/root.reachable.per.org.cw.geo.award.tsv.gz --as root_nodes --limit 2

[2021-10-02 21:38:51 sqlstore]: IMPORT graph directly into table graph_19 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/temp.build-tutorial/root.reachable.per.org.cw.geo.award.tsv.gz ...
[2021-10-02 21:38:51 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_19 AS graph_19_c1
     LIMIT ?
  PARAS: [2]
---------------------------------------------
node1	label	node2
Q2685	reachable	Q2685
Q2685	reachable	Q12158205


## Build initial graph containing the item edges

Figure out which properties are used so so that we can add them as node1s and get all the info about them.

In [28]:
%%time
!$kypher -i root_nodes -i datatypes -i claims \
--match ' \
    root_nodes: ()-[]->(n1), \
    claims: (n1)-[l {label: property}]->(), \
    datatypes: (property)-[:datatype]->(datatype) \
    ' \
--where 'datatype in ["wikibase-item", "string", "quantity", "time", "monolingualtext"]' \
--return 'distinct "root" as node1, "link" as label, property as node2' \
-o "$TEMP"/root.nodes.property.tsv.gz

[2021-10-02 21:38:52 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT ? "_aLias.node1", ? "_aLias.label", graph_4_c3."node1" "_aLias.node2"
     FROM graph_1 AS graph_1_c2
     INNER JOIN graph_19 AS graph_19_c1, graph_4 AS graph_4_c3
     ON graph_19_c1."node2" = graph_1_c2."node1"
        AND graph_4_c3."node1" = graph_1_c2."label"
        AND graph_1_c2."label" = graph_4_c3."node1"
        AND graph_4_c3."label" = ?
        AND (graph_4_c3."node2" IN (?, ?, ?, ?, ?))
  PARAS: ['root', 'link', 'datatype', 'wikibase-item', 'string', 'quantity', 'time', 'monolingualtext']
---------------------------------------------
[2021-10-02 21:38:52 sqlstore]: CREATE INDEX on table graph_19 column node2 ...
[2021-10-02 21:38:52 sqlstore]: ANALYZE INDEX on table graph_19 column node2 ...
[2021-10-02 21:38:52 sqlstore]: CREATE INDEX on table graph_1 column node1 ...
[2021-10-02 21:53:50 sqlstore]: ANALYZE INDEX on table graph_1 column node1 ...
[2021-10-02 21:

Concatenate the new nodes with the ones we found via reachability

In [29]:
!kgtk cat -i "$TEMP"/root.nodes.property.tsv.gz -i "$TEMP"/root.reachable.per.org.cw.geo.award.tsv.gz \
-o "$TEMP"/root.nodes.all.tsv.gz

Print number of nodes that we have so far for the new graph

In [30]:
!zcat < "$TEMP"/root.nodes.all.tsv.gz | wc -l

   35796


Update the Kypher database

In [31]:
!$kypher -i "$TEMP"/root.nodes.all.tsv.gz --as root_nodes --limit 2

[2021-10-02 22:20:17 sqlstore]: DROP graph data table graph_19 from root_nodes
[2021-10-02 22:20:17 sqlstore]: IMPORT graph directly into table graph_19 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/temp.build-tutorial/root.nodes.all.tsv.gz ...
[2021-10-02 22:20:17 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_19 AS graph_19_c1
     LIMIT ?
  PARAS: [2]
---------------------------------------------
node1	label	node2
root	link	P1082
root	link	P112


Extract the item to item edges connecting the nodes in the new graph

In [68]:
%%time
!$kypher -i root_nodes -i item \
--match ' \
    root_nodes: ()-[]->(n1), \
    root_nodes: ()-[]->(n2), \
    item: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.item.tsv.gz

[2021-10-02 22:49:41 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_19_c1."node2" "_aLias.node1", graph_2_c3."label" "_aLias.label", graph_19_c2."node2" "_aLias.node2", graph_2_c3."id" "_aLias.id"
     FROM graph_19 AS graph_19_c1
     INNER JOIN graph_19 AS graph_19_c2, graph_2 AS graph_2_c3
     ON graph_19_c1."node2" = graph_2_c3."node1"
        AND graph_19_c2."node2" = graph_2_c3."node2"
  PARAS: []
---------------------------------------------
CPU times: user 245 ms, sys: 75.6 ms, total: 320 ms
Wall time: 23.5 s


Add to the kypher database

In [69]:
!$kypher -i $OUT/root.graph.item.tsv.gz --as rootitems --limit 2

[2021-10-02 22:50:04 sqlstore]: DROP graph data table graph_20 from rootitems
[2021-10-02 22:50:04 sqlstore]: IMPORT graph directly into table graph_23 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/root.graph.item.tsv.gz ...
[2021-10-02 22:50:05 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_23 AS graph_23_c1
     LIMIT ?
  PARAS: [2]
---------------------------------------------
node1	label	node2	id
P1001	P1855	Q11696	P1001-P1855-Q11696-cdbf391b-0
P1001	P1855	Q181574	P1001-P1855-Q181574-7f428c9b-0


## Extract the other types of edges

Extract the quantities

In [73]:
%%time
!$kypher -i quantity -i root_nodes \
--match ' \
    root_nodes: ()-[]->(n1), \
    quantity: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.quantity.tsv.gz

[2021-10-02 22:51:33 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_19_c1."node2" "_aLias.node1", graph_9_c2."label" "_aLias.label", graph_9_c2."node2" "_aLias.node2", graph_9_c2."id" "_aLias.id"
     FROM graph_19 AS graph_19_c1
     INNER JOIN graph_9 AS graph_9_c2
     ON graph_19_c1."node2" = graph_9_c2."node1"
  PARAS: []
---------------------------------------------
[2021-10-02 22:51:33 sqlstore]: CREATE INDEX on table graph_9 column node1 ...
[2021-10-02 22:52:38 sqlstore]: ANALYZE INDEX on table graph_9 column node1 ...
CPU times: user 839 ms, sys: 254 ms, total: 1.09 s
Wall time: 1min 13s


Extract the time edges

In [74]:
%%time
!$kypher -i time -i root_nodes \
--match ' \
    root_nodes: ()-[]->(n1), \
    time: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.time.tsv.gz

[2021-10-02 22:52:47 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_19_c1."node2" "_aLias.node1", graph_10_c2."label" "_aLias.label", graph_10_c2."node2" "_aLias.node2", graph_10_c2."id" "_aLias.id"
     FROM graph_10 AS graph_10_c2
     INNER JOIN graph_19 AS graph_19_c1
     ON graph_19_c1."node2" = graph_10_c2."node1"
  PARAS: []
---------------------------------------------
[2021-10-02 22:52:47 sqlstore]: CREATE INDEX on table graph_10 column node1 ...
[2021-10-02 22:53:23 sqlstore]: ANALYZE INDEX on table graph_10 column node1 ...
CPU times: user 483 ms, sys: 145 ms, total: 628 ms
Wall time: 42.6 s


Extract the monolingual text edges

In [75]:
%%time
!$kypher -i monolingualtext -i root_nodes \
--match ' \
    root_nodes: ()-[]->(n1), \
    monolingualtext: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.monolingual.tsv.gz

[2021-10-02 22:53:30 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_19_c1."node2" "_aLias.node1", graph_13_c2."label" "_aLias.label", graph_13_c2."node2" "_aLias.node2", graph_13_c2."id" "_aLias.id"
     FROM graph_13 AS graph_13_c2
     INNER JOIN graph_19 AS graph_19_c1
     ON graph_19_c1."node2" = graph_13_c2."node1"
  PARAS: []
---------------------------------------------
[2021-10-02 22:53:30 sqlstore]: CREATE INDEX on table graph_13 column node1 ...
[2021-10-02 22:54:11 sqlstore]: ANALYZE INDEX on table graph_13 column node1 ...
CPU times: user 540 ms, sys: 165 ms, total: 706 ms
Wall time: 48.3 s


Extract the string edges

In [76]:
%%time
!$kypher -i string -i root_nodes \
--match ' \
    root_nodes: ()-[]->(n1), \
    string: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.string.tsv.gz

[2021-10-02 22:54:18 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_19_c1."node2" "_aLias.node1", graph_14_c2."label" "_aLias.label", graph_14_c2."node2" "_aLias.node2", graph_14_c2."id" "_aLias.id"
     FROM graph_14 AS graph_14_c2
     INNER JOIN graph_19 AS graph_19_c1
     ON graph_19_c1."node2" = graph_14_c2."node1"
  PARAS: []
---------------------------------------------
CPU times: user 122 ms, sys: 41.2 ms, total: 164 ms
Wall time: 11.1 s


Extract external identifiers NEW

In [77]:
%%time
!$kypher -i external_id -i root_nodes \
--match ' \
    root_nodes: ()-[]->(n1), \
    external_id: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.external_ids.tsv.gz

[2021-10-02 22:54:29 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_19_c1."node2" "_aLias.node1", graph_11_c2."label" "_aLias.label", graph_11_c2."node2" "_aLias.node2", graph_11_c2."id" "_aLias.id"
     FROM graph_11 AS graph_11_c2
     INNER JOIN graph_19 AS graph_19_c1
     ON graph_19_c1."node2" = graph_11_c2."node1"
  PARAS: []
---------------------------------------------
[2021-10-02 22:54:29 sqlstore]: CREATE INDEX on table graph_11 column node1 ...
[2021-10-02 22:56:29 sqlstore]: ANALYZE INDEX on table graph_11 column node1 ...
CPU times: user 1.66 s, sys: 492 ms, total: 2.16 s
Wall time: 2min 25s


## Complete the graph

Add external_ids

In [78]:
%%time
!kgtk cat \
-i $OUT/root.graph.item.tsv.gz \
-i $OUT/root.graph.quantity.tsv.gz \
-i $OUT/root.graph.time.tsv.gz \
-i $OUT/root.graph.monolingual.tsv.gz \
-i $OUT/root.graph.string.tsv.gz \
-i $OUT/root.graph.external_ids.tsv.gz \
-o $OUT/root.graph.item.quantity.time.monolingual.string.tsv.gz 

!$kypher -i $OUT/root.graph.item.quantity.time.monolingual.string.tsv.gz --as rootbase --limit 2

[2021-10-02 22:57:01 sqlstore]: IMPORT graph directly into table graph_24 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/root.graph.item.quantity.time.monolingual.string.tsv.gz ...
[2021-10-02 22:57:05 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_24 AS graph_24_c1
     LIMIT ?
  PARAS: [2]
---------------------------------------------
node1	label	node2	id
P1001	P1855	Q11696	P1001-P1855-Q11696-cdbf391b-0
P1001	P1855	Q181574	P1001-P1855-Q181574-7f428c9b-0
CPU times: user 134 ms, sys: 51.3 ms, total: 185 ms
Wall time: 12 s


### Collect all the properties

Get edges for the properties

In [79]:
%%time
!$kypher -i rootbase -i wikibase_property \
--match ' \
    rootbase: ()-[l {label: property}]->(), \
    wikibase_property: (property)-[lp]->(n) \
    ' \
--return 'distinct property as node1, lp.label as label, n as node2, lp as id' \
/ sort \
-o $OUT/root.graph.property.tsv.gz

[2021-10-02 22:57:06 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_3_c2."node1" "_aLias.node1", graph_3_c2."label" "_aLias.label", graph_3_c2."node2" "_aLias.node2", graph_3_c2."id" "_aLias.id"
     FROM graph_24 AS graph_24_c1
     INNER JOIN graph_3 AS graph_3_c2
     ON graph_3_c2."node1" = graph_24_c1."label"
        AND graph_24_c1."label" = graph_3_c2."node1"
  PARAS: []
---------------------------------------------
[2021-10-02 22:57:06 sqlstore]: CREATE INDEX on table graph_24 column label ...
[2021-10-02 22:57:07 sqlstore]: ANALYZE INDEX on table graph_24 column label ...
[2021-10-02 22:57:07 sqlstore]: CREATE INDEX on table graph_3 column node1 ...
[2021-10-02 22:57:07 sqlstore]: ANALYZE INDEX on table graph_3 column node1 ...
CPU times: user 41.4 ms, sys: 18.4 ms, total: 59.8 ms
Wall time: 3.34 s


Update the base

In [80]:
%%time
!kgtk cat \
-i $OUT/root.graph.item.quantity.time.monolingual.string.tsv.gz \
-i $OUT/root.graph.property.tsv.gz \
/ compact \
-o $OUT/root.graph.item.quantity.time.monolingual.string.property.tsv.gz 

!$kypher -i $OUT/root.graph.item.quantity.time.monolingual.string.property.tsv.gz --as rootbase --limit 2

[2021-10-02 22:57:28 sqlstore]: DROP graph data table graph_24 from rootbase
[2021-10-02 22:57:29 sqlstore]: IMPORT graph directly into table graph_24 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/root.graph.item.quantity.time.monolingual.string.property.tsv.gz ...
[2021-10-02 22:57:33 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_24 AS graph_24_c1
     LIMIT ?
  PARAS: [2]
---------------------------------------------
node1	label	node2	id
P1001	P1647	P276	P1001-P1647-P276-e4e44f83-0
P1001	P1659	P1269	P1001-P1659-P1269-785921cd-0
CPU times: user 265 ms, sys: 90.9 ms, total: 355 ms
Wall time: 24.4 s


### Compute qualifiers

In [81]:
%%time
!$kypher -i qualifiers -i rootbase \
--match ' \
    rootbase: ()-[l]->(), \
    qualifiers: (l)-[lq {label: property}]->(n) \
    ' \
--return 'distinct l as node1, property as label, n as node2, lq as id' \
/ sort \
-o $OUT/root.graph.qualifiers.tsv.gz

[2021-10-02 22:57:34 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_24_c1."id" "_aLias.node1", graph_5_c2."label" "_aLias.label", graph_5_c2."node2" "_aLias.node2", graph_5_c2."id" "_aLias.id"
     FROM graph_24 AS graph_24_c1
     INNER JOIN graph_5 AS graph_5_c2
     ON graph_24_c1."id" = graph_5_c2."node1"
        AND graph_5_c2."label" = graph_5_c2."label"
  PARAS: []
---------------------------------------------
[2021-10-02 22:57:34 sqlstore]: CREATE INDEX on table graph_24 column id ...
[2021-10-02 22:57:34 sqlstore]: ANALYZE INDEX on table graph_24 column id ...
CPU times: user 211 ms, sys: 68.1 ms, total: 279 ms
Wall time: 19.8 s


Update the base again

In [82]:
%%time
!kgtk cat \
-i $OUT/root.graph.item.quantity.time.monolingual.string.property.tsv.gz \
-i $OUT/root.graph.qualifiers.tsv.gz \
/ compact \
-o $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.tsv.gz 

!$kypher -i $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.tsv.gz --as rootbase --limit 2

[2021-10-02 22:58:17 sqlstore]: DROP graph data table graph_24 from rootbase
[2021-10-02 22:58:18 sqlstore]: IMPORT graph directly into table graph_24 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/root.graph.item.quantity.time.monolingual.string.property.qualifiers.tsv.gz ...
[2021-10-02 22:58:23 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_24 AS graph_24_c1
     LIMIT ?
  PARAS: [2]
---------------------------------------------
node1	label	node2	id
P1001	P1647	P276	P1001-P1647-P276-e4e44f83-0
P1001	P1659	P1269	P1001-P1659-P1269-785921cd-0
CPU times: user 338 ms, sys: 113 ms, total: 450 ms
Wall time: 30.8 s


### Add the units

Find all values of quantity properties, and get the units defined for them.

> `kgtk_quantity_wd_units` throws an exception when it gets a quantity without units, so we have to hack around that using grep.

In [83]:
%%time
!$kypher -i datatypes -i rootbase \
--match ' \
    rootbase: ()-[l {label: property}]->(n2), \
    datatypes: (property)-[:datatype]->(datatype) \
    ' \
--where 'datatype in ["quantity"]' \
--return 'distinct n2' \
| grep Q > "$TEMP"/units.noheader.tsv

!echo -e "node1" | cat - "$TEMP"/units.noheader.tsv > "$TEMP"/quantities.units.tsv

[2021-10-02 22:58:24 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_24_c1."node2"
     FROM graph_24 AS graph_24_c1
     INNER JOIN graph_4 AS graph_4_c2
     ON graph_4_c2."node1" = graph_24_c1."label"
        AND graph_24_c1."label" = graph_4_c2."node1"
        AND graph_4_c2."label" = ?
        AND (graph_4_c2."node2" IN (?))
  PARAS: ['datatype', 'quantity']
---------------------------------------------
[2021-10-02 22:58:24 sqlstore]: CREATE INDEX on table graph_24 column label ...
[2021-10-02 22:58:25 sqlstore]: ANALYZE INDEX on table graph_24 column label ...
CPU times: user 19.1 ms, sys: 14 ms, total: 33.1 ms
Wall time: 1.67 s


In [84]:
%%time
!$kypher -i "$TEMP"/quantities.units.tsv \
--match '(quantity)' \
--return 'distinct kgtk_quantity_wd_units(quantity) as node1' \
-o "$TEMP"/units.tsv

[2021-10-02 22:58:26 sqlstore]: DROP graph data table graph_21 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/temp.build-tutorial/quantities.units.tsv
[2021-10-02 22:58:26 sqlstore]: IMPORT graph directly into table graph_25 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/temp.build-tutorial/quantities.units.tsv ...
[2021-10-02 22:58:26 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT kgtk_quantity_wd_units(graph_25_c1."node1") "_aLias.node1"
     FROM graph_25 AS graph_25_c1
  PARAS: []
---------------------------------------------
CPU times: user 9.82 ms, sys: 8.99 ms, total: 18.8 ms
Wall time: 673 ms


Now that we have the units in a file, we can get all the properties we want about them

In [85]:
%%time
!$kypher -i "$TEMP"/units.tsv -i item -i datatypes \
--match ' \
    units: (unit), \
    datatypes: (property)-[:datatype]->(datatype), \
    item: (unit)-[l {label: property}]->(n2) \
    ' \
--where 'datatype in ["wikibase-item", "string", "quantity", "time", "monolingualtext"]' \
--return 'distinct unit as node1, property as label, n2 as node2, l as id' \
/ cat -i - -i $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.tsv.gz\
/ compact \
-o $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.units.tsv.gz \

!$kypher -i $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.units.tsv.gz --as rootbase --limit 2

[2021-10-02 22:58:27 sqlstore]: DROP graph data table graph_22 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/temp.build-tutorial/units.tsv
[2021-10-02 22:58:27 sqlstore]: IMPORT graph directly into table graph_26 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/temp.build-tutorial/units.tsv ...
[2021-10-02 22:58:27 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_26_c1."node1" "_aLias.node1", graph_4_c2."node1" "_aLias.label", graph_2_c3."node2" "_aLias.node2", graph_2_c3."id" "_aLias.id"
     FROM graph_2 AS graph_2_c3
     INNER JOIN graph_26 AS graph_26_c1, graph_4 AS graph_4_c2
     ON graph_26_c1."node1" = graph_2_c3."node1"
        AND graph_4_c2."node1" = graph_2_c3."label"
        AND graph_2_c3."label" = graph_4_c2."node1"
        AND graph_4_c2."label" = ?
        AND (graph_4_c2."node2" IN (?, ?, ?, ?, ?))
  PARAS: ['datatype', 'wikibase-item', 'string', 'quantity', 'time', 'monolingualtext']


### Make sure that every q-node has at least P31 and P279
need to do it twice, once for node1 and once for node2

In [86]:
%%time
!$kypher -i rootbase -i claims \
--match 'rootbase: (n)-[]->(), claims: (n)-[l {label: property}]->(n2)' \
--where 'property in ["P31", "P279"]' \
--return 'distinct n as node1, property as label, n2 as node2, l as id' \
-o "$TEMP"/root.node1.P31.P279.tsv.gz

[2021-10-02 23:07:57 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_24_c1."node1" "_aLias.node1", graph_1_c2."label" "_aLias.label", graph_1_c2."node2" "_aLias.node2", graph_1_c2."id" "_aLias.id"
     FROM graph_1 AS graph_1_c2
     INNER JOIN graph_24 AS graph_24_c1
     ON graph_24_c1."node1" = graph_1_c2."node1"
        AND graph_1_c2."label" = graph_1_c2."label"
        AND (graph_1_c2."label" IN (?, ?))
  PARAS: ['P31', 'P279']
---------------------------------------------
[2021-10-02 23:07:57 sqlstore]: CREATE INDEX on table graph_24 column node1 ...
[2021-10-02 23:07:58 sqlstore]: ANALYZE INDEX on table graph_24 column node1 ...
CPU times: user 8.86 s, sys: 2.59 s, total: 11.5 s
Wall time: 14min 12s


In [87]:
%%time
!$kypher -i rootbase -i claims \
--match 'rootbase: ()-[]->(n), claims: (n)-[l {label: property}]->(n2)' \
--where 'property in ["P31", "P279"]' \
--return 'distinct n as node1, property as label, n2 as node2, l as id' \
-o "$TEMP"/root.node2.P31.P279.tsv.gz

[2021-10-02 23:22:10 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_24_c1."node2" "_aLias.node1", graph_1_c2."label" "_aLias.label", graph_1_c2."node2" "_aLias.node2", graph_1_c2."id" "_aLias.id"
     FROM graph_1 AS graph_1_c2
     INNER JOIN graph_24 AS graph_24_c1
     ON graph_24_c1."node2" = graph_1_c2."node1"
        AND graph_1_c2."label" = graph_1_c2."label"
        AND (graph_1_c2."label" IN (?, ?))
  PARAS: ['P31', 'P279']
---------------------------------------------
[2021-10-02 23:22:10 sqlstore]: CREATE INDEX on table graph_24 column node2 ...
[2021-10-02 23:22:12 sqlstore]: ANALYZE INDEX on table graph_24 column node2 ...
CPU times: user 8.75 s, sys: 2.6 s, total: 11.3 s
Wall time: 13min 49s


Recreate the base file NEW
> the output file should have the `.units` segment in the name, but I didnt' add it so that I don't have to modify all the other ocmmands
> a better design for the file names would not have this problem

In [88]:
%%time
!kgtk cat \
-i $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.units.tsv.gz \
-i "$TEMP"/root.node2.P31.P279.tsv.gz \
/ compact \
-o $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.tsv.gz 

!$kypher -i $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.tsv.gz --as rootbase --limit 2

[2021-10-02 23:36:25 sqlstore]: DROP graph data table graph_24 from rootbase
[2021-10-02 23:36:28 sqlstore]: IMPORT graph directly into table graph_24 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.tsv.gz ...
[2021-10-02 23:36:34 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_24 AS graph_24_c1
     LIMIT ?
  PARAS: [2]
---------------------------------------------
node1	label	node2	id
P10	P31	Q18610173	P10-P31-Q18610173-85ef4d24-0
P1000	P31	Q18608871	P1000-P31-Q18608871-093affb5-0
CPU times: user 395 ms, sys: 133 ms, total: 527 ms
Wall time: 35.9 s


### Incorporate all nodes up to the top of the class hierarchy
When we do a breath first traversal, we may not follow enough links on the P279 hierarchy to reach the top. We need to do a full traversal on the P279 hierarchy to incorporate all the relevant classes.

Approach:
- Create a graph including P31 and P279 to do the traversal
- Create a file of all the nodes in the Schwarzenneger file to use as roots

In [89]:
%%time
!$kypher -i claims \
--match '(n1)-[l {label:property}]->(n2)' \
--where 'property in ["P31", "P279"]' \
--return 'distinct n1 as node1, "link" as label, n2 as node2' \
-o "$TEMP"/P31.P279.subgraph.tsv.gz

[2021-10-02 23:36:35 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_1_c1."node1" "_aLias.node1", ? "_aLias.label", graph_1_c1."node2" "_aLias.node2"
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."label" = graph_1_c1."label"
        AND (graph_1_c1."label" IN (?, ?))
  PARAS: ['link', 'P31', 'P279']
---------------------------------------------
CPU times: user 13.2 s, sys: 3.94 s, total: 17.1 s
Wall time: 22min 13s


#### Create the roots

Find roots in node1

> This step is including qualifier ids in node1, which makes reachable nodes have more roots than necessary. Would be nice to eliminate qualifiers here.

In [90]:
%%time
!$kypher -i rootbase \
--match '(n)-[]->()' \
--return 'distinct n as node1' \
-o "$TEMP"/root.node1.tsv.gz

[2021-10-02 23:58:49 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_24_c1."node1" "_aLias.node1"
     FROM graph_24 AS graph_24_c1
  PARAS: []
---------------------------------------------
CPU times: user 37.6 ms, sys: 17.9 ms, total: 55.5 ms
Wall time: 3.81 s


Find roots in node2

In [91]:
%%time
!$kypher -i rootbase -i datatypes \
--match ' \
    rootbase: ()-[l {label: property}]->(n), \
    datatypes: (property)-[:datatype]->(datatype) \
    ' \
--where 'datatype in ["wikibase-item"]' \
--return 'distinct n as node1' \
-o "$TEMP"/root.node2.tsv.gz

[2021-10-02 23:58:52 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_24_c1."node2" "_aLias.node1"
     FROM graph_24 AS graph_24_c1
     INNER JOIN graph_4 AS graph_4_c2
     ON graph_4_c2."node1" = graph_24_c1."label"
        AND graph_24_c1."label" = graph_4_c2."node1"
        AND graph_4_c2."label" = ?
        AND (graph_4_c2."node2" IN (?))
  PARAS: ['datatype', 'wikibase-item']
---------------------------------------------
[2021-10-02 23:58:52 sqlstore]: CREATE INDEX on table graph_24 column label ...
[2021-10-02 23:58:53 sqlstore]: ANALYZE INDEX on table graph_24 column label ...
CPU times: user 21.8 ms, sys: 12.9 ms, total: 34.7 ms
Wall time: 1.96 s


Combine the two files to create all the roots

In [92]:
%%time
!$kgtk cat --mode=NONE -i "$TEMP"/root.node1.tsv.gz -i "$TEMP"/root.node2.tsv.gz \
/ compact --mode=NONE --columns node1 \
-o "$TEMP"/root.nodes.tsv.gz

!$kypher -i "$TEMP"/root.nodes.tsv.gz --as rootnode1 --limit 2

[2021-10-02 23:58:58 sqlstore]: IMPORT graph directly into table graph_27 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/temp.build-tutorial/root.nodes.tsv.gz ...
[2021-10-02 23:58:58 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_27 AS graph_27_c1
     LIMIT ?
  PARAS: [2]
---------------------------------------------
node1
P10
P1000
CPU times: user 49.3 ms, sys: 22.6 ms, total: 71.9 ms
Wall time: 4.43 s


Circumvent a problem in `reachable-nodes` where it does not accept a root file with column header `node1`

In [93]:
%%time
!$kgtk rename-columns -i "$TEMP"/root.nodes.tsv.gz --output-columns id --mode=NONE \
/ compact -o "$TEMP"/root.roots.tsv.gz

CPU times: user 35.2 ms, sys: 16.3 ms, total: 51.5 ms
Wall time: 3.24 s


Do a depth-first traversal of the P31/P279 graph using as roots all items in the Schewarzenegger graph

In [94]:
%%time
!$kgtk reachable-nodes \
    --rootfile "$TEMP"/root.roots.tsv.gz \
    --rootfilecolumn id \
    --prop link \
    --label "reachable" \
    --selflink \
    -i "$TEMP"/P31.P279.subgraph.tsv.gz \
    -o "$TEMP"/P31.P279.reachable.tsv.gz

CPU times: user 16.6 s, sys: 5.13 s, total: 21.7 s
Wall time: 26min 24s


Deduplicate the reachable nodes file

In [95]:
%%time
!$kgtk remove-columns -i "$TEMP"/P31.P279.reachable.tsv.gz --columns node1 label \
/ rename-columns --mode=NONE --output-columns node1 \
/ compact --mode=NONE --columns node1 \
-o "$TEMP"/P31.P279.reachable.dedup.tsv.gz

CPU times: user 494 ms, sys: 159 ms, total: 653 ms
Wall time: 50.3 s


Put all the reachable nodes in `rootnode1`

In [96]:
%%time
!$kgtk cat --mode=NONE \
-i "$TEMP"/root.nodes.tsv.gz \
-i "$TEMP"/P31.P279.reachable.dedup.tsv.gz \
/ compact --deduplicate --mode=NONE --columns node1 \
-o "$TEMP"/root.nodes.ontology.tsv.gz

!$kypher -i "$TEMP"/root.nodes.ontology.tsv.gz --as rootnode1 --limit 2

[2021-10-03 00:26:20 sqlstore]: DROP graph data table graph_27 from rootnode1
[2021-10-03 00:26:20 sqlstore]: IMPORT graph directly into table graph_27 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/temp.build-tutorial/root.nodes.ontology.tsv.gz ...
[2021-10-03 00:26:20 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_27 AS graph_27_c1
     LIMIT ?
  PARAS: [2]
---------------------------------------------
node1
P10
P1000
CPU times: user 47.4 ms, sys: 24.5 ms, total: 71.9 ms
Wall time: 4.5 s


Extract all P31/P279 edges from Wikidata for all the nodes in the new graph and consolidate.

In [97]:
%%time
!$kypher -i claims -i rootnode1 \
--match ' \
    rootnode1: (n1), \
    claims: (n1)-[l {label:property}]->(n2) \
    ' \
--where 'property in ["P31", "P279"]' \
--return 'distinct n1 as node1, property as label, n2 as node2, l as id' \
/ cat -i - -i $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.tsv.gz \
/ compact \
-o $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.ontology.tsv.gz \

!$kypher -i $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.ontology.tsv.gz --as rootbase --limit 2

[2021-10-03 00:26:21 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_27_c1."node1" "_aLias.node1", graph_1_c2."label" "_aLias.label", graph_1_c2."node2" "_aLias.node2", graph_1_c2."id" "_aLias.id"
     FROM graph_1 AS graph_1_c2
     INNER JOIN graph_27 AS graph_27_c1
     ON graph_27_c1."node1" = graph_1_c2."node1"
        AND graph_1_c2."label" = graph_1_c2."label"
        AND (graph_1_c2."label" IN (?, ?))
  PARAS: ['P31', 'P279']
---------------------------------------------
[2021-10-03 00:26:21 sqlstore]: CREATE INDEX on table graph_27 column node1 ...
[2021-10-03 00:26:21 sqlstore]: ANALYZE INDEX on table graph_27 column node1 ...
[2021-10-03 00:39:36 sqlstore]: DROP graph data table graph_24 from rootbase
[2021-10-03 00:39:37 sqlstore]: IMPORT graph directly into table graph_24 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.ontology.tsv.

I am not certain about the need for this cell, whether new nodes can appear after adding P31 and P279.

In [98]:
%%time
!$kypher -i rootbase -i datatypes \
--match ' \
    rootbase: ()-[l {label: property}]->(n), \
    datatypes: (property)-[:datatype]->(datatype) \
    ' \
--where 'datatype in ["wikibase-item"]' \
--return 'distinct n as node1' \
/ cat -i - -i "$TEMP"/root.nodes.ontology.tsv.gz --mode=NONE \
/ compact --mode=NONE --columns node1 \
-o "$TEMP"/root.nodes.ontology.star.tsv.gz \

!$kypher -i "$TEMP"/root.nodes.ontology.star.tsv.gz --as rootnode1 --limit 2

[2021-10-03 00:39:45 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_24_c1."node2" "_aLias.node1"
     FROM graph_24 AS graph_24_c1
     INNER JOIN graph_4 AS graph_4_c2
     ON graph_4_c2."node1" = graph_24_c1."label"
        AND graph_24_c1."label" = graph_4_c2."node1"
        AND graph_4_c2."label" = ?
        AND (graph_4_c2."node2" IN (?))
  PARAS: ['datatype', 'wikibase-item']
---------------------------------------------
[2021-10-03 00:39:45 sqlstore]: CREATE INDEX on table graph_24 column label ...
[2021-10-03 00:39:45 sqlstore]: ANALYZE INDEX on table graph_24 column label ...
[2021-10-03 00:39:49 sqlstore]: DROP graph data table graph_27 from rootnode1
[2021-10-03 00:39:49 sqlstore]: IMPORT graph directly into table graph_27 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/temp.build-tutorial/root.nodes.ontology.star.tsv.gz ...
[2021-10-03 00:39:50 query]: SQL Translation:
----------------------------------------

Include in node1 all the properties in the graph

In [99]:
%%time
!$kypher -i rootbase \
--match ' \
    rootbase: ()-[l {label: property}]->(n)' \
--return 'distinct property as node1' \
/ cat -i - -i "$TEMP"/root.nodes.ontology.star.tsv.gz --mode=NONE \
/ compact --mode=NONE --columns node1 \
-o "$TEMP"/root.nodes.ontology.star.property.tsv.gz \

!$kypher -i "$TEMP"/root.nodes.ontology.star.property.tsv.gz --as rootnode1 --limit 2

[2021-10-03 00:39:51 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_24_c1."label" "_aLias.node1"
     FROM graph_24 AS graph_24_c1
     WHERE graph_24_c1."label" = graph_24_c1."label"
  PARAS: []
---------------------------------------------
[2021-10-03 00:39:54 sqlstore]: DROP graph data table graph_27 from rootnode1
[2021-10-03 00:39:54 sqlstore]: IMPORT graph directly into table graph_27 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/temp.build-tutorial/root.nodes.ontology.star.property.tsv.gz ...
[2021-10-03 00:39:54 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_27 AS graph_27_c1
     LIMIT ?
  PARAS: [2]
---------------------------------------------
node1
P10
P1000
CPU times: user 48.7 ms, sys: 23.6 ms, total: 72.3 ms
Wall time: 4.32 s


## Add property datatypes

In [100]:
%%time
!$kypher -i datatypes -i rootbase \
--match ' \
    rootbase: ()-[r {label: property}]->(), \
    datatypes: (property)-[l:datatype]->(datatype) \
    ' \
--return 'distinct property as node1, l.label as label, datatype as node2, l as id' \
/ cat -i - -i $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.ontology.tsv.gz \
/ compact \
-o $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.ontology.datatype.tsv.gz \

!$kypher -i root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.ontology.datatype.tsv.gz --as rootbase --limit 2

[2021-10-03 00:39:55 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_4_c2."node1" "_aLias.node1", graph_4_c2."label" "_aLias.label", graph_4_c2."node2" "_aLias.node2", graph_4_c2."id" "_aLias.id"
     FROM graph_24 AS graph_24_c1
     INNER JOIN graph_4 AS graph_4_c2
     ON graph_4_c2."node1" = graph_24_c1."label"
        AND graph_24_c1."label" = graph_4_c2."node1"
        AND graph_4_c2."label" = ?
  PARAS: ['datatype']
---------------------------------------------
[2021-10-03 00:40:22 sqlstore]: DROP graph data table graph_24 from rootbase
[2021-10-03 00:40:23 sqlstore]: IMPORT graph directly into table graph_24 from /Users/pedroszekely/Downloads/kypher/projects/build-tutorial/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.ontology.datatype.tsv.gz ...
[2021-10-03 00:40:29 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_24 AS graph_24_c1
     LIMIT ?
  PARA

## Build labels, aliases and descriptions

Extract the label edges

In [103]:
%%time
!$kypher -i label -i rootnode1 \
--match ' \
    rootnode1: (n1)-[]->(), \
    label: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.label.tsv.gz

[2021-10-03 00:40:32 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_27_c1."node1" "_aLias.node1", graph_15_c2."label" "_aLias.label", graph_15_c2."node2" "_aLias.node2", graph_15_c2."id" "_aLias.id"
     FROM graph_15 AS graph_15_c2
     INNER JOIN graph_27 AS graph_27_c1
     ON graph_27_c1."node1" = graph_15_c2."node1"
  PARAS: []
---------------------------------------------
[2021-10-03 00:40:32 sqlstore]: CREATE INDEX on table graph_27 column node1 ...
[2021-10-03 00:40:32 sqlstore]: ANALYZE INDEX on table graph_27 column node1 ...
[2021-10-03 00:40:32 sqlstore]: CREATE INDEX on table graph_15 column node1 ...
[2021-10-03 00:41:30 sqlstore]: ANALYZE INDEX on table graph_15 column node1 ...
CPU times: user 746 ms, sys: 224 ms, total: 970 ms
Wall time: 1min 7s


Extract the alias edges

In [108]:
%%time
!$kypher -i alias -i rootnode1 \
--match ' \
    rootnode1: (n1)-[]->(), \
    alias: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.alias.tsv.gz

[2021-10-03 08:24:04 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_27_c1."node1" "_aLias.node1", graph_16_c2."label" "_aLias.label", graph_16_c2."node2" "_aLias.node2", graph_16_c2."id" "_aLias.id"
     FROM graph_16 AS graph_16_c2
     INNER JOIN graph_27 AS graph_27_c1
     ON graph_27_c1."node1" = graph_16_c2."node1"
  PARAS: []
---------------------------------------------
[2021-10-03 08:24:04 sqlstore]: CREATE INDEX on table graph_16 column node1 ...
[2021-10-03 08:24:08 sqlstore]: ANALYZE INDEX on table graph_16 column node1 ...
CPU times: user 89.1 ms, sys: 31.6 ms, total: 121 ms
Wall time: 7.37 s


Extract the description edges

In [105]:
%%time
!$kypher -i description -i rootnode1 \
--match ' \
    rootnode1: (n1)-[]->(), \
    description: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.description.tsv.gz

[2021-10-03 00:41:41 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_27_c1."node1" "_aLias.node1", graph_17_c2."label" "_aLias.label", graph_17_c2."node2" "_aLias.node2", graph_17_c2."id" "_aLias.id"
     FROM graph_17 AS graph_17_c2
     INNER JOIN graph_27 AS graph_27_c1
     ON graph_27_c1."node1" = graph_17_c2."node1"
  PARAS: []
---------------------------------------------
[2021-10-03 00:41:41 sqlstore]: CREATE INDEX on table graph_17 column node1 ...
[2021-10-03 00:42:30 sqlstore]: ANALYZE INDEX on table graph_17 column node1 ...
CPU times: user 636 ms, sys: 191 ms, total: 828 ms
Wall time: 57.1 s


## Compute useful derived files

### Inverses of `P279`

> To do: need to define t`P279_` property, it's datatype, label, etc.

In [106]:
!$kypher -i rootbase \
--match '(n1)-[:P279]->(class)' \
--return 'distinct class as node1, "P279_" as label, n1 as node2' \
/ add-id --id-style wikidata \
/ sort \
-o "$OUT"/root.derived.P279inv.tsv.gz

[2021-10-03 00:42:39 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_24_c1."node2" "_aLias.node1", ? "_aLias.label", graph_24_c1."node1" "_aLias.node2"
     FROM graph_24 AS graph_24_c1
     WHERE graph_24_c1."label" = ?
  PARAS: ['P279_', 'P279']
---------------------------------------------
[2021-10-03 00:42:39 sqlstore]: CREATE INDEX on table graph_24 column label ...
[2021-10-03 00:42:40 sqlstore]: ANALYZE INDEX on table graph_24 column label ...


## Final files
- base, includes all edges except labeles, aliases and descriptions
- labels
- aliases
- descriptions

In [109]:
%%time
!$kgtk cat \
-i "$OUT"/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.ontology.datatype.tsv.gz \
-i "$OUT"/root.graph.alias.tsv.gz \
-i "$OUT"/root.graph.label.tsv.gz \
-i "$OUT"/root.graph.description.tsv.gz \
-o "$OUT"/all.tsv.gz

CPU times: user 108 ms, sys: 38.5 ms, total: 147 ms
Wall time: 10 s
