# Generating Useful Wikidata Files

In [1]:
import io
import os

import numpy as np
import pandas as pd

# from IPython.display import display, HTML, Image
# from pandas_profiling import ProfileReport

## Set up environment and folders to store the files

- `WIKIDATA_HOME` folder where you put your Wikidata data
- `OUT` folder where the output files go
- `TEMP` folder to keep temporary files , including the database
- `kgtk` shortcut to invoke the kgtk software
- `compress` the compression software to use

The current implementation of some of the kgtk commands does not understand compressed files. In particular, `query` often rejects `gz` files.

To dos:

- Make sure that all files have id columns as `query` gets unhappy when files have no ids.
- Create an output folder for a subset of Wikidata without scholarly articles. This is half done: the remaining work is to subtract the scholarly articles from `EDGES` and repeat the workflow.
- Change the naming convention to make it clear which files are a partition of the original `EDGES`, so users know what files they need to get to have a full version.
- Create a qualifier file for the partition files of Wikidata: this is so that if a user gets one of the partitions, they can get the corresponding qualifier file.
- Add pagerank and other stats. We can compute the pagerank from the `all.item` file, so maybe should be called `all.item.pagerank.tsv`

Naming convention: the name `all` is redundant, we should consider removing it. I recomment using the prefix `part.` to name the partition of Wikidata, e.g., `part.label`, `part.quantity`. Files such as `P279` are not partitions as it is a subset of `part.item`.

If we create a subset of Wikidata, e.g., no scholarly articles, we could call it `minus.Q13442814`; if we remove galaxies too, we could call it `minus.Q13442814-Q318`, so the files would be `minus.Q13442814-Q318.part.quantity.tsv` (the idea of `all` is in contrast to `minus`). We can also have files that start with Qnodes, e.g, `Q5.part.quantity.tsv`; constructing such files is harder as we don't want dangling nodes in the item file.

In [2]:
#home = %env HOME
# supply home from command line
# supply wiki_file from command line as well
# f_path from command line
%env WIKIDATA_HOME=$home/$f_path
wikidata_home = %env WIKIDATA_HOME
%env OUT = $wikidata_home/output
%env TEMP = $wikidata_home/temp

# Define $kgtk so that we can turn on and off the debugging options
%env kgtk = time kgtk --debug
#%env kgtk = kgtk
%env compress = compress
%env compress = cat

env: WIKIDATA_HOME=/Users/pedroszekely/Downloads/kypher
env: OUT=/Users/pedroszekely/Downloads/kypher/output
env: TEMP=/Users/pedroszekely/Downloads/kypher/temp
env: kgtk=time kgtk --debug
env: compress=compress
env: compress=cat


In [3]:
cd $wikidata_home

/Users/pedroszekely/Downloads/kypher


In [4]:
!mkdir output
!mkdir temp

mkdir: output: File exists
mkdir: temp: File exists


Clean up the output and temp folders before we start

In [5]:
!rm $OUT/*.tsv
!rm $TEMP/*.tsv

rm: /Users/pedroszekely/Downloads/kypher/temp/*.tsv: No such file or directory


The `all` file contains 100M edges of the full dump, `all.10` contains 10M edges. This is for testing, as we should run on the full edges file.

In [6]:
%env STORE=$wikidata_home/temp/wikidata.sqlite3.db
# %env EDGES=$wikidata_home/all.10.tsv
%env EDGES=$wikidata_home/$wiki_file

#%env QUALS=$wikidata_home/wikidata-20200803-all-qualifiers.tsv.gz
#%env LABELS=$wikidata_home/wikidata-20200803-all-labels-en-sorted.tsv.gz

env: STORE=/Users/pedroszekely/Downloads/kypher/temp/wikidata.sqlite3.db
env: EDGES=/Users/pedroszekely/Downloads/kypher/all.10.tsv
env: EDGES=/Users/pedroszekely/Downloads/kypher/all.tsv


Uncomment the line below to remove the sqllite2 database. It takes a long time to load all the data and create indices, so don't remove the database unless you change files that have already been loaded and you need to force a reload.

In [7]:
#rm $TEMP/wikidata.sqlite3.db

### Get a sample and force importing the edge file into the database

In [8]:
!$kgtk query -i $EDGES --limit 10 --graph-cache $STORE

[2020-09-20 22:30:38 sqlstore]: IMPORT graph directly into table graph_1 from /Users/pedroszekely/Downloads/kypher/all.tsv ...
[2020-09-20 22:42:49 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_1 graph_1_c1
     LIMIT ?
  PARAS: [10]
---------------------------------------------
id	node1	label	node2	rank	node2;magnitude	node2;unit	node2;date	node2;item	node2;lower	node2;upper	node2;latitude	node2;longitude	node2;precision	node2;calendar	node2;entity-type	node2;wikidatatype
Q45-label-en	Q45	label	'Portugal'@en													
Q45-label-fr	Q45	label	'Portugal'@fr													
Q45-label-nb	Q45	label	'Portugal'@nb													
Q45-label-it	Q45	label	'Portogallo'@it													
Q45-label-ru	Q45	label	'Португалия'@ru													
Q45-label-nl	Q45	label	'Portugal'@nl													
Q45-label-es	Q45	label	'Portugal'@es													
Q45-label-de	Q45	label	'Portugal'@de													
Q45-label-pl	Q45	label	'Portugalia'@pl													
Q45-label-be-t

Force creation of the index on the label column

In [9]:
!$kgtk query -i $EDGES --graph-cache $STORE -o - \
    --match '(i)-[:P31]->(c)' \
    --limit 5

[2020-09-20 22:42:50 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_1 graph_1_c1
     WHERE graph_1_c1."label"=?
     LIMIT ?
  PARAS: ['P31', 5]
---------------------------------------------
[2020-09-20 22:42:50 sqlstore]: CREATE INDEX on table graph_1 column label ...
[2020-09-20 22:45:19 sqlstore]: ANALYZE INDEX on table graph_1 column label ...
id	node1	label	node2	rank	node2;magnitude	node2;unit	node2;date	node2;item	node2;lower	node2;upper	node2;latitude	node2;longitude	node2;precision	node2;calendar	node2;entity-type	node2;wikidatatype
Q45-P31-1	Q45	P31	Q3624078	normal				Q3624078							item	wikibase-item
Q45-P31-2	Q45	P31	Q6256	normal				Q6256							item	wikibase-item
Q45-P31-3	Q45	P31	Q20181813	normal				Q20181813							item	wikibase-item
Q140-P31-1	Q140	P31	Q16521	normal				Q16521							item	wikibase-item
Q183-P31-1	Q183	P31	Q3624078	preferred				Q3624078							item	wikibase-item
      161.10 real        67.03 user        21.

Force creation of the index on the node2 column

In [10]:
!$kgtk query -i $EDGES --graph-cache $STORE -o - \
    --match '(i)-[r]->(:Q5)' \
    --limit 5

[2020-09-20 22:45:33 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_1 graph_1_c1
     WHERE graph_1_c1."node2"=?
     LIMIT ?
  PARAS: ['Q5', 5]
---------------------------------------------
[2020-09-20 22:45:33 sqlstore]: CREATE INDEX on table graph_1 column node2 ...
[2020-09-20 22:49:38 sqlstore]: ANALYZE INDEX on table graph_1 column node2 ...
id	node1	label	node2	rank	node2;magnitude	node2;unit	node2;date	node2;item	node2;lower	node2;upper	node2;latitude	node2;longitude	node2;precision	node2;calendar	node2;entity-type	node2;wikidatatype
Q1253-P31-1	Q1253	P31	Q5	normal				Q5							item	wikibase-item
Q1526-P31-1	Q1526	P31	Q5	normal				Q5							item	wikibase-item
Q3794-P31-1	Q3794	P31	Q5	normal				Q5							item	wikibase-item
Q4291-P31-1	Q4291	P31	Q5	normal				Q5							item	wikibase-item
Q4489-P31-1	Q4489	P31	Q5	normal				Q5							item	wikibase-item
      263.21 real       102.59 user        35.79 sys


### Count the number of edges

In [11]:
!$kgtk query -i $EDGES --graph-cache $STORE \
    --match 'all: ()-[r]->()' \
    --return 'count(r) as count' \
    --limit 10

[2020-09-20 22:49:54 query]: SQL Translation:
---------------------------------------------
  SELECT count(graph_1_c1."id") "count"
     FROM graph_1 graph_1_c1
     LIMIT ?
  PARAS: [10]
---------------------------------------------
count
99999999
       52.37 real         8.40 user         9.00 sys


### Get the distribution of the label column
I would like to have it sorted numerically, but don't know how to make it happen

In [12]:
!$kgtk unique --column label -i $EDGES / sort2 -c node2 -r -o $OUT/all-distribution.tsv 

      326.64 real       320.87 user         5.13 sys


In [13]:
!head $OUT/all-distribution.tsv | column -t -s $'\t' 

node1  label  node2
P814   count  999
P18    count  99319
P4342  count  991
P687   count  990
P6055  count  99
P3884  count  99
P169   count  99
P1653  count  99
P282   count  9858


### Compute files with labels, aliases and descriptions
Return the id, node1, label and node2 columns

In [14]:
!$kgtk query -i $EDGES --graph-cache $STORE -o - \
    --match '(n1)-[l:label]->(n2)' \
    --return 'l, n1, l.label, n2' \
    | $compress > $OUT/all.label.tsv

[2020-09-20 22:56:14 query]: SQL Translation:
---------------------------------------------
  SELECT graph_1_c1."id", graph_1_c1."node1", graph_1_c1."label", graph_1_c1."node2"
     FROM graph_1 graph_1_c1
     WHERE graph_1_c1."label"=?
  PARAS: ['label']
---------------------------------------------
      117.57 real        38.36 user        20.29 sys


In [15]:
!$kgtk query -i $EDGES --graph-cache $STORE -o - \
    --match '(n1)-[l:alias]->(n2)' \
    --return 'l, n1, l.label, n2' \
    | $compress > $OUT/all.alias.tsv

[2020-09-20 22:58:11 query]: SQL Translation:
---------------------------------------------
  SELECT graph_1_c1."id", graph_1_c1."node1", graph_1_c1."label", graph_1_c1."node2"
     FROM graph_1 graph_1_c1
     WHERE graph_1_c1."label"=?
  PARAS: ['alias']
---------------------------------------------
       20.87 real         7.68 user         2.70 sys


In [16]:
!$kgtk query -i $EDGES --graph-cache $STORE -o - \
    --match '(n1)-[l:description]->(n2)' \
    --return 'l, n1, l.label, n2' \
    | $compress > $OUT/all.description.tsv

[2020-09-20 22:58:32 query]: SQL Translation:
---------------------------------------------
  SELECT graph_1_c1."id", graph_1_c1."node1", graph_1_c1."label", graph_1_c1."node2"
     FROM graph_1 graph_1_c1
     WHERE graph_1_c1."label"=?
  PARAS: ['description']
---------------------------------------------
      318.41 real       194.71 user        28.51 sys


### Now create files with the English labels, aliases and descriptions

In [17]:
!$kgtk query -i $OUT/all.label.tsv --graph-cache $STORE -o - \
    --match '()-[]->(n2)' \
    --where 'n2.kgtk_lqstring_lang = "en"' \
    | kgtk sort2 \
    | $compress > $OUT/all.label.en.tsv

[2020-09-20 23:03:51 sqlstore]: IMPORT graph directly into table graph_2 from /Users/pedroszekely/Downloads/kypher/output/all.label.tsv ...
[2020-09-20 23:04:50 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_2 graph_2_c1
     WHERE (kgtk_lqstring_lang(graph_2_c1."node2") = ?)
  PARAS: ['en']
---------------------------------------------
       83.68 real       114.10 user         3.45 sys


In [18]:
!$kgtk query -i $OUT/all.alias.tsv --graph-cache $STORE -o - \
    --match '()-[]->(n2)' \
    --where 'n2.kgtk_lqstring_lang = "en"' \
    | kgtk sort2 \
    | $compress > $OUT/all.alias.en.tsv

[2020-09-20 23:05:16 sqlstore]: IMPORT graph directly into table graph_3 from /Users/pedroszekely/Downloads/kypher/output/all.alias.tsv ...
[2020-09-20 23:05:26 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_3 graph_3_c1
     WHERE (kgtk_lqstring_lang(graph_3_c1."node2") = ?)
  PARAS: ['en']
---------------------------------------------
       14.72 real        20.29 user         0.65 sys


In [19]:
!$kgtk query -i $OUT/all.description.tsv --graph-cache $STORE -o - \
    --match '()-[]->(n2)' \
    --where 'n2.kgtk_lqstring_lang = "en"' \
    | kgtk sort2 \
    | $compress > $OUT/all.description.en.tsv

[2020-09-20 23:05:31 sqlstore]: IMPORT graph directly into table graph_4 from /Users/pedroszekely/Downloads/kypher/output/all.description.tsv ...
[2020-09-20 23:11:41 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_4 graph_4_c1
     WHERE (kgtk_lqstring_lang(graph_4_c1."node2") = ?)
  PARAS: ['en']
---------------------------------------------
      470.45 real       658.21 user        17.29 sys


Let's sample these files to see what they look like:

* we are getting all variants of English, we really want `en` only
* the labels have the language tags, how do we output only the string without the language tag?

In [20]:
!head $OUT/all.label.en.tsv | column -t -s $'\t' 

id                 node1  label  node2
P1015-label-en     P1015  label  'NORAF ID'@en
P1015-label-en-gb  P1015  label  'BIBSYS ID'@en-gb
P102-label-en      P102   label  'member of political party'@en
P102-label-en-ca   P102   label  'member of political party'@en-ca
P102-label-en-gb   P102   label  'member of political party'@en-gb
P1025-label-en     P1025  label  'SUDOC editions'@en
P1057-label-en     P1057  label  'chromosome'@en
P1057-label-en-gb  P1057  label  'chromosome'@en-gb
P1064-label-en     P1064  label  'track gauge'@en


### Compute the distribution of the number of edges for each Wikidata type

In [21]:
!$kgtk unique --column 'node2;wikidatatype' -i $EDGES / sort2 -c node2 -r -o $OUT/all.wikidatatype.distribution.tsv

      296.67 real       291.06 user         4.66 sys


In [22]:
!column -t -s $'\t' $OUT/all.wikidatatype.distribution.tsv

node1              label  node2
monolingualtext    count  875537
wikibase-property  count  710
geo-shape          count  560
string             count  5431775
math               count  391
external-id        count  3559912
wikibase-lexeme    count  25
globe-coordinate   count  224842
wikibase-form      count  16
quantity           count  1434724
url                count  141091
musical-notation   count  14
tabular-data       count  128
commonsMedia       count  122171
wikibase-item      count  11793643
time               count  1049993


### Create a file to contain the edges for each wikidata type

In [23]:
types = [
    "time",
    "wikibase-item",
    "math",
    "wikibase-form",
    "quantity",
    "string",
    "external-id",
    "commonsMedia",
    "globe-coordinate",
    "monolingualtext",
    "musical-notation",
    "geo-shape",
    "wikibase-property",
    "url",
]
command = "$kgtk query -i $EDGES --graph-cache $STORE -o - \
    --match '(n1)-[l]->(n2 {wikidatatype: type})' \
    --return 'l, n1, l.label, n2'\
    --where 'type = \"TYPE\"' \
    | kgtk sort2 | $compress > $OUT/all.TYPE.tsv"
for type in types:
    cmd = command.replace("TYPE", type)
    print(cmd)
    os.system(cmd)

$kgtk query -i $EDGES --graph-cache $STORE -o -     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = "time"'     | kgtk sort2 | $compress > $OUT/all.time.tsv
$kgtk query -i $EDGES --graph-cache $STORE -o -     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = "wikibase-item"'     | kgtk sort2 | $compress > $OUT/all.wikibase-item.tsv
$kgtk query -i $EDGES --graph-cache $STORE -o -     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = "math"'     | kgtk sort2 | $compress > $OUT/all.math.tsv
$kgtk query -i $EDGES --graph-cache $STORE -o -     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = "wikibase-form"'     | kgtk sort2 | $compress > $OUT/all.wikibase-form.tsv
$kgtk query -i $EDGES --graph-cache $STORE -o -     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2' 

### Create a file with the sitelinks

In [24]:
!$kgtk query -i $EDGES --graph-cache $STORE -o - \
    --match '(n1)-[l:wikipedia_sitelink]->(n2)' \
    --return 'l, n1, l.label, n2' \
    | kgtk sort2 \
    | $compress > $OUT/all.wikipedia_sitelink.tsv

[2020-09-20 23:33:12 query]: SQL Translation:
---------------------------------------------
  SELECT graph_1_c1."id", graph_1_c1."node1", graph_1_c1."label", graph_1_c1."node2"
     FROM graph_1 graph_1_c1
     WHERE graph_1_c1."label"=?
  PARAS: ['wikipedia_sitelink']
---------------------------------------------
       18.79 real         7.54 user         3.12 sys


### Create a file that specifies for each node whether it is an item or a property

In [25]:
!$kgtk query -i $EDGES --graph-cache $STORE -o - \
    --match '(n1)-[l:type]->(n2)' \
    --return 'l, n1, l.label, n2' \
    | kgtk sort2 \
    | $compress > $OUT/all.type.tsv 

[2020-09-20 23:33:32 query]: SQL Translation:
---------------------------------------------
  SELECT graph_1_c1."id", graph_1_c1."node1", graph_1_c1."label", graph_1_c1."node2"
     FROM graph_1 graph_1_c1
     WHERE graph_1_c1."label"=?
  PARAS: ['type']
---------------------------------------------
       33.82 real         7.49 user         7.16 sys


### Create the P31 and P279 files

In [26]:
!$kgtk query -i $EDGES --graph-cache $STORE -o - \
    --match '(n1)-[l:P31]->(n2)' \
    --return 'l, n1, l.label, n2' \
    | kgtk sort2 | $compress > $OUT/all.P31.tsv

[2020-09-20 23:34:07 query]: SQL Translation:
---------------------------------------------
  SELECT graph_1_c1."id", graph_1_c1."node1", graph_1_c1."label", graph_1_c1."node2"
     FROM graph_1 graph_1_c1
     WHERE graph_1_c1."label"=?
  PARAS: ['P31']
---------------------------------------------
       25.20 real         6.62 user         5.06 sys


In [27]:
!$kgtk query -i $EDGES --graph-cache $STORE -o - \
    --match '(n1)-[l:P279]->(n2)' \
    --return 'l, n1, l.label, n2' \
    | kgtk sort2 | $compress > $OUT/all.P279.tsv

[2020-09-20 23:34:32 query]: SQL Translation:
---------------------------------------------
  SELECT graph_1_c1."id", graph_1_c1."node1", graph_1_c1."label", graph_1_c1."node2"
     FROM graph_1 graph_1_c1
     WHERE graph_1_c1."label"=?
  PARAS: ['P279']
---------------------------------------------
        1.52 real         0.77 user         0.27 sys


In [28]:
!head $OUT/all.P31.tsv | column -t -s $'\t' 

id           node1  label  node2
P1015-P31-1  P1015  P31    Q55586529
P1015-P31-2  P1015  P31    Q19595382
P1015-P31-3  P1015  P31    Q19833377
P102-P31-1   P102   P31    Q18608871
P1025-P31-1  P1025  P31    Q29547399
P1025-P31-2  P1025  P31    Q96776953
P1057-P31-1  P1057  P31    Q19887775
P1064-P31-1  P1064  P31    Q23310071
P1108-P31-1  P1108  P31    Q21294996


In [29]:
!$kgtk cat -i $OUT/all.P279.tsv -i $OUT/all.P31.tsv -o - \
    | kgtk sort2 | $compress > $OUT/all.P31_P279.tsv

        9.27 real         8.97 user         0.20 sys


In [30]:
!head $OUT/all.P31_P279.tsv | column -t -s $'\t' 

id           node1  label  node2
P1015-P31-1  P1015  P31    Q55586529
P1015-P31-2  P1015  P31    Q19595382
P1015-P31-3  P1015  P31    Q19833377
P102-P31-1   P102   P31    Q18608871
P1025-P31-1  P1025  P31    Q29547399
P1025-P31-2  P1025  P31    Q96776953
P1057-P31-1  P1057  P31    Q19887775
P1064-P31-1  P1064  P31    Q23310071
P1108-P31-1  P1108  P31    Q21294996


### Create the file that contains all nodes reachable via P279 starting from a node2 in P31 or a node1 in P279

First compute the roots

In [31]:
!$kgtk query -i $OUT/all.P279.tsv --graph-cache $STORE -o - \
    --match '(n1)-[]->()' \
    --return 'n1 as node' \
    | kgtk sort -c node > $TEMP/P279.n1.tsv

[2020-09-20 23:34:44 sqlstore]: IMPORT graph directly into table graph_5 from /Users/pedroszekely/Downloads/kypher/output/all.P279.tsv ...
[2020-09-20 23:34:45 query]: SQL Translation:
---------------------------------------------
  SELECT graph_5_c1."node1" "node"
     FROM graph_5 graph_5_c1
  PARAS: []
---------------------------------------------
        1.90 real         1.03 user         0.20 sys


In [32]:
!$kgtk query -i $OUT/all.P31.tsv --graph-cache $STORE  -o - \
    --match '()-[]->(n2)' \
    --return 'n2 as node' \
    | kgtk sort -c node > $TEMP/P31.n2.tsv

[2020-09-20 23:34:47 sqlstore]: IMPORT graph directly into table graph_6 from /Users/pedroszekely/Downloads/kypher/output/all.P31.tsv ...
[2020-09-20 23:34:52 query]: SQL Translation:
---------------------------------------------
  SELECT graph_6_c1."node2" "node"
     FROM graph_6 graph_6_c1
  PARAS: []
---------------------------------------------
        8.04 real        11.35 user         0.42 sys


In [33]:
!$kgtk cat --mode NONE $TEMP/P31.n2.tsv $TEMP/P279.n1.tsv \
    / compact --mode NONE --columns node \
    > $TEMP/P279.roots.tsv

       11.50 real        16.15 user         0.58 sys


Now we can invoke the reachable-nodes command

In [34]:
!$kgtk reachable-nodes \
    --rootfile $TEMP/P279.roots.tsv \
    --rootfilecolumn 0 \
    --subj 1 --pred 2 --obj 3 \
    $OUT/all.P279.tsv \
    | kgtk sort2 \
    | $compress > $TEMP/P279.reachable.tsv

        3.11 real         2.48 user         0.25 sys


The reachable-nodes command produces edges labeled `reachable`, so we need one command to rename them.

In [35]:
!$kgtk query -i $TEMP/P279.reachable.tsv --graph-cache $STORE  -o - \
    --match '(n1)-[]->(n2)' \
    --return 'n1, "P279star" as label, n2 as node2' \
     > $TEMP/P279star.1.tsv

[2020-09-20 23:35:17 sqlstore]: IMPORT graph directly into table graph_7 from /Users/pedroszekely/Downloads/kypher/temp/P279.reachable.tsv ...
[2020-09-20 23:35:17 query]: SQL Translation:
---------------------------------------------
  SELECT graph_7_c1."node1", ? "label", graph_7_c1."node2" "node2"
     FROM graph_7 graph_7_c1
  PARAS: ['P279star']
---------------------------------------------
        0.99 real         0.93 user         0.15 sys


We also want `P279star` to be relflexive, ie, contain `(n1)-[:P279star]->(n1)` for all node1

In [36]:
!$kgtk query -i $TEMP/P279.reachable.tsv --graph-cache $STORE  -o - \
    --match '(n1)-[]->(n2)' \
    --return 'n1 as node1, "P279star" as label, n1 as node2' \
     > $TEMP/P279star.2.tsv

[2020-09-20 23:35:18 query]: SQL Translation:
---------------------------------------------
  SELECT graph_7_c1."node1" "node1", ? "label", graph_7_c1."node1" "node2"
     FROM graph_7 graph_7_c1
  PARAS: ['P279star']
---------------------------------------------
        0.73 real         0.62 user         0.10 sys


In [37]:
!$kgtk query -i $TEMP/P279.reachable.tsv --graph-cache $STORE  -o - \
    --match '(n1)-[]->(n2)' \
    --return 'n2 as node1, "P279star" as label, n2 as node2' \
     > $TEMP/P279star.3.tsv

[2020-09-20 23:35:19 query]: SQL Translation:
---------------------------------------------
  SELECT graph_7_c1."node2" "node1", ? "label", graph_7_c1."node2" "node2"
     FROM graph_7 graph_7_c1
  PARAS: ['P279star']
---------------------------------------------
        0.71 real         0.60 user         0.10 sys


In [38]:
!$kgtk query -i $OUT/all.P31.tsv --graph-cache $STORE  -o - \
    --match '(n1)-[]->(n2)' \
    --return 'n2 as node1, "P279star" as label, n2 as node2' \
     > $TEMP/P279star.4.tsv

[2020-09-20 23:35:20 query]: SQL Translation:
---------------------------------------------
  SELECT graph_6_c1."node2" "node1", ? "label", graph_6_c1."node2" "node2"
     FROM graph_6 graph_6_c1
  PARAS: ['P279star']
---------------------------------------------
        3.55 real         3.37 user         0.16 sys


Now we can concatenate these files to produce the final output

In [39]:
!$kgtk cat --mode NONE $TEMP/P279star.1.tsv $TEMP/P279star.2.tsv $TEMP/P279star.3.tsv $TEMP/P279star.4.tsv \
    | kgtk compact \
    | kgtk sort2 \
    | kgtk add-id --id-style node1-label-node2-num \
    > $OUT/all.P279star.tsv

        9.63 real         8.43 user         0.17 sys


This is difficult to test with our Wikidata subset because our hierarchy is very sparse.

This is how we would do the typical `?item P31/P279* ?class` in Kypher. 
The example shows how to get all the `n1` that are instances of subclasses of beer (q44).

In [40]:
!$kgtk query -i $OUT/all.P31.tsv -i $OUT/all.P279star.tsv --graph-cache $STORE  -o - \
    --match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q44)' \
    --return 'count(n1) as count'

[2020-09-20 23:35:42 sqlstore]: IMPORT graph directly into table graph_8 from /Users/pedroszekely/Downloads/kypher/output/all.P279star.tsv ...
[2020-09-20 23:35:43 query]: SQL Translation:
---------------------------------------------
  SELECT count(graph_6_c1."node1") "count"
     FROM graph_6 graph_6_c1, graph_8 graph_8_c2
     WHERE graph_6_c1."label"=?
     AND graph_8_c2."node2"=?
     AND graph_6_c1."node2"=graph_8_c2."node1"
  PARAS: ['P31', 'Q44']
---------------------------------------------
[2020-09-20 23:35:43 sqlstore]: CREATE INDEX on table graph_6 column node2 ...
[2020-09-20 23:35:44 sqlstore]: ANALYZE INDEX on table graph_6 column node2 ...
[2020-09-20 23:35:44 sqlstore]: CREATE INDEX on table graph_8 column node1 ...
[2020-09-20 23:35:44 sqlstore]: ANALYZE INDEX on table graph_8 column node1 ...
[2020-09-20 23:35:44 sqlstore]: CREATE INDEX on table graph_8 column node2 ...
[2020-09-20 23:35:44 sqlstore]: ANALYZE INDEX on table graph_8 column node2 ...
[2020-09-20 23:35

### Create a file to do generalized Is-A queries
The idea is that `(n1)-[:isa]->(n2)` when `(n1)-[:P31]->(n2)` or `(n1)-[:P279]->(n2)`

We do this by concatenating the files and renaming the relation

In [41]:
!$kgtk cat $OUT/all.P31.tsv $OUT/all.P279.tsv \
    > $TEMP/isa.1.tsv

        8.72 real         8.49 user         0.20 sys


In [42]:
!$kgtk query -i $TEMP/isa.1.tsv --graph-cache $STORE  -o - \
    --match '(n1)-[]->(n2)' \
    --return 'n1, "isa" as label, n2' \
    | kgtk sort2 \
    | $compress > $OUT/all.isa.tsv 

[2020-09-20 23:35:55 sqlstore]: IMPORT graph directly into table graph_9 from /Users/pedroszekely/Downloads/kypher/temp/isa.1.tsv ...
[2020-09-20 23:36:00 query]: SQL Translation:
---------------------------------------------
  SELECT graph_9_c1."node1", ? "label", graph_9_c1."node2"
     FROM graph_9 graph_9_c1
  PARAS: ['isa']
---------------------------------------------
        9.27 real        12.63 user         0.38 sys


Example of how to use the `isa` relation

In [43]:
!$kgtk query -i $OUT/all.isa.tsv -i $OUT/all.P279star.tsv --graph-cache $STORE  -o - \
    --match 'isa: (n1)-[l:isa]->(c), P279star: (c)-[]->(:Q44)' \
    --return 'distinct n1, l.label, "Q44" as node2' \
    --limit 10

[2020-09-20 23:36:04 sqlstore]: IMPORT graph directly into table graph_10 from /Users/pedroszekely/Downloads/kypher/output/all.isa.tsv ...
[2020-09-20 23:36:08 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_10_c1."node1", graph_10_c1."label", ? "node2"
     FROM graph_10 graph_10_c1, graph_8 graph_8_c2
     WHERE graph_10_c1."label"=?
     AND graph_8_c2."node2"=?
     AND graph_10_c1."node2"=graph_8_c2."node1"
     LIMIT ?
  PARAS: ['Q44', 'isa', 'Q44', 10]
---------------------------------------------
[2020-09-20 23:36:08 sqlstore]: CREATE INDEX on table graph_10 column node2 ...
[2020-09-20 23:36:09 sqlstore]: ANALYZE INDEX on table graph_10 column node2 ...
[2020-09-20 23:36:09 sqlstore]: CREATE INDEX on table graph_10 column label ...
[2020-09-20 23:36:09 sqlstore]: ANALYZE INDEX on table graph_10 column label ...
node1	label	node2
Q10313616	isa	Q44
Q12009657	isa	Q44
Q1445539	isa	Q44
Q15883984	isa	Q44
Q16069508	isa	Q44
Q16545279	isa	

### Creating a subset of Wikidata without scholarly articles (Q13442814)
First create a file with the schloarly articles

In [52]:
!$kgtk query -i $OUT/all.isa.tsv -i $OUT/all.P279star.tsv --graph-cache $STORE  -o - \
    --match 'isa: (n1)-[l:isa]->(n2:Q13442814)' \
    --return 'distinct n1, l.label, n2' \
    | kgtk sort2 \
    | $compress > $OUT/all.isa.Q13442814.tsv 

[2020-09-21 09:57:27 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_10_c1."node1", graph_10_c1."label", graph_10_c1."node2"
     FROM graph_10 graph_10_c1
     WHERE graph_10_c1."label"=?
     AND graph_10_c1."node2"=?
  PARAS: ['isa', 'Q13442814']
---------------------------------------------
        7.84 real         2.59 user         0.54 sys


Now we need to remove from `$EDGES` any edge where node1 or node2 is in node1 of `$OUT/all.isa.Q13442814.tsv`. The result will be `$OUT/minus.Q13442814.tsv`. We can then run the whole notebook with this new file as $EDGES and compute all the product files in a new output directory

In [55]:
!head $OUT/all.isa.Q13442814.tsv | column -t -s $'\t' 

node1      label  node2
Q12376813  isa    Q13442814
Q14565069  isa    Q13442814
Q16338058  isa    Q13442814
Q17166308  isa    Q13442814
Q17454081  isa    Q13442814
Q17485684  isa    Q13442814
Q17485687  isa    Q13442814
Q17535010  isa    Q13442814
Q1801903   isa    Q13442814


## Summary

In [58]:
!wc -l $OUT/*.tsv $EDGES

    6030 /Users/pedroszekely/Downloads/kypher/output/all-distribution.tsv
   75887 /Users/pedroszekely/Downloads/kypher/output/all.P279.tsv
  161308 /Users/pedroszekely/Downloads/kypher/output/all.P279star.tsv
 1790373 /Users/pedroszekely/Downloads/kypher/output/all.P31.tsv
 1866259 /Users/pedroszekely/Downloads/kypher/output/all.P31_P279.tsv
  215468 /Users/pedroszekely/Downloads/kypher/output/all.alias.en.tsv
 2418893 /Users/pedroszekely/Downloads/kypher/output/all.alias.tsv
  122172 /Users/pedroszekely/Downloads/kypher/output/all.commonsMedia.tsv
 2622641 /Users/pedroszekely/Downloads/kypher/output/all.description.en.tsv
 57987893 /Users/pedroszekely/Downloads/kypher/output/all.description.tsv
 3559913 /Users/pedroszekely/Downloads/kypher/output/all.external-id.tsv
     561 /Users/pedroszekely/Downloads/kypher/output/all.geo-shape.tsv
  224843 /Users/pedroszekely/Downloads/kypher/output/all.globe-coordinate.tsv
  699853 /Users/pedroszekely/Downloads/kypher/output/all.isa.Q13442814.t

Number of distinct items in our dataset

In [61]:
!$kgtk query -i $EDGES --graph-cache $STORE  -o - \
    --match '(n1)-[]->()' \
    --return 'count(distinct n1) as count'

[2020-09-21 10:15:37 query]: SQL Translation:
---------------------------------------------
  SELECT count(DISTINCT graph_1_c1."node1") "count"
     FROM graph_1 graph_1_c1
  PARAS: []
---------------------------------------------
count
1959831
      100.28 real        57.04 user        10.30 sys


## Other Stuff

Little bug: if two files are specified as input, only one is used.

In [44]:
!$kgtk query -i $OUT/all.isa.tsv -i $OUT/all.P279star.tsv --graph-cache $STORE  -o - \
    --match 'P279star: (c)-[]->(:Q44)' \
    --limit 10

[2020-09-20 23:36:10 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_8 graph_8_c1
     WHERE graph_8_c1."node2"=?
     LIMIT ?
  PARAS: ['Q44', 10]
---------------------------------------------
node1	label	node2	id
Q10313616	P279star	Q44	Q10313616-P279star-Q44-0000
Q12009657	P279star	Q44	Q12009657-P279star-Q44-0000
Q1445539	P279star	Q44	Q1445539-P279star-Q44-0000
Q16069508	P279star	Q44	Q16069508-P279star-Q44-0000
Q16545279	P279star	Q44	Q16545279-P279star-Q44-0000
Q28077004	P279star	Q44	Q28077004-P279star-Q44-0000
Q44	P279star	Q44	Q44-P279star-Q44-0000
Q4488344	P279star	Q44	Q4488344-P279star-Q44-0000
Q4626	P279star	Q44	Q4626-P279star-Q44-0000
Q4880001	P279star	Q44	Q4880001-P279star-Q44-0000
        0.60 real         0.49 user         0.10 sys


In [45]:
!$kgtk query -i $OUT/all.isa.tsv -i $OUT/all.P279star.tsv --graph-cache $STORE  -o - \
    --match '(c)-[]->(:Q44)' \
    --limit 10

[2020-09-20 23:36:11 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_10 graph_10_c1
     WHERE graph_10_c1."node2"=?
     LIMIT ?
  PARAS: ['Q44', 10]
---------------------------------------------
node1	label	node2
Q10313616	isa	Q44
Q12009657	isa	Q44
Q1445539	isa	Q44
Q15883984	isa	Q44
Q16069508	isa	Q44
Q16545279	isa	Q44
Q2579953	isa	Q44
Q28077004	isa	Q44
Q3699039	isa	Q44
Q4488344	isa	Q44
        0.60 real         0.49 user         0.10 sys


In [62]:
!$kgtk query -i $OUT/all.isa.tsv -i $OUT/all.P279star.tsv --graph-cache $STORE  -o - \
    --match 'isa: (n1)-[l:isa]->(n2:Q318)' \
    --return 'distinct n1, l.label, n2' \
    | kgtk sort2 \
    | $compress > $OUT/all.isa.Q318.tsv 

[2020-09-21 21:31:06 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_10_c1."node1", graph_10_c1."label", graph_10_c1."node2"
     FROM graph_10 graph_10_c1
     WHERE graph_10_c1."label"=?
     AND graph_10_c1."node2"=?
  PARAS: ['isa', 'Q318']
---------------------------------------------
        5.60 real         0.72 user         0.27 sys
