# Generating Useful Wikidata Files

In [1]:
# Parameters
home = "/Users/pedroszekely/Downloads"
f_path = "kypher"
wiki_file = "all.10.tsv.gz"
output_folder = "output"
output_folder = "output.all.10"
temp_folder = "temp.all.10"
delete_database = ""

In [2]:
import io
import os

import numpy as np
import pandas as pd

# from IPython.display import display, HTML, Image
# from pandas_profiling import ProfileReport

## Set up environment and folders to store the files

- `WIKIDATA_HOME` folder where you put your Wikidata data
- `OUT` folder where the output files go
- `TEMP` folder to keep temporary files , including the database
- `kgtk` shortcut to invoke the kgtk software

The current implementation of some of the kgtk commands does not understand compressed files. In particular, `query` often rejects `gz` files.

To dos:

- Make sure that all files have id columns as `query` gets unhappy when files have no ids.
- Create an output folder for a subset of Wikidata without scholarly articles. This is half done: the remaining work is to subtract the scholarly articles from `EDGES` and repeat the workflow.
- Change the naming convention to make it clear which files are a partition of the original `EDGES`, so users know what files they need to get to have a full version.
- Create a qualifier file for the partition files of Wikidata: this is so that if a user gets one of the partitions, they can get the corresponding qualifier file.
- Add pagerank and other stats. We can compute the pagerank from the `all.item` file, so maybe should be called `all.item.pagerank.tsv`

Naming convention: the name `all` is redundant, we should consider removing it. I recomment using the prefix `part.` to name the partition of Wikidata, e.g., `part.label`, `part.quantity`. Files such as `P279` are not partitions as it is a subset of `part.item`.

If we create a subset of Wikidata, e.g., no scholarly articles, we could call it `minus.Q13442814`; if we remove galaxies too, we could call it `minus.Q13442814-Q318`, so the files would be `minus.Q13442814-Q318.part.quantity.tsv` (the idea of `all` is in contrast to `minus`). We can also have files that start with Qnodes, e.g, `Q5.part.quantity.tsv`; constructing such files is harder as we don't want dangling nodes in the item file.

In [15]:
!echo $WIKIDATA_HOME
!echo $OUT
!echo $TEMP
!echo $kgtk

/Users/pedroszekely/Downloads/kypher
/Users/pedroszekely/Downloads/kypher/output.all.10
/Users/pedroszekely/Downloads/kypher/temp.all.10
time kgtk --debug


In [5]:
cd $wikidata_home

/Users/pedroszekely/Downloads/kypher


In [6]:
!mkdir $OUT
!mkdir $TEMP

mkdir: /Users/pedroszekely/Downloads/kypher/output.all.10: File exists
mkdir: /Users/pedroszekely/Downloads/kypher/temp.all.10: File exists


Clean up the output and temp folders before we start

In [7]:
!rm $OUT/*.tsv $OUT/*.tsv.gz
!rm $TEMP/*.tsv $TEMP/*.tsv.gz

In [8]:
if delete_database:
    print("Deleting database")
    !rm $TEMP/wikidata.sqlite3.db

The `all` file contains 100M edges of the full dump, `all.10` contains 10M edges. This is for testing, as we should run on the full edges file.

In [9]:
%env STORE=$wikidata_home/temp/wikidata.sqlite3.db
# %env EDGES=$wikidata_home/all.10.tsv
%env EDGES=$wikidata_home/$wiki_file

#%env QUALS=$wikidata_home/wikidata-20200803-all-qualifiers.tsv.gz
#%env LABELS=$wikidata_home/wikidata-20200803-all-labels-en-sorted.tsv.gz

env: STORE=/Users/pedroszekely/Downloads/kypher/temp/wikidata.sqlite3.db
env: EDGES=/Users/pedroszekely/Downloads/kypher/all.10.tsv.gz


Uncomment the line below to remove the sqllite2 database. It takes a long time to load all the data and create indices, so don't remove the database unless you change files that have already been loaded and you need to force a reload.

### Get a sample and force importing the edge file into the database

In [10]:
!$kgtk query -i $EDGES --limit 10 --graph-cache $STORE

[2020-09-25 15:23:51 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_5 AS graph_5_c1
     LIMIT ?
  PARAS: [10]
---------------------------------------------
id	node1	label	node2	rank	node2;magnitude	node2;unit	node2;date	node2;item	node2;lower	node2;upper	node2;latitude	node2;longitude	node2;precision	node2;calendar	node2;entity-type	node2;wikidatatype
Q45-label-en	Q45	label	'Portugal'@en													
Q45-label-fr	Q45	label	'Portugal'@fr													
Q45-label-nb	Q45	label	'Portugal'@nb													
Q45-label-it	Q45	label	'Portogallo'@it													
Q45-label-ru	Q45	label	'Португалия'@ru													
Q45-label-nl	Q45	label	'Portugal'@nl													
Q45-label-es	Q45	label	'Portugal'@es													
Q45-label-de	Q45	label	'Portugal'@de													
Q45-label-pl	Q45	label	'Portugalia'@pl													
Q45-label-be-tarask	Q45	label	'Партугалія'@be-tarask													
        0.90 real         0.60 user         0.17 sys


Force creation of the index on the label column

In [11]:
!$kgtk query -i $EDGES --graph-cache $STORE -o - \
    --match '(i)-[:P31]->(c)' \
    --limit 5

[2020-09-25 15:23:52 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_5 AS graph_5_c1
     WHERE graph_5_c1."label"=?
     LIMIT ?
  PARAS: ['P31', 5]
---------------------------------------------
id	node1	label	node2	rank	node2;magnitude	node2;unit	node2;date	node2;item	node2;lower	node2;upper	node2;latitude	node2;longitude	node2;precision	node2;calendar	node2;entity-type	node2;wikidatatype
Q45-P31-1	Q45	P31	Q3624078	normal				Q3624078							item	wikibase-item
Q45-P31-2	Q45	P31	Q6256	normal				Q6256							item	wikibase-item
Q45-P31-3	Q45	P31	Q20181813	normal				Q20181813							item	wikibase-item
Q140-P31-1	Q140	P31	Q16521	normal				Q16521							item	wikibase-item
Q183-P31-1	Q183	P31	Q3624078	preferred				Q3624078							item	wikibase-item
        0.85 real         0.67 user         0.15 sys


Force creation of the index on the node2 column

In [12]:
!$kgtk query -i $EDGES --graph-cache $STORE -o - \
    --match '(i)-[r]->(:Q5)' \
    --limit 5

[2020-09-25 15:23:52 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_5 AS graph_5_c1
     WHERE graph_5_c1."node2"=?
     LIMIT ?
  PARAS: ['Q5', 5]
---------------------------------------------
[2020-09-25 15:23:52 sqlstore]: CREATE INDEX on table graph_5 column node2 ...
[2020-09-25 15:24:18 sqlstore]: ANALYZE INDEX on table graph_5 column node2 ...
id	node1	label	node2	rank	node2;magnitude	node2;unit	node2;date	node2;item	node2;lower	node2;upper	node2;latitude	node2;longitude	node2;precision	node2;calendar	node2;entity-type	node2;wikidatatype
Q1253-P31-1	Q1253	P31	Q5	normal				Q5							item	wikibase-item
Q1526-P31-1	Q1526	P31	Q5	normal				Q5							item	wikibase-item
Q3794-P31-1	Q3794	P31	Q5	normal				Q5							item	wikibase-item
Q4291-P31-1	Q4291	P31	Q5	normal				Q5							item	wikibase-item
Q4489-P31-1	Q4489	P31	Q5	normal				Q5							item	wikibase-item
       27.54 real        18.31 user         4.63 sys


### Count the number of edges

In [13]:
!$kgtk query -i $EDGES --graph-cache $STORE \
    --match 'all: ()-[r]->()' \
    --return 'count(r) as count' \
    --limit 10

[2020-09-25 15:24:20 query]: SQL Translation:
---------------------------------------------
  SELECT count(graph_5_c1."id") "count"
     FROM graph_5 AS graph_5_c1
     LIMIT ?
  PARAS: [10]
---------------------------------------------
count
9999999
        2.79 real         1.49 user         1.02 sys


### Get the distribution of the label column
I would like to have it sorted numerically, but don't know how to make it happen

In [14]:
!$kgtk unique --column label -i $EDGES / sort2 -c node2 -r -o $OUT/all-distribution.tsv 

       50.16 real        48.25 user         1.37 sys


In [15]:
!head $OUT/all-distribution.tsv | column -t -s $'\t' 

node1  label  node2
P3987  count  998
P410   count  987
P575   count  985
P6879  count  98
P6562  count  98
P5395  count  98
P3153  count  98
P4933  count  97
P3135  count  97


### Compute files with labels, aliases and descriptions
Return the id, node1, label and node2 columns

In [16]:
!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.label.tsv.gz \
    --match '(n1)-[l:label]->(n2)' \
    --return 'l, n1, l.label, n2' 

[2020-09-25 15:25:14 query]: SQL Translation:
---------------------------------------------
  SELECT graph_5_c1."id", graph_5_c1."node1", graph_5_c1."label", graph_5_c1."node2"
     FROM graph_5 AS graph_5_c1
     WHERE graph_5_c1."label"=?
  PARAS: ['label']
---------------------------------------------
       22.71 real        19.10 user         1.80 sys


In [17]:
!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.alias.tsv.gz \
    --match '(n1)-[l:alias]->(n2)' \
    --return 'l, n1, l.label, n2'

[2020-09-25 15:25:37 query]: SQL Translation:
---------------------------------------------
  SELECT graph_5_c1."id", graph_5_c1."node1", graph_5_c1."label", graph_5_c1."node2"
     FROM graph_5 AS graph_5_c1
     WHERE graph_5_c1."label"=?
  PARAS: ['alias']
---------------------------------------------
        3.99 real         3.21 user         0.50 sys


In [18]:
!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.description.tsv.gz \
    --match '(n1)-[l:description]->(n2)' \
    --return 'l, n1, l.label, n2'

[2020-09-25 15:25:41 query]: SQL Translation:
---------------------------------------------
  SELECT graph_5_c1."id", graph_5_c1."node1", graph_5_c1."label", graph_5_c1."node2"
     FROM graph_5 AS graph_5_c1
     WHERE graph_5_c1."label"=?
  PARAS: ['description']
---------------------------------------------
       66.17 real        55.06 user         4.49 sys


### Now create files with the English labels, aliases and descriptions

In [19]:
!$kgtk query -i $OUT/all.label.tsv.gz --graph-cache $STORE -o $OUT/all.label.en.tsv.gz \
    --match '()-[]->(n2)' \
    --where 'n2.kgtk_lqstring_lang_suffix = "en"' 

[2020-09-25 15:26:47 sqlstore]: IMPORT graph directly into table graph_14 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.label.tsv.gz ...
[2020-09-25 15:26:56 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_14 AS graph_14_c1
     WHERE (kgtk_lqstring_lang_suffix(graph_14_c1."node2") = ?)
  PARAS: ['en']
---------------------------------------------
       14.52 real        19.55 user         0.92 sys


In [20]:
!$kgtk query -i $OUT/all.alias.tsv.gz --graph-cache $STORE -o $OUT/all.alias.en.tsv.gz \
    --match '()-[]->(n2)' \
    --where 'n2.kgtk_lqstring_lang_suffix = "en"'

[2020-09-25 15:27:01 sqlstore]: IMPORT graph directly into table graph_15 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.alias.tsv.gz ...
[2020-09-25 15:27:03 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_15 AS graph_15_c1
     WHERE (kgtk_lqstring_lang_suffix(graph_15_c1."node2") = ?)
  PARAS: ['en']
---------------------------------------------
        2.59 real         3.12 user         0.24 sys


In [21]:
!$kgtk query -i $OUT/all.description.tsv.gz --graph-cache $STORE -o $OUT/all.description.en.tsv.gz \
    --match '()-[]->(n2)' \
    --where 'n2.kgtk_lqstring_lang_suffix = "en"' 

[2020-09-25 15:27:04 sqlstore]: IMPORT graph directly into table graph_16 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.description.tsv.gz ...
[2020-09-25 15:27:45 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_16 AS graph_16_c1
     WHERE (kgtk_lqstring_lang_suffix(graph_16_c1."node2") = ?)
  PARAS: ['en']
---------------------------------------------
       51.48 real        73.68 user         2.88 sys


Let's sample these files to see what they look like:

* we are getting all variants of English, we really want `en` only
* the labels have the language tags, how do we output only the string without the language tag?

In [22]:
!gzcat $OUT/all.label.en.tsv.gz | head | column -t -s $'\t' 

gzcat: id             node1  label  node2
error writing to output: Q45-label-en   Q45    label  'Portugal'@en
Broken pipe
Q140-label-en  Q140   label  'lion'@en
Q183-label-en  Q183   label  'Germany'@en
Q317-label-en  Q317   label  'dictatorship'@en
Q433-label-en  Q433   label  'Gmina Kurów'@en
Q514-label-en  Q514   label  'anatomy'@en
Q595-label-en  Q595   label  'The Intouchables'@en
gzcat: Q647-label-en  Q647   label  'Rennes'@en
/Users/pedroszekely/Downloads/kypher/output.all.10/all.label.en.tsv.gz: uncompress failedQ716-label-en  Q716   label  'titanium'@en



### Compute the distribution of the number of edges for each Wikidata type

In [23]:
!$kgtk unique --column 'node2;wikidatatype' -i $EDGES / sort2 -c node2 -r | gzip > $OUT/all.wikidatatype.distribution.tsv.gz

       43.63 real        42.65 user         1.06 sys


In [24]:
!gzcat $OUT/all.wikidatatype.distribution.tsv.gz | column -t -s $'\t' 

node1              label  node2
time               count  76936
wikibase-item      count  729535
math               count  70
wikibase-form      count  7
quantity           count  69823
string             count  68283
external-id        count  416408
commonsMedia       count  36794
globe-coordinate   count  26063
monolingualtext    count  24131
musical-notation   count  2
geo-shape          count  183
wikibase-property  count  148
url                count  12874


### Create a file to contain the edges for each wikidata type

In [25]:
types = [
    "time",
    "wikibase-item",
    "math",
    "wikibase-form",
    "quantity",
    "string",
    "external-id",
    "commonsMedia",
    "globe-coordinate",
    "monolingualtext",
    "musical-notation",
    "geo-shape",
    "wikibase-property",
    "url",
]
command = "$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.TYPE.tsv.gz \
    --match '(n1)-[l]->(n2 {wikidatatype: type})' \
    --return 'l, n1, l.label, n2'\
    --where 'type = \"TYPE\"'"
for type in types:
    cmd = command.replace("TYPE", type)
    print(cmd)
    os.system(cmd)

$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.time.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = "time"'
$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.wikibase-item.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = "wikibase-item"'
$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.math.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = "math"'
$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.wikibase-form.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = "wikibase-form"'
$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.quantity.tsv.gz     --match '(n1)-[l]->(n2 {wikidatatype: type})'     --return 'l, n1, l.label, n2'    --where 'type = "quantity"'
$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.string.tsv.

### Create a file with the sitelinks

In [26]:
!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.wikipedia_sitelink.tsv.gz \
    --match '(n1)-[l:wikipedia_sitelink]->(n2)' \
    --return 'l, n1, l.label, n2' 

[2020-09-25 15:29:42 query]: SQL Translation:
---------------------------------------------
  SELECT graph_5_c1."id", graph_5_c1."node1", graph_5_c1."label", graph_5_c1."node2"
     FROM graph_5 AS graph_5_c1
     WHERE graph_5_c1."label"=?
  PARAS: ['wikipedia_sitelink']
---------------------------------------------
        8.68 real         7.11 user         0.83 sys


### Create a file that specifies for each node whether it is an item or a property

In [27]:
!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.type.tsv.gz \
    --match '(n1)-[l:type]->(n2)' \
    --return 'l, n1, l.label, n2' 

[2020-09-25 15:29:50 query]: SQL Translation:
---------------------------------------------
  SELECT graph_5_c1."id", graph_5_c1."node1", graph_5_c1."label", graph_5_c1."node2"
     FROM graph_5 AS graph_5_c1
     WHERE graph_5_c1."label"=?
  PARAS: ['type']
---------------------------------------------
        3.42 real         2.64 user         0.63 sys


### Create the P31 and P279 files

In [28]:
!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.P31.tsv.gz \
    --match '(n1)-[l:P31]->(n2)' \
    --return 'l, n1, l.label, n2' 

[2020-09-25 15:29:54 query]: SQL Translation:
---------------------------------------------
  SELECT graph_5_c1."id", graph_5_c1."node1", graph_5_c1."label", graph_5_c1."node2"
     FROM graph_5 AS graph_5_c1
     WHERE graph_5_c1."label"=?
  PARAS: ['P31']
---------------------------------------------
        2.72 real         2.15 user         0.46 sys


In [29]:
!$kgtk query -i $EDGES --graph-cache $STORE -o $OUT/all.P279.tsv.gz \
    --match '(n1)-[l:P279]->(n2)' \
    --return 'l, n1, l.label, n2' 

[2020-09-25 15:29:57 query]: SQL Translation:
---------------------------------------------
  SELECT graph_5_c1."id", graph_5_c1."node1", graph_5_c1."label", graph_5_c1."node2"
     FROM graph_5 AS graph_5_c1
     WHERE graph_5_c1."label"=?
  PARAS: ['P279']
---------------------------------------------
        0.78 real         0.61 user         0.14 sys


In [30]:
!gzcat $OUT/all.P31.tsv.gz | head | column -t -s $'\t' 

gzcat: error writing to output: Broken pipe
id          node1  label  node2
Q45-P31-1   Q45    P31    Q3624078
Q45-P31-2   Q45    P31    Q6256
gzcat: Q45-P31-3   Q45    P31    Q20181813
/Users/pedroszekely/Downloads/kypher/output.all.10/all.P31.tsv.gz: uncompress failed
Q140-P31-1  Q140   P31    Q16521
Q183-P31-1  Q183   P31    Q3624078
Q183-P31-2  Q183   P31    Q43702
Q183-P31-3  Q183   P31    Q7270
Q183-P31-4  Q183   P31    Q619610
Q183-P31-5  Q183   P31    Q4209223


In [31]:
!$kgtk cat -i $OUT/all.P279.tsv.gz -i $OUT/all.P31.tsv.gz -o $OUT/all.P31_P279.tsv.gz 

        2.54 real         2.35 user         0.14 sys


In [32]:
!gzcat $OUT/all.P31_P279.tsv | head | column -t -s $'\t' 

gzcat: error writing to outputid            node1  label  node2
: Broken pipe
Q317-P279-1   Q317   P279   Q173424
Q514-P279-1   Q514   P279   Q420
Q514-P279-2   Q514   P279   Q11190
Q716-P279-1   Q716   P279   Q19588
Q716-P279-2   Q716   P279   Q428766
Q901-P279-1   Q901   P279   Q1650915
gzcat: Q901-P279-2   Q901   P279   Q20826540
/Users/pedroszekely/Downloads/kypher/output.all.10/all.P31_P279.tsv.gz: uncompress failedQ1071-P279-1  Q1071  P279   Q34749

Q1071-P279-2  Q1071  P279   Q8008


### Create the file that contains all nodes reachable via P279 starting from a node2 in P31 or a node1 in P279

First compute the roots

In [33]:
!$kgtk query -i $OUT/all.P279.tsv.gz --graph-cache $STORE -o $TEMP/P279.n1.tsv.gz \
    --match '(n1)-[]->()' \
    --return 'n1 as node' 

[2020-09-25 15:30:01 sqlstore]: IMPORT graph directly into table graph_17 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279.tsv.gz ...
[2020-09-25 15:30:01 query]: SQL Translation:
---------------------------------------------
  SELECT graph_17_c1."node1" "node"
     FROM graph_17 AS graph_17_c1
  PARAS: []
---------------------------------------------
        0.93 real         0.72 user         0.19 sys


In [34]:
!$kgtk query -i $OUT/all.P31.tsv.gz --graph-cache $STORE  -o $TEMP/P31.n2.tsv.gz \
    --match '()-[]->(n2)' \
    --return 'n2 as node' 

[2020-09-25 15:30:02 sqlstore]: DROP graph data table graph_7 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P31.tsv.gz
[2020-09-25 15:30:02 sqlstore]: IMPORT graph directly into table graph_18 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P31.tsv.gz ...
[2020-09-25 15:30:02 query]: SQL Translation:
---------------------------------------------
  SELECT graph_18_c1."node2" "node"
     FROM graph_18 AS graph_18_c1
  PARAS: []
---------------------------------------------
        1.47 real         1.44 user         0.23 sys


In [35]:
!$kgtk cat --mode NONE $TEMP/P31.n2.tsv.gz $TEMP/P279.n1.tsv.gz \
    / compact --mode NONE --columns node \
    > $TEMP/P279.roots.tsv

        2.32 real         2.88 user         0.43 sys


Now we can invoke the reachable-nodes command

In [36]:
!$kgtk reachable-nodes \
    --rootfile $TEMP/P279.roots.tsv \
    --rootfilecolumn 0 \
    --subj 1 --pred 2 --obj 3 \
    $OUT/all.P279.tsv.gz \
    | kgtk sort2 \
    | gzip > $TEMP/P279.reachable.tsv.gz

        1.35 real         0.87 user         0.21 sys


The reachable-nodes command produces edges labeled `reachable`, so we need one command to rename them.

In [37]:
!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE  -o $TEMP/P279star.1.tsv.gz \
    --match '(n1)-[]->(n2)' \
    --return 'n1, "P279star" as label, n2 as node2' 

[2020-09-25 15:30:07 sqlstore]: DROP graph data table graph_6 from /Users/pedroszekely/Downloads/kypher/temp.all.10/P279.reachable.tsv.gz
[2020-09-25 15:30:07 sqlstore]: IMPORT graph directly into table graph_19 from /Users/pedroszekely/Downloads/kypher/temp.all.10/P279.reachable.tsv.gz ...
[2020-09-25 15:30:07 query]: SQL Translation:
---------------------------------------------
  SELECT graph_19_c1."node1", ? "label", graph_19_c1."node2" "node2"
     FROM graph_19 AS graph_19_c1
  PARAS: ['P279star']
---------------------------------------------
        0.84 real         0.65 user         0.17 sys


We also want `P279star` to be relflexive, ie, contain `(n1)-[:P279star]->(n1)` for all node1

In [38]:
!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE  -o $TEMP/P279star.2.tsv.gz \
    --match '(n1)-[]->(n2)' \
    --return 'n1 as node1, "P279star" as label, n1 as node2' 

[2020-09-25 15:30:08 query]: SQL Translation:
---------------------------------------------
  SELECT graph_19_c1."node1" "node1", ? "label", graph_19_c1."node1" "node2"
     FROM graph_19 AS graph_19_c1
  PARAS: ['P279star']
---------------------------------------------
        0.77 real         0.62 user         0.13 sys


In [39]:
!$kgtk query -i $TEMP/P279.reachable.tsv.gz --graph-cache $STORE  -o $TEMP/P279star.3.tsv.gz \
    --match '(n1)-[]->(n2)' \
    --return 'n2 as node1, "P279star" as label, n2 as node2' 

[2020-09-25 15:30:09 query]: SQL Translation:
---------------------------------------------
  SELECT graph_19_c1."node2" "node1", ? "label", graph_19_c1."node2" "node2"
     FROM graph_19 AS graph_19_c1
  PARAS: ['P279star']
---------------------------------------------
        0.81 real         0.65 user         0.14 sys


In [40]:
!$kgtk query -i $OUT/all.P31.tsv.gz --graph-cache $STORE  -o $TEMP/P279star.4.tsv.gz \
    --match '(n1)-[]->(n2)' \
    --return 'n2 as node1, "P279star" as label, n2 as node2' 

[2020-09-25 15:30:10 query]: SQL Translation:
---------------------------------------------
  SELECT graph_18_c1."node2" "node1", ? "label", graph_18_c1."node2" "node2"
     FROM graph_18 AS graph_18_c1
  PARAS: ['P279star']
---------------------------------------------
        1.24 real         1.05 user         0.16 sys


Now we can concatenate these files to produce the final output

In [41]:
!$kgtk cat --mode NONE $TEMP/P279star.1.tsv.gz $TEMP/P279star.2.tsv.gz $TEMP/P279star.3.tsv.gz $TEMP/P279star.4.tsv.gz \
    | kgtk compact \
    | kgtk sort2 \
    | kgtk add-id --id-style node1-label-node2-num \
    | gzip > $OUT/all.P279star.tsv.gz

        1.49 real         1.28 user         0.17 sys


This is difficult to test with our Wikidata subset because our hierarchy is very sparse.

This is how we would do the typical `?item P31/P279* ?class` in Kypher. 
The example shows how to get all the `n1` that are instances of subclasses of beer (q44).

In [42]:
!$kgtk query -i $OUT/all.P31.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE  -o - \
    --match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q44)' \
    --return 'count(n1) as count'

[2020-09-25 15:30:14 sqlstore]: IMPORT graph directly into table graph_20 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279star.tsv.gz ...
[2020-09-25 15:30:14 query]: SQL Translation:
---------------------------------------------
  SELECT count(graph_18_c1."node1") "count"
     FROM graph_18 AS graph_18_c1, graph_20 AS graph_20_c2
     WHERE graph_18_c1."label"=?
     AND graph_20_c2."node2"=?
     AND graph_18_c1."node2"=graph_20_c2."node1"
  PARAS: ['P31', 'Q44']
---------------------------------------------
[2020-09-25 15:30:14 sqlstore]: CREATE INDEX on table graph_20 column node1 ...
[2020-09-25 15:30:14 sqlstore]: ANALYZE INDEX on table graph_20 column node1 ...
[2020-09-25 15:30:14 sqlstore]: CREATE INDEX on table graph_18 column node2 ...
[2020-09-25 15:30:14 sqlstore]: ANALYZE INDEX on table graph_18 column node2 ...
[2020-09-25 15:30:14 sqlstore]: CREATE INDEX on table graph_20 column node2 ...
[2020-09-25 15:30:14 sqlstore]: ANALYZE INDEX on table graph_20 co

### Create a file to do generalized Is-A queries
The idea is that `(n1)-[:isa]->(n2)` when `(n1)-[:P31]->(n2)` or `(n1)-[:P279]->(n2)`

We do this by concatenating the files and renaming the relation

In [43]:
!$kgtk cat $OUT/all.P31.tsv.gz $OUT/all.P279.tsv.gz \
    | gzip > $TEMP/isa.1.tsv.gz

        1.45 real         1.27 user         0.15 sys


In [44]:
!$kgtk query -i $TEMP/isa.1.tsv.gz --graph-cache $STORE  -o $OUT/all.isa.tsv.gz \
    --match '(n1)-[]->(n2)' \
    --return 'n1, "isa" as label, n2' 

[2020-09-25 15:30:17 sqlstore]: IMPORT graph directly into table graph_21 from /Users/pedroszekely/Downloads/kypher/temp.all.10/isa.1.tsv.gz ...
[2020-09-25 15:30:17 query]: SQL Translation:
---------------------------------------------
  SELECT graph_21_c1."node1", ? "label", graph_21_c1."node2"
     FROM graph_21 AS graph_21_c1
  PARAS: ['isa']
---------------------------------------------
        1.85 real         1.87 user         0.20 sys


Example of how to use the `isa` relation

In [45]:
!$kgtk query -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE  -o - \
    --match 'isa: (n1)-[l:isa]->(c), P279star: (c)-[]->(:Q44)' \
    --return 'distinct n1, l.label, "Q44" as node2' \
    --limit 10

[2020-09-25 15:30:19 sqlstore]: IMPORT graph directly into table graph_22 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.isa.tsv.gz ...
[2020-09-25 15:30:19 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_22_c1."node1", graph_22_c1."label", ? "node2"
     FROM graph_20 AS graph_20_c2, graph_22 AS graph_22_c1
     WHERE graph_20_c2."node2"=?
     AND graph_22_c1."label"=?
     AND graph_20_c2."node1"=graph_22_c1."node2"
     LIMIT ?
  PARAS: ['Q44', 'Q44', 'isa', 10]
---------------------------------------------
[2020-09-25 15:30:19 sqlstore]: CREATE INDEX on table graph_22 column node2 ...
[2020-09-25 15:30:19 sqlstore]: ANALYZE INDEX on table graph_22 column node2 ...
[2020-09-25 15:30:19 sqlstore]: CREATE INDEX on table graph_22 column label ...
[2020-09-25 15:30:19 sqlstore]: ANALYZE INDEX on table graph_22 column label ...
node1	label	node2
Q2579953	isa	Q44
Q4488344	isa	Q44
Q7587890	isa	Q44
Q10313616	isa	Q44
        1.12 r

### Creating a subset of Wikidata without scholarly articles (Q13442814)
First create a file with the schloarly articles

In [46]:
!$kgtk query -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE  -o $OUT/all.isa.Q13442814.tsv.gz \
    --match 'isa: (n1)-[l:isa]->(n2:Q13442814)' \
    --return 'distinct n1, l.label, n2'

[2020-09-25 15:30:20 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_22_c1."node1", graph_22_c1."label", graph_22_c1."node2"
     FROM graph_22 AS graph_22_c1
     WHERE graph_22_c1."label"=?
     AND graph_22_c1."node2"=?
  PARAS: ['isa', 'Q13442814']
---------------------------------------------
        0.76 real         0.59 user         0.14 sys


Now we need to remove from `$EDGES` any edge where node1 or node2 is in node1 of `$OUT/all.isa.Q13442814.tsv`. The result will be `$OUT/minus.Q13442814.tsv`. We can then run the whole notebook with this new file as $EDGES and compute all the product files in a new output directory

In [47]:
!gzcat $OUT/all.isa.Q13442814.tsv | head | column -t -s $'\t' 

node1     label  node2
Q1801903  isa    Q13442814


## Summary

In [48]:
!wc -l $OUT/*.tsv $OUT/*.tsv.gz $EDGES

    4882 /Users/pedroszekely/Downloads/kypher/output.all.10/all-distribution.tsv
     143 /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279.tsv.gz
     793 /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279star.tsv.gz
    5348 /Users/pedroszekely/Downloads/kypher/output.all.10/all.P31.tsv.gz
    5512 /Users/pedroszekely/Downloads/kypher/output.all.10/all.P31_P279.tsv.gz
    1814 /Users/pedroszekely/Downloads/kypher/output.all.10/all.alias.en.tsv.gz
   13162 /Users/pedroszekely/Downloads/kypher/output.all.10/all.alias.tsv.gz
    3125 /Users/pedroszekely/Downloads/kypher/output.all.10/all.commonsMedia.tsv.gz
    6171 /Users/pedroszekely/Downloads/kypher/output.all.10/all.description.en.tsv.gz
  227442 /Users/pedroszekely/Downloads/kypher/output.all.10/all.description.tsv.gz
   22109 /Users/pedroszekely/Downloads/kypher/output.all.10/all.external-id.tsv.gz
      16 /Users/pedroszekely/Downloads/kypher/output.all.10/all.geo-shape.tsv.gz
    1495 /Users/pedroszekely/Downl

Number of distinct items in our dataset

In [49]:
!$kgtk query -i $EDGES --graph-cache $STORE  -o - \
    --match '(n1)-[]->()' \
    --return 'count(distinct n1) as count'

[2020-09-25 15:30:21 query]: SQL Translation:
---------------------------------------------
  SELECT count(DISTINCT graph_5_c1."node1") "count"
     FROM graph_5 AS graph_5_c1
  PARAS: []
---------------------------------------------
count
156559
        6.02 real         4.90 user         1.01 sys


## Other Stuff

In [50]:
!$kgtk query -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz --graph-cache $STORE  -o $OUT/all.isa.Q318.tsv.gz  \
    --match 'isa: (n1)-[l:isa]->(n2:Q318)' \
    --return 'distinct n1, l.label, n2' 

[2020-09-25 15:30:27 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_22_c1."node1", graph_22_c1."label", graph_22_c1."node2"
     FROM graph_22 AS graph_22_c1
     WHERE graph_22_c1."label"=?
     AND graph_22_c1."node2"=?
  PARAS: ['isa', 'Q318']
---------------------------------------------
        0.73 real         0.56 user         0.13 sys
