# Generating Useful Wikidata Files

This notebook generates files that contain derived data that is useful in many applications. The input to the notebook is the full Wikidata or a subset of Wikidata. It also works for arbutrary KGs as long as they follow the representation requirements of Wikidata:

- the *instance of* relation is represented using the `P31` property
- the *subclass of* relation is represented using the `P279` property
- all properties declare a datatype, and the data types must be one of the datatypes in Wikidata.

Inputs:

- `claims_file`: contains all statements, which consist of edges `node1/label/node2` where `label` is a property in Wikidata (e.g., sitelinks, labels, aliases and description are not in the claims file.
- `item_file`: the subset of the `claims_file` consistin of edges for property of data type `wikibase-item`
- `label_file`, `alias_file` and `description_file` containing labels, aliases and descriptions. It is assume that these files contain the labels, aliases and descriptions of all nodes appearing in the claims file. Users may provide these files for specific languages only.

Outputs:

- **Instance of (P31):** `derived.P31.tsv.gz` contains all the `instance of (P31)` edges present in the claims file.
- **Subclass of (P279):** `derived.P279.tsv.gz` contains all the `subclass of (P279)` edges present in the claims file.
- **Is A (isa):** `derived.isa.tsv.gz` contains edges `node`isa/node2` where either `node1/P31/node2` or `node1/P279/node2`
- **Closure of subclass of (P279star):** `derived.P279star.tsv.gz` contains edges `node1/P279star/node2` where `node2` is reachable from `node1` via zero or more hops using the `P279` property. Note that for example, `Q44/P279star/Q44`. An example when this file is useful is when you want to find all the instance of a class, including instances of subclasses of the given class.
- **In/out degrees:** `metadata.out_degree.tsv.gz` contains the out degree of every node, and `metadata.in_degree.tsv.gz` contains the in degree of every node.
- **Pagerank:** outputs page rank on the directed graph in `metadata.pagerank.directed.tsv.gz` and page rank of the directed graph in `metadata.pagerank.undirected.tsv.gz`.

### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

```
papermill Wikidata\ Useful\ Files.ipynb useful-files.out.ipynb \
-p claims_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/all.tsv.gz \
-p label_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.label.en.tsv.gz \
-p item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.wikibase-item.tsv.gz \
-p property_item_file = /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.property.wikibase-item.tsv.gz \
-p output_path <local folder> \
-p output_folder useful_files_v4 \
-p temp_folder temp.useful_files_v4 \
-p delete_database no 
-p languages es,ru,zh-cn
```

In [1]:
# Parameters

# Folder on local machine where to create the output and temporary folders
output_path = "/data/amandeep/wikidata-20210215"

# The names of the output and temporary folders
output_folder = "useful_wikidata_files"
temp_folder = "temp.useful_wikidata_files"

# The location of input files
wiki_root_folder = "/data/amandeep/wikidata-20210215"
claims_file = "claims.tsv.gz"
label_file = "labels.en.tsv.gz"
alias_file = "aliases.en.tsv.gz"
description_file = "descriptions.en.tsv.gz"
item_file = "claims.wikibase-item.tsv.gz"

label_all = "labels.tsv.gz"
alias_all = "aliases.tsv.gz"
description_all = "descriptions.tsv.gz"

# Location of the cache database for kypher
cache_path = f"{output_path}/{output_folder}/{temp_folder}"

# Whether to delete the cache database
delete_database = False

# Whether to compute pagerank as it may not run on the laptop
compute_pagerank = True
languages = 'en,ru,es,zh-cn,de,it,nl,pl,fr,pt,sv'
debug = "false"
debug = debug.lower() == "true"

In [2]:
languages = languages.split(',')

In [3]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

import altair as alt

## Set up environment and folders to store the files

- `OUT` folder where the output files go
- `TEMP` folder to keep temporary files , including the database
- `kgtk` shortcut to invoke the kgtk software
- `kypher` shortcut to invoke `kgtk query with the cache database
- `CLAIMS` the `all.tsv` file of wikidata that contains all edges except label/alias/description
- `LABELS` the file with the English labels
- `ITEMS` the wikibase-item file (currently does not include node1 that are properties so for now we need the net file
- `STORE` location of the cache file

In [4]:
if cache_path:
    os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_path)
else:
    os.environ['STORE'] = "{}/{}/{}/wikidata.sqlite3.db".format(output_path, output_folder, temp_folder)
os.environ['OUT'] = "{}/{}".format(output_path, output_folder)
os.environ['TEMP'] = "{}/{}/{}".format(output_path, output_folder, temp_folder)
os.environ['kgtk'] = "kgtk"
os.environ['kgtk'] = "kgtk --debug"
os.environ['kypher'] = "kgtk --debug query --graph-cache " + os.environ['STORE']
os.environ['CLAIMS'] = f"{wiki_root_folder}/{claims_file}"
os.environ['LABELS'] = f"{wiki_root_folder}/{label_file}"
os.environ['ALIASES'] = f"{wiki_root_folder}/{alias_file}"
os.environ['DESCRIPTIONS'] = f"{wiki_root_folder}/{description_file}"
os.environ['ITEMS'] = f"{wiki_root_folder}/{item_file}"

Echo the variables to see if they are all set correctly

In [7]:
!echo $OUT
!echo $TEMP
!echo $kgtk
!echo $kypher
!echo $CLAIMS
!echo $LABELS
!echo $ALIASES
!echo $LABELS
!echo $DESCRIPTIONS
!echo $STORE
!alias col="column -t -s $'\t' "

/data/amandeep/wikidata-20210215/useful_wikidata_files
/data/amandeep/wikidata-20210215/temp.useful_wikidata_files
time kgtk --debug
time kgtk --debug query --graph-cache /data/amandeep/wikidata-20210215/temp.useful_wikidata_files/wikidata.sqlite3.db
/data/amandeep/wikidata-20210215/claims.tsv.gz
/data/amandeep/wikidata-20210215/labels.en.tsv.gz
/data/amandeep/wikidata-20210215/aliases.en.tsv.gz
/data/amandeep/wikidata-20210215/labels.en.tsv.gz
/data/amandeep/wikidata-20210215/descriptions.en.tsv.gz
/data/amandeep/wikidata-20210215/temp.useful_wikidata_files/wikidata.sqlite3.db


Go to the output directory and create the subfolders for the output files and the temporary files

In [8]:
cd $output_path

/data/amandeep/wikidata-20210215


In [10]:
!mkdir -p $OUT
!mkdir -p $TEMP

Clean up the output and temp folders before we start

In [11]:
# !rm $OUT/*.tsv $OUT/*.tsv.gz
# !rm $TEMP/*.tsv $TEMP/*.tsv.gz

In [12]:
if delete_database:
    print("Deleted database") 
    !rm $STORE

In [13]:
!ls -l $OUT
!ls $TEMP
!ls -l "$CLAIMS"
!ls -l "$LABELS"
!ls -l "$ALIASES"
!ls -l "$LABELS"
!ls -l "$DESCRIPTIONS"
!ls $STORE

total 0
-rw-r--r-- 1 amandeep root 25910383480 Feb 26 04:59 /data/amandeep/wikidata-20210215/claims.tsv.gz
-rw-r--r-- 1 amandeep root 2247582427 Feb 26 11:18 /data/amandeep/wikidata-20210215/labels.en.tsv.gz
-rw-r--r-- 1 amandeep root 140720117 Feb 25 10:36 /data/amandeep/wikidata-20210215/aliases.en.tsv.gz
-rw-r--r-- 1 amandeep root 2247582427 Feb 26 11:18 /data/amandeep/wikidata-20210215/labels.en.tsv.gz
-rw-r--r-- 1 amandeep root 632811768 Feb 26 08:53 /data/amandeep/wikidata-20210215/descriptions.en.tsv.gz
ls: cannot access /data/amandeep/wikidata-20210215/temp.useful_wikidata_files/wikidata.sqlite3.db: No such file or directory


In [14]:
!zcat < "$CLAIMS" | head | col

id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video" normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item
P10-P1659-P1651-c4068028-0	P10	P1659	P1651	normal	wikibase-property
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18	normal	wikibase-property

gzip: P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238	normal	wikibase-property
stdout: Broken pipe
P10-P1659-P51-86aca4c5-0	P10	P1659	P51	normal	wikibase-property
P10-P1855-Q15075950-7eff6d65-0	P10	P1855	Q15075950	normal	wikibase-item
P10-P1855-Q4504-a69d2c73-0	P10	P1855	Q4504	normal	wikibase-item


### Preview the input files

It is always a good practice to peek a the files to make sure the column headings are what we expect

In [17]:
!$kypher -i "$CLAIMS" --limit 10 | col 

[2021-03-01 11:55:31 sqlstore]: IMPORT graph directly into table graph_1 from /data/amandeep/wikidata-20210215/claims.tsv.gz ...
[2021-03-01 12:29:45 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_1 AS graph_1_c1
     LIMIT ?
  PARAS: [10]
---------------------------------------------
id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video" normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item
P10-P1659-P1651-c4068028-0	P10	P1659	P1651	normal	wikibase-property
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18	normal	wikibase-property
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238	normal	wikibase-property
P10-P1659-P51-86aca4c5-0	P10	P1659	P51	normal	wikibase-property
P10-P1855-Q15075950-7eff6d65-0	P10	P1855	Q15075950	normal	wikibase-item
P10-P1855-Q4504-a69d2c73-0	P10	P1855	Q4504	normal	w

Force creation of the index on the label column

In [18]:
!$kypher -i "$CLAIMS" -o - \
--match '(i)-[:P31]->(c)' \
--limit 5 \
| column -t -s $'\t' 

[2021-03-01 12:29:46 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."label"=?
     LIMIT ?
  PARAS: ['P31', 5]
---------------------------------------------
[2021-03-01 12:29:46 sqlstore]: CREATE INDEX on table graph_1 column label ...
[2021-03-01 12:46:38 sqlstore]: ANALYZE INDEX on table graph_1 column label ...
id                              node1  label  node2      rank    node2;wikidatatype
P10-P31-Q18610173-85ef4d24-0    P10    P31    Q18610173  normal  wikibase-item
P1000-P31-Q18608871-093affb5-0  P1000  P31    Q18608871  normal  wikibase-item
P1001-P31-Q15720608-deeedec9-0  P1001  P31    Q15720608  normal  wikibase-item
P1001-P31-Q22984026-8beb0cfe-0  P1001  P31    Q22984026  normal  wikibase-item
P1001-P31-Q22997934-1e5b1a96-0  P1001  P31    Q22997934  normal  wikibase-item


Force creation of the index on the node2 column

In [19]:
!$kypher -i "$CLAIMS" -o - \
--match '(i)-[r]->(:Q5)' \
--limit 5 \
| column -t -s $'\t' 

[2021-03-01 12:47:50 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."node2"=?
     LIMIT ?
  PARAS: ['Q5', 5]
---------------------------------------------
[2021-03-01 12:47:50 sqlstore]: CREATE INDEX on table graph_1 column node2 ...
[2021-03-01 13:14:30 sqlstore]: ANALYZE INDEX on table graph_1 column node2 ...
id                         node1  label  node2  rank    node2;wikidatatype
P1424-P1855-Q5-47bdcd17-0  P1424  P1855  Q5     normal  wikibase-item
P1552-P1855-Q5-53b667e4-0  P1552  P1855  Q5     normal  wikibase-item
P1963-P1855-Q5-1ba43aca-0  P1963  P1855  Q5     normal  wikibase-item
P3055-P1629-Q5-fb63cfeb-0  P3055  P1629  Q5     normal  wikibase-item
P5869-P1855-Q5-3a19317f-0  P5869  P1855  Q5     normal  wikibase-item


### Count the number of edges

Counting takes a long time

In [20]:
if debug:
    !$kypher -i "$CLAIMS" \
    --match '()-[r]->()' \
    --return 'count(r) as count' \
    --limit 10

[2021-03-01 13:15:59 query]: SQL Translation:
---------------------------------------------
  SELECT count(graph_1_c1."id") "_aLias.count"
     FROM graph_1 AS graph_1_c1
     LIMIT ?
  PARAS: [10]
---------------------------------------------
count
1180224951


### Get labels, aliases and descriptions for other languages

In [21]:
for lang in languages:
    cmd = f"kgtk --debug query --graph-cache {os.environ['STORE']} -i {wiki_root_folder}/{label_all} -o {output_path}/{output_folder}/labels.{lang}.tsv.gz --match '(n1)-[l:label]->(n2)' --where 'n2.kgtk_lqstring_lang_suffix = \"{lang}\"' --return 'n1, l.label, n2, l.id' "
    !{cmd}

[2021-03-01 16:17:26 sqlstore]: IMPORT graph directly into table graph_2 from /data/amandeep/wikidata-20210215/labels.tsv.gz ...
[2021-03-01 16:30:42 query]: SQL Translation:
---------------------------------------------
  SELECT graph_2_c1."node1", graph_2_c1."label", graph_2_c1."node2", graph_2_c1."id"
     FROM graph_2 AS graph_2_c1
     WHERE graph_2_c1."label"=?
     AND (kgtk_lqstring_lang_suffix(graph_2_c1."node2") = ?)
  PARAS: ['label', 'ru']
---------------------------------------------
[2021-03-01 16:30:42 sqlstore]: CREATE INDEX on table graph_2 column label ...
[2021-03-01 16:35:29 sqlstore]: ANALYZE INDEX on table graph_2 column label ...
[2021-03-01 16:45:36 query]: SQL Translation:
---------------------------------------------
  SELECT graph_2_c1."node1", graph_2_c1."label", graph_2_c1."node2", graph_2_c1."id"
     FROM graph_2 AS graph_2_c1
     WHERE graph_2_c1."label"=?
     AND (kgtk_lqstring_lang_suffix(graph_2_c1."node2") = ?)
  PARAS: ['label', 'es']
------------

In [22]:
for lang in languages:
    cmd = f"kgtk --debug query --graph-cache {os.environ['STORE']} -i {wiki_root_folder}/{alias_all} -o {output_path}/{output_folder}/aliases.{lang}.tsv.gz --match '(n1)-[l:alias]->(n2)' --where 'n2.kgtk_lqstring_lang_suffix = \"{lang}\"' --return 'n1, l.label, n2, l.id' "
    !{cmd}

[2021-03-01 18:17:16 sqlstore]: IMPORT graph directly into table graph_3 from /data/amandeep/wikidata-20210215/aliases.tsv.gz ...
[2021-03-01 18:19:25 query]: SQL Translation:
---------------------------------------------
  SELECT graph_3_c1."node1", graph_3_c1."label", graph_3_c1."node2", graph_3_c1."id"
     FROM graph_3 AS graph_3_c1
     WHERE graph_3_c1."label"=?
     AND (kgtk_lqstring_lang_suffix(graph_3_c1."node2") = ?)
  PARAS: ['alias', 'ru']
---------------------------------------------
[2021-03-01 18:19:25 sqlstore]: CREATE INDEX on table graph_3 column label ...
[2021-03-01 18:20:07 sqlstore]: ANALYZE INDEX on table graph_3 column label ...
[2021-03-01 18:22:15 query]: SQL Translation:
---------------------------------------------
  SELECT graph_3_c1."node1", graph_3_c1."label", graph_3_c1."node2", graph_3_c1."id"
     FROM graph_3 AS graph_3_c1
     WHERE graph_3_c1."label"=?
     AND (kgtk_lqstring_lang_suffix(graph_3_c1."node2") = ?)
  PARAS: ['alias', 'es']
-----------

In [23]:
for lang in languages:
    cmd = f"kgtk --debug query --graph-cache {os.environ['STORE']} -i {wiki_root_folder}/{description_all} -o {output_path}/{output_folder}/descriptions.{lang}.tsv.gz --match '(n1)-[l:description]->(n2)' --where 'n2.kgtk_lqstring_lang_suffix = \"{lang}\"' --return 'n1, l.label, n2, l.id' "
    !{cmd}

[2021-03-01 18:38:00 sqlstore]: IMPORT graph directly into table graph_4 from /data/amandeep/wikidata-20210215/descriptions.tsv.gz ...
[2021-03-01 19:52:20 query]: SQL Translation:
---------------------------------------------
  SELECT graph_4_c1."node1", graph_4_c1."label", graph_4_c1."node2", graph_4_c1."id"
     FROM graph_4 AS graph_4_c1
     WHERE graph_4_c1."label"=?
     AND (kgtk_lqstring_lang_suffix(graph_4_c1."node2") = ?)
  PARAS: ['description', 'ru']
---------------------------------------------
[2021-03-01 19:52:20 sqlstore]: CREATE INDEX on table graph_4 column label ...
[2021-03-01 20:15:45 sqlstore]: ANALYZE INDEX on table graph_4 column label ...
[2021-03-01 21:03:03 query]: SQL Translation:
---------------------------------------------
  SELECT graph_4_c1."node1", graph_4_c1."label", graph_4_c1."node2", graph_4_c1."id"
     FROM graph_4 AS graph_4_c1
     WHERE graph_4_c1."label"=?
     AND (kgtk_lqstring_lang_suffix(graph_4_c1."node2") = ?)
  PARAS: ['description', 

### Create the P31 and P279 files

Create the `P31` file

In [24]:
!$kypher -i "$CLAIMS" -o $OUT/derived.P31.tsv.gz \
--match '(n1)-[l:P31]->(n2)' \
--return 'l, n1, l.label, n2' 

[2021-03-02 03:49:07 query]: SQL Translation:
---------------------------------------------
  SELECT graph_1_c1."id", graph_1_c1."node1", graph_1_c1."label", graph_1_c1."node2"
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."label"=?
  PARAS: ['P31']
---------------------------------------------


In [25]:
!gzcat $OUT/derived.P31.tsv.gz | head | col

/bin/bash: gzcat: command not found



Create the P279 file

In [26]:
!$kypher -i "$CLAIMS" -o $OUT/derived.P279.tsv.gz \
    --match '(n1)-[l:P279]->(n2)' \
    --return 'l, n1, l.label, n2' 

[2021-03-02 04:05:11 query]: SQL Translation:
---------------------------------------------
  SELECT graph_1_c1."id", graph_1_c1."node1", graph_1_c1."label", graph_1_c1."node2"
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."label"=?
  PARAS: ['P279']
---------------------------------------------


### Create the file that contains all nodes reachable via P279 starting from a node2 in P31 or a node1 in P279

First compute the roots

In [27]:
!$kypher -i $OUT/derived.P279.tsv.gz -o $TEMP/P279.n1.tsv.gz \
--match '(n1)-[l]->()' \
--return 'n1 as id' 

[2021-03-02 04:06:27 sqlstore]: IMPORT graph directly into table graph_5 from /data/amandeep/wikidata-20210215/useful_wikidata_files/derived.P279.tsv.gz ...
[2021-03-02 04:06:31 query]: SQL Translation:
---------------------------------------------
  SELECT graph_5_c1."node1" "_aLias.id"
     FROM graph_5 AS graph_5_c1
  PARAS: []
---------------------------------------------


In [28]:
!$kypher -i $OUT/derived.P31.tsv.gz -o $TEMP/P31.n2.tsv.gz \
--match '()-[l]->(n2)' \
--return 'n2 as id' 

[2021-03-02 04:06:46 sqlstore]: IMPORT graph directly into table graph_6 from /data/amandeep/wikidata-20210215/useful_wikidata_files/derived.P31.tsv.gz ...
[2021-03-02 04:08:22 query]: SQL Translation:
---------------------------------------------
  SELECT graph_6_c1."node2" "_aLias.id"
     FROM graph_6 AS graph_6_c1
  PARAS: []
---------------------------------------------


In [29]:
!$kgtk cat --mode NONE -i $TEMP/P31.n2.tsv.gz $TEMP/P279.n1.tsv.gz \
| gzip > $TEMP/P279.roots.1.tsv.gz

In [30]:
!$kgtk sort2 --mode NONE --column id -i $TEMP/P279.roots.1.tsv.gz \
| gzip > $TEMP/P279.roots.2.tsv.gz

We have lots of duplicates

In [31]:
!zcat < $TEMP/P279.roots.2.tsv.gz | head

id
Q100000030
Q100000058
Q1000017
Q1000032
Q1000032
Q1000039
Q100004761
Q100004761
Q100004761

gzip: stdout: Broken pipe


In [47]:
!$kgtk compact -i $TEMP/P279.roots.2.tsv.gz --mode NONE \
    --presorted \
    --columns id \
    -o $TEMP/P279.roots.tsv

Now we can invoke the reachable-nodes command

In [48]:
!$kgtk reachable-nodes \
    --rootfile $TEMP/P279.roots.tsv \
    --selflink \
    -i $OUT/derived.P279.tsv.gz \
| gzip > $TEMP/P279.reachable.tsv.gz

In [49]:
!zcat < $TEMP/P279.reachable.tsv.gz | head | col

node1	label	node2
Q100000030	reachable	Q100000030
Q100000030	reachable	Q14748
Q100000030	reachable	Q14745
Q100000030	reachable	Q1357761
Q100000030	reachable	Q223557
Q100000030	reachable	Q4406616
Q100000030	reachable	Q488383
Q100000030	reachable	Q35120

gzip: Q100000030	reachable	Q2424752
stdout: Broken pipe


The reachable-nodes command produces edges labeled `reachable`, so we need one command to rename them.

In [50]:
!$kypher -i $TEMP/P279.reachable.tsv.gz -o $TEMP/P279star.1.tsv.gz \
--match '(n1)-[]->(n2)' \
--return 'n1, "P279star" as label, n2 as node2' 

[2021-03-02 09:28:05 sqlstore]: IMPORT graph directly into table graph_7 from /data/amandeep/wikidata-20210215/temp.useful_wikidata_files/P279.reachable.tsv.gz ...
[2021-03-02 09:29:11 query]: SQL Translation:
---------------------------------------------
  SELECT graph_7_c1."node1", ? "_aLias.label", graph_7_c1."node2" "_aLias.node2"
     FROM graph_7 AS graph_7_c1
  PARAS: ['P279star']
---------------------------------------------


Now we can concatenate these files to produce the final output

In [51]:
!$kgtk sort2 -i $TEMP/P279star.1.tsv.gz -o $TEMP/P279star.2.tsv.gz

Make sure there are no duplicates

In [52]:
!$kgtk compact --presorted -i $TEMP/P279star.2.tsv.gz -o $TEMP/P279star.3.tsv.gz

Add ids

In [62]:
!$kgtk add-id --id-style node1-label-node2-num -i $TEMP/P279star.3.tsv.gz -o $OUT/derived.P279star.tsv.gz

In [63]:
!zcat < $OUT/derived.P279star.tsv.gz | head | col

node1	label	node2	id

gzip: Q100000030	P279star	Q100000030	Q100000030-P279star-Q100000030-0000
stdout: Broken pipe
Q100000030	P279star	Q1357761	Q100000030-P279star-Q1357761-0000
Q100000030	P279star	Q14745	Q100000030-P279star-Q14745-0000
Q100000030	P279star	Q14748	Q100000030-P279star-Q14748-0000
Q100000030	P279star	Q15401930	Q100000030-P279star-Q15401930-0000
Q100000030	P279star	Q15621286	Q100000030-P279star-Q15621286-0000
Q100000030	P279star	Q16686448	Q100000030-P279star-Q16686448-0000
Q100000030	P279star	Q17537576	Q100000030-P279star-Q17537576-0000
Q100000030	P279star	Q223557 Q100000030-P279star-Q223557-0000


This is how we would do the typical `?item P31/P279* ?class` in Kypher. 
The example shows how to get all the counts of instances of subclasses of city (Q515).

In [64]:
if debug:
    !$kypher -i $OUT/derived.P31.tsv.gz -i $OUT/derived.P279star.tsv.gz -i "$LABELS" \
    --match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q515), label: (n1)-[:label]->(label), label: (c)-[:label]->(c_label)' \
    --return 'distinct c as class, count(c) as count, c_label as `class name`, n1 as instance, label as `label`' \
    --order-by 'count(c) desc, c, n1' \
    --limit 10 \
    | col

[2021-03-02 11:13:34 sqlstore]: IMPORT graph directly into table graph_8 from /data/amandeep/wikidata-20210215/useful_wikidata_files/derived.P279star.tsv.gz ...
[2021-03-02 11:15:10 sqlstore]: IMPORT graph directly into table graph_9 from /data/amandeep/wikidata-20210215/labels.en.tsv.gz ...
[2021-03-02 11:17:42 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_9_c4."node1" "_aLias.class", count(graph_9_c4."node1") "_aLias.count", graph_9_c4."node2" "_aLias.class name", graph_9_c3."node1" "_aLias.instance", graph_9_c3."node2" "_aLias.label"
     FROM graph_6 AS graph_6_c1, graph_8 AS graph_8_c2, graph_9 AS graph_9_c3, graph_9 AS graph_9_c4
     WHERE graph_6_c1."label"=?
     AND graph_8_c2."node2"=?
     AND graph_9_c3."label"=?
     AND graph_9_c4."label"=?
     AND graph_6_c1."node1"=graph_9_c3."node1"
     AND graph_6_c1."node2"=graph_8_c2."node1"
     AND graph_8_c2."node1"=graph_9_c4."node1"
     GROUP BY "_aLias.class"
     ORDER BY c

Illustrate that it is indeed `P279*`

In [65]:
if debug:
    !$kypher -i $OUT/derived.P31.tsv.gz -i $OUT/derived.P279star.tsv.gz -i "$LABELS" \
    --match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q63440326), label: (n1)-[:label]->(label), label: (c)-[:label]->(c_label)' \
    --return 'distinct c as class, c_label as `class name`, n1 as instance, label as `label`' \
    --order-by 'c, n1' \
    --limit 10 \
    | col 

[2021-03-02 11:23:52 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_9_c4."node1" "_aLias.class", graph_9_c4."node2" "_aLias.class name", graph_9_c3."node1" "_aLias.instance", graph_9_c3."node2" "_aLias.label"
     FROM graph_6 AS graph_6_c1, graph_8 AS graph_8_c2, graph_9 AS graph_9_c3, graph_9 AS graph_9_c4
     WHERE graph_6_c1."label"=?
     AND graph_8_c2."node2"=?
     AND graph_9_c3."label"=?
     AND graph_9_c4."label"=?
     AND graph_6_c1."node1"=graph_9_c3."node1"
     AND graph_6_c1."node2"=graph_8_c2."node1"
     AND graph_8_c2."node1"=graph_9_c4."node1"
     ORDER BY graph_9_c4."node1" ASC, graph_9_c3."node1" ASC
     LIMIT ?
  PARAS: ['P31', 'Q63440326', 'label', 'label', 10]
---------------------------------------------
class	class name	instance	label
Q63440326	'city of Oregon'@en	Q1003672	'Cascade Locks'@en
Q63440326	'city of Oregon'@en	Q1003826	'Yamhill'@en
Q63440326	'city of Oregon'@en	Q1003838	'La Pine'@en
Q63440326	'ci

Test that `P279star` is indeed star

In [66]:
if debug:
    !$kypher -i $OUT/derived.P279star.tsv.gz \
    --match '(n1:Q44)-[:P279star]->(n2:Q44)'

[2021-03-02 11:23:53 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_8 AS graph_8_c1
     WHERE graph_8_c1."label"=?
     AND graph_8_c1."node1"=?
     AND graph_8_c1."node2"=?
  PARAS: ['P279star', 'Q44', 'Q44']
---------------------------------------------
[2021-03-02 11:23:53 sqlstore]: CREATE INDEX on table graph_8 column label ...
[2021-03-02 11:24:29 sqlstore]: ANALYZE INDEX on table graph_8 column label ...
node1	label	node2	id
Q44	P279star	Q44	Q44-P279star-Q44-0000


### Create a file to do generalized Is-A queries
The idea is that `(n1)-[:isa]->(n2)` when `(n1)-[:P31]->(n2)` or `(n1)-[:P279]->(n2)`

We do this by concatenating the files and renaming the relation

In [67]:
!$kgtk cat -i $OUT/derived.P31.tsv.gz $OUT/derived.P279.tsv.gz \
-o $TEMP/isa.1.tsv.gz

In [68]:
!$kypher -i $TEMP/isa.1.tsv.gz -o $OUT/derived.isa.tsv.gz \
--match '(n1)-[]->(n2)' \
--return 'n1, "isa" as label, n2' 

[2021-03-02 11:32:09 sqlstore]: IMPORT graph directly into table graph_10 from /data/amandeep/wikidata-20210215/temp.useful_wikidata_files/isa.1.tsv.gz ...
[2021-03-02 11:33:50 query]: SQL Translation:
---------------------------------------------
  SELECT graph_10_c1."node1", ? "_aLias.label", graph_10_c1."node2"
     FROM graph_10 AS graph_10_c1
  PARAS: ['isa']
---------------------------------------------


Example of how to use the `isa` relation

In [69]:
!$kypher -i $OUT/derived.isa.tsv.gz -i $OUT/derived.P279star.tsv.gz -i "$LABELS" -o - \
--match 'isa: (n1)-[l:isa]->(c), P279star: (c)-[]->(:Q44), label: (n1)-[:label]->(label)' \
--return 'distinct n1, l.label, "Q44" as node2, label as n1_label' \
--limit 10 \
| col

[2021-03-02 11:39:35 sqlstore]: IMPORT graph directly into table graph_11 from /data/amandeep/wikidata-20210215/useful_wikidata_files/derived.isa.tsv.gz ...
[2021-03-02 11:40:39 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_9_c3."node1", graph_11_c1."label", ? "_aLias.node2", graph_9_c3."node2" "_aLias.n1_label"
     FROM graph_11 AS graph_11_c1, graph_8 AS graph_8_c2, graph_9 AS graph_9_c3
     WHERE graph_11_c1."label"=?
     AND graph_8_c2."node2"=?
     AND graph_9_c3."label"=?
     AND graph_11_c1."node1"=graph_9_c3."node1"
     AND graph_11_c1."node2"=graph_8_c2."node1"
     LIMIT ?
  PARAS: ['Q44', 'isa', 'Q44', 'label', 10]
---------------------------------------------
[2021-03-02 11:40:39 sqlstore]: CREATE INDEX on table graph_11 column node2 ...
[2021-03-02 11:41:41 sqlstore]: ANALYZE INDEX on table graph_11 column node2 ...
[2021-03-02 11:41:46 sqlstore]: CREATE INDEX on table graph_11 column node1 ...
[2021-03-02 11:42:24 sql

### Create files with `isa/P279* and P31/P279*` 
This file is useful to find all nodes that are below a q-node via P279 or isa.

> These files are very large and take many hours to compute

In [75]:
os.environ['P279STAR'] = f"{os.environ['OUT']}/derived.P279star.tsv.gz"
os.environ['ISA'] = f"{os.environ['OUT']}/derived.isa.tsv.gz"

In [76]:
!ls -l $CLAIMS
!ls -l $P279STAR
!ls -l $ISA

-rw-r--r-- 1 amandeep root 25910383480 Feb 26 04:59 /data/amandeep/wikidata-20210215/claims.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 555040419 Mar  2 11:12 /data/amandeep/wikidata-20210215/useful_wikidata_files/derived.P279star.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 301114416 Mar  2 11:39 /data/amandeep/wikidata-20210215/useful_wikidata_files/derived.isa.tsv.gz


In [77]:
!$kypher -i "$P279STAR" -i "$ISA"  \
--match '\
  isa: (n1)-[]->(n2), \
  P279star: (n2)-[]->(n3)' \
--return 'distinct n1 as node1, "isa_star" as label, n3 as node2' \
-o "$TEMP"/derived.isastar_1.tsv.gz

[2021-03-02 11:48:35 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_11_c1."node1" "_aLias.node1", ? "_aLias.label", graph_8_c2."node2" "_aLias.node2"
     FROM graph_11 AS graph_11_c1, graph_8 AS graph_8_c2
     WHERE graph_11_c1."node2"=graph_8_c2."node1"
  PARAS: ['isa_star']
---------------------------------------------


Now add ids and sort it

In [78]:
!$kgtk add-id --id-style wikidata -i "$TEMP"/derived.isastar_1.tsv.gz \
/ sort2 -o "$OUT"/derived.isastar.tsv.gz

It is very big

In [79]:
if debug:
    !zcat < "$OUT"/derived.isastar.tsv.gz | wc

2617235272 10468941088 145469856635


Also calculate the same file by for P31/P279*

In [80]:
!$kypher -i "$CLAIMS" -i "$P279STAR" \
--match '\
  claims: (n1)-[:P31]->(n2), \
  P279star: (n2)-[]->(n3)' \
--return 'distinct n1 as node1, "P31P279star" as label, n3 as node2' \
-o "$TEMP"/derived.P31P279star.gz

[2021-03-02 22:51:38 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_1_c1."node1" "_aLias.node1", ? "_aLias.label", graph_8_c2."node2" "_aLias.node2"
     FROM graph_1 AS graph_1_c1, graph_8 AS graph_8_c2
     WHERE graph_1_c1."label"=?
     AND graph_1_c1."node2"=graph_8_c2."node1"
  PARAS: ['P31P279star', 'P31']
---------------------------------------------


Add ids and sort it

In [81]:
!$kgtk add-id --id-style wikidata -i "$TEMP"/derived.P31P279star.gz \
/ sort2 -o "$OUT"/derived.P31P279star.tsv.gz

It is also very big

In [82]:
if debug:
    !zcat < "$OUT"/derived.P31P279star.tsv.gz | wc

2593337226 10373348904 159708531569


## Compute pagerank

Now compute pagerank. These commands will exceed 16GB memory for graphs containing over 25 million nodes.

In [88]:
if compute_pagerank:
    !$kgtk graph-statistics -i "$ITEMS" -o $OUT/metadata.pagerank.directed.tsv.gz \
    --page-rank-property directed_pagerank \
    --pagerank --statistics-only \
    --log $TEMP/metadata.pagerank.directed.summary.txt 

In [89]:
if compute_pagerank:
    !cat $TEMP/metadata.pagerank.directed.summary.txt

graph loaded! It has 90168160 nodes and 527218491 edges

###Top relations:
P2860	181975473
P31	94148018
P1433	37049039
P50	20454513
P17	13639526
P407	13632234
P921	12555461
P131	10428633
P106	8273145
P6259	8083472

###PageRank
Max pageranks
3724	Q4167836	0.024227
38679	Q13442814	0.023336
7471	Q1860	0.007249
2306	Q5	0.006819
966866	Q35252665	0.006501


In [90]:
if compute_pagerank:
    !$kgtk graph-statistics -i "$ITEMS" -o $OUT/metadata.pagerank.undirected.tsv.gz \
    --page-rank-property undirected_pagerank \
    --pagerank --statistics-only --undirected \
    --log $TEMP/metadata.pagerank.undirected.summary.txt 

In [91]:
if compute_pagerank:
    !cat $TEMP/metadata.pagerank.undirected.summary.txt 

graph loaded! It has 90168160 nodes and 527218491 edges

###Top relations:
P2860	181975473
P31	94148018
P1433	37049039
P50	20454513
P17	13639526
P407	13632234
P921	12555461
P131	10428633
P106	8273145
P6259	8083472

###PageRank
Max pageranks
38679	Q13442814	0.039830
118262	Q1264450	0.013909
3724	Q4167836	0.012974
2306	Q5	0.008969
7471	Q1860	0.008689


## Compute Degrees

Kypher can compute the out degree by counting the node2s for each node1

In [92]:
!$kypher -i "$CLAIMS" -o $TEMP/metadata.out_degree.tsv.gz \
--match '(n1)-[l]->()' \
--return 'distinct n1 as node1, count(distinct l) as node2, "out_degree" as label' 

[2021-03-03 14:42:51 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_1_c1."node1" "_aLias.node1", count(DISTINCT graph_1_c1."id") "_aLias.node2", ? "_aLias.label"
     FROM graph_1 AS graph_1_c1
     GROUP BY "_aLias.node1"
  PARAS: ['out_degree']
---------------------------------------------


In [105]:
!$kgtk add-id --id-style wikidata -i $TEMP/metadata.out_degree.tsv.gz \
/ sort2 -o $OUT/metadata.out_degree.tsv.gz

To count the in-degree we only care when the node2 is a wikibase-item

In [94]:
# BUG in kypher, sometimes the following command will not work, as in we'll see multilple rows for a Qnode, which is
# fixable by deleting cache
!$kypher -i "$CLAIMS" -o $TEMP/metadata.in_degree.tsv.gz \
--match '()-[l]->(n2 {`wikidatatype`:"wikibase-item"})' \
--return 'distinct n2 as node1, count(distinct l) as node2, "in_degree" as label' \
--order-by 'n2'

[2021-03-03 15:20:22 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_1_c1."node2" "_aLias.node1", count(DISTINCT graph_1_c1."id") "_aLias.node2", ? "_aLias.label"
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."node2;wikidatatype"=?
     GROUP BY "_aLias.node1"
     ORDER BY graph_1_c1."node2" ASC
  PARAS: ['in_degree', 'wikibase-item']
---------------------------------------------


In [106]:
!$kgtk add-id --id-style wikidata -i $TEMP/metadata.in_degree.tsv.gz \
/ sort2 -o $OUT/metadata.in_degree.tsv.gz

In [97]:
!zcat < $OUT/metadata.in_degree.tsv.gz | head | col

node1	node2	label	id
Q1	82	in_degree	Q1-in_degree-82-0000
Q100	12349	in_degree	Q100-in_degree-12349-0000

gzip: Q1000	6529	in_degree	Q1000-in_degree-6529-0000
stdout: Broken pipe
Q100000 124	in_degree	Q100000-in_degree-124-0000
Q10000000	1	in_degree	Q10000000-in_degree-1-0000
Q100000001	3	in_degree	Q100000001-in_degree-3-0000
Q10000002	1	in_degree	Q10000002-in_degree-1-0000
Q100000040	1	in_degree	Q100000040-in_degree-1-0000
Q10000005	1	in_degree	Q10000005-in_degree-1-0000


Calculate the distribution so we can make a nice chart

In [98]:
!$kypher -i $OUT/metadata.in_degree.tsv.gz -o $OUT/statistics.in_degree.distribution.tsv \
--match '(n1)-[]->(n2)' \
--return 'distinct n2 as in_degree, count(distinct n1) as count, "count" as label' \
--order-by 'cast(n2, integer)' 

[2021-03-03 16:43:34 sqlstore]: IMPORT graph directly into table graph_12 from /data/amandeep/wikidata-20210215/useful_wikidata_files/metadata.in_degree.tsv.gz ...
[2021-03-03 16:44:08 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_12_c1."node2" "_aLias.in_degree", count(DISTINCT graph_12_c1."node1") "_aLias.count", ? "_aLias.label"
     FROM graph_12 AS graph_12_c1
     GROUP BY "_aLias.in_degree"
     ORDER BY CAST(graph_12_c1."node2" AS integer) ASC
  PARAS: ['count']
---------------------------------------------


In [107]:
!head $OUT/statistics.in_degree.distribution.tsv | col

in_degree	count	label
1	10695755	count
2	4087073 count
3	2203179 count
4	1458671 count
5	1119862 count
6	876990	count
7	724348	count
8	587460	count
9	500293	count


In [100]:
!$kypher -i $OUT/metadata.out_degree.tsv.gz -o $OUT/statistics.out_degree.distribution.tsv \
--match '(n1)-[]->(n2)' \
--return 'distinct n2 as out_degree, count(distinct n1) as count, "count" as label' \
--order-by 'cast(n2, integer)' 

[2021-03-03 16:44:51 sqlstore]: IMPORT graph directly into table graph_13 from /data/amandeep/wikidata-20210215/useful_wikidata_files/metadata.out_degree.tsv.gz ...
[2021-03-03 16:46:37 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_13_c1."node2" "_aLias.out_degree", count(DISTINCT graph_13_c1."node1") "_aLias.count", ? "_aLias.label"
     FROM graph_13 AS graph_13_c1
     GROUP BY "_aLias.out_degree"
     ORDER BY CAST(graph_13_c1."node2" AS integer) ASC
  PARAS: ['count']
---------------------------------------------


Draw some charts

In [101]:
if debug:
    data = pd.read_csv(
        os.environ["OUT"] + "/statistics.in_degree.distribution.tsv", sep="\t"
    )

    alt.Chart(data).mark_circle(size=60).encode(
        x=alt.X("in_degree", scale=alt.Scale(type="log")),
        y=alt.Y("count", scale=alt.Scale(type="log"), title="count of nodes"),
        tooltip=["in_degree", "count"],
    ).interactive().properties(title="Distribution of In Degree")

MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000). For information on how to plot larger datasets in Altair, see the documentation

alt.Chart(...)

In [102]:
id debug:
    data = pd.read_csv(
        os.environ["OUT"] + "/statistics.out_degree.distribution.tsv", sep="\t"
    )

    alt.Chart(data).mark_circle(size=60).encode(
        x=alt.X("out_degree", scale=alt.Scale(type="log")),
        y=alt.Y("count", scale=alt.Scale(type="log"), title="count of nodes"),
        tooltip=["out_degree", "count"],
    ).interactive().properties(title="Distribution of Out Degree")

## Summary of results

In [103]:
!ls -lh $OUT/*

-rw-r--r-- 1 amandeep isdstaff  32M Mar  1 18:27 /data/amandeep/wikidata-20210215/useful_wikidata_files/aliases.de.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  24M Mar  1 18:23 /data/amandeep/wikidata-20210215/useful_wikidata_files/aliases.es.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  27M Mar  1 18:34 /data/amandeep/wikidata-20210215/useful_wikidata_files/aliases.fr.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  13M Mar  1 18:29 /data/amandeep/wikidata-20210215/useful_wikidata_files/aliases.it.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  26M Mar  1 18:31 /data/amandeep/wikidata-20210215/useful_wikidata_files/aliases.nl.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 6.9M Mar  1 18:32 /data/amandeep/wikidata-20210215/useful_wikidata_files/aliases.pl.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 9.2M Mar  1 18:36 /data/amandeep/wikidata-20210215/useful_wikidata_files/aliases.pt.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  34M Mar  1 18:22 /data/amandeep/wikidata-20210215/useful_wikidata_files/aliases.ru.tsv.gz
-rw-r--r-- 1 amandeep is

Highest page rank

In [5]:
if debug:
    if compute_pagerank:
        !$kypher -i $OUT/metadata.pagerank.undirected.tsv.gz -i "$LABELS" \
        --match 'pagerank: (n1)-[:undirected_pagerank]->(page_rank), label: (n1)-[:label]->(label)' \
        --return 'distinct n1, label as label, page_rank as `undirected page rank`' \
        --order-by 'page_rank desc' \
        --limit 10 

[2021-05-17 11:13:55 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_2_c1."node1", graph_3_c2."node2" "_aLias.label", graph_2_c1."node2" "_aLias.undirected page rank"
     FROM graph_2 AS graph_2_c1, graph_3 AS graph_3_c2
     WHERE graph_2_c1."label"=?
     AND graph_3_c2."label"=?
     AND graph_2_c1."node1"=graph_3_c2."node1"
     ORDER BY graph_2_c1."node2" DESC
     LIMIT ?
  PARAS: ['undirected_pagerank', 'label', 10]
---------------------------------------------
node1	label	undirected page rank
Q75818371	'Charles Illingworth'@en	9.999999243667743e-09
Q23417931	'3-deoxy-manno-octulosonate cytidylyltransferase BN117_1886'@en	9.999999002311657e-09
Q23911432	'Gahnite in the Metamorphosed Stratiform Massive Sulfide Deposits of the Mineral District, Virginia, U.S.A'@en	9.999998269043717e-09
Q23352480	'hypothetical protein BSU10100'@en	9.999997894232425e-09
Q80741080	'Cl* NGC 6981 KBP S5'@en	9.999997538018684e-09
Q80741836	'[GMF2013] VVDS2

## Create DWD ISA (Variant of IS A)

In [None]:
!$kgtk filter -i "$CLAIMS" -p '; P31, P279, P106, P39 ;' -o "$TEMP"/derived.P31_39_106_279.1.tsv.gz

In [None]:
with open(os.environ['TEMP'] + '/custom-edges.tsv', 'w') as fp:
    fp.write("node1\tlabel\tnode2\n")
    fp.write("Q215627\tdwd_isa\tQ5\n") # person dwd_isa human
    fp.write("Q12737077\tdwd_isa\tQ5\n") # occupation dwd_isa human (perhaps controversial)
    fp.write("Q5\tdwd_isa_\tQ215627\n") # inverse
    fp.write("Q5\tdwd_isa_\tQ12737077\n") # inverse
fp.close()

In [None]:
!$kgtk cat -i "$TEMP"/derived.P31_39_106_279.1.tsv.gz -i "$TEMP"/custom-edges.tsv \
-o "$OUT"/derived.dwd_isa.tsv.gz

In [None]:
!zcat < "$OUT"/derived.dwd_isa.tsv.gz | wc

## Compute TF IDF : Class and Property count files

### Class Counts

In [None]:
!$kypher  -i "$P279STAR" -i "$OUT"/derived.dwd_isa.tsv.gz \
--match 'dwd_isa: (n1)-[]->(class), P279star: (class)-[]->(super_class)' \
--return 'distinct class as node1, "P31_39_106_279star" as label, super_class as node2' \
--order-by 'node1, label, node2' \
/ add-id --id-style wikidata \
-o "$OUT"/derived.P31_39_106_279star.tsv.gz

In [None]:
!$kypher -i "$OUT"/derived.P31_39_106_279star.tsv.gz --as dwd_isa_star -i "$OUT"/derived.dwd_isa.tsv.gz --as P31_39_106_279 \
--match 'P31_39_106_279: (n1)-[]->(class), dwd_isa_star: (class)-[]->(super_class)' \
--return 'distinct super_class as node1, count(distinct n1) as node2, "P31_39_106_279_count" as label' \
--order-by 'node1, label, node2' \
-o "$OUT"/derived.dwd.count.tsv.gz

In [None]:
!$kypher -i dwd_isa_star -i P31_39_106_279 -i "$OUT"/derived.dwd.count.tsv.gz \
--match 'P31_39_106_279: (n1)-[]->(class), dwd_isa_star: (class)-[]->(super_class), count: (super_class)-[]->(count)' \
--return 'distinct n1 as node1, "class_count" as label, printf("%s:%s", super_class, count) as node2' \
--order-by 'node1, label, node2' \
-o "$TEMP"/dwd_isa_class_count.tsv.gz

In [None]:
!kgtk sort -i "$TEMP"/dwd_isa_class_count.tsv.gz -X "--parallel 8 --buffer-size 60%" -o "$TEMP"/dwd_isa_class_count.sorted.tsv.gz

In [None]:
!kgtk compact -i "$TEMP"/dwd_isa_class_count.sorted.tsv.gz --mode=NONE --columns node1 label --presorted True -o "$OUT"/dwd_isa_class_count.compact.tsv.gz

### Property Counts

#### For each property get the number of node1 that it occurs in

In [None]:
!$kypher -i "$CLAIMS" \
--match '(n1)-[l {label: property}]->()' \
--return 'distinct property as node1, count(distinct n1) as node2, "nodes_count" as label' \
-o "$TEMP"/property.count.tsv.gz

#### For each item, list the properties it has

In [None]:
!$kypher -i "$CLAIMS" \
--match '(n1)-[l {label: property}]->()' \
--return 'distinct n1 as node1, property as node2, "property" as label' \
-o "$TEMP"/item.property.tsv.gz

#### Combine the property and the counts into one column

In [None]:
!$kypher -i "$TEMP"/property.count.tsv.gz -i "$TEMP"/item.property.tsv.gz \
--match 'count: (property)-[]->(count), item: (n1)-[]->(property)' \
--return 'distinct n1 as node1, "property_count" as label, printf("%s:%s", property, count) as node2' \
--order-by 'node1, label, node2' \
-o "$TEMP"/item.property.count.tsv.gz

#### Put all the property/count pairs in one row for each node

In [None]:
!kgtk sort -i "$TEMP"/item.property.count.tsv.gz --sort-command gsort -X "--parallel 8 --buffer-size 60%" -o "$TEMP"/item.property.count.sorted.tsv.gz

In [None]:
!kgtk compact -i "$TEMP"/item.property.count.sorted.tsv.gz --mode=NONE --columns node1 label --presorted True -o "$OUT"/item.property.count.compact.tsv.gz