# Generating Useful Wikidata Files

This notebook generates files that contain derived data that is useful in many applications. The input to the notebook is the full Wikidata or a subset of Wikidata. It also works for arbutrary KGs as long as they follow the representation requirements of Wikidata:

- the *instance of* relation is represented using the `P31` property
- the *subclass of* relation is represented using the `P279` property
- all properties declare a datatype, and the data types must be one of the datatypes in Wikidata.

Inputs:

- `claims_file`: contains all statements, which consist of edges `node1/label/node2` where `label` is a property in Wikidata (e.g., sitelinks, labels, aliases and description are not in the claims file.
- `item_file`: the subset of the `claims_file` consistin of edges for property of data type `wikibase-item`
- `label_file`, `alias_file` and `description_file` containing labels, aliases and descriptions. It is assume that these files contain the labels, aliases and descriptions of all nodes appearing in the claims file. Users may provide these files for specific languages only.

Outputs:

- **Instance of (P31):** `derived.P31.tsv.gz` contains all the `instance of (P31)` edges present in the claims file.
- **Subclass of (P279):** `derived.P279.tsv.gz` contains all the `subclass of (P279)` edges present in the claims file.
- **Is A (isa):** `derived.isa.tsv.gz` contains edges `node`isa/node2` where either `node1/P31/node2` or `node1/P279/node2`
- **Closure of subclass of (P279star):** `derived.P279star.tsv.gz` contains edges `node1/P279star/node2` where `node2` is reachable from `node1` via zero or more hops using the `P279` property. Note that for example, `Q44/P279star/Q44`. An example when this file is useful is when you want to find all the instance of a class, including instances of subclasses of the given class.
- **In/out degrees:** `metadata.out_degree.tsv.gz` contains the out degree of every node, and `metadata.in_degree.tsv.gz` contains the in degree of every node.
- **Pagerank:** outputs page rank on the directed graph in `metadata.pagerank.directed.tsv.gz` and page rank of the directed graph in `metadata.pagerank.undirected.tsv.gz`.

### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

```
papermill Wikidata\ Useful\ Files.ipynb useful-files.out.ipynb \
-p claims_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/all.tsv.gz \
-p label_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.label.en.tsv.gz \
-p item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.wikibase-item.tsv.gz \
-p property_item_file = /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.property.wikibase-item.tsv.gz \
-p output_path <local folder> \
-p output_folder useful_files_v4 \
-p temp_folder temp.useful_files_v4 \
-p delete_database no 
```

In [1]:
# Parameters

# Folder on local machine where to create the output and temporary folders
output_path = "/Users/pedroszekely/Downloads/kypher"

# The names of the output and temporary folders
output_folder = "useful_wikidata_files_v4"
temp_folder = "temp.useful_wikidata_files_v4"

# The location of input files
wiki_root_folder = "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/"
claims_file = "claims.tsv.gz"
label_file = "labels.en.tsv.gz"
alias_file = "aliases.en.tsv.gz"
description_file = "descriptions.en.tsv.gz"
item_file = "claims.wikibase-item.tsv.gz"

# Location of the cache database for kypher
cache_path = "/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v4"

# Whether to delete the cache database
delete_database = False

# Whether to compute pagerank as it may not run on the laptop
compute_pagerank = False

In [2]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

import altair as alt

## Set up environment and folders to store the files

- `OUT` folder where the output files go
- `TEMP` folder to keep temporary files , including the database
- `kgtk` shortcut to invoke the kgtk software
- `kypher` shortcut to invoke `kgtk query with the cache database
- `CLAIMS` the `all.tsv` file of wikidata that contains all edges except label/alias/description
- `LABELS` the file with the English labels
- `ITEMS` the wikibase-item file (currently does not include node1 that are properties so for now we need the net file
- `STORE` location of the cache file

In [3]:
if cache_path:
    os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_path)
else:
    os.environ['STORE'] = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)
os.environ['OUT'] = "{}/{}".format(output_path, output_folder)
os.environ['TEMP'] = "{}/{}".format(output_path, temp_folder)
os.environ['kgtk'] = "kgtk"
os.environ['kgtk'] = "time kgtk --debug"
os.environ['kypher'] = "time kgtk --debug query --graph-cache " + os.environ['STORE']
os.environ['CLAIMS'] = wiki_root_folder + claims_file
os.environ['LABELS'] = wiki_root_folder + label_file
os.environ['ALIASES'] = wiki_root_folder + alias_file
os.environ['DESCRIPTIONS'] = wiki_root_folder + description_file
os.environ['ITEMS'] = wiki_root_folder + item_file

Echo the variables to see if they are all set correctly

In [4]:
!echo $OUT
!echo $TEMP
!echo $kgtk
!echo $kypher
!echo $CLAIMS
!echo $LABELS
!echo $ALIASES
!echo $LABELS
!echo $DESCRIPTIONS
!echo $STORE
!alias col="column -t -s $'\t' "

/Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4
/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v4
time kgtk --debug
time kgtk --debug query --graph-cache /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v4/wikidata.sqlite3.db
/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/claims.tsv.gz
/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/labels.en.tsv.gz
/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/aliases.en.tsv.gz
/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/labels.en.tsv.gz
/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/descriptions.en.tsv.gz
/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v4/wikidata.sqlite3.db


Go to the output directory and create the subfolders for the output files and the temporary files

In [5]:
cd $output_path

/Users/pedroszekely/Downloads/kypher


In [6]:
!mkdir $OUT
!mkdir $TEMP

mkdir: /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4: File exists
mkdir: /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v4: File exists


Clean up the output and temp folders before we start

In [8]:
# !rm $OUT/*.tsv $OUT/*.tsv.gz
# !rm $TEMP/*.tsv $TEMP/*.tsv.gz

In [9]:
if delete_database:
    print("Deleteddatabase") 
    !rm $STORE

In [7]:
!ls -l $OUT
!ls $TEMP
!ls -l "$CLAIMS"
!ls -l "$LABELS"
!ls -l "$ALIASES"
!ls -l "$LABELS"
!ls -l "$DESCRIPTIONS"
!ls $STORE

total 1888376
-rw-r--r--  1 pedroszekely  staff   38563332 Nov 14 00:50 all.P279.tsv.gz
-rw-r--r--  1 pedroszekely  staff         21 Nov 14 00:50 all.P31_P279.tsv.gz
-rw-r--r--  1 pedroszekely  staff         37 Nov 14 00:51 all.isa.tsv.gz
-rw-r--r--  1 pedroszekely  staff   38563336 Nov 14 08:14 derived.P279.tsv.gz
-rw-r--r--  1 pedroszekely  staff  876497386 Nov 14 08:39 derived.P31.tsv.gz
P279.n1.tsv.gz        P279.roots.tsv        isa.1.tsv.gz
P279.reachable.tsv.gz P279star.1.tsv.gz     wikidata.sqlite3.db
P279.roots.1.tsv.gz   P279star.2.tsv.gz
P279.roots.2.tsv.gz   P31.n2.tsv.gz
-rw-------  1 pedroszekely  staff  24260264435 Nov 10 21:52 /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/claims.tsv.gz
-rw-------  1 pedroszekely  staff  2142929019 Nov 10 22:19 /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/labels.en.tsv.gz
-rw-------  1 pedroszekely  staff  129552943 Nov 10 21:55 /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-202

In [8]:
!zcat < "$CLAIMS" | head | col

zcat: id	node1	label	node2	rank	node2;wikidatatype
error writing to output: P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video" normal	url
Broken pipe
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item
P10-P1659-P1651-c4068028-0	P10	P1659	P1651	normal	wikibase-property
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18	normal	wikibase-property
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238	normal	wikibase-property
P10-P1659-P51-86aca4c5-0	P10	P1659	P51	normal	wikibase-property
P10-P1855-Q15075950-7eff6d65-0	P10	P1855	Q15075950	normal	wikibase-item
P10-P1855-Q4504-a69d2c73-0	P10	P1855	Q4504	normal	wikibase-item


### Preview the input files

It is always a good practice to peek a the files to make sure the column headings are what we expect

In [9]:
!$kypher -i "$CLAIMS" --limit 10 | col 

[2020-11-14 08:45:04 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_1 AS graph_1_c1
     LIMIT ?
  PARAS: [10]
---------------------------------------------
        0.80 real         0.53 user         0.14 sys
id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video" normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item
P10-P1659-P1651-c4068028-0	P10	P1659	P1651	normal	wikibase-property
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18	normal	wikibase-property
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238	normal	wikibase-property
P10-P1659-P51-86aca4c5-0	P10	P1659	P51	normal	wikibase-property
P10-P1855-Q15075950-7eff6d65-0	P10	P1855	Q15075950	normal	wikibase-item
P10-P1855-Q4504-a69d2c73-0	P10	P1855	Q4504	normal	wikibase-item
P10-P1855-Q69063653-c8cdb04c-0	P10	P1855	Q69063653	normal	wikib

Force creation of the index on the label column

In [10]:
!$kypher -i "$CLAIMS" -o - \
--match '(i)-[:P31]->(c)' \
--limit 5 \
| column -t -s $'\t' 

[2020-11-14 08:45:06 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."label"=?
     LIMIT ?
  PARAS: ['P31', 5]
---------------------------------------------
        0.62 real         0.50 user         0.11 sys
id                              node1  label  node2      rank    node2;wikidatatype
P10-P31-Q18610173-85ef4d24-0    P10    P31    Q18610173  normal  wikibase-item
P1000-P31-Q18608871-093affb5-0  P1000  P31    Q18608871  normal  wikibase-item
P1001-P31-Q15720608-deeedec9-0  P1001  P31    Q15720608  normal  wikibase-item
P1001-P31-Q22984026-8beb0cfe-0  P1001  P31    Q22984026  normal  wikibase-item
P1001-P31-Q22997934-1e5b1a96-0  P1001  P31    Q22997934  normal  wikibase-item


Force creation of the index on the node2 column

In [11]:
!$kypher -i "$CLAIMS" -o - \
--match '(i)-[r]->(:Q5)' \
--limit 5 \
| column -t -s $'\t' 

[2020-11-14 08:45:09 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."node2"=?
     LIMIT ?
  PARAS: ['Q5', 5]
---------------------------------------------
        0.62 real         0.50 user         0.11 sys
id                         node1  label  node2  rank    node2;wikidatatype
P1424-P1855-Q5-47bdcd17-0  P1424  P1855  Q5     normal  wikibase-item
P1963-P1855-Q5-1ba43aca-0  P1963  P1855  Q5     normal  wikibase-item
P3055-P1629-Q5-fb63cfeb-0  P3055  P1629  Q5     normal  wikibase-item
P685-P1855-Q5-76c93460-0   P685   P1855  Q5     normal  wikibase-item
P8168-P1855-Q5-1f792f8c-0  P8168  P1855  Q5     normal  wikibase-item


### Count the number of edges

Counting takes a long time

In [48]:
!$kypher -i "$CLAIMS" \
--match '()-[r]->()' \
--return 'count(r) as count' \
--limit 10

[2020-11-14 08:04:54 query]: SQL Translation:
---------------------------------------------
  SELECT count(graph_1_c1."id") "count"
     FROM graph_1 AS graph_1_c1
     LIMIT ?
  PARAS: [10]
---------------------------------------------
count
1102950183
      491.63 real        87.41 user        94.80 sys


### Create the P31 and P279 files

Create the `P31` file

In [12]:
!$kypher -i "$CLAIMS" -o $OUT/derived.P31.tsv.gz \
--match '(n1)-[l:P31]->(n2)' \
--return 'l, n1, l.label, n2' 

[2020-11-14 08:45:19 query]: SQL Translation:
---------------------------------------------
  SELECT graph_1_c1."id", graph_1_c1."node1", graph_1_c1."label", graph_1_c1."node2"
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."label"=?
  PARAS: ['P31']
---------------------------------------------
     1573.64 real       930.85 user       202.92 sys


Create the P279 file

In [13]:
!gzcat $OUT/derived.P31.tsv.gz | head | col

id	node1	label	node2
P10-P31-Q18610173-85ef4d24-0	P10	P31	Q18610173
gzcat: P1000-P31-Q18608871-093affb5-0	P1000	P31	Q18608871
P1001-P31-Q15720608-deeedec9-0	P1001	P31	Q15720608
P1001-P31-Q22984026-8beb0cfe-0	P1001	P31	Q22984026
P1001-P31-Q22997934-1e5b1a96-0	P1001	P31	Q22997934
P1001-P31-Q61719275-0ccc11a5-0	P1001	P31	Q61719275
P1002-P31-Q22963600-b3a47587-0	P1002	P31	Q22963600
error writing to outputP1003-P31-Q19595382-152d2cdd-0	P1003	P31	Q19595382
: Broken pipe
P1003-P31-Q19833377-75138cf5-0	P1003	P31	Q19833377
gzcat: /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/derived.P31.tsv.gz: uncompress failed


In [14]:
!$kypher -i "$CLAIMS" -o $OUT/derived.P279.tsv.gz \
    --match '(n1)-[l:P279]->(n2)' \
    --return 'l, n1, l.label, n2' 

[2020-11-14 09:11:34 query]: SQL Translation:
---------------------------------------------
  SELECT graph_1_c1."id", graph_1_c1."node1", graph_1_c1."label", graph_1_c1."node2"
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."label"=?
  PARAS: ['P279']
---------------------------------------------
      102.82 real        38.44 user        18.03 sys


### Create the file that contains all nodes reachable via P279 starting from a node2 in P31 or a node1 in P279

First compute the roots

In [17]:
!$kypher -i $OUT/derived.P279.tsv.gz -o $TEMP/P279.n1.tsv.gz \
--match '(n1)-[l]->()' \
--return 'n1 as id' 

[2020-11-14 09:24:17 sqlstore]: IMPORT graph directly into table graph_3 from /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/derived.P279.tsv.gz ...
[2020-11-14 09:24:31 query]: SQL Translation:
---------------------------------------------
  SELECT graph_3_c1."node1" "id"
     FROM graph_3 AS graph_3_c1
  PARAS: []
---------------------------------------------
       27.08 real        34.74 user         0.84 sys


In [24]:
!$kypher -i $OUT/derived.P31.tsv.gz -o $TEMP/P31.n2.tsv.gz \
--match '()-[l]->(n2)' \
--return 'n2 as id' 

[2020-11-14 10:08:48 sqlstore]: IMPORT graph directly into table graph_4 from /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/derived.P31.tsv.gz ...
[2020-11-14 10:14:38 query]: SQL Translation:
---------------------------------------------
  SELECT graph_4_c1."node2" "id"
     FROM graph_4 AS graph_4_c1
  PARAS: []
---------------------------------------------
      526.29 real       735.22 user        21.42 sys


In [25]:
!$kgtk cat --mode NONE -i $TEMP/P31.n2.tsv.gz $TEMP/P279.n1.tsv.gz \
| gzip > $TEMP/P279.roots.1.tsv.gz

In [26]:
!$kgtk sort2 --mode NONE --column id -i $TEMP/P279.roots.1.tsv.gz \
| gzip > $TEMP/P279.roots.2.tsv.gz

We have lots of duplicates

In [27]:
!zcat < $TEMP/P279.roots.2.tsv.gz | head

id
Q1
Q1
Q1000032
Q1000032
Q1000039
Q1000064
Q1000084
Q1000108
Q1000116
zcat: error writing to output: Broken pipe


In [28]:
!$kgtk compact -i $TEMP/P279.roots.2.tsv.gz --mode NONE \
    --presorted \
    --columns id \
> $TEMP/P279.roots.tsv

Now we can invoke the reachable-nodes command

In [30]:
!$kgtk reachable-nodes \
    --rootfile $TEMP/P279.roots.tsv \
    --selflink \
    -i $OUT/derived.P279.tsv.gz \
| gzip > $TEMP/P279.reachable.tsv.gz

     4429.37 real      2866.14 user      1546.66 sys


In [31]:
!zcat < $TEMP/P279.reachable.tsv.gz | head | col

zcat: node1	label	node2
error writing to outputQ1000032	reachable	Q1000032
: Broken pipe
Q1000032	reachable	Q1813494
Q1000032	reachable	Q1799072
Q1000032	reachable	Q16686448
Q1000032	reachable	Q35120
Q1000032	reachable	novalue
Q1000032	reachable	Q2695280
Q1000032	reachable	Q1914636
Q1000032	reachable	Q20937557


The reachable-nodes command produces edges labeled `reachable`, so we need one command to rename them.

In [32]:
!$kypher -i $TEMP/P279.reachable.tsv.gz -o $TEMP/P279star.1.tsv.gz \
--match '(n1)-[]->(n2)' \
--return 'n1, "P279star" as label, n2 as node2' 

[2020-11-14 11:46:10 sqlstore]: IMPORT graph directly into table graph_5 from /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v4/P279.reachable.tsv.gz ...
[2020-11-14 11:49:16 query]: SQL Translation:
---------------------------------------------
  SELECT graph_5_c1."node1", ? "label", graph_5_c1."node2" "node2"
     FROM graph_5 AS graph_5_c1
  PARAS: ['P279star']
---------------------------------------------
      738.99 real       866.13 user        11.47 sys


Now we can concatenate these files to produce the final output

In [33]:
!$kgtk sort2 -i $TEMP/P279star.1.tsv.gz -o $TEMP/P279star.2.tsv.gz

      239.06 real       232.07 user        47.03 sys


Make sure there are no duplicates

In [34]:
!$kgtk compact --presorted -i $TEMP/P279star.2.tsv.gz -o $TEMP/P279star.3.tsv.gz

     1372.23 real      1368.60 user         2.27 sys


Add ids

In [42]:
!$kgtk add-id --id-style node1-label-node2-num -i $TEMP/P279star.3.tsv.gz -o $OUT/derived.P279star.tsv.gz

     1273.07 real      1239.92 user        19.64 sys


In [43]:
!zcat < $OUT/derived.P279star.tsv.gz | head | col

zcat: node1	label	node2	id
error writing to outputQ1000032	P279star	Q1000032	Q1000032-P279star-Q1000032-0000
: Broken pipe
Q1000032	P279star	Q1150070	Q1000032-P279star-Q1150070-0000
Q1000032	P279star	Q1190554	Q1000032-P279star-Q1190554-0000
Q1000032	P279star	Q133500 Q1000032-P279star-Q133500-0000
Q1000032	P279star	Q13878858	Q1000032-P279star-Q13878858-0000
Q1000032	P279star	Q14819853	Q1000032-P279star-Q14819853-0000
Q1000032	P279star	Q14912053	Q1000032-P279star-Q14912053-0000
Q1000032	P279star	Q16686448	Q1000032-P279star-Q16686448-0000
Q1000032	P279star	Q16722960	Q1000032-P279star-Q16722960-0000


This is how we would do the typical `?item P31/P279* ?class` in Kypher. 
The example shows how to get all the counts of instances of subclasses of city (Q515).

In [44]:
!$kypher -i $OUT/derived.P31.tsv.gz -i $OUT/derived.P279star.tsv.gz -i "$LABELS" \
--match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q515), label: (n1)-[:label]->(label), label: (c)-[:label]->(c_label)' \
--return 'distinct c as class, count(c) as count, c_label as `class name`, n1 as instance, label as `label`' \
--order-by 'count(c) desc, c, n1' \
--limit 10 \
| col

[2020-11-14 12:54:48 sqlstore]: IMPORT graph directly into table graph_6 from /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/derived.P279star.tsv.gz ...
[2020-11-14 13:01:15 sqlstore]: IMPORT graph directly into table graph_7 from /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/labels.en.tsv.gz ...
[2020-11-14 13:10:32 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_4_c1."node2" "class", count(graph_4_c1."node2") "count", graph_7_c4."node2" "class name", graph_7_c3."node1" "instance", graph_7_c3."node2" "label"
     FROM graph_4 AS graph_4_c1, graph_6 AS graph_6_c2, graph_7 AS graph_7_c3, graph_7 AS graph_7_c4
     WHERE graph_4_c1."label"=?
     AND graph_6_c2."node2"=?
     AND graph_7_c3."label"=?
     AND graph_7_c4."label"=?
     AND graph_4_c1."node1"=graph_7_c3."node1"
     AND graph_4_c1."node2"=graph_6_c2."node1"
     AND graph_6_c2."node1"=graph_7_c4."node1"
     GROUP BY class
     ORDER BY c

Illustrate that it is indeed `P279*`

In [49]:
!$kypher -i $OUT/derived.P31.tsv.gz -i $OUT/derived.P279star.tsv.gz -i "$LABELS" \
--match 'P31: (n1)-[:P31]->(c), P279star: (c)-[]->(:Q63440326), label: (n1)-[:label]->(label), label: (c)-[:label]->(c_label)' \
--return 'distinct c as class, c_label as `class name`, n1 as instance, label as `label`' \
--order-by 'c, n1' \
--limit 10 \
| col 

[2020-11-14 13:44:45 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_6_c2."node1" "class", graph_7_c4."node2" "class name", graph_7_c3."node1" "instance", graph_7_c3."node2" "label"
     FROM graph_4 AS graph_4_c1, graph_6 AS graph_6_c2, graph_7 AS graph_7_c3, graph_7 AS graph_7_c4
     WHERE graph_4_c1."label"=?
     AND graph_6_c2."node2"=?
     AND graph_7_c3."label"=?
     AND graph_7_c4."label"=?
     AND graph_4_c1."node1"=graph_7_c3."node1"
     AND graph_4_c1."node2"=graph_6_c2."node1"
     AND graph_4_c1."node2"=graph_7_c4."node1"
     ORDER BY graph_6_c2."node1" ASC, graph_7_c3."node1" ASC
     LIMIT ?
  PARAS: ['P31', 'Q63440326', 'label', 'label', 10]
---------------------------------------------
        1.28 real         0.61 user         0.22 sys
class	class name	instance	label
Q63440326	'city of Oregon'@en	Q1003672	'Cascade Locks'@en
Q63440326	'city of Oregon'@en	Q1003826	'Yamhill'@en
Q63440326	'city of Oregon'@en	Q1003838	'

Test that `P279star` is indeed star

In [50]:
!$kypher -i $OUT/derived.P279star.tsv.gz \
--match '(n1:Q44)-[:P279star]->(n2:Q44)'

[2020-11-14 14:58:59 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_6 AS graph_6_c1
     WHERE graph_6_c1."label"=?
     AND graph_6_c1."node1"=?
     AND graph_6_c1."node2"=?
  PARAS: ['P279star', 'Q44', 'Q44']
---------------------------------------------
[2020-11-14 14:58:59 sqlstore]: CREATE INDEX on table graph_6 column label ...
[2020-11-14 15:00:08 sqlstore]: ANALYZE INDEX on table graph_6 column label ...
node1	label	node2	id
Q44	P279star	Q44	Q44-P279star-Q44-0000
       78.41 real        38.58 user        16.33 sys


### Create a file to do generalized Is-A queries
The idea is that `(n1)-[:isa]->(n2)` when `(n1)-[:P31]->(n2)` or `(n1)-[:P279]->(n2)`

We do this by concatenating the files and renaming the relation

In [46]:
!$kgtk cat -i $OUT/derived.P31.tsv.gz $OUT/derived.P279.tsv.gz \
    | gzip > $TEMP/isa.1.tsv.gz

      435.98 real       431.80 user         2.86 sys


In [47]:
!$kypher -i $TEMP/isa.1.tsv.gz -o $OUT/derived.isa.tsv.gz \
--match '(n1)-[]->(n2)' \
--return 'n1, "isa" as label, n2' 

[2020-11-14 13:27:24 sqlstore]: IMPORT graph directly into table graph_8 from /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v4/isa.1.tsv.gz ...
[2020-11-14 13:33:32 query]: SQL Translation:
---------------------------------------------
  SELECT graph_8_c1."node1", ? "label", graph_8_c1."node2"
     FROM graph_8 AS graph_8_c1
  PARAS: ['isa']
---------------------------------------------
      736.21 real       953.69 user        24.91 sys


Example of how to use the `isa` relation

In [48]:
!$kypher -i $OUT/derived.isa.tsv.gz -i $OUT/derived.P279star.tsv.gz -i "$LABELS" -o - \
--match 'isa: (n1)-[l:isa]->(c), P279star: (c)-[]->(:Q44), label: (n1)-[:label]->(label)' \
--return 'distinct n1, l.label, "Q44" as node2, label as n1_label' \
--limit 10 \
| col

[2020-11-14 13:39:41 sqlstore]: IMPORT graph directly into table graph_9 from /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/derived.isa.tsv.gz ...
[2020-11-14 13:42:09 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_9_c1."node1", graph_9_c1."label", ? "node2", graph_7_c3."node2" "n1_label"
     FROM graph_6 AS graph_6_c2, graph_7 AS graph_7_c3, graph_9 AS graph_9_c1
     WHERE graph_6_c2."node2"=?
     AND graph_7_c3."label"=?
     AND graph_9_c1."label"=?
     AND graph_6_c2."node1"=graph_9_c1."node2"
     AND graph_7_c3."node1"=graph_9_c1."node1"
     LIMIT ?
  PARAS: ['Q44', 'Q44', 'label', 'isa', 10]
---------------------------------------------
[2020-11-14 13:42:09 sqlstore]: CREATE INDEX on table graph_9 column label ...
[2020-11-14 13:42:50 sqlstore]: ANALYZE INDEX on table graph_9 column label ...
[2020-11-14 13:42:56 sqlstore]: CREATE INDEX on table graph_9 column node1 ...
[2020-11-14 13:43:32 sqlstore]: ANALYZE I

## Compute pagerank

Now compute pagerank. These commands will exceed 16GB memory for graphs containing over 25 million nodes.

In [51]:
if compute_pagerank:
    !$kgtk graph-statistics -i "$ITEMS" -o $OUT/metadata.pagerank.directed.tsv.gz \
    --page-rank-property directed_pagerank \
    --pagerank --statistics-only \
    --log $TEMP/metadata.pagerank.directed.summary.txt 

In [52]:
if compute_pagerank:
    !cat $TEMP/metadata.pagerank.directed.summary.txt

In [53]:
if compute_pagerank:
    !$kgtk graph-statistics -i "$ITEMS" -o $OUT/metadata.pagerank.undirected.tsv.gz \
    --page-rank-property undirected_pagerank \
    --pagerank --statistics-only \
    --log $TEMP/metadata.pagerank.undirected.summary.txt 

In [54]:
if compute_pagerank:
    !cat $TEMP/metadata.pagerank.undirected.summary.txt 

## Compute Degrees

Kypher can compute the out degree by counting the node2s for each node1

In [78]:
!$kypher -i "$CLAIMS" -o $TEMP/metadata.out_degree.tsv.gz \
--match '(n1)-[l]->()' \
--return 'distinct n1 as node1, count(l) as node2, "out_degree" as label' 

[2020-11-14 18:33:05 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_1_c1."node1" "node1", count(graph_1_c1."id") "node2", ? "label"
     FROM graph_1 AS graph_1_c1
     GROUP BY node1
  PARAS: ['out_degree']
---------------------------------------------
     2160.01 real       986.18 user       826.28 sys


In [79]:
!$kgtk add-id --id-style node1-label-node2-num -i $TEMP/metadata.out_degree.tsv.gz \
/ sort2 -o $OUT/metadata.out_degree.tsv.gz

      707.37 real       742.88 user        69.35 sys


To count the in-degree we only care when the node2 is a wikibase-item

In [80]:
!$kypher -i "$CLAIMS" -o $TEMP/metadata.in_degree.tsv.gz \
--match '()-[l {`node2;wikidatatype`:"wikibase-item"}]->(n2)' \
--return 'distinct n2 as node1, count(l) as node2, "in_degree" as label' 

[2020-11-14 19:20:53 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_1_c1."node2" "node1", count(graph_1_c1."id") "node2", ? "label"
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."node2;wikidatatype"=?
     GROUP BY node1
  PARAS: ['in_degree', 'wikibase-item']
---------------------------------------------
     1342.16 real       458.20 user       498.16 sys


In [81]:
!$kgtk add-id --id-style node1-label-node2-num -i $TEMP/metadata.in_degree.tsv.gz \
/ sort2 -o $OUT/metadata.in_degree.tsv.gz

       29.74 real        32.55 user         1.18 sys


In [82]:
!zcat < $OUT/metadata.in_degree.tsv.gz | head | col

node1	node2	label	id
Q1	11	in_degree	Q1-in_degree-11-0000
Q1	12	in_degree	Q1-in_degree-12-0000
Q1	17	in_degree	Q1-in_degree-17-0000
Q1	2	in_degree	Q1-in_degree-2-0000
Q1	3	in_degree	Q1-in_degree-3-0000
zcat: Q1	4	in_degree	Q1-in_degree-4-0000
Q1	6	in_degree	Q1-in_degree-6-0000
Q100	1	in_degree	Q100-in_degree-1-0000
Q100	10	in_degree	Q100-in_degree-10-0000
error writing to output: Broken pipe


Calculate the distribution so we can make a nice chart

In [83]:
!$kypher -i $OUT/metadata.in_degree.tsv.gz -o $OUT/statistics.in_degree.distribution.tsv \
--match '(n1)-[]->(n2)' \
--return 'distinct n2 as in_degree, count(distinct n1) as count, "count" as label' \
--order-by 'cast(n2, integer)' 

[2020-11-14 19:43:45 sqlstore]: DROP graph data table graph_12 from /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/metadata.in_degree.tsv.gz
[2020-11-14 19:45:40 sqlstore]: IMPORT graph directly into table graph_12 from /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/metadata.in_degree.tsv.gz ...
[2020-11-14 19:45:56 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_12_c1."node2" "in_degree", count(DISTINCT graph_12_c1."node1") "count", ? "label"
     FROM graph_12 AS graph_12_c1
     GROUP BY in_degree
     ORDER BY CAST(graph_12_c1."node2" AS integer) ASC
  PARAS: ['count']
---------------------------------------------
      135.41 real        33.74 user        43.07 sys


In [84]:
!head $OUT/metadata.in_degree.distribution.tsv | col

head: /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/metadata.in_degree.distribution.tsv: No such file or directory


In [85]:
!$kypher -i $OUT/metadata.out_degree.tsv.gz -o $OUT/statistics.out_degree.distribution.tsv \
--match '(n1)-[]->(n2)' \
--return 'distinct n2 as out_degree, count(distinct n1) as count, "count" as label' \
--order-by 'cast(n2, integer)' 

[2020-11-14 19:46:01 sqlstore]: DROP graph data table graph_11 from /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/metadata.out_degree.tsv.gz
[2020-11-14 19:48:11 sqlstore]: IMPORT graph directly into table graph_11 from /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/metadata.out_degree.tsv.gz ...
[2020-11-14 19:53:34 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_11_c1."node2" "out_degree", count(DISTINCT graph_11_c1."node1") "count", ? "label"
     FROM graph_11 AS graph_11_c1
     GROUP BY out_degree
     ORDER BY CAST(graph_11_c1."node2" AS integer) ASC
  PARAS: ['count']
---------------------------------------------
      593.21 real       659.68 user        72.66 sys


Draw some charts

In [86]:
data = pd.read_csv(
    os.environ["OUT"] + "/statistics.in_degree.distribution.tsv", sep="\t"
)

alt.Chart(data).mark_circle(size=60).encode(
    x=alt.X("in_degree", scale=alt.Scale(type="log")),
    y=alt.Y("count", scale=alt.Scale(type="log"), title="count of nodes"),
    tooltip=["in_degree", "count"],
).interactive().properties(title="Distribution of In Degree")

In [87]:
data = pd.read_csv(
    os.environ["OUT"] + "/statistics.out_degree.distribution.tsv", sep="\t"
)

alt.Chart(data).mark_circle(size=60).encode(
    x=alt.X("out_degree", scale=alt.Scale(type="log")),
    y=alt.Y("count", scale=alt.Scale(type="log"), title="count of nodes"),
    tooltip=["out_degree", "count"],
).interactive().properties(title="Distribution of Out Degree")

## Summary of results

In [88]:
!ls -lh $OUT/*

-rw-r--r--  1 pedroszekely  staff    37M Nov 14 09:13 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/derived.P279.tsv.gz
-rw-r--r--  1 pedroszekely  staff   500M Nov 14 12:54 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/derived.P279star.tsv.gz
-rw-r--r--  1 pedroszekely  staff   973M Nov 14 09:11 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/derived.P31.tsv.gz
-rw-r--r--  1 pedroszekely  staff   252M Nov 14 13:39 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/derived.isa.tsv.gz
-rw-r--r--  1 pedroszekely  staff    21M Nov 14 19:43 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/metadata.in_degree.tsv.gz
-rw-r--r--  1 pedroszekely  staff   512M Nov 14 19:20 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/metadata.out_degree.tsv.gz
-rw-r--r--  1 pedroszekely  staff    11K Nov 14 19:46 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v4/statistics.in_degree.distribution.tsv
-rw-r--r--  1 p

Highest page rank

In [100]:
if compute_pagerank:
    !$kypher -i $OUT/metadata.pagerank.undirected.tsv.gz -i "$LABELS" -o - \
    --match 'pagerank: (n1)-[:undirected_pagerank]->(page_rank), label: (n1)-[:label]->(label)' \
    --return 'distinct n1, label as label, page_rank as `undirected page rank' \
    --order-by 'cast(page_rank, float) desc' \
    --limit 10 \
    | col

       36.63 real         4.07 user         7.29 sys
node1      labe                               page_rank
Q81581     'Szeged'@en                        9.99820910584327e-06
Q474406    'Tropiduchidae'@en                 9.99775062441874e-06
Q102496    'parish'@en                        9.989295648293259e-06
Q19830596  'Rubens'@en                        9.98709688634465e-06
Q211661    'Jämtland'@en                      9.983987961260548e-06
Q10361310  'Rick Bonadio'@en                  9.983680487413594e-06
Q688275    'São Leopoldo'@en                  9.972744274999134e-06
Q15008131  'Category:Acyrthosiphon'@en        9.971425871370554e-06
Q9876232   'Category:Colladonus'@en           9.971425871370554e-06
Q10387575  'registered historic monument'@en  9.963088508250605e-06


In [9]:
!ls "$QUALS"

ls: : No such file or directory
