# Creating a subset of Wikidata

This notebook illustrates how to create a subset of Wikidata. We use as an example https://www.wikidata.org/wiki/Q11173 (chemical compound)

Parameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:

```
papermill Example8\ -\ Wikidata\ Subset.ipynb example8.out.ipynb \
-p wikidata_home /Users/pedroszekely/Downloads/kypher \
-p wikidata_file all.10.tsv.gz \
-p wikidata_parts_folder /Users/pedroszekely/Downloads/kypher/output.all.10 \
-p subset_name Q11173 \
-p home /Users/pedroszekely/Downloads/kypher \
-p cache_folder /Users/pedroszekely/Downloads/kypher
-p delete_database no 
```

In [39]:
# Parameters
wikidata_home = "/Users/pedroszekely/Downloads/kypher"
wikidata_file = "wikidata-20200803-all-edges.tsv.gz"
wikidata_file = "all.10.tsv.gz"
wikidata_parts_folder = "/Users/pedroszekely/Downloads/kypher/useful_wikidata_files"
wikidata_parts_folder = "/Users/pedroszekely/Downloads/kypher/output.all.10"
subset_name = "Q11173"
#subset_name = "Q318"
subset_name = "Q5"
home = "/Users/pedroszekely/Downloads/kypher"
cache_folder = "/Users/pedroszekely/Downloads/kypher"
delete_database = "no"

In [30]:
temp_folder = subset_name + "-temp"
output_folder = subset_name

In [31]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

# from IPython.display import display, HTML, Image
# from pandas_profiling import ProfileReport

A convenience function to run templetazed commands, substituting NAME with the name of the dataset and substituting other keys provided in a dictionary.

In [32]:
def run_command(command, substitution_dictionary = {}):
    """Run a templetized command."""
    cmd = command.replace("NAME", subset_name)
    for k, v in substitution_dictionary.items():
        cmd = cmd.replace(k, v)
    
    print(cmd)
    output = subprocess.run([cmd], shell=True, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    print(output.stdout)
    print(output.stderr)
    #print(output.returncode)

### Set up environment variables and folders that we need

In [33]:
# folder containing wikidata broken down into smaller files.
os.environ['WIKIDATA_PARTS'] = wikidata_parts_folder
# path of folder where the wikidata parts folder is stored.
os.environ['WIKIDATA_HOME'] = wikidata_home
# name of the dataset
os.environ['NAME'] = subset_name
# folder where to put the output
os.environ['OUT'] = "{}/{}".format(home, output_folder)
# temporary folder
os.environ['TEMP'] = "{}/{}".format(home, temp_folder)
# kgtk command to run
os.environ['kgtk'] = "kgtk"
os.environ['kgtk'] = "time kgtk --debug"
# absolute path of the db
if cache_folder:
    os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_folder)
else:
    os.environ['STORE'] = "{}/{}/wikidata.sqlite3.db".format(home, temp_folder)

In [34]:
cd $home

/Users/pedroszekely/Downloads/kypher


In [35]:
!mkdir $output_folder
!mkdir $temp_folder

mkdir: Q5: File exists
mkdir: Q5-temp: File exists


In [36]:
!rm $OUT/*.tsv $OUT/*.tsv.gz
!rm $TEMP/*.tsv $TEMP/*.tsv.gz

rm: /Users/pedroszekely/Downloads/kypher/Q5/*.tsv: No such file or directory


In [37]:
if delete_database and delete_database != "no":
    print("Deleted database")
    !rm $STORE

### Extract the Q-nodes for the items we want
Here we assume that the subset is for an individual q-node, so that the subset name is the name of the q-node. We should generalize this so that this query can be passed in as a parameter. We construct a file that contains all the node1s that are isa of the given NAME q-node

In [10]:
command = "$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz -i $WIKIDATA_PARTS/all.P279star.tsv.gz \
    --graph-cache $STORE \
    -o $TEMP/all.isa.NAME.tsv.gz  \
    --match 'isa: (n1)-[l:isa]->(n2:NAME)' \
    --return 'distinct n1, l.label, n2'"
run_command(command)

$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz -i $WIKIDATA_PARTS/all.P279star.tsv.gz     --graph-cache $STORE     -o $TEMP/all.isa.Q5.tsv.gz      --match 'isa: (n1)-[l:isa]->(n2:Q5)'     --return 'distinct n1, l.label, n2'

[2020-09-30 11:55:11 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_1_c1."node1", graph_1_c1."label", graph_1_c1."node2"
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."label"=?
     AND graph_1_c1."node2"=?
  PARAS: ['isa', 'Q5']
---------------------------------------------
        0.92 real         0.72 user         0.14 sys



### Generate the parts of this dataset

In [11]:
types = [
    "time",
    "wikibase-item",
    "math",
    "wikibase-form",
    "quantity",
    "string",
    "external-id",
    "commonsMedia",
    "globe-coordinate",
    "monolingualtext",
    "musical-notation",
    "geo-shape",
    "wikibase-property",
    "url",
]
command = "$kgtk query -i $TEMP/all.isa.NAME.tsv.gz -i $WIKIDATA_PARTS/part.TYPE_FILE.tsv.gz --graph-cache $STORE  \
    -o $OUT/NAME.part.TYPE_FILE.tsv.gz  \
    --match 'NAME: (n1)-[]->(), `TYPE_FILE`: (n1)-[l]->(n2)' \
    --return 'distinct l, n1, l.label, n2' \
    --order-by 'n1, l.label, n2'"
for type in types:
    run_command(command, {"TYPE_FILE": type})


$kgtk query -i $TEMP/all.isa.Q5.tsv.gz -i $WIKIDATA_PARTS/part.time.tsv.gz --graph-cache $STORE      -o $OUT/Q5.part.time.tsv.gz      --match 'Q5: (n1)-[]->(), `time`: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'

[2020-09-30 11:55:12 sqlstore]: IMPORT graph directly into table graph_29 from /Users/pedroszekely/Downloads/kypher/Q5-temp/all.isa.Q5.tsv.gz ...
[2020-09-30 11:55:12 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_4_c2."id", graph_4_c2."node1", graph_4_c2."label", graph_4_c2."node2"
     FROM graph_29 AS graph_29_c1, graph_4 AS graph_4_c2
     WHERE graph_29_c1."node1"=graph_4_c2."node1"
     ORDER BY graph_4_c2."node1" ASC, graph_4_c2."label" ASC, graph_4_c2."node2" ASC
  PARAS: []
---------------------------------------------
[2020-09-30 11:55:12 sqlstore]: CREATE INDEX on table graph_29 column node1 ...
[2020-09-30 11:55:12 sqlstore]: ANALYZE INDEX on table graph_29 column node1 

### Generate a P279star file

First generate the P279 and P31 or every node2 in the wikibase_item file.

In [12]:
command_p279 = "$kgtk query -i $OUT/NAME.part.wikibase-item.tsv.gz -i $WIKIDATA_PARTS/all.P279.tsv.gz --graph-cache $STORE \
-o $TEMP/NAME.node2.P279.tsv.gz \
--match 'NAME: ()-[]->(n1), P279: (n1)-[l]->(n2)' \
--return 'distinct l, n1 as node1, l.label, n2' \
--order-by 'n1, l.label, n2'"

command_p31 = "$kgtk query -i $OUT/NAME.part.wikibase-item.tsv.gz -i $WIKIDATA_PARTS/all.P31.tsv.gz --graph-cache $STORE \
-o $TEMP/NAME.node2.P31.tsv.gz \
--match 'NAME: ()-[]->(n1), P31: (n1)-[l]->(n2)' \
--return 'distinct l, n1 as node1, l.label, n2' \
--order-by 'n1, l.label, n2'"

run_command(command_p279)
run_command(command_p31)

!$kgtk cat $TEMP/$NAME.node2.P279.tsv.gz $TEMP/$NAME.node2.P31.tsv.gz | gzip > $TEMP/$NAME.P279_P31.tsv.gz


$kgtk query -i $OUT/Q5.part.wikibase-item.tsv.gz -i $WIKIDATA_PARTS/all.P279.tsv.gz --graph-cache $STORE -o $TEMP/Q5.node2.P279.tsv.gz --match 'Q5: ()-[]->(n1), P279: (n1)-[l]->(n2)' --return 'distinct l, n1 as node1, l.label, n2' --order-by 'n1, l.label, n2'

[2020-09-30 11:55:32 sqlstore]: IMPORT graph directly into table graph_30 from /Users/pedroszekely/Downloads/kypher/Q5/Q5.part.wikibase-item.tsv.gz ...
[2020-09-30 11:55:33 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_19_c2."id", graph_30_c1."node2" "node1", graph_19_c2."label", graph_19_c2."node2"
     FROM graph_19 AS graph_19_c2, graph_30 AS graph_30_c1
     WHERE graph_19_c2."node1"=graph_30_c1."node2"
     ORDER BY graph_30_c1."node2" ASC, graph_19_c2."label" ASC, graph_19_c2."node2" ASC
  PARAS: []
---------------------------------------------
[2020-09-30 11:55:33 sqlstore]: CREATE INDEX on table graph_30 column node2 ...
[2020-09-30 11:55:33 sqlstore]: ANALYZE INDEX on tabl

In [13]:
!kgtk cat -i $OUT/$NAME.part.*.tsv.gz  | gzip > $TEMP/$NAME.all_1.tsv.gz

In [14]:
command_node1 = "$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/NAME.all_1.tsv.gz \
    --graph-cache $STORE  \
    -o $TEMP/NAME.P279star.1.tsv.gz \
    --match 'P279star: (n1)-[l]->(n2), all_1: (n1)-[]->()' \
    --return 'distinct l, n1, l.label, n2'"

command_node2 = "$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/NAME.all_1.tsv.gz \
    --graph-cache $STORE  \
    -o $TEMP/NAME.P279star.2.tsv.gz \
    --match 'P279star: (n1)-[l]->(n2), all_1: ()-[]->(n1)' \
    --return 'distinct l, n1 as node1, l.label, n2'" 

cat_command = "$kgtk cat $TEMP/NAME.P279star.1.tsv.gz $TEMP/NAME.P279star.2.tsv.gz | gzip > $OUT/NAME.P279star.tsv.gz"

run_command(command_node1)
run_command(command_node2)
run_command(cat_command)

$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/Q5.all_1.tsv.gz     --graph-cache $STORE      -o $TEMP/Q5.P279star.1.tsv.gz     --match 'P279star: (n1)-[l]->(n2), all_1: (n1)-[]->()'     --return 'distinct l, n1, l.label, n2'

[2020-09-30 11:55:40 sqlstore]: IMPORT graph directly into table graph_31 from /Users/pedroszekely/Downloads/kypher/Q5-temp/Q5.all_1.tsv.gz ...
[2020-09-30 11:55:42 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_2_c1."id", graph_2_c1."node1", graph_2_c1."label", graph_2_c1."node2"
     FROM graph_2 AS graph_2_c1, graph_31 AS graph_31_c2
     WHERE graph_2_c1."node1"=graph_31_c2."node1"
  PARAS: []
---------------------------------------------
[2020-09-30 11:55:42 sqlstore]: CREATE INDEX on table graph_31 column node1 ...
[2020-09-30 11:55:42 sqlstore]: ANALYZE INDEX on table graph_31 column node1 ...
        3.10 real         4.24 user         0.31 sys

$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz 

### Get info on all properties

In [15]:
!$kgtk cat $OUT/*.gz | gzip > $TEMP/$NAME.everything_1.tsv.gz

        4.01 real         3.82 user         0.15 sys


First get a list of all the proerties used in this file

In [16]:
!$kgtk query -i $TEMP/$NAME.everything_1.tsv.gz --graph-cache $STORE \
-o $TEMP/$NAME.properties.tsv \
--match '(n1)-[l]->(n2)' \
--return 'distinct l.label as node1, "dummy" as label, "dummy" as node2' 

[2020-09-30 11:55:49 sqlstore]: IMPORT graph directly into table graph_32 from /Users/pedroszekely/Downloads/kypher/Q5-temp/Q5.everything_1.tsv.gz ...
[2020-09-30 11:55:51 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_32_c1."label" "node1", ? "label", ? "node2"
     FROM graph_32 AS graph_32_c1
  PARAS: ['dummy', 'dummy']
---------------------------------------------
        3.21 real         4.43 user         0.29 sys


Now get all the info in these properties

In [17]:
!$kgtk query -i $TEMP/$NAME.properties.tsv -i $WIKIDATA_PARTS/part.wikibase-item.tsv.gz --graph-cache $STORE \
-o $OUT/$NAME.properties.tsv.gz \
--match '`wikibase-item`: (p)-[l]->(n2), properties: (p)-[]->()' \
--return 'distinct l, p, l.label, n2' 

[2020-09-30 11:55:52 sqlstore]: IMPORT graph directly into table graph_33 from /Users/pedroszekely/Downloads/kypher/Q5-temp/Q5.properties.tsv ...
[2020-09-30 11:55:52 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_5_c1."id", graph_5_c1."node1", graph_5_c1."label", graph_5_c1."node2"
     FROM graph_33 AS graph_33_c2, graph_5 AS graph_5_c1
     WHERE graph_33_c2."node1"=graph_5_c1."node1"
  PARAS: []
---------------------------------------------
[2020-09-30 11:55:52 sqlstore]: CREATE INDEX on table graph_33 column node1 ...
[2020-09-30 11:55:52 sqlstore]: ANALYZE INDEX on table graph_33 column node1 ...
        0.75 real         0.58 user         0.15 sys


### Generate the labels, aliases and descriptions
We want the labels, aliases and descriptions for every q-node in our dataset. THis means that we need these lables for all q-nodes that appear in the node1 or node2 position.

The first step is to concatenate all the files in our dataset.

In [18]:
!$kgtk cat $OUT/*.gz | gzip > $TEMP/$NAME.everything_2.tsv.gz

        4.05 real         3.86 user         0.15 sys


Now we extract the labels from from our input wikidata folder. We do this matching node1, thend node 2, then we concatenate the resulting label files.

In [19]:
labels = [
    "label",
    "alias",
    "description"
]

command_node1 = "$kgtk query -i $TEMP/NAME.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.LABEL.en.tsv.gz --graph-cache $STORE  \
    -o $TEMP/NAME.LABEL.en.1.tsv.gz  \
    --match 'everything_2: (n1)-[]->(), part: (n1)-[l]->(n2)' \
    --return 'distinct l, n1, l.label, n2' \
    --order-by 'n1, l.label, n2'"

command_node2 = "$kgtk query -i $TEMP/NAME.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.LABEL.en.tsv.gz --graph-cache $STORE  \
    -o $TEMP/NAME.LABEL.en.2.tsv.gz  \
    --match 'everything_2: ()-[]->(n1), part: (n1)-[l]->(n2)' \
    --return 'distinct l, n1 as node1, l.label, n2' \
    --order-by 'n1, l.label, n2'"

cat_command = "kgtk cat $TEMP/NAME.LABEL.*.gz | gzip > $OUT/NAME.LABEL.en.tsv.gz"

for label in labels:
    run_command(command_node1, {"LABEL": label})
    run_command(command_node2, {"LABEL": label})
    run_command(cat_command, {"LABEL": label})


$kgtk query -i $TEMP/Q5.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE      -o $TEMP/Q5.label.en.1.tsv.gz      --match 'everything_2: (n1)-[]->(), part: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'

[2020-09-30 11:55:57 sqlstore]: IMPORT graph directly into table graph_34 from /Users/pedroszekely/Downloads/kypher/Q5-temp/Q5.everything_2.tsv.gz ...
[2020-09-30 11:56:00 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_25_c2."id", graph_34_c1."node1", graph_25_c2."label", graph_25_c2."node2"
     FROM graph_25 AS graph_25_c2, graph_34 AS graph_34_c1
     WHERE graph_25_c2."node1"=graph_34_c1."node1"
     ORDER BY graph_34_c1."node1" ASC, graph_25_c2."label" ASC, graph_25_c2."node2" ASC
  PARAS: []
---------------------------------------------
[2020-09-30 11:56:00 sqlstore]: CREATE INDEX on table graph_34 column node1 ...
[2020-09-30 11:56:00 sqlstore]: ANALYZE IND

### Summary of what we got

In [20]:
%%bash
for f in $OUT/*.tsv.gz; do
    echo -n `basename $f`
    gzcat $f | wc -l
done

Q5.P279star.tsv.gz     482
Q5.alias.en.tsv.gz   12633
Q5.description.en.tsv.gz   27179
Q5.label.en.tsv.gz   29886
Q5.part.commonsMedia.tsv.gz   10285
Q5.part.external-id.tsv.gz  182960
Q5.part.geo-shape.tsv.gz       1
Q5.part.globe-coordinate.tsv.gz       1
Q5.part.math.tsv.gz       1
Q5.part.monolingualtext.tsv.gz    6334
Q5.part.musical-notation.tsv.gz       1
Q5.part.quantity.tsv.gz   13984
Q5.part.string.tsv.gz    9277
Q5.part.time.tsv.gz   47355
Q5.part.url.tsv.gz    1591
Q5.part.wikibase-form.tsv.gz       8
Q5.part.wikibase-item.tsv.gz  337605
Q5.part.wikibase-property.tsv.gz       1
Q5.properties.tsv.gz      39


Unzip the everything file as graph-statistics cannont work with gz files

In [21]:
!rm $TEMP/$NAME.everything_2.tsv

rm: /Users/pedroszekely/Downloads/kypher/Q5-temp/Q5.everything_2.tsv: No such file or directory


In [22]:
!gunzip --keep $TEMP/$NAME.everything_2.tsv.gz

In [23]:
!$kgtk graph-statistics --log $OUT/$NAME.everything.statistics.txt \
    --statistics-only --pagerank -i $TEMP/$NAME.everything_2.tsv \
    | gzip > $OUT/$NAME.statistics.tsv.gz

       15.82 real        17.16 user         0.98 sys


In [24]:
!cat $OUT/$NAME.everything.statistics.txt

loading the TSV graph now ...
graph loaded! It has 330218 nodes and 609909 edges

###Top relations:
P106	44621
P31	30987
P21	30593
P569	29455
P27	28609
P735	26676
P19	21619
P54	16787
P570	15185
P734	14326

###PageRank
Max pageranks
242	Q82955	0.002671
261693	Q30	0.003147
351	Q6581072	0.002999
261449	Q6581097	0.016354
1	Q5	0.019606


In [25]:
!exa -l $WIKIDATA_PARTS

.[1;33mr[31mw[0m[38;5;244m-[33mr[38;5;244m--[33mr[38;5;244m--[0m  [1;32m71[0m[32mk[0m [1;33mpedroszekely[0m [34m30 Sep 11:22[0m all-distribution.tsv
.[1;33mr[31mw[0m[38;5;244m-[33mr[38;5;244m--[33mr[38;5;244m--[0m  [1;32m555[0m [1;33mpedroszekely[0m [34m30 Sep 11:27[0m [31mall.isa.Q318.tsv.gz[0m
.[1;33mr[31mw[0m[38;5;244m-[33mr[38;5;244m--[33mr[38;5;244m--[0m   [1;32m88[0m [1;33mpedroszekely[0m [34m30 Sep 11:27[0m [31mall.isa.Q13442814.tsv.gz[0m
.[1;33mr[31mw[0m[38;5;244m-[33mr[38;5;244m--[33mr[38;5;244m--[0m [1;32m577[0m[32mk[0m [1;33mpedroszekely[0m [34m30 Sep 11:27[0m [31mall.isa.tsv.gz[0m
.[1;33mr[31mw[0m[38;5;244m-[33mr[38;5;244m--[33mr[38;5;244m--[0m [1;32m842[0m[32mk[0m [1;33mpedroszekely[0m [34m30 Sep 11:26[0m [31mall.P31.tsv.gz[0m
.[1;33mr[31mw[0m[38;5;244m-[33mr[38;5;244m--[33mr[38;5;244m--[0m [1;32m885[0m[32mk[0m [1;33mpedroszekely[0m [34m30 Sep 11:26[0m [31mall.P31_P27

Example of how to do simple queries

In [26]:
!$kgtk query -i $OUT/$NAME.label.en.tsv.gz --graph-cache $STORE \
--match '(n:P39)-[:label]->(n2)' \
--return 'n as node, n2 as label'

[2020-09-30 11:56:27 sqlstore]: IMPORT graph directly into table graph_35 from /Users/pedroszekely/Downloads/kypher/Q5/Q5.label.en.tsv.gz ...
[2020-09-30 11:56:27 query]: SQL Translation:
---------------------------------------------
  SELECT graph_35_c1."node1" "node", graph_35_c1."node2" "label"
     FROM graph_35 AS graph_35_c1
     WHERE graph_35_c1."label"=?
     AND graph_35_c1."node1"=?
  PARAS: ['label', 'P39']
---------------------------------------------
[2020-09-30 11:56:27 sqlstore]: CREATE INDEX on table graph_35 column node1 ...
[2020-09-30 11:56:27 sqlstore]: ANALYZE INDEX on table graph_35 column node1 ...
[2020-09-30 11:56:27 sqlstore]: CREATE INDEX on table graph_35 column label ...
[2020-09-30 11:56:27 sqlstore]: ANALYZE INDEX on table graph_35 column label ...
node	label
P39	'position held'@en
        0.99 real         0.83 user         0.17 sys


Example of how to get statistics on the properties. 

In [27]:
!kgtk query -i $TEMP/$NAME.everything_2.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE \
--match 'everything: (n1)-[l:P106]->(n2), label: (n2)-[:label]->(label)' \
--return 'distinct l.label as property_id, label as property_label, n2 as value, count(n2) as value_count' \
--order-by 'count(n2) desc' \
--limit 10 \
| column -t -s $'\t' 

property_id  property_label                     value     value_count
P106         'association football manager'@en  Q628099   326
P106         'scientist'@en                     Q901      127
P106         'theater director'@en              Q3387717  83
P106         'geographer'@en                    Q901402   48
P106         'merchant'@en                      Q215536   35
P106         'postage stamp designer'@en        Q2000124  8
P106         'carpenter'@en                     Q154549   4
P106         'type designer'@en                 Q354034   3
P106         'restorer'@en                      Q2145981  2
P106         'sniper'@en                        Q201948   2
