# Creating a subset of Wikidata

This notebook illustrates how to create a subset of Wikidata. We use as an example https://www.wikidata.org/wiki/Q11173 (chemical compound)

Parameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:

```
papermill Example8\ -\ Wikidata\ Subset.ipynb exmaple8.out.ipynb \
-p wikidata_home /Users/pedroszekely/Downloads/kypher \
-p wikidata_file all.10.tsv.gz \
-p wikidata_parts_folder /Users/pedroszekely/Downloads/kypher/output.all.10 \
-p subset_name Q11173 \
-p home /Users/pedroszekely/Downloads/kypher \
-p delete_database yes 
```

In [1]:
# Parameters
wikidata_home = "/Users/pedroszekely/Downloads/kypher"
wikidata_file = "all.10.tsv.gz"
wikidata_parts_folder = "/Users/pedroszekely/Downloads/kypher/output.all.10"
subset_name = "Q11173"
subset_name = "Q318"
subset_name = "Q5"
home = "/Users/pedroszekely/Downloads/kypher"
delete_database = "yes"

In [None]:
temp_folder = subset_name + "-temp"
output_folder = subset_name

In [2]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

# from IPython.display import display, HTML, Image
# from pandas_profiling import ProfileReport

A convenience function to run templetazed commands, substituting NAME with the name of the dataset and substituting other keys provided in a dictionary.

In [3]:
def run_command(command, substitution_dictionary = {}):
    """Run a templetized command."""
    cmd = command.replace("NAME", subset_name)
    for k, v in substitution_dictionary.items():
        cmd = cmd.replace(k, v)
    
    print(cmd)
    output = subprocess.run([cmd], shell=True, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    print(output.stdout)
    print(output.stderr)
    #print(output.returncode)

### Set up environment variables and folders that we need

In [4]:
# folder containing wikidata broken down into smaller files.
os.environ['WIKIDATA_PARTS'] = wikidata_parts_folder
# path of folder where the wikidata parts folder is stored.
os.environ['WIKIDATA_HOME'] = wikidata_home
# name of the dataset
os.environ['NAME'] = subset_name
# folder where to put the output
os.environ['OUT'] = "{}/{}".format(home, output_folder)
# temporary folder
os.environ['TEMP'] = "{}/{}".format(home, temp_folder)
# kgtk command to run
os.environ['kgtk'] = "kgtk"
os.environ['kgtk'] = "time kgtk --debug"
# absolute path of the db
os.environ['STORE'] = "{}/{}/wikidata.sqlite3.db".format(home, temp_folder)

In [5]:
cd $home

/Users/pedroszekely/Downloads/kypher


In [6]:
!mkdir $output_folder
!mkdir $temp_folder

mkdir: Q5: File exists
mkdir: Q5-temp: File exists


In [7]:
!rm $OUT/*.tsv $OUT/*.tsv.gz
!rm $TEMP/*.tsv $TEMP/*.tsv.gz

rm: /Users/pedroszekely/Downloads/kypher/Q5/*.tsv: No such file or directory
rm: /Users/pedroszekely/Downloads/kypher/Q5-temp/*.tsv: No such file or directory


In [8]:
if delete_database:
    print("Deleted database")
    !rm $TEMP/wikidata.sqlite3.db

Deleted database


### Extract the Q-nodes for the items we want
Here we assume that the subset is for an individual q-node, so that the subset name is the name of the q-node. We should generalize this so that this query can be passed in as a parameter. We construct a file that contains all the node1s that are isa of the given NAME q-node

In [9]:
command = "$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz -i $WIKIDATA_PARTS/all.P279star.tsv.gz \
    --graph-cache $STORE \
    -o $TEMP/all.isa.NAME.tsv.gz  \
    --match 'isa: (n1)-[l:isa]->(n2:NAME)' \
    --return 'distinct n1, l.label, n2'"
run_command(command)

$kgtk query -i $WIKIDATA_PARTS/all.isa.tsv.gz -i $WIKIDATA_PARTS/all.P279star.tsv.gz     --graph-cache $STORE     -o $TEMP/all.isa.Q5.tsv.gz      --match 'isa: (n1)-[l:isa]->(n2:Q5)'     --return 'distinct n1, l.label, n2'

[2020-09-26 21:40:15 sqlstore]: IMPORT graph directly into table graph_1 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.isa.tsv.gz ...
[2020-09-26 21:40:16 sqlstore]: IMPORT graph directly into table graph_2 from /Users/pedroszekely/Downloads/kypher/output.all.10/all.P279star.tsv.gz ...
[2020-09-26 21:40:16 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_1_c1."node1", graph_1_c1."label", graph_1_c1."node2"
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."label"=?
     AND graph_1_c1."node2"=?
  PARAS: ['isa', 'Q5']
---------------------------------------------
[2020-09-26 21:40:16 sqlstore]: CREATE INDEX on table graph_1 column label ...
[2020-09-26 21:40:16 sqlstore]: ANALYZE INDEX on table graph_1 c

### Generate the parts of this dataset

In [10]:
types = [
    "time",
    "wikibase_item",
    "math",
    "wikibase_form",
    "quantity",
    "string",
    "external_id",
    "commonsMedia",
    "globe_coordinate",
    "monolingualtext",
    "musical_notation",
    "geo_shape",
    "wikibase_property",
    "url",
]
command = "$kgtk query -i $TEMP/all.isa.NAME.tsv.gz -i $WIKIDATA_PARTS/part.TYPE_FILE.tsv.gz --graph-cache $STORE  \
    -o $OUT/NAME.part.TYPE_FILE.tsv.gz  \
    --match 'NAME: (n1)-[]->(), TYPE_FILE: (n1)-[l]->(n2)' \
    --return 'distinct l, n1, l.label, n2' \
    --order-by 'n1, l.label, n2'"
for type in types:
    run_command(command, {"TYPE_FILE": type.replace("-", "_")})


$kgtk query -i $TEMP/all.isa.Q5.tsv.gz -i $WIKIDATA_PARTS/part.time.tsv.gz --graph-cache $STORE      -o $OUT/Q5.part.time.tsv.gz      --match 'Q5: (n1)-[]->(), time: (n1)-[l]->(n2)'     --return 'distinct l, n1, l.label, n2'     --order-by 'n1, l.label, n2'

[2020-09-26 21:40:17 sqlstore]: IMPORT graph directly into table graph_3 from /Users/pedroszekely/Downloads/kypher/Q5-temp/all.isa.Q5.tsv.gz ...
[2020-09-26 21:40:17 sqlstore]: IMPORT graph directly into table graph_4 from /Users/pedroszekely/Downloads/kypher/output.all.10/part.time.tsv.gz ...
[2020-09-26 21:40:17 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_4_c2."id", graph_3_c1."node1", graph_4_c2."label", graph_4_c2."node2"
     FROM graph_3 AS graph_3_c1, graph_4 AS graph_4_c2
     WHERE graph_3_c1."node1"=graph_4_c2."node1"
     ORDER BY graph_3_c1."node1" ASC, graph_4_c2."label" ASC, graph_4_c2."node2" ASC
  PARAS: []
---------------------------------------------
[2020-09-26 2


[2020-09-26 21:40:34 sqlstore]: IMPORT graph directly into table graph_12 from /Users/pedroszekely/Downloads/kypher/output.all.10/part.globe_coordinate.tsv.gz ...
[2020-09-26 21:40:35 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_12_c2."id", graph_12_c2."node1", graph_12_c2."label", graph_12_c2."node2"
     FROM graph_12 AS graph_12_c2, graph_3 AS graph_3_c1
     WHERE graph_12_c2."node1"=graph_3_c1."node1"
     ORDER BY graph_12_c2."node1" ASC, graph_12_c2."label" ASC, graph_12_c2."node2" ASC
  PARAS: []
---------------------------------------------
[2020-09-26 21:40:35 sqlstore]: CREATE INDEX on table graph_12 column node1 ...
[2020-09-26 21:40:35 sqlstore]: ANALYZE INDEX on table graph_12 column node1 ...
        0.79 real         0.72 user         0.14 sys

$kgtk query -i $TEMP/all.isa.Q5.tsv.gz -i $WIKIDATA_PARTS/part.monolingualtext.tsv.gz --graph-cache $STORE      -o $OUT/Q5.part.monolingualtext.tsv.gz      --match 'Q5: (n1)-[]->

### Generate a P279star file

In [11]:
!kgtk cat -i $OUT/$NAME.part.*.tsv.gz | gzip > $TEMP/$NAME.all_1.tsv.gz

In [12]:
command_node1 = "$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/NAME.all_1.tsv.gz \
    --graph-cache $STORE  \
    -o $TEMP/P279star.1.tsv.gz \
    --match 'P279star: (n1)-[l]->(n2), all_1: (n1)-[]->()' \
    --return 'distinct l, n1, l.label, n2'"

command_node2 = "$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/NAME.all_1.tsv.gz \
    --graph-cache $STORE  \
    -o $TEMP/P279star.2.tsv.gz \
    --match 'P279star: (n1)-[l]->(n2), all_1: ()-[]->(n1)' \
    --return 'distinct l, n1 as node1, l.label, n2'" 

cat_command = "$kgtk cat $TEMP/P279star.1.tsv.gz $TEMP/P279star.2.tsv.gz | gzip > $OUT/NAME.P279star.tsv.gz"

run_command(command_node1)
run_command(command_node2)
run_command(cat_command)

$kgtk query -i $WIKIDATA_PARTS/all.P279star.tsv.gz -i $TEMP/Q5.all_1.tsv.gz     --graph-cache $STORE      -o $TEMP/P279star.1.tsv.gz     --match 'P279star: (n1)-[l]->(n2), all_1: (n1)-[]->()'     --return 'distinct l, n1, l.label, n2'

[2020-09-26 21:40:43 sqlstore]: IMPORT graph directly into table graph_18 from /Users/pedroszekely/Downloads/kypher/Q5-temp/Q5.all_1.tsv.gz ...
[2020-09-26 21:40:45 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_2_c1."id", graph_2_c1."node1", graph_2_c1."label", graph_2_c1."node2"
     FROM graph_18 AS graph_18_c2, graph_2 AS graph_2_c1
     WHERE graph_18_c2."node1"=graph_2_c1."node1"
  PARAS: []
---------------------------------------------
[2020-09-26 21:40:45 sqlstore]: CREATE INDEX on table graph_18 column node1 ...
[2020-09-26 21:40:45 sqlstore]: ANALYZE INDEX on table graph_18 column node1 ...
[2020-09-26 21:40:45 sqlstore]: CREATE INDEX on table graph_2 column node1 ...
[2020-09-26 21:40:45 sqlstore

### Generate the labels, aliases and descriptions
We want the labels, aliases and descriptions for every q-node in our dataset. THis means that we need these lables for all q-nodes that appear in the node1 or node2 position.

The first step is to concatenate all the files in our dataset.

In [13]:
!kgtk cat -i $TEMP/$NAME.all_1.tsv.gz $OUT/$NAME.P279star.tsv.gz | gzip > $TEMP/$NAME.all_2.tsv.gz

Now we extract the labels from from our input wikidata folder. We do this matching node1, thend node 2, then we concatenate the resulting label files.

In [14]:
labels = [
    "label",
    "alias",
    "description"
]

command_node1 = "$kgtk query -i $TEMP/NAME.all_2.tsv.gz -i $WIKIDATA_PARTS/part.LABEL.en.tsv.gz --graph-cache $STORE  \
    -o $TEMP/NAME.LABEL.en.1.tsv.gz  \
    --match 'NAME: (n1)-[]->(), part: (n1)-[l]->(n2)' \
    --return 'l, n1, l.label, n2' \
    --order-by 'n1, l.label, n2'"

command_node2 = "$kgtk query -i $TEMP/NAME.all_2.tsv.gz -i $WIKIDATA_PARTS/part.LABEL.en.tsv.gz --graph-cache $STORE  \
    -o $TEMP/NAME.LABEL.en.2.tsv.gz  \
    --match 'NAME: ()-[]->(n1), part: (n1)-[l]->(n2)' \
    --return 'l, n1 as node1, l.label, n2' \
    --order-by 'n1, l.label, n2'"

cat_command = "kgtk cat $TEMP/NAME.LABEL.*.gz | gzip > $OUT/$NAME.LABEL.en.tsv.gz"

for label in labels:
    run_command(command_node1, {"LABEL": label})
    run_command(command_node2, {"LABEL": label})
    run_command(cat_command, {"LABEL": label})


$kgtk query -i $TEMP/Q5.all_2.tsv.gz -i $WIKIDATA_PARTS/part.label.en.tsv.gz --graph-cache $STORE      -o $TEMP/Q5.label.en.1.tsv.gz      --match 'Q5: (n1)-[]->(), part: (n1)-[l]->(n2)'     --return 'l, n1, l.label, n2'     --order-by 'n1, l.label, n2'

[2020-09-26 21:40:52 sqlstore]: IMPORT graph directly into table graph_19 from /Users/pedroszekely/Downloads/kypher/Q5-temp/Q5.all_2.tsv.gz ...
[2020-09-26 21:40:54 sqlstore]: IMPORT graph directly into table graph_20 from /Users/pedroszekely/Downloads/kypher/output.all.10/part.label.en.tsv.gz ...
[2020-09-26 21:40:54 query]: SQL Translation:
---------------------------------------------
  SELECT graph_20_c2."id", graph_19_c1."node1", graph_20_c2."label", graph_20_c2."node2"
     FROM graph_19 AS graph_19_c1, graph_20 AS graph_20_c2
     WHERE graph_19_c1."node1"=graph_20_c2."node1"
     ORDER BY graph_19_c1."node1" ASC, graph_20_c2."label" ASC, graph_20_c2."node2" ASC
  PARAS: []
---------------------------------------------
[2020-09-2

### Summary of what we got

In [15]:
%%bash
for f in $OUT/*.tsv.gz; do
    echo -n `basename $f`
    gzcat $f | wc -l
done

Q5.P279star.tsv.gz     482
Q5.part.commonsMedia.tsv.gz   10285
Q5.part.external_id.tsv.gz  182960
Q5.part.geo_shape.tsv.gz       1
Q5.part.globe_coordinate.tsv.gz       1
Q5.part.math.tsv.gz       1
Q5.part.monolingualtext.tsv.gz    6334
Q5.part.musical_notation.tsv.gz       1
Q5.part.quantity.tsv.gz   13984
Q5.part.string.tsv.gz    9277
Q5.part.time.tsv.gz   47355
Q5.part.url.tsv.gz    1591
Q5.part.wikibase_form.tsv.gz       8
Q5.part.wikibase_item.tsv.gz  337605
Q5.part.wikibase_property.tsv.gz       1


In [16]:
!$kgtk cat $OUT/*.gz > $TEMP/everything.tsv

        3.67 real         3.48 user         0.14 sys


In [17]:
!$kgtk graph-statistics --log $OUT/$NAME.everything.statistics.txt \
    --statistics-only --pagerank -i $TEMP/everything.tsv \
    | gzip > $OUT/$NAME.statistics.tsv.gz

       14.78 real        17.07 user         0.92 sys


In [18]:
!cat $OUT/$NAME.everything.statistics.txt

loading the TSV graph now ...
graph loaded! It has 330189 nodes and 609871 edges

###Top relations:
P106	44621
P31	30984
P21	30593
P569	29455
P27	28609
P735	26676
P19	21619
P54	16787
P570	15185
P734	14326

###PageRank
Max pageranks
242	Q82955	0.002672
351	Q6581072	0.003000
1	Q5	0.019608
261449	Q6581097	0.016356
261693	Q30	0.003148
