# Generating Subsets of Wikidata

### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

```
papermill Wikidata\ Useful\ Files.ipynb useful-files.out.ipynb \
-p wiki_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz \
-p label_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.label.en.tsv.gz \
-p item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.wikibase-item.tsv.gz \
-p property_item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.property.wikibase-item.tsv.gz \
-p qual_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/qual.tsv.gz \
-p output_path <local folder> \
-p output_folder useful_files_v4 \
-p temp_folder temp.useful_files_v4 \
-p delete_database no 
```

In [44]:
# Parameters

# Folder on local machine where to create the output and temporary folders
output_path = "/Users/pedroszekely/Downloads/kypher"

# The names of the output and temporary folders
output_folder = "useful_wikidata_files_v3"
temp_folder = "temp.useful_wikidata_files_v3"

# Classes to remove
remove_classes = "Q13442814, Q523, Q16521, Q318, Q7318358, Q7187, Q11173, Q8054"

# The location of input files
wiki_file = "/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz"
label_file = "/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/part.label.en.tsv.gz"
item_file = "/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/part.wikibase-item.tsv.gz"
property_item_file = "/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/part.property.wikibase-item.tsv.gz"
qual_file = "/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/qual.tsv.gz"

# Location of the cache database for kypher
cache_path = "/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3"

# Whether to delete the cache database
delete_database = "no"

In [2]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

# from IPython.display import display, HTML, Image
# from pandas_profiling import ProfileReport

## Set up environment and folders to store the files

- `OUT` folder where the output files go
- `TEMP` folder to keep temporary files , including the database
- `kgtk` shortcut to invoke the kgtk software
- `kypher` shortcut to invoke `kgtk query with the cache database
- `EDGES` the `all.tsv` file of wikidata that contains all edges except label/alias/description
- `LABELS` the file with the English labels
- `ITEMS` the wikibase-item file (currently does not include node1 that are properties so for now we need the net file
- `PROPERTY_ITEMS` the items that are properties
- `STORE` location of the cache file

In [45]:
if cache_path:
    os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_path)
else:
    os.environ['STORE'] = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)
os.environ['OUT'] = "{}/{}".format(output_path, output_folder)
os.environ['TEMP'] = "{}/{}".format(output_path, temp_folder)
os.environ['kgtk'] = "kgtk"
# os.environ['kgtk'] = "time kgtk --debug"
os.environ['kypher'] = "time kgtk query --graph-cache " + os.environ['STORE']
os.environ['EDGES'] = wiki_file
os.environ['LABELS'] = label_file
os.environ['ITEMS'] = item_file
os.environ['PROPERTY_ITEMS'] = property_item_file
os.environ['QUALS'] = qual_file

Echo the variables to see if they are all set correctly

In [4]:
!echo $OUT
!echo $TEMP
!echo $kgtk
!echo $kypher
!echo $EDGES
!echo $LABELS
!echo $ITEMS
!echo $PROPERTY_ITEMS
!echo $STORE

/Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v3
/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3
kgtk
time kgtk query --graph-cache /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3/wikidata.sqlite3.db
/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz
/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/part.label.en.tsv.gz
/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/part.wikibase-item.tsv.gz
/Volumes/GoogleDrive/Shared drives/KGTK-public-graphs/wikidata-20200803-v3/part.property.wikibase-item.tsv.gz
/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3/wikidata.sqlite3.db


Go to the output directory and create the subfolders for the output files and the temporary files

In [5]:
cd $output_path

/Users/pedroszekely/Downloads/kypher


In [6]:
!mkdir $OUT
!mkdir $TEMP

mkdir: /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v3: File exists
mkdir: /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v3: File exists


Clean up the output and temp folders before we start

In [7]:
# !rm $OUT/*.tsv $OUT/*.tsv.gz
# !rm $TEMP/*.tsv $TEMP/*.tsv.gz

Uncomment the line below to remove the sqllite2 database. It takes a long time to load all the data and create indices, so don't remove the database unless you change files that have already been loaded and you need to force a reload.

In [8]:
if delete_database and delete_database != "no":
    print("Deleted database")
    !rm $STORE

### Preview the input files

It is always a good practice to peek a the files to make sure the column headings are what we expect

In [128]:
!$kypher -i "$EDGES" --limit 10 | column -t -s $'\t' 

        0.80 real         0.63 user         0.14 sys
id                             node1  label  node2      rank    node2;wikidatatype
Q1-P1036-418bc4-78f5a565-0     Q1     P1036  "113"      normal  external-id
Q1-P1036-b98c08-1dc98be9-0     Q1     P1036  "523.1"    normal  external-id
Q1-P1051-d70eb1-60991f20-0     Q1     P1051  "517"      normal  external-id
Q1-P1245-ee25a9-46be09ed-0     Q1     P1245  "8506"     normal  external-id
Q1-P1256-8da0ce-af30f4e9-0     Q1     P1256  "51A11"    normal  external-id
Q1-P1296-f73b4e-4d0c1e5d-0     Q1     P1296  "0216407"  normal  external-id
Q1-P1343-Q19190511-ab132b87-0  Q1     P1343  Q19190511  normal  wikibase-item
Q1-P1343-Q2041543-4ed8a129-0   Q1     P1343  Q2041543   normal  wikibase-item
Q1-P1343-Q602358-12bf99e2-0    Q1     P1343  Q602358    normal  wikibase-item
Q1-P1343-Q88672152-5080b9e2-0  Q1     P1343  Q88672152  normal  wikibase-item


## Creating a list of all the items we want to remove

### Compute the items to be removed

First look at the classes we will remove

In [6]:
cmd = "wd u {}".format(" ".join(remove_classes.split(",")))
!{cmd}

[90mid[39m Q13442814
[42mLabel[49m scholarly article
[44mDescription[49m article in an academic publication, usually peer reviewed
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mscholarly publication [90m(Q591041)[39m | article [90m(Q191067)[39m | scholarly work [90m(Q55915575)[39m

[90mid[39m Q523
[42mLabel[49m star
[44mDescription[49m astronomical object consisting of a luminous spheroid of plasma held together by its own gravity
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39m astronomical object type [90m(Q17444909)[39m
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mastronomical object [90m(Q6999)[39m | fusor [90m(Q1027098)[39m

[90mid[39m Q16521
[42mLabel[49m taxon
[44mDescription[49m group of one or more organism(s), which a taxonomist adjudges to be a unit
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39m first-order metaclass [90m(Q24017414)[39m
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: 

Compose the kypher command to remove the classes

In [8]:
!zcat < $OUT/all.isa.tsv.gz | head | column -t -s $'\t' 

node1    label  node2
Q1       isa    Q36906466
Q100     isa    Q1093829
Q100     isa    Q1549591
Q100     isa    Q21518270
Q1000    isa    Q179023
Q1000    isa    Q3624078
Q1000    isa    Q6256
Q10000   isa    Q10876391
Q100000  isa    Q1852859
zcat: error writing to output: Broken pipe


In [12]:
classes = map(lambda x: '"{}"'.format(x), remove_classes.replace(" ", "").split(","))
remove_command = "$kypher -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz -o $TEMP/items.remove.tsv.gz \
--match 'isa: (n1)-[:isa]->(c), P279star: (c)-[]->(class)' \
--where 'class in [CLASSES]' \
--return 'distinct n1, \"p31_p279star\" as label, class as node2' ".replace("CLASSES", ", ".join(list(classes)))
remove_command

'$kypher -i $OUT/all.isa.tsv.gz -i $OUT/all.P279star.tsv.gz -o $TEMP/items.remove.tsv.gz --match \'isa: (n1)-[:isa]->(c), P279star: (c)-[]->(class)\' --where \'class in ["Q13442814", "Q523", "Q16521", "Q318", "Q7318358", "Q7187", "Q11173", "Q8054"]\' --return \'distinct n1, "p31_p279star" as label, class as node2\' '

Run the command, the items to remove will be in file `$TEMP/items.remove.tsv.gz`

In [13]:
!{remove_command}

      502.25 real       312.36 user        63.43 sys


Preview the file

In [14]:
!zcat < $TEMP/items.remove.tsv.gz | head | column -t -s $'\t' 

zcat: error writing to output: Broken pipe
node1      label         node2
Q356147    p31_p279star  Q11173
Q5198204   p31_p279star  Q11173
Q7117879   p31_p279star  Q11173
Q221307    p31_p279star  Q11173
Q24883404  p31_p279star  Q11173
Q2645893   p31_p279star  Q11173
Q27277736  p31_p279star  Q11173
Q377339    p31_p279star  Q11173
Q382897    p31_p279star  Q11173


Collect all the classes of items we will remove, just as a sanity check

In [15]:
!$kypher -i $TEMP/items.remove.tsv.gz \
--match '()-[]->(n2)' \
--return 'distinct n2' \
--limit 10

node2
Q11173
Q13442814
Q16521
Q318
Q523
Q7187
Q7318358
Q8054
      172.88 real       245.46 user        22.74 sys


Sort the file as we will need it sorted later

## Create the reduced edges file

### Remove the items from the all.tsv and the label, alias and description files
We will be left with `reduced` files where the edges do not have the unwanted items. We have to remove them from the node1 and node2 positions, so we need to run the ifnotexists commands twice.

Before we start preview the files to see the column headings and check whether they look sorted.

In [16]:
!$kgtk sort2 -i $TEMP/items.remove.tsv.gz -o $TEMP/items.remove.sorted.tsv.gz

In [18]:
!zcat < $TEMP/items.remove.sorted.tsv.gz | head | column -t -s $'\t' 

zcat: error writing to output: Broken pipe
node1     label         node2
Q1000017  p31_p279star  Q16521
Q1000126  p31_p279star  Q16521
Q1000261  p31_p279star  Q16521
Q1000262  p31_p279star  Q16521
Q1000266  p31_p279star  Q16521
Q1000270  p31_p279star  Q16521
Q1000274  p31_p279star  Q16521
Q1000278  p31_p279star  Q16521
Q1000280  p31_p279star  Q16521


In [26]:
!zcat < $OUT/all.sorted.tsv.gz | head -5 | column -t -s $'\t' 

id                          node1  label  node2    rank    node2;wikidatatype
Q1-P1036-418bc4-78f5a565-0  Q1     P1036  "113"    normal  external-id
Q1-P1036-b98c08-1dc98be9-0  Q1     P1036  "523.1"  normal  external-id
Q1-P1051-d70eb1-60991f20-0  Q1     P1051  "517"    normal  external-id
Q1-P1245-ee25a9-46be09ed-0  Q1     P1245  "8506"   normal  external-id
zcat: error writing to output: Broken pipe


Remove from the full set of edges those edges that have a `node1` present in `items.remove.sorted.tsv`

In [29]:
!$kgtk ifnotexists -i $OUT/all.sorted.tsv.gz -o $TEMP/item.edges.reduced.tsv.gz \
--filter-on $TEMP/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted 

From the remaining edges, remove those that have a `node2` present in `items.remove.sorted.tsv`

In [41]:
!$kgtk sort2 -i $TEMP/item.edges.reduced.tsv.gz -o $TEMP/item.edges.reduced.sorted.tsv.gz \
--columns node2 label node1 id

In [42]:
!$kgtk ifnotexists -i $TEMP/item.edges.reduced.sorted.tsv.gz -o $TEMP/item.edges.reduced.2.tsv.gz \
--filter-on $TEMP/items.remove.sorted.tsv.gz \
--input-keys node2 \
--filter-keys node1 \
--presorted 

Create a file with the labels

In [30]:
!$kgtk ifnotexists -i $OUT/part.label.en.sorted.tsv.gz -o $TEMP/label.edges.reduced.tsv.gz \
--filter-on $TEMP/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

Create a file with the aliases

In [31]:
!$kgtk ifnotexists -i $OUT/part.alias.en.sorted.tsv.gz -o $TEMP/alias.edges.reduced.tsv.gz \
--filter-on $TEMP/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

Create a file with the descriptions

In [32]:
!$kgtk ifnotexists -i $OUT/part.description.en.sorted.tsv.gz -o $TEMP/description.edges.reduced.tsv.gz \
--filter-on $TEMP/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

### Concatenate all the files to produce our edges file

To-do:verify that the sitelinks are still there

In [48]:
!$kgtk cat -o $TEMP/wikidataos.all.tsv.gz \
-i $TEMP/alias.edges.reduced.tsv.gz \
-i $TEMP/description.edges.reduced.tsv.gz \
-i $TEMP/item.edges.reduced.2.tsv.gz \
-i $TEMP/label.edges.reduced.tsv.gz 

## Create the reduced qualifiers file
We do this by finding all the ids of the reduced edges file, and then selecting out from `qual.tsv`

We need to join by id, so we need to sort both files by id, node1, label, node2:

- `$QUALS` comes sorted by id/node1/label/node2, so we don't need to do anything to it
- `$OUT/wikidataos.all.tsv.gz` is unsorted, so sort it by id/node1/label/node2

In [59]:
!zcat < "$QUALS" | head | column -t -s $'\t' 

zcat: error writing to output: id                                              node1                          label  node2               node2;wikidatatype
Broken pipe
Q1-P1343-Q19190511-ab132b87-0-P805-Q84065667-0  Q1-P1343-Q19190511-ab132b87-0  P805   Q84065667           wikibase-item
Q1-P1343-Q2041543-4ed8a129-0-P805-Q23856221-0   Q1-P1343-Q2041543-4ed8a129-0   P805   Q23856221           wikibase-item
Q1-P1343-Q602358-12bf99e2-0-P805-Q24373557-0    Q1-P1343-Q602358-12bf99e2-0    P805   Q24373557           wikibase-item
Q1-P1343-Q88672152-5080b9e2-0-P304-5724c3-0     Q1-P1343-Q88672152-5080b9e2-0  P304   "13-36"             string
Q1-P1419-70a524-1b5a620e-0-P805-Q1647152-0      Q1-P1419-70a524-1b5a620e-0     P805   Q1647152            wikibase-item
Q1-P1419-Q5457948-c405a033-0-P3680-70a524-0     Q1-P1419-Q5457948-c405a033-0   P3680  somevalue           wikibase-item
Q1-P227-50e4b0-119ba012-0-P1932-d6e10c-0        Q1-P227-50e4b0-119ba012-0      P1932  "Weltall"           string
Q1-P22

In [49]:
!$kgtk sort2 -i $TEMP/wikidataos.all.tsv.gz -o $OUT/wikidataos.all.tsv.gz \
--columns id node1 label node2 

Run `ifexists` to select out the quals for the edges in `$OUT/wikidataos.qual.tsv.gz`. Note that we use `node1` in the qualifier file, matching to `id` in the `wikidataos.all.tsv` file.

In [60]:
!$kgtk ifexists -i "$QUALS" -o $OUT/wikidataos.qual.tsv.gz \
--filter-on $OUT/wikidataos.all.tsv.gz \
--input-keys node1 \
--filter-keys id \
--presorted

## Summary of results

In [61]:
!ls -lh $OUT/wikidataos.*

-rw-r--r--  1 pedroszekely  staff   5.5G Nov 10 11:07 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v3/wikidataos.all.tsv.gz
-rw-r--r--  1 pedroszekely  staff   660M Nov 10 14:08 /Users/pedroszekely/Downloads/kypher/useful_wikidata_files_v3/wikidataos.qual.tsv.gz


In [54]:
!zcat < $OUT/wikidataos.all.tsv.gz | wc

 310387222 1938057258 24398709041


In [56]:
!zcat < $OUT/wikidataos.all.tsv.gz | head | column -t -s $'\t' 

zcat: error writing to output: Broken pipe
id                       node1  label        node2                                                                                                                                     rank  node2;wikidatatype
P10-alias-en-282226-0    P10    alias        'gif'@en
P10-alias-en-2f86d8-0    P10    alias        'animation'@en
P10-alias-en-c1427e-0    P10    alias        'media'@en
P10-alias-en-c61ab1-0    P10    alias        'trailer (Commons)'@en
P10-description-en       P10    description  'relevant video. For images, use the property P18. For film trailers, qualify with \"object has role\" (P3831)=\"trailer\" (Q622550)'@en
P10-label-en             P10    label        'video'@en
P1000-description-en     P1000  description  'notable record achieved by a person or entity, include qualifiers for dates held'@en
P1000-label-en           P1000  label        'record held'@en
P1001-alias-en-0dd7ce-0  P1001  alias        'belongs to jurisdiction'@en


## Verification

The edges file must contain edges for properties, this is not the case on 2020-11-10`


In [64]:
!$kypher -i "$EDGES" \
--match '(:P10)-[l]->(n2)' \
--limit 10

id	node1	label	node2	rank	node2;wikidatatype
        0.72 real         0.57 user         0.13 sys
