# Generating Subsets of Wikidata

### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

UPDATE EXAMPLE INVOCATION


```
papermill Wikidata\ Useful\ Files.ipynb useful-files.out.ipynb \
-p wiki_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz \
-p label_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.label.en.tsv.gz \
-p item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.wikibase-item.tsv.gz \
-p property_item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.property.wikibase-item.tsv.gz \
-p qual_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/qual.tsv.gz \
-p output_path <local folder> \
-p output_folder useful_files_v4 \
-p temp_folder temp.useful_files_v4 \
-p delete_database no 
```

In [38]:
# Parameters

# Folder on local machine where to create the output and temporary folders
output_path = "/Users/pedroszekely/Downloads/kypher"

# The names of the output and temporary folders
output_folder = "wikidata_os_v5"
temp_folder = "temp.wikidata_os_v5"

# Classes to remove
remove_classes = "Q13442814, Q523, Q16521, Q318, Q7318358, Q7187, Q11173, Q8054"

# The location of input files
wiki_root_folder = "/Volumes/GoogleDrive/Shared\ drives/KGTK/datasets/wikidata-20200803-v4/"
claims_file = "claims.tsv.gz"
label_file = "labels.en.tsv.gz"
alias_file = "aliases.en.tsv.gz"
description_file = "descriptions.en.tsv.gz"
item_file = "claims.wikibase-item.tsv.gz"
qual_file = "qualifiers.tsv.gz"
property_datatypes_file = "metadata.property.datatypes.tsv.gz"
isa_file = "derived.isa.tsv.gz"
p279star_file = "derived.P279star.tsv.gz"

# Location of the cache database for kypher
cache_path = "/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v4"

# Whether to delete the cache database
delete_database = False

# shortcuts to commands
kgtk = "time kgtk --debug"
# kgtk = "kgtk --debug"

# Useful files Jupyter notebook
useful_files_notebook = "Wikidata Useful Files.ipynb"
notebooks_folder = "/Users/pedroszekely/Documents/GitHub/kgtk/examples/"

In [39]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

import altair as alt

import papermill as pm

## Set up variables for files

In [40]:
if cache_path:
    os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_path)
else:
    os.environ['STORE'] = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)
    
if cache_path:
    store = "{}/wikidata.sqlite3.db".format(cache_path)
else:
    store = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)

out = "{}/{}".format(output_path, output_folder)
temp = "{}/{}".format(output_path, temp_folder)

kypher = "kgtk query --debug --graph-cache " + store

claims = wiki_root_folder + claims_file
items = wiki_root_folder + item_file
isa = wiki_root_folder + isa_file
quals = wiki_root_folder + qual_file
datatypes = wiki_root_folder + property_datatypes_file
p279star = wiki_root_folder + p279star_file

labels = wiki_root_folder + label_file
aliases = wiki_root_folder + alias_file
descriptions = wiki_root_folder + description_file

Go to the output directory and create the subfolders for the output files and the temporary files

In [4]:
cd output_path

[Errno 2] No such file or directory: 'output_path'
/Users/pedroszekely/Documents/GitHub/kgtk/use-cases


In [5]:
!mkdir {out}
!mkdir {temp}

mkdir: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5: File exists
mkdir: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5: File exists


Clean up the output and temp folders before we start

In [6]:
# !rm {out}/*.tsv {out}/*.tsv.gz
# !rm {temp}/*.tsv {temp}/*.tsv.gz

In [7]:
if delete_database:
    !rm {out}/*.tsv {out}/*.tsv.gz
    !rm {temp}/*.tsv {temp}/*.tsv.gz

### Preview the input files

It is always a good practice to peek a the files to make sure the column headings are what we expect

In [8]:
!gzcat {claims} | head

id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"	normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item
P10-P1659-P1651-c4068028-0	P10	P1659	P1651	normal	wikibase-property
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18	normal	wikibase-property
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238	normal	wikibase-property
P10-P1659-P51-86aca4c5-0	P10	P1659	P51	normal	wikibase-property
P10-P1855-Q15075950-7eff6d65-0	P10	P1855	Q15075950	normal	wikibase-item
P10-P1855-Q4504-a69d2c73-0	P10	P1855	Q4504	normal	wikibase-item
gzcat: error writing to output: Broken pipe
gzcat: /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/claims.tsv.gz: uncompress failed


In [43]:
!{kypher} -i {claims} \
--match '()-[]-()' \
--limit 10

[2020-11-19 16:34:30 sqlstore]: IMPORT graph directly into table graph_18 from /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/claims.tsv.gz ...
Exception in thread background thread for pid 34908:
Traceback (most recent call last):
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py", line 1662, in wrap
    fn(*args, **kwargs)
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py", line 2606, in background_thread
    handle_exit_code(exit_code)
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py", line 2304, in fn
    return self.command.han

## Creating a list of all the items we want to remove

### Compute the items to be removed

First look at the classes we will remove

In [10]:
cmd = "wd u {}".format(" ".join(remove_classes.split(",")))
!{cmd}

[90mid[39m Q13442814
[42mLabel[49m scholarly article
[44mDescription[49m article in an academic publication, usually peer reviewed
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mscholarly publication [90m(Q591041)[39m | article [90m(Q191067)[39m | scholarly work [90m(Q55915575)[39m

[90mid[39m Q523
[42mLabel[49m star
[44mDescription[49m astronomical object consisting of a luminous spheroid of plasma held together by its own gravity
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39m astronomical object type [90m(Q17444909)[39m
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mastronomical object [90m(Q6999)[39m | fusor [90m(Q1027098)[39m

[90mid[39m Q16521
[42mLabel[49m taxon
[44mDescription[49m group of one or more organism(s), which a taxonomist adjudges to be a unit
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39m first-order metaclass [90m(Q24017414)[39m
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: 

Compose the kypher command to remove the classes

In [11]:
!zcat < {isa} | head | col

zcat: failed to read stdin: Input/output error


Run the command, the items to remove will be in file `{temp}/items.remove.tsv.gz`

In [12]:
classes = ", ".join(list(map(lambda x: '"{}"'.format(x), remove_classes.replace(" ", "").split(","))))
!{kypher}  -i {isa} -i {p279star} -o {temp}/items.remove.tsv.gz \
--match 'isa: (n1)-[:isa]->(c), P279star: (c)-[]->(class)' \
--where 'class in [{classes}]' \
--return 'distinct n1, "p31_p279star" as label, class as node2'


[2020-11-19 09:32:42 sqlstore]: IMPORT graph directly into table graph_15 from /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/derived.isa.tsv.gz ...
Exception in thread background thread for pid 30155:
Traceback (most recent call last):
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py", line 1662, in wrap
    fn(*args, **kwargs)
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py", line 2606, in background_thread
    handle_exit_code(exit_code)
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/sh-1.13.1-py3.7.egg/sh.py", line 2304, in fn
    return self.comman

Preview the file

In [13]:
!zcat < {temp}/items.remove.tsv.gz | head | col

Collect all the classes of items we will remove, just as a sanity check

In [14]:
!{kypher} -i {temp}/items.remove.tsv.gz \
--match '()-[]->(n2)' \
--return 'distinct n2' \
--limit 10

Traceback (most recent call last):
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py", line 148, in run
    index=options.get('index'))
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py", line 182, in __init__
    store.add_graph(file)
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 565, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 630, in import_graph_data_via_import
    if header.endswith('\r\n'):
TypeError: endswith first arg must be bytes or a tuple of bytes, not str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/pedroszekely/opt/anaconda3/e

## Create the reduced edges file

### Remove the items from the all.tsv and the label, alias and description files
We will be left with `reduced` files where the edges do not have the unwanted items. We have to remove them from the node1 and node2 positions, so we need to run the ifnotexists commands twice.

Before we start preview the files to see the column headings and check whether they look sorted.

In [15]:
!$kgtk sort2 -i {temp}/items.remove.tsv.gz -o {temp}/items.remove.sorted.tsv.gz

usage: kgtk [options] command [ / command]*
kgtk: error: unrecognized arguments: sort2


In [16]:
!zcat < {temp}/items.remove.sorted.tsv.gz | head | col

/bin/bash: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/items.remove.sorted.tsv.gz: No such file or directory


In [17]:
!zcat < "{claims}" | head -5 | col

/bin/bash: /Volumes/GoogleDrive/Shared\ drives/KGTK/datasets/wikidata-20200803-v4/claims.tsv.gz: No such file or directory


Remove from the full set of edges those edges that have a `node1` present in `items.remove.sorted.tsv`

In [18]:
!$kgtk ifnotexists -i "{claims}" -o {temp}/item.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted 

usage: kgtk [options] command [ / command]*
kgtk: error: unrecognized arguments: ifnotexists --filter-on /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/items.remove.sorted.tsv.gz --input-keys node1 --filter-keys node1 --presorted


From the remaining edges, remove those that have a `node2` present in `items.remove.sorted.tsv`

In [19]:
!$kgtk sort2 -i {temp}/item.edges.reduced.tsv.gz -o {temp}/item.edges.reduced.sorted.tsv.gz \
--columns node2 label node1 id

usage: kgtk [options] command [ / command]*
kgtk: error: unrecognized arguments: sort2 --columns node2 label node1 id


In [20]:
!$kgtk ifnotexists -i {temp}/item.edges.reduced.sorted.tsv.gz -o {temp}/item.edges.reduced.2.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node2 \
--filter-keys node1 \
--presorted 

usage: kgtk [options] command [ / command]*
kgtk: error: unrecognized arguments: ifnotexists --filter-on /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/items.remove.sorted.tsv.gz --input-keys node2 --filter-keys node1 --presorted


Create a file with the labels

In [21]:
!$kgtk ifnotexists -i {labels} -o {temp}/label.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

usage: kgtk [options] command [ / command]*
kgtk: error: unrecognized arguments: ifnotexists --filter-on /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/items.remove.sorted.tsv.gz --input-keys node1 --filter-keys node1 --presorted


Create a file with the aliases

In [22]:
!$kgtk ifnotexists -i {aliases} -o {temp}/alias.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

usage: kgtk [options] command [ / command]*
kgtk: error: unrecognized arguments: ifnotexists --filter-on /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/items.remove.sorted.tsv.gz --input-keys node1 --filter-keys node1 --presorted


Create a file with the descriptions

In [23]:
!$kgtk ifnotexists -i {descriptions} -o {temp}/description.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

usage: kgtk [options] command [ / command]*
kgtk: error: unrecognized arguments: ifnotexists --filter-on /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/items.remove.sorted.tsv.gz --input-keys node1 --filter-keys node1 --presorted


### Produce the output files for claims, labels, aliases and descriptions

In [24]:
!$kgtk sort2 -i {temp}/item.edges.reduced.2.tsv.gz -o {out}/claims.tsv.gz 

usage: kgtk [options] command [ / command]*
kgtk: error: unrecognized arguments: sort2


In [25]:
!$kgtk sort2 -i {temp}/label.edges.reduced.tsv.gz -o {out}/labels.en.tsv.gz 

usage: kgtk [options] command [ / command]*
kgtk: error: unrecognized arguments: sort2


In [26]:
!$kgtk sort2 -i {temp}/alias.edges.reduced.tsv.gz -o {out}/aliases.en.tsv.gz 

usage: kgtk [options] command [ / command]*
kgtk: error: unrecognized arguments: sort2


In [27]:
!$kgtk sort2 -i {temp}/description.edges.reduced.tsv.gz -o {out}/descriptions.en.tsv.gz 

usage: kgtk [options] command [ / command]*
kgtk: error: unrecognized arguments: sort2


Sanity checks to see if it looks reasonable

## Create the reduced qualifiers file
We do this by finding all the ids of the reduced edges file, and then selecting out from `qual.tsv`

We need to join by id, so we need to sort both files by id, node1, label, node2:

- `{quals}` 
- `{out}/claims.tsv.gz` 

In [28]:
!zcat < "{quals}" | head | column -t -s $'\t' 

/bin/bash: /Volumes/GoogleDrive/Shared\ drives/KGTK/datasets/wikidata-20200803-v4/qualifiers.tsv.gz: No such file or directory


Run `ifexists` to select out the quals for the edges in `{out}/wikidataos.qual.tsv.gz`. Note that we use `node1` in the qualifier file, matching to `id` in the `wikidataos.all.tsv` file.

In [29]:
!$kgtk ifexists -i "{quals}" -o {out}/qualifiers.tsv.gz \
--filter-on {out}/claims.tsv.gz \
--input-keys node1 \
--filter-keys id \
--presorted

usage: kgtk [options] command [ / command]*
kgtk: error: unrecognized arguments: ifexists --filter-on /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/claims.tsv.gz --input-keys node1 --filter-keys id --presorted


Look at the final output for qualifiers

In [30]:
!zcat < {out}/qualifiers.tsv.gz | head | col

/bin/bash: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/qualifiers.tsv.gz: No such file or directory


## Sanity checks

In [31]:
!{kypher} -i {out}/claims.tsv.gz \
--match '(n1:Q368441)-[l]->(n2)' \
--limit 10 \
| col

[2020-11-19 09:33:14 sqlstore]: IMPORT graph via csv.reader into table graph_16 from /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/claims.tsv.gz ...
Traceback (most recent call last):
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 565, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 613, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py", line 148, in run
    index=options.get('index'))
  File "/Users/pedros

In [32]:
!{kypher} -i {out}/claims.tsv.gz \
--match '(n1:P131)-[l]->(n2)' \
--limit 10 \
| col

[2020-11-19 09:33:15 sqlstore]: IMPORT graph via csv.reader into table graph_16 from /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/claims.tsv.gz ...
Traceback (most recent call last):
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 565, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 613, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py", line 148, in run
    index=options.get('index'))
  File "/Users/pedros

## Compute the derived files using the `Wikidata Useful Files` Jupyter notebook

Compute `claims.wikibase-item.tsv.gz` which would be computed by the Wikidata partitioner, but we are not using it here yet

In [33]:
!zcat < "{datatypes}" | head | col

/bin/bash: /Volumes/GoogleDrive/Shared\ drives/KGTK/datasets/wikidata-20200803-v4/metadata.property.datatypes.tsv.gz: No such file or directory


In [34]:
!{kypher} -i {out}/claims.tsv.gz -i "{datatypes}" -o {out}/claims.wikibase-item.tsv.gz \
--match 'claims: (n1)-[l {label: p}]->(n2), datatypes: (p)-[:datatype]->(:`wikibase-item`)' \
--return 'l as id, n1 as node1, p as label, n2 as node2' \
--order-by 'l' 

/bin/bash: {kgtk}: command not found


To compute the derived files we use papermill to run the `Wikidata Useful Files` notebook.

In [35]:
pm.execute_notebook(
    notebooks_folder + useful_files_notebook,
    {temp} + "/useful_files_notebook_output.ipynb",
    parameters=dict(
        output_path="/Users/pedroszekely/Downloads/kypher",
        output_folder="wikidata_os_v1",
        temp_folder="temp.wikidata_os_v1",
        wiki_root_folder="/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/",
        claims_file="claims.tsv.gz",
        label_file="labels.en.tsv.gz",
        alias_file="aliases.en.tsv.gz",
        description_file="descriptions.en.tsv.gz",
        item_file="claims.wikibase-item.tsv.gz",
        cache_path="/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v4",
        delete_database=False,
        compute_pagerank=False
    )
)

TypeError: unsupported operand type(s) for +: 'set' and 'str'

Look at the columns so we know how to construct the kypher query

## Summary of results

In [None]:
!ls -lh {out}/*wikidataos.*

In [None]:
!zcat < {out}/wikidataos.all.tsv.gz | wc

## Verification

The edges file must contain edges for properties, this is not the case on 2020-11-10`


In [None]:
!{kgtk} -i "{claims}" \
--match '(:P10)-[l]->(n2)' \
--limit 10