# Generating Subsets of Wikidata

### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

UPDATE EXAMPLE INVOCATION


```
papermill Wikidata\ Useful\ Files.ipynb useful-files.out.ipynb \
-p wiki_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz \
-p label_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.label.en.tsv.gz \
-p item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.wikibase-item.tsv.gz \
-p property_item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.property.wikibase-item.tsv.gz \
-p qual_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/qual.tsv.gz \
-p output_path <local folder> \
-p output_folder useful_files_v4 \
-p temp_folder temp.useful_files_v4 \
-p delete_database no \
-p compute_pagerank no \
-p languages es,ru,zh-cn 
```

In [1]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

import papermill as pm

from kgtk.configure_kgtk_notebooks import ConfigureKGTK
from kgtk.functions import kgtk, kypher

In [2]:
input_path = "/data/amandeep/wikidata-20220505/import-wikidata/data"
output_path = "/data/amandeep"
kgtk_path = "/data/amandeep/Github/kgtk"

graph_cache_path = None

project_name = "wikidata-20220505-dwd-v4"

files = 'isa,p279star'

# Classes to remove
remove_classes = "Q7318358,Q13442814"

useful_files_notebook = "Wikidata-Useful-Files.ipynb"
notebooks_folder = f"{kgtk_path}/use-cases"

languages = "en,ru,es,zh-cn,de,it,nl,pl,fr,pt,sv"
debug = False

In [3]:
files = files.split(',')
languages = languages.split(',')

In [4]:
ck = ConfigureKGTK(files, kgtk_path=kgtk_path)
ck.configure_kgtk(input_graph_path=input_path,
                  output_path=output_path,
                  project_name=project_name,
                 graph_cache_path=graph_cache_path)

User home: /nas/home/amandeep
Current dir: /data/amandeep/Github/kgtk/use-cases
KGTK dir: /data/amandeep/Github/kgtk
Use-cases dir: /data/amandeep/Github/kgtk/use-cases


In [5]:
ck.print_env_variables()

kypher: kgtk query --graph-cache /data/amandeep/wikidata-20220505-dwd-v4/temp.wikidata-20220505-dwd-v4/wikidata.sqlite3.db
GRAPH: /data/amandeep/wikidata-20220505/import-wikidata/data
KGTK_GRAPH_CACHE: /data/amandeep/wikidata-20220505-dwd-v4/temp.wikidata-20220505-dwd-v4/wikidata.sqlite3.db
STORE: /data/amandeep/wikidata-20220505-dwd-v4/temp.wikidata-20220505-dwd-v4/wikidata.sqlite3.db
OUT: /data/amandeep/wikidata-20220505-dwd-v4
kgtk: kgtk
TEMP: /data/amandeep/wikidata-20220505-dwd-v4/temp.wikidata-20220505-dwd-v4
USE_CASES_DIR: /data/amandeep/Github/kgtk/use-cases
KGTK_OPTION_DEBUG: false
EXAMPLES_DIR: /data/amandeep/Github/kgtk/examples
KGTK_LABEL_FILE: /data/amandeep/wikidata-20220505/import-wikidata/data/labels.en.tsv.gz
claims: /data/amandeep/wikidata-20220505/import-wikidata/data/claims.tsv.gz
label_all: /data/amandeep/wikidata-20220505/import-wikidata/data/labels.tsv.gz
alias_all: /data/amandeep/wikidata-20220505/import-wikidata/data/aliases.tsv.gz
description_all: /data/amande

In [6]:
ck.load_files_into_cache()

kgtk query --graph-cache /data/amandeep/wikidata-20220505-dwd-v4/temp.wikidata-20220505-dwd-v4/wikidata.sqlite3.db -i "/data/amandeep/wikidata-20220505/import-wikidata/data/claims.tsv.gz" --as claims  -i "/data/amandeep/wikidata-20220505/import-wikidata/data/labels.tsv.gz" --as label_all  -i "/data/amandeep/wikidata-20220505/import-wikidata/data/aliases.tsv.gz" --as alias_all  -i "/data/amandeep/wikidata-20220505/import-wikidata/data/descriptions.tsv.gz" --as description_all  -i "/data/amandeep/wikidata-20220505/import-wikidata/data/claims.wikibase-item.tsv.gz" --as item  -i "/data/amandeep/wikidata-20220505/import-wikidata/data/qualifiers.tsv.gz" --as qualifiers  -i "/data/amandeep/wikidata-20220505/import-wikidata/data/metadata.property.datatypes.tsv.gz" --as datatypes  -i "/data/amandeep/wikidata-20220505/import-wikidata/data/metadata.types.tsv.gz" --as types  -i "/data/amandeep/wikidata-20220505/import-wikidata/data/derived.isa.tsv.gz" --as isa  -i "/data/amandeep/wikidata-20220505

### Preview the input files

It is always a good practice to peek a the files to make sure the column headings are what we expect

In [8]:
!zcat $claims | head

id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"	normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item
P10-P1630-53947a-fbe9093e-0	P10	P1630	"https://commons.wikimedia.org/wiki/File:$1"	normal	string
P10-P1659-P1651-c4068028-0	P10	P1659	P1651	normal	wikibase-property
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18	normal	wikibase-property
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238	normal	wikibase-property
P10-P1659-P51-86aca4c5-0	P10	P1659	P51	normal	wikibase-property
P10-P1855-Q15075950-7eff6d65-0	P10	P1855	Q15075950	normal	wikibase-item

gzip: stdout: Broken pipe


## Creating a list of all the items we want to remove

### Compute the items to be removed

Compose the kypher command to remove the classes

In [10]:
!zcat $isa | head | col

node1	label	node2
P10	isa	Q18610173
P10	isa	Q19847637
P1000	isa	Q18608871
P10000	isa	Q19833377
P10000	isa	Q89560413
P10001	isa	Q107738007

gzip: P10001	isa	Q64221137
P10002	isa	Q93433126
stdout: Broken pipe
P10003	isa	Q108914651


Run the command, the items to remove will be in file `{temp}/items.remove.tsv.gz`

In [11]:
classes = ", ".join(list(map(lambda x: '"{}"'.format(x), remove_classes.replace(" ", "").split(","))))

classes

'"Q7318358", "Q13442814"'

In [13]:
kypher(f"""  -i isa -i p279star -o "$TEMP"/items.remove.tsv.gz 
            --match 'isa: (n1)-[:isa]->(c), p279star: (c)-[]->(class)' 
            --where 'class in [{classes}]' 
            --return 'distinct n1, "p31_p279star" as label, class as node2' 
            --order-by 'n1'
            """)

Preview the file

In [14]:
!zcat < "$TEMP"/items.remove.tsv.gz | head | col

node1	label	node2
Q100000005	p31_p279star	Q13442814

gzip: Q100000009	p31_p279star	Q13442814
stdout: Broken pipe
Q100000015	p31_p279star	Q13442814
Q100000022	p31_p279star	Q13442814
Q100000031	p31_p279star	Q13442814
Q100000044	p31_p279star	Q13442814
Q100000056	p31_p279star	Q13442814
Q100000066	p31_p279star	Q13442814
Q100000074	p31_p279star	Q13442814


In [15]:
!zcat < "$TEMP"/items.remove.tsv.gz | wc

39873936 119621808 1314915334


Collect all the classes of items we will remove, just as a sanity check

In [16]:
!$kypher -i "$TEMP"/items.remove.tsv.gz \
--match '()-[]->(n2)' \
--return 'distinct n2' \
--limit 10

node2
Q13442814
Q7318358


## Create the reduced edges file

### Remove the items from the all.tsv and the label, alias and description files
We will be left with `reduced` files where the edges do not have the unwanted items. We have to remove them from the node1 and node2 positions, so we need to run the ifnotexists commands twice.

Before we start preview the files to see the column headings and check whether they look sorted.

In [17]:
!zcat "$TEMP"/items.remove.tsv.gz | head | col

node1	label	node2

gzip: Q100000005	p31_p279star	Q13442814
Q100000009	p31_p279star	Q13442814
stdout: Broken pipe
Q100000015	p31_p279star	Q13442814
Q100000022	p31_p279star	Q13442814
Q100000031	p31_p279star	Q13442814
Q100000044	p31_p279star	Q13442814
Q100000056	p31_p279star	Q13442814
Q100000066	p31_p279star	Q13442814
Q100000074	p31_p279star	Q13442814


Remove from the full set of edges those edges that have a `node1` present in `items.remove.tsv`

In [18]:
kgtk("""ifnotexists 
        -i $claims 
        -o "$TEMP"/item.edges.reduced.tsv.gz
        --filter-on "$TEMP"/items.remove.tsv.gz
        --input-keys node1
        --filter-keys node1
        --presorted
    """)

From the remaining edges, remove those that have a `node2` present in `items.remove.tsv`

In [19]:
kgtk(f"""sort 
        -i "$TEMP"/item.edges.reduced.tsv.gz 
        -o "$TEMP"/item.edges.reduced.sorted.tsv.gz
        --extra '--parallel 24 --buffer-size 30% --temporary-directory {os.environ['TEMP']}'
        --columns node2 label node1 id""")

In [20]:
kgtk("""ifnotexists 
        -i $TEMP/item.edges.reduced.sorted.tsv.gz 
        -o $TEMP/item.edges.reduced.2.tsv.gz
        --filter-on $TEMP/items.remove.tsv.gz
        --input-keys node2
        --filter-keys node1
        --presorted""")

Create a file with the labels, for all the languages specified, **FIX THIS**

In [21]:
kgtk("""ifnotexists -i $label_all 
        -o "$TEMP"/label.all.edges.reduced.tsv.gz
        --filter-on "$TEMP"/items.remove.tsv.gz
        --input-keys node1
        --filter-keys node1
        --presorted""")

In [22]:
kgtk(f"""sort 
        -i $TEMP/label.all.edges.reduced.tsv.gz 
        --extra '--parallel 24 --buffer-size 30% --temporary-directory {os.environ['TEMP']}'
        -o $OUT/labels.tsv.gz""")


Create a file with the aliases, for all the languages specified

In [23]:
kgtk("""ifnotexists -i $alias_all
        -o $TEMP/alias.all.edges.reduced.tsv.gz
        --filter-on $TEMP/items.remove.tsv.gz
        --input-keys node1
        --filter-keys node1
        --presorted""")

In [24]:
kgtk(f"""sort 
        -i $TEMP/alias.all.edges.reduced.tsv.gz 
        --extra '--parallel 24 --buffer-size 30% --temporary-directory {os.environ['TEMP']}'
        -o $OUT/aliases.tsv.gz""")


Create a file with the descriptions, for all the languages specified

In [25]:
kgtk("""ifnotexists 
        -i $description_all
        -o $TEMP/description.all.edges.reduced.tsv.gz
        --filter-on $TEMP/items.remove.tsv.gz
        --input-keys node1
        --filter-keys node1
        --presorted""")

In [26]:
kgtk(f"""sort 
        -i $TEMP/description.all.edges.reduced.tsv.gz 
        --extra '--parallel 24 --buffer-size 30% --temporary-directory {os.environ['TEMP']}'
        -o $OUT/descriptions.tsv.gz""")

### Produce the output files for claims, labels, aliases and descriptions

In [27]:
kgtk(f"""sort 
        -i $TEMP/item.edges.reduced.2.tsv.gz
        --extra '--parallel 24 --buffer-size 30% --temporary-directory {os.environ['TEMP']}'
        -o $OUT/claims.tsv.gz""")

## Create the reduced qualifiers file
We do this by finding all the ids of the reduced edges file, and then selecting out from `qual.tsv`

We need to join by id, so we need to sort both files by id, node1, label, node2:

- `$qualifiers` 
- `$OUT/claims.tsv.gz` 

In [28]:
if debug:
    !zcat < "$qualifiers" | head | column -t -s $'\t' 


gzip: id                                                node1                           label  node2                                                                    node2;wikidatatype
P10-P1630-53947a-fbe9093e-0-P407-Q20923490-0      P10-P1630-53947a-fbe9093e-0     P407   Q20923490                                                                wikibase-item
stdout: Broken pipe
P10-P1855-Q15075950-7eff6d65-0-P10-54b214-0       P10-P1855-Q15075950-7eff6d65-0  P10    "Smoorverliefd 12 september.webm"                                        commonsMedia
P10-P1855-Q15075950-7eff6d65-0-P3831-Q622550-0    P10-P1855-Q15075950-7eff6d65-0  P3831  Q622550                                                                  wikibase-item
P10-P1855-Q4504-a69d2c73-0-P10-bef003-0           P10-P1855-Q4504-a69d2c73-0      P10    "Komodo dragons video.ogv"                                               commonsMedia
P10-P1855-Q69063653-c8cdb04c-0-P10-6fb08f-0       P10-P1855-Q69063653-c8cdb04c-0  P10    "

Run `ifexists` to select out the quals for the edges in `{out}/wikidataos.qual.tsv.gz`. Note that we use `node1` in the qualifier file, matching to `id` in the `wikidataos.all.tsv` file.

In [29]:
kgtk("""ifexists 
    -i $qualifiers 
    -o $OUT/qualifiers.tsv.gz
    --filter-on $OUT/claims.tsv.gz
    --input-keys node1
    --filter-keys id
    --presorted""")

Look at the final output for qualifiers

In [30]:
if debug:
    !zcat $OUT/qualifiers.tsv.gz | head | col


gzip: stdout: Broken pipe
id	node1	label	node2	node2;wikidatatype
P10-P1630-53947a-fbe9093e-0-P407-Q20923490-0	P10-P1630-53947a-fbe9093e-0	P407	Q20923490	wikibase-item
P10-P1855-Q15075950-7eff6d65-0-P10-54b214-0	P10-P1855-Q15075950-7eff6d65-0	P10	"Smoorverliefd 12 september.webm"	commonsMedia
P10-P1855-Q15075950-7eff6d65-0-P3831-Q622550-0	P10-P1855-Q15075950-7eff6d65-0	P3831	Q622550 wikibase-item
P10-P1855-Q4504-a69d2c73-0-P10-bef003-0 P10-P1855-Q4504-a69d2c73-0	P10	"Komodo dragons video.ogv"	commonsMedia
P10-P1855-Q69063653-c8cdb04c-0-P10-6fb08f-0	P10-P1855-Q69063653-c8cdb04c-0	P10	"Couch Commander.webm"	commonsMedia
P10-P1855-Q825197-555592a4-0-P10-8a982d-0	P10-P1855-Q825197-555592a4-0	P10	"Elephants Dream (2006).webm"	commonsMedia
P10-P2302-Q21502404-d012aef4-0-P1793-1f3adb-0	P10-P2302-Q21502404-d012aef4-0	P1793	"(?i).+\\.(webm\|ogv\|ogg\|gif\|svg)"	string
P10-P2302-Q21502404-d012aef4-0-P2316-Q21502408-0	P10-P2302-Q21502404-d012aef4-0	P2316	Q21502408	wikibase-item
P10-P2302-Q215024

In [31]:
!ls -l "$OUT"

total 34220224
-rw-r--r-- 1 amandeep isdstaff  2214529468 May 14 20:50 aliases.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 11594856613 May 15 04:31 claims.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 12667243225 May 15 03:52 descriptions.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  6007956701 May 14 20:09 labels.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  2556913530 May 15 05:28 qualifiers.tsv.gz
drwxr-xr-x 2 amandeep isdstaff         288 May 15 04:31 temp.wikidata-20220505-dwd-v4


Copy the property datatypes and metadata types file over

In [32]:
!cp $datatypes $OUT/metadata.property.datatypes.tsv.gz

Filter out edges from metdata types file

In [33]:
kgtk("""ifexists 
        -i "$types" -o $OUT/metadata.types.tsv.gz
        --filter-on $OUT/claims.tsv.gz
        --input-keys node1
        --filter-keys node1
        --presorted""")

Get the sitelinks as well, the sitelinks are not in claims.tsv.gz

In [34]:
kgtk("""ifexists 
        -i "$GRAPH/sitelinks.tsv.gz" 
        -o "$OUT/sitelinks.tsv.gz"
        --filter-on "$OUT/claims.tsv.gz"
        --input-keys node1
        --filter-keys node1
        --presorted""")

Contruct the cat command to generate `all.tsv.gz`

In [35]:
kgtk("""cat -i "$OUT/labels.tsv.gz"
            -i "$OUT/aliases.tsv.gz"
            -i "$OUT/descriptions.tsv.gz"
            -i "$OUT/claims.tsv.gz"
            -i "$OUT/qualifiers.tsv.gz"
            -i "$OUT/metadata.property.datatypes.tsv.gz"
            -i "$OUT/metadata.types.tsv.gz"
            -i "$OUT/sitelinks.tsv.gz"
            -o "$OUT/all.tsv.gz"
            """)

### Run the Partitions Notebook

In [None]:
pm.execute_notebook(
    "partition-wikidata.ipynb",
    os.environ["TEMP"] + "/partition-wikidata.out.ipynb",
    parameters=dict(
        wikidata_input_path = os.environ["OUT"] + "/all.tsv.gz",
        wikidata_parts_path = os.environ["OUT"] + "/parts",
        temp_folder_path = os.environ["OUT"] + "/parts/temp",
        sort_extras = "--buffer-size 30% --temporary-directory $OUT/parts/temp",
        verbose = False,
        gzip_command = 'gzip'
    )
)

### copy the `claims.wikibase-item.tsv` file from the `parts` folder

In [38]:
!cp $OUT/parts/claims.wikibase-item.tsv.gz $OUT

### RUN the Useful Files notebook

In [None]:
pm.execute_notebook(
    f'{useful_files_notebook}',
    os.environ["TEMP"] + "/Wikidata-Useful-Files-Out.ipynb",
    parameters=dict(
        output_path = os.environ["OUT"],
        input_path = os.environ["OUT"],
        kgtk_path = kgtk_path,
        compute_pagerank=True,
        compute_degrees=True,
        compute_isa_star=True,
        compute_p31p279_star=True,
        debug=False
    )
)


## Sanity checks

In [None]:
if debug:
    !$kypher -i $OUT/claims.tsv.gz \
    --match '(n1:Q368441)-[l]->(n2)' \
    --limit 10 \
    | col

In [None]:
if debug:
    !$kypher -i $OUT/claims.tsv.gz \
    --match '(n1:P131)-[l]->(n2)' \
    --limit 10 \
    | col

## Summary of results

In [10]:
!ls -lh $OUT/*.tsv.gz

-rw-r--r-- 1 amandeep isdstaff 175M May 16 04:59 /data/amandeep/wikidata-20220505-dwd-v4/aliases.en.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 2.0G May 16 01:22 /data/amandeep/wikidata-20220505-dwd-v4/aliases.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  39G May 15 22:08 /data/amandeep/wikidata-20220505-dwd-v4/all.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 184M May 16 07:06 /data/amandeep/wikidata-20220505-dwd-v4/claims.commonsMedia.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 2.5G May 16 07:06 /data/amandeep/wikidata-20220505-dwd-v4/claims.external-id.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 779K May 16 07:06 /data/amandeep/wikidata-20220505-dwd-v4/claims.geo-shape.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 227M May 16 07:06 /data/amandeep/wikidata-20220505-dwd-v4/claims.globe-coordinate.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 689K May 16 07:06 /data/amandeep/wikidata-20220505-dwd-v4/claims.math.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 295M May 16 07:06 /data/amandeep/wikidata-20220505-dwd-v4/claims.monolingualtext.tsv.g