# Generating Subsets of Wikidata

>Warning: 
**This notebook is under construction and it doesn't work**

## Purpose

>This notebook is used to create smaller subgraphs from a larger input Wikidata graph. Notebook users can provide a list of Wikidata classes (**QNodes**) to remove and preserve to create desired subsets of Wikidata. 


### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

UPDATE EXAMPLE INVOCATION


```
papermill Wikidata\ Useful\ Files.ipynb useful-files.out.ipynb \
-p wiki_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz \
-p label_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.label.en.tsv.gz \
-p item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.wikibase-item.tsv.gz \
-p property_item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.property.wikibase-item.tsv.gz \
-p qual_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/qual.tsv.gz \
-p output_path <local folder> \
-p output_folder useful_files_v4 \
-p temp_folder temp.useful_files_v4 \
-p delete_database no \
-p compute_pagerank no \
-p languages es,ru,zh-cn 
```

In [1]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

import papermill as pm

import gzip

In [2]:
# Parameters

# Folder on local machine where to create the output and temporary folders
# output_path = "/Users/pedroszekely/Downloads/kypher"
# output_path = "/Users/markmann/Downloads/subset"
output_path = "/nas/home/mbmann/subset"

# The names of the output and temporary folders
output_folder = "output"
temp_folder = "temp.output"

# Classes to remove
#Q34508 - video tape recording
# remove_classes = "Q13442814, Q523, Q16521, Q318, Q7318358, Q7187, Q11173, Q8054, Q5, Q13100073, Q8502, Q3305213, Q4022, Q79007, Q1931185, Q30612, Q101352, Q54050, Q13433827, Q2668072, Q23397, Q3863, Q11424, Q482994, Q47150325, Q16970, Q18593264, Q355304, Q9842, Q7725634, Q27020041, Q56436498, Q2154519, Q61443690, Q49008, Q3331189, Q47521, Q5084, Q19389637, Q21014462, Q4164871, Q11060274, Q5633421, Q39816, Q5185279, Q55488, Q134556, Q22698, Q985488, Q1260524, Q204107, Q2225692, Q215380, Q71963409, Q452237, Q93184, Q12323"
# remove_classes = "Q34508"

# The location of input files
# wiki_root_folder = "/Volumes/GoogleDrive/Shared\ drives/KGTK/datasets/wikidata-20200803-v4/"
# wiki_root_folder = "/Volumes/GoogleDrive/Shared\ drives/KGTK/datasets/wikidata-20200803-v4/"
# wiki_root_folder = "/Users/pedroszekely/Downloads/kypher/wikidataos-v4/"
# wiki_root_folder = "/Users/markmann/Google\ Drive/Shared\ drives/KGTK/datasets/wikidataos-v4-mm-2/"
wiki_root_folder = "/nas/home/mbmann/kgtk/datasets/wikidataos-v4-mm-2/"

metadata_folder = "/nas/home/mbmann/kgtk/datasets/wikidata-20200803-v5/data/"

claims_file = "claims.tsv.gz"
label_file = "labels.en.tsv.gz"
alias_file = "aliases.en.tsv.gz"
description_file = "descriptions.en.tsv.gz"
item_file = "claims.wikibase-item.tsv.gz"
qual_file = "qualifiers.tsv.gz"
property_datatypes_file = "metadata.property.datatypes.tsv.gz" #FIX
metadata_file = "metadata.types.tsv.gz" #FIX
isa_file = "derived.isa.tsv.gz"
p279star_file = "derived.P279star.tsv.gz"

# Useful files Jupyter notebook
useful_files_notebook = "Wikidata Useful Files.ipynb"
# notebooks_folder = "/Users/markmann/Desktop/CKG/kgtk_subset/kgtk/examples/"
notebooks_folder = "/nas/home/mbmann/kgtk_subset/kgtk/examples/"

# Location of the cache database for kypher
# cache_path = "/Users/pedroszekely/Downloads/kypher/wikidataos-v4"
cache_path = f'{output_path}/{output_folder}'

#Additional parameters
delete_database = "no"
compute_pagerank = "no"
languages = "en,"

### Needs fixing
# Whether to delete the cache database
if delete_database and delete_database.lower().strip() == 'yes':
    delete_database = True
else:
    delete_database = False

### Needs fixing
if compute_pagerank and compute_pagerank.lower().strip() == 'yes':
    compute_pagerank = True
else:
    compute_pagerank = False

if languages:
    languages = languages.split(',')

## Set up variables for files

In [3]:
#Environment variables
if cache_path:
    os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_path)
else:
    os.environ['STORE'] = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)

#Python variables
if cache_path:
    store = "{}/wikidata.sqlite3.db".format(cache_path)
else:
    store = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)

out = "{}/{}".format(output_path, output_folder)
temp = "{}/{}".format(output_path, temp_folder)

claims = wiki_root_folder + claims_file
labels = wiki_root_folder + label_file
aliases = wiki_root_folder + alias_file
descriptions = wiki_root_folder + description_file
items = wiki_root_folder + item_file
quals = wiki_root_folder + qual_file
isa = wiki_root_folder + isa_file
p279star = wiki_root_folder + p279star_file

datatypes = metadata_folder + property_datatypes_file #FIX
metadata = metadata_folder + metadata_file #FIX

# shortcuts to commands
kgtk_path = "~/anaconda3/envs/kgtk-subset/bin/kgtk"
kgtk = f'time {kgtk_path} --debug'
kypher = f"{kgtk_path} query --debug --graph-cache " + store

Go to the output directory and create the subfolders for the output files and the temporary files

In [None]:
!cd $output_path
!mkdir {out}
!mkdir {temp}

Clean up the output and temp folders before we start

In [None]:
# !rm {out}/*.tsv {out}/*.tsv.gz
# !rm {temp}/*.tsv {temp}/*.tsv.gz

if delete_database:
    !rm {out}/*.tsv {out}/*.tsv.gz
    !rm {temp}/*.tsv {temp}/*.tsv.gz

### Preview the input files

It is always a good practice to peek a the files to make sure the column headings are what we expect

In [None]:
!{kypher} -i {claims} \
--match '()-[]->()' \
--limit 10

## Creating a list of all the items  to remove

**Add classes to remove, based on <u>classes themselves</u> here:** <br>
- **Example:** Let's remove the class (videotape recording, 'Q34508')
- **NOTE:** This will only remove items that have a P31/P279 relation with the class

In [None]:
classes_to_remove = ['Q34508'] #Parameter: Add classes manually here
classes = ', '.join([f'"{c}"' for c in classes_to_remove])

!{kypher} -i {claims} -o {temp}/classes.remove.manual.tsv.gz \
--match '(instance)-[:P279]->(c)' \
--where 'c in [{classes}]' \
--return 'c, instance, "p279" as label'

!zcat {temp}/classes.remove.manual.tsv.gz |head

**Add classes to remove, based on <u>instances</u> here:**
- **Example:** Let's remove classes that are part of instance (Fireball, 'Q5451712'), (Bush, 'Q1017471'), and (Italin Grape Alle, 'Q67772833')
- **NOTE:** The expected class to remove is (whisky, 'Q281'), (beer, 'Q44'), (beer brand, Q15075508), and (ale, 'Q208385')

Specify the intances to remove

In [None]:
instances_to_remove = ['Q5451712', 'Q1017471', 'Q67772833'] #Parameter: Add instances manually here
instances = ', '.join([f'"{instance}"' for instance in instances_to_remove])

For all instances, get their **(P279, subclass)** from `claims.tsv.gz`

In [None]:
!{kypher} -i {claims} -o {temp}/classes.remove.p279.tsv.gz \
--match '(instance)-[:P279]->(c)' \
--where 'instance in [{instances}]' \

!zcat {temp}/classes.remove.p279.tsv.gz |head

For all instances, get their **(P31, class)** from `claims.tsv.gz`

In [None]:
!{kypher} -i {claims} -o {temp}/classes.remove.p31.tsv.gz \
--match '(instance)-[:P31]->(c)' \
--where 'instance in [{instances}]' \

!zcat {temp}/classes.remove.p31.tsv.gz |head

Concatenate all `classes.remove.manual`, `classes.remove.p31`, and `classes.remove.p279` into one file

In [None]:
!{kgtk} cat -i {temp}/classes.remove.manual.tsv.gz -i {temp}/classes.remove.p31.tsv.gz -i {temp}/classes.remove.p279.tsv.gz \
-o {temp}/classes.remove.tsv.gz

Check and remove duplicate classes. 

In [None]:
!{kgtk} unique -i {temp}/classes.remove.tsv.gz \
-o {temp}/classes.remove2.tsv.gz

!zcat {temp}/classes.remove2.tsv.gz | head

In [None]:
#ISSUE: Can't query for multiple relations for given node1 within same query
#GITHUB: https://github.com/usc-isi-i2/kgtk/issues/330
#Test 1
# !{kypher} -i {claims} \
# --match '(n1)-[:P31]->(n2)' \
# --where 'n1 = "Q5451712"' \

#Test 2, Test 3 
# !{kypher} -i {claims} \
# --match 'claims: (n1)-[l1 {label: p}]->(n2)' \
# --where 'n1 = "Q5451712" and p = "P31"' \
# --where 'n1 = "Q5451712" and p = "P31" OR p = "P279"' \

### Compute the items to be removed via classes

First look at the classes we will remove

In [None]:
!zcat {temp}/classes.remove2.tsv.gz | head

1. Given all classes in `classes.remove2`, find all subclasses from `p279star`. <br>
2. Given all subclasses from `p279star`, find all subclass instances from `isa`
3. The resulting items to remove will be in `{temp}/items.remove.byclass.tsv.gz`

In [None]:
!{kypher} -i {temp}/classes.remove2.tsv.gz -i {p279star} -i {isa} \
--match 'isa: (item)-[:isa]->(subclass), P279star: (subclass)-[:P279star]->(c), class: (c)-[:count]->()' \
--return 'distinct item, "p31_p279star" as label, c as node2' \
-o {temp}/items.remove.byclass.tsv.gz

Check the result

In [None]:
# !zcat {temp}/items.remove.byclass.tsv.gz | head
!echo 'Johnnie Walker'
!zgrep 'Q502268	' {temp}/items.remove.byclass.tsv.gz
!echo 'Fireball'
!zgrep 'Q5451712	' {temp}/items.remove.byclass.tsv.gz

### Compute the items to be removed via out degree

Specify the # of node out-degrees `k`, and identify items with out-degree less than `k`
- Ex: Find items that have out-degree `k` less than 2.

Compute out-degree for all QNodes in the `claims` file. Check the result.

In [None]:
!{kypher} -i {claims} -o {temp}/metadata.out_degree.tsv.gz \
--match '(n1)-[l]->()' \
--where "upper(substr(n1,0)) >= 'Q'" \
--return 'distinct n1 as node1, count(distinct l) as node2, "out_degree" as label' 

!zcat {temp}/metadata.out_degree.tsv.gz | head

Create a list of items that have out_degree < `k`, along with any parent classses they belong to. <br>
Put the results into `items.remove.bydegree.tsv.gz`. <br>

In [None]:
k = 2 #Parameter
!{kypher} -i {temp}/metadata.out_degree.tsv.gz -i {isa} -i {p279star} \
--match 'out: (item)-[:out_degree]->(n2), isa: (item)-[:isa]->(subclass), P279star: (subclass)-[:P279star]->(c)' \
--where 'cast(n2, integer) <= {k}' \
--return 'distinct item, "p31_p279star" as label, c as node2' \
-o {temp}/items.remove.bydegree.tsv.gz \

!zcat {temp}/items.remove.bydegree.tsv.gz | head

### Combine the items to remove by-class and by-outdegree
Concatenate all items from `items.remove.byclass` and `items.remove.bydegree`.
The resulting list of items to remove will be `items.remove`.

In [None]:
# !zcat {temp}/items.remove.byclass.tsv.gz | head
# !echo 'Johnnie Walker'
# !zgrep 'Q502268	' {temp}/items.remove.byclass.tsv.gz
# !echo 'Fireball'
# !zgrep 'Q5451712	' {temp}/items.remove.byclass.tsv.gz

# !zcat {temp}/items.remove.bydegree.tsv.gz | head
# !zcat {temp}/items.remove.tsv.gz | head

!{kgtk} cat -i {temp}/items.remove.byclass.tsv.gz {temp}/items.remove.bydegree.tsv.gz \
-o {temp}/items.remove.tsv.gz

#Check if fireball is still in there
# !echo 'Fireball'
# !zgrep 'Q5451712	' {temp}/items.remove.tsv.gz

Deduplicate the concatenated file of items to remove. <br>
The resulting list of items to remove will be `items.remove2`.

In [None]:
!{kypher} -i {temp}/items.remove.tsv.gz -o {temp}/items.remove2.tsv.gz \
--match '(item)-[:p31_p279star]->(c)' \
--return 'distinct item, "p31_p279star" as label, c as node2'
!zcat {temp}/items.remove2.tsv.gz | head

### Validate the items we will remove
Check the `items.remove` file for classes added via different methods: 1) by-class, 2) by-instance, 3) by-outdegree

1) Check for class added manually, i.e. (videotape recording, 'Q34508')

In [None]:
!echo 'videotape recording'
!zgrep 'Q34508' {temp}/items.remove2.tsv.gz | head

2) Check for class added by-instance, i.e. (Fireball, 'Q5451712'), (whisky, 'Q281')

In [None]:
!echo 'fireball'
!zgrep 'Q5451712' {temp}/items.remove2.tsv.gz | head

!echo 'whisky'
!zgrep 'Q281' {temp}/items.remove2.tsv.gz | head

3) Check for class added by-outdegree, i.e. (??, 'Q100000030')

In [None]:
!zgrep 'Q100000030' {temp}/items.remove2.tsv.gz | head

Collect all the classes of items we will remove, just as a sanity check

In [None]:
!{kypher} -i {temp}/items.remove2.tsv.gz \
--match '()-[]->(n2)' \
--return 'count(distinct n2)' \

!{kypher} -i {temp}/items.remove2.tsv.gz \
--match '()-[]->(n2)' \
--return 'distinct n2' \
--limit 10

## [TODO] Create a list of all items to protect

## Create the reduced edges file

### Remove the items from the all.tsv and the label, alias and description files
We will be left with `reduced` files where the edges do not have the unwanted items. We have to remove them from the node1 and node2 positions, so we need to run the ifnotexists commands twice.

Before we start preview the files to see the column headings and check whether they look sorted.

In [None]:
!{kgtk} sort2 -i {temp}/items.remove2.tsv.gz -o {temp}/items.remove2.sorted.tsv.gz

In [None]:
!zcat < {temp}/items.remove2.sorted.tsv.gz | head | col
# !zgrep 'Q34508' {temp}/items.remove.sorted.tsv.gz -c #466

Remove from the full set of edges those edges that have a `node1` present in `items.remove.sorted.tsv`

In [None]:
!zcat {temp}/items.remove2.tsv.gz | head

In [None]:
!{kgtk} ifnotexists -i {claims} -o {temp}/item.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove2.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted 

From the remaining edges, remove those that have a `node2` present in `items.remove.sorted.tsv`

In [None]:
!{kgtk} sort2 -i {temp}/item.edges.reduced.tsv.gz -o {temp}/item.edges.reduced.sorted.tsv.gz \
--columns node2 label node1 id

In [None]:
!{kgtk} ifnotexists -i {temp}/item.edges.reduced.sorted.tsv.gz -o {temp}/item.edges.reduced.2.tsv.gz \
--filter-on {temp}/items.remove2.sorted.tsv.gz \
--input-keys node2 \
--filter-keys node1 \
--presorted 

Create a file with the labels

In [None]:
!{kgtk} ifnotexists -i {labels} -o {temp}/label.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove2.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

In [None]:
!{kgtk} sort2 -i {temp}/label.edges.reduced.tsv.gz -o {out}/labels.tsv.gz

Create a file with the aliases

In [None]:
!{kgtk} ifnotexists -i {aliases} -o {temp}/alias.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove2.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

Create a file with the descriptions

In [None]:
!{kgtk} ifnotexists -i {descriptions} -o {temp}/description.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove2.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

### Produce the output files for claims, labels, aliases and descriptions

In [None]:
!{kgtk} sort2 -i {temp}/item.edges.reduced.2.tsv.gz -o {out}/claims.tsv.gz 

In [None]:
!{kgtk} sort2 -i {temp}/label.edges.reduced.tsv.gz -o {out}/labels.en.tsv.gz 

In [None]:
!{kgtk} sort2 -i {temp}/alias.edges.reduced.tsv.gz -o {out}/aliases.en.tsv.gz 

In [None]:
!{kgtk} sort2 -i {temp}/description.edges.reduced.tsv.gz -o {out}/descriptions.en.tsv.gz 

## Tests: Confirm items were removed from claims file
**NOTE:** We will check the items we removed do not exist in claims file

1) Confirm no instance of class added manually, i.e. (class: 'Q34508')<-(instance: 'Q100431477')

In [None]:
!zgrep 'Q100431477' {out}/claims.tsv.gz #PASS: No result

2) Confirm no target_instance of class added by source_instance, i.e. (source_instance, 'Q5451712')->(class:'Q281')<-(target_instance: Q1350656)

In [None]:
!zgrep 'Q1350656	' {out}/claims.tsv.gz #PASS: No result

3) Confirm no instance with out-degree < 2, i.e. (instance: 'Q100000030')

In [None]:
!zgrep 'Q100000030	' {out}/claims.tsv.gz #PASS: No result

## Create the reduced qualifiers file
We do this by finding all the ids of the reduced edges file, and then selecting out from `qual.tsv`

We need to join by id, so we need to sort both files by id, node1, label, node2:

- `{quals}` 
- `{out}/claims.tsv.gz` 

In [None]:
!zcat < {quals} | head | column -t -s $'\t' 

Run `ifexists` to select out the quals for the edges in `{out}/wikidataos.qual.tsv.gz`. Note that we use `node1` in the qualifier file, matching to `id` in the `wikidataos.all.tsv` file.

In [None]:
!$kgtk ifexists -i {quals} -o {out}/qualifiers.tsv.gz \
--filter-on {out}/claims.tsv.gz \
--input-keys node1 \
--filter-keys id \
--presorted

Look at the final output for qualifiers

In [None]:
!zcat < {out}/qualifiers.tsv.gz | head | col

## Call partition and useful_files notebooks, to generate the file output

In [None]:
kgtk_scripts_path = "/nas/home/mbmann/kgtk_subset/kgtk"
os.environ["EXAMPLES_DIR"] = kgtk_scripts_path + "/examples"
os.environ["USECASE_DIR"] = kgtk_scripts_path + "/use-cases"
os.environ["TEMP"] = temp
os.environ["OUT"] = out
os.environ["DATATYPES"] = datatypes
os.environ["METADATA"] = metadata

In [None]:
os.environ["EXAMPLES_DIR"]

In [None]:
!ls "$TEMP"

In [None]:
!ls "$OUT"

**Concatenate all output files together** <br>

**NOTE:** The `metadata.property.datatypes` and `metadata.types` are not currently generated by this notebook, and have been copied from `wikidata-20200803-v5/data`. <br>
**TODO:** We must confirm if these are source files, or computed. If they are computed, we should compute them in this notebook.

In [None]:
!{kgtk} cat \
-i {out}/aliases.en.tsv.gz \
-i {out}/descriptions.en.tsv.gz \
-i {out}/qualifiers.tsv.gz \
-i {out}/claims.tsv.gz \
-i {out}/labels.en.tsv.gz \
-i {datatypes} \
-i {metadata} \
-o {out}/all.tsv.gz

In [None]:
!ls {os.environ["EXAMPLES_DIR"] + "/partition-wikidata.ipynb"}

In [None]:
os.environ["OUT"]

### Call the partition-wikidata notebook
`partition-wikidata` will take all intermediary computed outputs from `all.tsv.gz` and partition each wikidata entity (i.e. claims, aliases, labels, descriptions, qualifiers), into partitions. <br>

**NOTE:** This notebook also produces `claims.wikibase-item.tsv.gz` and outputs it to `wikidata_parts_path`

In [None]:
pm.execute_notebook(
    os.environ["EXAMPLES_DIR"] + "/partition-wikidata.ipynb",
    os.environ["TEMP"] + "/partition-wikidata.out.ipynb",
    parameters=dict(
        wikidata_input_path = os.environ["OUT"] + "/all.tsv.gz",
        wikidata_parts_path = os.environ["OUT"] + "/parts",
        temp_folder_path = os.environ["OUT"] + "/parts/temp",
        sort_extras = "--buffer-size 30% --temporary-directory $OUT/parts/temp",
        verbose = False
    )
)
;

## Call the useful-files notebook
`Wikidata Useful Files` will take intermediary output generated by `partition-wikidata` and produce the following statistics: `derived.P31.tsv.gz`, `derived.P279.tsv.gz`, `derived.isa.tsv.gz`, `derived.P279star.tsv.gz`, and `metadata.out_degree.tsv.gz`.

In [None]:
#NOTE: Don't pass in cache path, as one doesn't yet exist for useful_files
pm.execute_notebook(
    os.environ["USECASE_DIR"] + "/Wikidata Useful Files.ipynb",
    os.environ["TEMP"] + "/Wikidata Useful Files Out.ipynb",
    parameters=dict(
        output_path = os.environ["OUT"],
        output_folder = "useful_files",
        temp_folder = "temp.useful_files",
        wiki_root_folder = os.environ["OUT"] + "/parts/",
        languages = 'en',
        compute_pagerank = True,
        delete_database = False
    )
)
;

## Summary of results

In [None]:
!ls -lh {out}/*wikidataos.*

In [None]:
!zcat < {out}/wikidataos.all.tsv.gz | wc

## Verification

The edges file must contain edges for properties, this is not the case on 2020-11-10`


In [None]:
!{kypher} -i {out}/claims.tsv.gz \
--match '(:P10)-[l]->(n2)' \
--limit 10

## concatenate files to get the `all` file

In [5]:
!{kgtk} cat -i {out}/claims.tsv.gz \
{out}/qualifiers.tsv.gz \
{out}/useful_files/metadata.pagerank.undirected.tsv.gz \
{out}/useful_files/metadata.pagerank.directed.tsv.gz \
{out}/useful_files/metadata.in_degree.sorted.tsv.gz \
{out}/useful_files/metadata.out_degree.sorted.tsv.gz \
-o {out}/wikidataos.all.tsv.gz


real	10m21.385s
user	10m19.903s
sys	0m0.653s


## concatenate files to get the `all for triples` file


In [6]:
!{kgtk} cat -i {out}/wikidataos.all.tsv.gz \
{out}/useful_files/derived.P31.tsv.gz \
{out}/useful_files/derived.P279.tsv.gz \
{out}/useful_files/derived.isa.tsv.gz \
{out}/useful_files/derived.P279star.tsv.gz \
-o {out}/wikidataos.all.for.triples.tsv.gz


real	13m6.168s
user	13m4.726s
sys	0m0.687s


## Filter out `novalue`, `somevalue` and `P9`

In [7]:
!{kgtk} filter -i {out}/wikidataos.all.for.triples.tsv.gz \
-o {out}/wikidataos.all.for.triples.filtered.tsv.gz \
-p ';;somevalue,novalue,P9' --invert


real	12m1.114s
user	11m59.364s
sys	0m0.563s


## Add ids for any edge with missing id

In [8]:
!{kgtk} add-id -i {out}/wikidataos.all.for.triples.filtered.tsv.gz \
-o {out}/wikidataos.all.for.triples.filtered.id.tsv.gz \
--id-style wikidata


real	12m35.521s
user	12m33.997s
sys	0m0.597s


## Sort by `id`

In [10]:
!{kgtk} sort2 -i {out}/wikidataos.all.for.triples.filtered.id.tsv.gz \
-o {out}/wikidataos.all.for.triples.filtered.id.sorted.tsv.gz \
-c id


real	3m38.368s
user	3m36.319s
sys	0m18.084s
