# Generating Subsets of Wikidata


## Purpose

This notebook is used to create smaller subgraphs from a larger input Wikidata graph. Notebook users can provide a list of Wikidata classes (**QNodes**) to remove and preserve to create desired subsets of Wikidata. 

## Prerequisite input data

**`wikidata_root_folder`** : This folder should contain all wikidata files 

**`useful_files_output_folder`** : This folder should contain all computed files using input from `wikidata_root_folder`

### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress. It is recommended to run this papermill command when the input data is very large (GB scale), as this notebook will take some time to finish running.

UPDATE EXAMPLE INVOCATION
```

papermill 'Wikidata Subsets.ipynb' 'Wikidata Subsets.out.ipynb' \
-p output_path /nas/home/mbmann/subset2 \
-p output_folder output \
-p temp_folder temp.output \
-p wiki_root_folder /nas/home/mbmann/KGTK-public-graphs2/wikidata-20201130/data/ \
-p useful_files_notebook 'Wikidata\ Useful\ Files.ipynb' \
-p notebooks_folder /nas/home/mbmann/kgtk_subset/kgtk/examples/ \
-p languages en, \
```

In [1]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

import papermill as pm

import gzip

In [2]:
# Parameters

# Folder on local machine where to create the output and temporary folders
output_path = "/nas/home/mbmann/subset2"

# The names of the output and temporary folders
output_folder = "output"
temp_folder = "temp.output"

# The location of input files
wiki_root_folder = "/nas/home/mbmann/kgtk/datasets/wikidataos-v4-mm-2/"
# wiki_root_folder = "/nas/home/mbmann/KGTK-public-graphs2/wikidata-20201130/data/"

# The location of useful_files output
useful_files_output_folder = "/nas/home/mbmann/useful_files_output/output/useful_files/"

claims_file = "claims.tsv.gz"
label_file = "labels.en.tsv.gz"
alias_file = "aliases.en.tsv.gz"
description_file = "descriptions.en.tsv.gz"
item_file = "claims.wikibase-item.tsv.gz"
qual_file = "qualifiers.tsv.gz"
property_datatypes_file = "metadata.property.datatypes.tsv.gz"
metadata_file = "metadata.types.tsv.gz" 
isa_file = "derived.isa.tsv.gz" #Preprocessed
p279star_file = "derived.P279star.tsv.gz" #Preprocessed

# Useful files Jupyter notebook
useful_files_notebook = "Wikidata Useful Files.ipynb"
notebooks_folder = "/nas/home/mbmann/kgtk_subset/kgtk/examples/"

# Location of the cache database for kypher
cache_path = f'{output_path}/{output_folder}'

#Additional parameters
delete_database = "no"
compute_pagerank = "no"
languages = "en,"

# Whether to delete the cache database
if delete_database and delete_database.lower().strip() == 'yes':
    delete_database = True
else:
    delete_database = False

#Whether to compute pagerank
if compute_pagerank and compute_pagerank.lower().strip() == 'yes':
    compute_pagerank = True
else:
    compute_pagerank = False

if languages:
    languages = languages.split(',')

## Set up variables for files

In [3]:
#Python and Environment variables
if cache_path:
    store = "{}/wikidata.sqlite3.db".format(cache_path)
    os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_path)
else:
    store = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)
    os.environ['STORE'] = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)

out = "{}/{}".format(output_path, output_folder)
temp = "{}/{}".format(output_path, temp_folder)

claims = wiki_root_folder + claims_file
labels = wiki_root_folder + label_file
aliases = wiki_root_folder + alias_file
descriptions = wiki_root_folder + description_file
items = wiki_root_folder + item_file
quals = wiki_root_folder + qual_file
datatypes = wiki_root_folder + property_datatypes_file 
metadata = wiki_root_folder + metadata_file 
isa = useful_files_output_folder + isa_file #Preprocessed
p279star = useful_files_output_folder + p279star_file #Preprocessed

# shortcuts to commands
kgtk_path = "~/anaconda3/envs/kgtk-subset/bin/kgtk"
kgtk = f'time {kgtk_path} --debug'
kypher = f"{kgtk_path} query --debug --graph-cache " + store

Go to the output directory and create the subfolders for the output files and the temporary files

In [4]:
!cd $output_path
!mkdir {out}
!mkdir {temp}

Clean up the output and temp folders before we start

In [5]:
# !rm {out}/*.tsv {out}/*.tsv.gz
# !rm {temp}/*.tsv {temp}/*.tsv.gz

if delete_database:
    !rm {out}/*.tsv {out}/*.tsv.gz
    !rm {temp}/*.tsv {temp}/*.tsv.gz

### Preview the input files

It is always a good practice to peek a the files to make sure the column headings are what we expect

In [8]:
!{kypher} -i {claims} \
--match '()-[]->()' \
--limit 10

[2021-02-17 08:17:46 sqlstore]: IMPORT graph directly into table graph_1 from /nas/home/mbmann/kgtk/datasets/wikidataos-v4-mm-2/claims.tsv.gz ...
[2021-02-17 08:18:31 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_1 AS graph_1_c1
     LIMIT ?
  PARAS: [10]
---------------------------------------------
id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"	normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item
P10-P1659-P1651-c4068028-0	P10	P1659	P1651	normal	wikibase-property
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18	normal	wikibase-property
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238	normal	wikibase-property
P10-P1659-P51-86aca4c5-0	P10	P1659	P51	normal	wikibase-property
P10-P1855-Q7378-555592a4-0	P10	P1855	Q7378	normal	wikibase-item
P10-P31-Q18610173-85ef4d24-0	P10	P31	Q1861

## Creating a list of all the items  to remove

**[REQUIRED] Add classes to remove, given a list of classes** <br>
- **Example:** Let's remove the class (videotape recording, 'Q34508')
- **NOTE:** This will only remove items that have a `P31_P279star` relation with the class

**[OPTIONAL] Add instances to remove, given a list of instances** <br>
- **Example:** Let's remove instances (Fireball, 'Q5451712'), (Bush, 'Q1017471'), and (Italian Grape Ale, 'Q67772833')

In [9]:
classes_to_remove = ['Q34508', 'Q7378']
# classes_to_remove = ['Q13442814', 'Q523', 'Q318', 'Q7318358', 'Q7187', 'Q11173', 'Q8054'] #Parameter: Add classes manually here
instances_to_remove = []

classes = ', '.join([f'"{c}"' for c in classes_to_remove])
instances = ', '.join([f'"{c}"' for c in instances_to_remove])
print('classes: ', classes)
print('instances: ', instances)

classes:  "Q34508", "Q7378"
instances:  


### Compute the items to be removed via classes

First look at the classes we will remove

1. Given all classes in `classes.remove2`, find all subclasses from `p279star`. <br>
2. Given all subclasses from `p279star`, find all subclass instances from `isa`
3. The resulting items to remove will be in `{temp}/items.remove.byclass.tsv.gz`

In [10]:
if classes != '' and instances == '':
    print('Finding items to remove based on classes only.')
    cmd = f'''
    {kypher} -i {p279star} -i {isa} \
    --match 'isa: (item)-[:isa]->(subclass), P279star: (subclass)-[:P279star]->(c)' \
    --return 'distinct item, "p31_p279star" as label, c as node2' \
    --where 'c in [{classes}]' \
    -o {temp}/items.remove.byclass.tsv.gz
    '''
    !{cmd}
else:
    print('Finding items to remove based on classes and instances.')
    cmd = f'''
    {kypher} -i {p279star} -i {isa} \
    --match 'isa: (item)-[:isa]->(subclass), P279star: (subclass)-[:P279star]->(c)' \
    --return 'distinct item, "p31_p279star" as label, c as node2' \
    --where 'c in [{classes}] OR item in [{instances}]' \
    -o {temp}/items.remove.byclass.tsv.gz
    '''
    !{cmd}

Finding items to remove based on classes only.
[2021-02-17 08:18:32 sqlstore]: IMPORT graph directly into table graph_2 from /nas/home/mbmann/useful_files_output/output/useful_files/derived.P279star.tsv.gz ...
[2021-02-17 08:18:49 sqlstore]: IMPORT graph directly into table graph_3 from /nas/home/mbmann/useful_files_output/output/useful_files/derived.isa.tsv.gz ...
[2021-02-17 08:18:52 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_3_c1."node1", ? "_aLias.label", graph_2_c2."node2" "_aLias.node2"
     FROM graph_2 AS graph_2_c2, graph_3 AS graph_3_c1
     WHERE graph_2_c2."label"=?
     AND graph_3_c1."label"=?
     AND graph_2_c2."node1"=graph_3_c1."node2"
     AND (graph_2_c2."node2" IN (?, ?))
  PARAS: ['p31_p279star', 'P279star', 'isa', 'Q34508', 'Q7378']
---------------------------------------------
[2021-02-17 08:18:52 sqlstore]: CREATE INDEX on table graph_3 column node2 ...
[2021-02-17 08:18:55 sqlstore]: ANALYZE INDEX on table gr

Check the result

In [11]:
!zcat {temp}/items.remove.byclass.tsv.gz | head

node1	label	node2
Q100000003	p31_p279star	Q34508
Q100000011	p31_p279star	Q34508
Q100000014	p31_p279star	Q34508
Q100000021	p31_p279star	Q34508
Q100000029	p31_p279star	Q34508
Q100000033	p31_p279star	Q34508
Q100000049	p31_p279star	Q34508
Q100328888	p31_p279star	Q34508
Q10267876	p31_p279star	Q34508


### Compute the items to be removed via out degree

Specify the # of node out-degrees `k`, and identify items with out-degree less than `k`
- Ex: Find items that have out-degree `k` less than 2.

Create a list of items that have out_degree < `k`, along with any parent classses they belong to. <br>
Put the results into `items.remove.bydegree.tsv.gz`. <br>

In [12]:
# k = 2 #Parameter
# !{kypher} -i {useful_files_output_folder}metadata.out_degree.sorted.tsv.gz \
# --match 'out: (item)-[:out_degree]->(n2)' \
# --where "cast(n2, integer) <= {k} and upper(substr(item,0)) >= 'Q'" \
# --return 'distinct item, "out_degree" as label, n2 as node2' \
# -o {temp}/items.remove.bydegree.tsv.gz \

# !zcat {temp}/items.remove.bydegree.tsv.gz | head

### Combine the items to remove by-class and by-outdegree
Concatenate all items from `items.remove.byclass` and `items.remove.bydegree`.
The resulting list of items to remove will be `items.remove`.

In [13]:
!{kgtk} cat -i {temp}/items.remove.*.tsv.gz \
-o {temp}/items.remove.tsv.gz
!zcat {temp}/items.remove.tsv.gz | head


real	0m1.516s
user	0m0.575s
sys	0m0.154s
node1	label	node2
Q100000003	p31_p279star	Q34508
Q100000011	p31_p279star	Q34508
Q100000014	p31_p279star	Q34508
Q100000021	p31_p279star	Q34508
Q100000029	p31_p279star	Q34508
Q100000033	p31_p279star	Q34508
Q100000049	p31_p279star	Q34508
Q100328888	p31_p279star	Q34508
Q10267876	p31_p279star	Q34508


Deduplicate the concatenated file of items to remove. <br>
The resulting list of items to remove will be `items.remove2`.

In [14]:
!{kgtk} sort2 -i {temp}/items.remove.tsv.gz -o {temp}/items.remove.sorted.tsv.gz


real	0m1.069s
user	0m0.605s
sys	0m0.157s


In [15]:
!{kgtk} compact -i {temp}/items.remove.sorted.tsv.gz -o {temp}/items.remove2.tsv.gz \
--columns 'node1'
!zcat {temp}/items.remove2.tsv.gz | head


real	0m1.044s
user	0m0.576s
sys	0m0.121s
node1	label	node2
Q100000003	p31_p279star	Q34508
Q100000011	p31_p279star	Q34508
Q100000014	p31_p279star	Q34508
Q100000021	p31_p279star	Q34508
Q100000029	p31_p279star	Q34508
Q100000033	p31_p279star	Q34508
Q100000049	p31_p279star	Q34508
Q100328888	p31_p279star	Q34508
Q10267876	p31_p279star	Q34508


### Validate the items we will remove
Check the `items.remove` file for classes added via different methods: 1) by-class, 2) by-instance, 3) by-outdegree

1) Check for class added manually, i.e. (videotape recording, 'Q34508')

In [None]:
# !echo 'videotape recording'
# !zgrep 'Q34508' {temp}/items.remove2.tsv.gz | head

2) Check for class added by-instance, i.e. (Fireball, 'Q5451712')

In [None]:
# !echo 'fireball'
# !zgrep 'Q5451712' {temp}/items.remove2.tsv.gz | head

3) Check for class added by-outdegree, i.e. (??, 'Q100000030')

In [None]:
# !zgrep 'Q100000030' {temp}/items.remove2.tsv.gz | head

Collect all the classes of items we will remove, just as a sanity check

In [16]:
!{kypher} -i {temp}/items.remove2.tsv.gz \
--match '(n1)-[]->()' \
--return 'count(distinct n1)' \

!{kypher} -i {temp}/items.remove2.tsv.gz \
--match '(n1)-[]->()' \
--return 'distinct n1' \
--limit 10

[2021-02-17 08:19:35 sqlstore]: IMPORT graph directly into table graph_4 from /nas/home/mbmann/subset2/temp.output/items.remove2.tsv.gz ...
[2021-02-17 08:19:35 query]: SQL Translation:
---------------------------------------------
  SELECT count(DISTINCT graph_4_c1."node1")
     FROM graph_4 AS graph_4_c1
  PARAS: []
---------------------------------------------
count(DISTINCT graph_4_c1."node1")
252
[2021-02-17 08:19:37 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_4_c1."node1"
     FROM graph_4 AS graph_4_c1
     LIMIT ?
  PARAS: [10]
---------------------------------------------
node1
Q100000003
Q100000011
Q100000014
Q100000021
Q100000029
Q100000033
Q100000049
Q100328888
Q10267876
Q11293639


## [TODO] Create a list of all items to protect

## Create the reduced edges file

### Remove the items from the all.tsv and the label, alias and description files
We will be left with `reduced` files where the edges do not have the unwanted items. We have to remove them from the node1 and node2 positions, so we need to run the ifnotexists commands twice.

Before we start preview the files to see the column headings and check whether they look sorted.

In [17]:
!{kgtk} sort2 -i {temp}/items.remove2.tsv.gz -o {temp}/items.remove2.sorted.tsv.gz


real	0m1.039s
user	0m0.566s
sys	0m0.170s


In [18]:
!zcat < {temp}/items.remove2.sorted.tsv.gz | head | col
# !zgrep 'Q34508' {temp}/items.remove.sorted.tsv.gz -c #466

node1	label	node2
Q100000003	p31_p279star	Q34508
Q100000011	p31_p279star	Q34508
Q100000014	p31_p279star	Q34508
Q100000021	p31_p279star	Q34508
Q100000029	p31_p279star	Q34508
Q100000033	p31_p279star	Q34508
Q100000049	p31_p279star	Q34508
Q100328888	p31_p279star	Q34508
Q10267876	p31_p279star	Q34508


Remove from the full set of edges those edges that have a `node1` present in `items.remove.sorted.tsv`

In [19]:
!zcat {temp}/items.remove2.tsv.gz | head

node1	label	node2
Q100000003	p31_p279star	Q34508
Q100000011	p31_p279star	Q34508
Q100000014	p31_p279star	Q34508
Q100000021	p31_p279star	Q34508
Q100000029	p31_p279star	Q34508
Q100000033	p31_p279star	Q34508
Q100000049	p31_p279star	Q34508
Q100328888	p31_p279star	Q34508
Q10267876	p31_p279star	Q34508


In [20]:
!{kgtk} ifnotexists -i {claims} -o {temp}/item.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove2.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted 


real	4m39.916s
user	4m38.509s
sys	0m0.669s


From the remaining edges, remove those that have a `node2` present in `items.remove.sorted.tsv`

In [21]:
!{kgtk} sort2 -i {temp}/item.edges.reduced.tsv.gz -o {temp}/item.edges.reduced.sorted.tsv.gz \
--columns node2 label node1 id


real	2m23.010s
user	2m29.580s
sys	0m8.189s


In [22]:
!{kgtk} ifnotexists -i {temp}/item.edges.reduced.sorted.tsv.gz -o {temp}/item.edges.reduced.2.tsv.gz \
--filter-on {temp}/items.remove2.sorted.tsv.gz \
--input-keys node2 \
--filter-keys node1 \
--presorted 


real	4m58.017s
user	4m56.338s
sys	0m0.450s


Create a file with the labels

In [23]:
!{kgtk} ifnotexists -i {labels} -o {temp}/label.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove2.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted


real	0m50.328s
user	0m43.826s
sys	0m0.208s


In [24]:
!{kgtk} sort2 -i {temp}/label.edges.reduced.tsv.gz -o {out}/labels.tsv.gz


real	0m12.949s
user	0m12.705s
sys	0m1.041s


Create a file with the aliases

In [25]:
!{kgtk} ifnotexists -i {aliases} -o {temp}/alias.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove2.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted


real	0m16.730s
user	0m13.992s
sys	0m0.162s


Create a file with the descriptions

In [26]:
!{kgtk} ifnotexists -i {descriptions} -o {temp}/description.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove2.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted


real	0m28.338s
user	0m24.969s
sys	0m0.176s


### Produce the output files for claims, labels, aliases and descriptions

In [27]:
!{kgtk} sort2 -i {temp}/item.edges.reduced.2.tsv.gz -o {out}/claims.tsv.gz 


real	1m58.866s
user	1m59.171s
sys	0m8.535s


In [28]:
!{kgtk} sort2 -i {temp}/label.edges.reduced.tsv.gz -o {out}/labels.en.tsv.gz 


real	0m12.330s
user	0m11.567s
sys	0m0.960s


In [29]:
!{kgtk} sort2 -i {temp}/alias.edges.reduced.tsv.gz -o {out}/aliases.en.tsv.gz 


real	0m4.445s
user	0m4.130s
sys	0m0.404s


In [30]:
!{kgtk} sort2 -i {temp}/description.edges.reduced.tsv.gz -o {out}/descriptions.en.tsv.gz 


real	0m7.325s
user	0m6.925s
sys	0m0.686s


## Tests: Confirm items were removed from claims file
**NOTE:** We will check the items we removed do not exist in claims file

1) Confirm no instance of class added manually, i.e. (class: 'Q34508')<-(instance: 'Q100431477')

In [None]:
#PASS: No result
# !zgrep 'Q100431477' {out}/claims.tsv.gz 

#PASS: Barack Obama not in result, as he is human (Q5) and human was removed
# !zgrep 'Q76' {out}/claims.tsv.gz 

2) Confirm no instance of instance added manually, i.e. (source_instance, 'Q5451712')

In [None]:
# !zgrep 'Q1350656	' {out}/claims.tsv.gz #PASS: No result

3) Confirm no instance with out-degree < 2, i.e. (instance: 'Q100000030')

In [None]:
# !zgrep 'Q100000030	' {out}/claims.tsv.gz #PASS: No result

## Create the reduced qualifiers file
We do this by finding all the ids of the reduced edges file, and then selecting out from `qual.tsv`

We need to join by id, so we need to sort both files by id, node1, label, node2:

- `{quals}` 
- `{out}/claims.tsv.gz` 

In [31]:
!zcat < {quals} | head | column -t -s $'\t' 

id                                               node1                             label  node2                          node2;wikidatatype
P10-P1855-Q7378-555592a4-0-P10-8a982d-0          P10-P1855-Q7378-555592a4-0        P10    "Elephants Dream (2006).webm"  commonsMedia
P1000-P1896-f63a36-b84f3cd2-0-P1476-bf511b-0     P1000-P1896-f63a36-b84f3cd2-0     P1476  'FAI records'@en               monolingualtext
P1001-P1855-Q29868931-76b67d84-0-P1001-Q11736-0  P1001-P1855-Q29868931-76b67d84-0  P1001  Q11736                         wikibase-item
P1001-P1855-Q29868931-76b67d84-0-P1001-Q17269-0  P1001-P1855-Q29868931-76b67d84-0  P1001  Q17269                         wikibase-item
P1001-P1855-Q29868931-76b67d84-0-P1001-Q21208-0  P1001-P1855-Q29868931-76b67d84-0  P1001  Q21208                         wikibase-item
P1001-P1855-Q29868931-76b67d84-0-P1001-Q34800-0  P1001-P1855-Q29868931-76b67d84-0  P1001  Q34800                         wikibase-item
P1001-P1855-Q29868931-76b67d84-0-P1001-Q41079-0  

Run `ifexists` to select out the quals for the edges in `{out}/wikidataos.qual.tsv.gz`. Note that we use `node1` in the qualifier file, matching to `id` in the `wikidataos.all.tsv` file.

In [32]:
!$kgtk ifexists -i {quals} -o {out}/qualifiers.tsv.gz \
--filter-on {out}/claims.tsv.gz \
--input-keys node1 \
--filter-keys id \
--presorted


real	2m12.444s
user	2m9.899s
sys	0m0.282s


Look at the final output for qualifiers

In [33]:
!zcat < {out}/qualifiers.tsv.gz | head | col

id	node1	label	node2	node2;wikidatatype
P10-P1855-Q7378-555592a4-0-P10-8a982d-0 P10-P1855-Q7378-555592a4-0	P10	"Elephants Dream (2006).webm"	commonsMedia
P1000-P1896-f63a36-b84f3cd2-0-P1476-bf511b-0	P1000-P1896-f63a36-b84f3cd2-0	P1476	'FAI records'@en	monolingualtext
P1001-P1855-Q29868931-76b67d84-0-P1001-Q11736-0 P1001-P1855-Q29868931-76b67d84-0	P1001	Q11736	wikibase-item
P1001-P1855-Q29868931-76b67d84-0-P1001-Q17269-0 P1001-P1855-Q29868931-76b67d84-0	P1001	Q17269	wikibase-item

gzip: P1001-P1855-Q29868931-76b67d84-0-P1001-Q21208-0 P1001-P1855-Q29868931-76b67d84-0	P1001	Q21208	wikibase-item
stdout: Broken pipe
P1001-P1855-Q29868931-76b67d84-0-P1001-Q34800-0 P1001-P1855-Q29868931-76b67d84-0	P1001	Q34800	wikibase-item
P1001-P1855-Q29868931-76b67d84-0-P1001-Q41079-0 P1001-P1855-Q29868931-76b67d84-0	P1001	Q41079	wikibase-item
P1001-P1855-Q29868931-76b67d84-0-P1001-Q42392-0 P1001-P1855-Q29868931-76b67d84-0	P1001	Q42392	wikibase-item
P1001-P1855-Q29868931-76b67d84-0-P1001-Q43684-

## Call partition and useful_files notebooks, to generate the file output

In [34]:
kgtk_scripts_path = "/nas/home/mbmann/kgtk_subset/kgtk"
os.environ["EXAMPLES_DIR"] = kgtk_scripts_path + "/examples"
os.environ["USECASE_DIR"] = kgtk_scripts_path + "/use-cases"
os.environ["TEMP"] = temp
os.environ["OUT"] = out
os.environ["DATATYPES"] = datatypes
os.environ["METADATA"] = metadata

In [35]:
os.environ["EXAMPLES_DIR"]

'/nas/home/mbmann/kgtk_subset/kgtk/examples'

In [36]:
!ls "$TEMP"

alias.edges.reduced.tsv.gz	  items.remove2.tsv.gz
description.edges.reduced.tsv.gz  items.remove.byclass.tsv.gz
item.edges.reduced.2.tsv.gz	  items.remove.sorted.tsv.gz
item.edges.reduced.sorted.tsv.gz  items.remove.tsv.gz
item.edges.reduced.tsv.gz	  label.edges.reduced.tsv.gz
items.remove2.sorted.tsv.gz


In [37]:
!ls "$OUT"

aliases.en.tsv.gz	labels.en.tsv.gz   wikidata.sqlite3.db
claims.tsv.gz		labels.tsv.gz
descriptions.en.tsv.gz	qualifiers.tsv.gz


**Concatenate all output files together** <br>

**NOTE:** The `metadata.property.datatypes` and `metadata.types` are not currently generated by this notebook, and have been copied from `wikidata-20200803-v5/data`. <br>
**TODO:** We must confirm if these are source files, or computed. If they are computed, we should compute them in this notebook.

In [38]:
!{kgtk} cat \
-i {out}/aliases.en.tsv.gz \
-i {out}/descriptions.en.tsv.gz \
-i {out}/qualifiers.tsv.gz \
-i {out}/claims.tsv.gz \
-i {out}/labels.en.tsv.gz \
-i {datatypes} \
-i {metadata} \
-o {out}/all.tsv.gz


real	19m56.099s
user	19m52.061s
sys	0m0.946s


In [39]:
!ls {os.environ["EXAMPLES_DIR"] + "/partition-wikidata.ipynb"}

/nas/home/mbmann/kgtk_subset/kgtk/examples/partition-wikidata.ipynb


In [40]:
os.environ["OUT"]

'/nas/home/mbmann/subset2/output'

### Call the partition-wikidata notebook
`partition-wikidata` will take all intermediary computed outputs from `all.tsv.gz` and partition each wikidata entity (i.e. claims, aliases, labels, descriptions, qualifiers), into partitions. <br>

**NOTE:** This notebook also produces `claims.wikibase-item.tsv.gz` and outputs it to `wikidata_parts_path`

In [41]:
pm.execute_notebook(
    os.environ["EXAMPLES_DIR"] + "/partition-wikidata.ipynb",
    os.environ["TEMP"] + "/partition-wikidata.out.ipynb",
    parameters=dict(
        wikidata_input_path = os.environ["OUT"] + "/all.tsv.gz",
        wikidata_parts_path = os.environ["OUT"] + "/parts",
        temp_folder_path = os.environ["OUT"] + "/parts/temp",
        sort_extras = "--buffer-size 30% --temporary-directory $OUT/parts/temp",
        verbose = False
    )
)
;

Executing:   0%|          | 0/49 [00:00<?, ?cell/s]

''

## Call the useful-files notebook
`Wikidata Useful Files` will take intermediary output generated by `partition-wikidata` and produce the following statistics: `derived.P31.tsv.gz`, `derived.P279.tsv.gz`, `derived.isa.tsv.gz`, `derived.P279star.tsv.gz`, and `metadata.out_degree.tsv.gz`.

In [42]:
#NOTE: Don't pass in cache path, as one doesn't yet exist for useful_files
pm.execute_notebook(
    os.environ["USECASE_DIR"] + "/Wikidata Useful Files.ipynb",
    os.environ["TEMP"] + "/Wikidata Useful Files Out.ipynb",
    parameters=dict(
        output_path = os.environ["OUT"],
        output_folder = "useful_files",
        temp_folder = "temp.useful_files",
        wiki_root_folder = os.environ["OUT"] + "/parts/",
        languages = 'en',
        compute_pagerank = True,
        delete_database = False
    )
)
;

Executing:   0%|          | 0/103 [00:00<?, ?cell/s]

''

## Summary of results

In [43]:
!ls -lh {out}/*wikidataos.*

ls: cannot access /nas/home/mbmann/subset2/output/*wikidataos.*: No such file or directory


In [44]:
!zcat < {out}/wikidataos.all.tsv.gz | wc

/bin/bash: /nas/home/mbmann/subset2/output/wikidataos.all.tsv.gz: No such file or directory
      0       0       0


## Verification

The edges file must contain edges for properties, this is not the case on 2020-11-10`


In [45]:
!{kypher} -i {out}/claims.tsv.gz \
--match '(:P10)-[l]->(n2)' \
--limit 10

[2021-02-17 10:04:28 sqlstore]: IMPORT graph directly into table graph_5 from /nas/home/mbmann/subset2/output/claims.tsv.gz ...
[2021-02-17 10:05:11 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_5 AS graph_5_c1
     WHERE graph_5_c1."node1"=?
     LIMIT ?
  PARAS: ['P10', 10]
---------------------------------------------
[2021-02-17 10:05:11 sqlstore]: CREATE INDEX on table graph_5 column node1 ...
[2021-02-17 10:05:24 sqlstore]: ANALYZE INDEX on table graph_5 column node1 ...
id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"	normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item
P10-P1659-P1651-c4068028-0	P10	P1659	P1651	normal	wikibase-property
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18	normal	wikibase-property
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238	normal	wikibase

## concatenate files to get the `all` file

In [46]:
!{kgtk} cat -i {out}/claims.tsv.gz \
{out}/qualifiers.tsv.gz \
{out}/useful_files/metadata.pagerank.undirected.tsv.gz \
{out}/useful_files/metadata.pagerank.directed.tsv.gz \
{out}/useful_files/metadata.in_degree.sorted.tsv.gz \
{out}/useful_files/metadata.out_degree.sorted.tsv.gz \
-o {out}/wikidataos.all.tsv.gz


real	11m42.484s
user	11m40.735s
sys	0m0.750s


## concatenate files to get the `all for triples` file


In [47]:
!{kgtk} cat -i {out}/wikidataos.all.tsv.gz \
{out}/useful_files/derived.P31.tsv.gz \
{out}/useful_files/derived.P279.tsv.gz \
{out}/useful_files/derived.isa.tsv.gz \
{out}/useful_files/derived.P279star.tsv.gz \
-o {out}/wikidataos.all.for.triples.tsv.gz


real	16m4.981s
user	16m2.844s
sys	0m0.934s


## Filter out `novalue`, `somevalue` and `P9`

In [48]:
!{kgtk} filter -i {out}/wikidataos.all.for.triples.tsv.gz \
-o {out}/wikidataos.all.for.triples.filtered.tsv.gz \
-p ';;somevalue,novalue,P9' --invert


real	14m42.770s
user	14m41.149s
sys	0m0.686s


## Add ids for any edge with missing id

In [49]:
!{kgtk} add-id -i {out}/wikidataos.all.for.triples.filtered.tsv.gz \
-o {out}/wikidataos.all.for.triples.filtered.id.tsv.gz \
--id-style wikidata


real	15m36.013s
user	15m32.033s
sys	0m0.741s


## Sort by `id`

In [50]:
!{kgtk} sort2 -i {out}/wikidataos.all.for.triples.filtered.id.tsv.gz \
-o {out}/wikidataos.all.for.triples.filtered.id.sorted.tsv.gz \
-c id


real	4m8.597s
user	4m5.535s
sys	0m20.281s
