# Generating Subsets of Wikidata


## Purpose

This notebook is used to create smaller subgraphs from a larger input Wikidata graph. Notebook users can provide a list of Wikidata classes (**QNodes**) to remove and preserve to create desired subsets of Wikidata. 

## Prerequisite input data

**`wikidata_root_folder`** : This folder should contain all wikidata files 

**`useful_files_output_folder`** : This folder should contain all computed files using input from `wikidata_root_folder`

### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress. It is recommended to run this papermill command when the input data is very large (GB scale), as this notebook will take some time to finish running.

```
papermill 'Wikidata Subsets.ipynb' 'Wikidata Subsets.out.ipynb' \
-p output_path /nas/home/mbmann/subset2 \
-p output_folder output \
-p temp_folder temp.output \
-p wiki_root_folder /nas/home/mbmann/KGTK-public-graphs2/wikidata-20201130/data/ \
-p useful_files_output_folder /nas/home/mbmann/useful_files_output/output/useful_files/ \
-p notebooks_folder /nas/home/mbmann/kgtk_subset/kgtk/examples/ \
-p useful_files_notebook 'Wikidata\ Useful\ Files.ipynb' \
-p languages en, \
```

In [None]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

import papermill as pm

import gzip

In [None]:
# Parameters

# Folder on local machine where to create the output and temporary folders
output_path = "/nas/home/mbmann/subset2"

# The names of the output and temporary folders
output_folder = "output"
temp_folder = "temp.output"

# The location of input files
wiki_root_folder = "/nas/home/mbmann/kgtk/datasets/wikidataos-v4-mm-2"
# wiki_root_folder = "/nas/home/mbmann/KGTK-public-graphs2/wikidata-20201130/data/"

# The location of useful_files output
useful_files_output_folder = "/nas/home/mbmann/useful_files_output/output/useful_files/"

claims_file = "claims.tsv.gz"
label_file = "labels.en.tsv.gz"
alias_file = "aliases.en.tsv.gz"
description_file = "descriptions.en.tsv.gz"
item_file = "claims.wikibase-item.tsv.gz"
qual_file = "qualifiers.tsv.gz"
property_datatypes_file = "metadata.property.datatypes.tsv.gz"
metadata_file = "metadata.types.tsv.gz" 
isa_file = "derived.isa.tsv.gz" #Preprocessed
p279star_file = "derived.P279star.tsv.gz" #Preprocessed

# Useful files Jupyter notebook
useful_files_notebook = "Wikidata Useful Files.ipynb"
notebooks_folder = "/nas/home/mbmann/kgtk_subset/kgtk/examples/"

# Location of the cache database for kypher
cache_path = f'{output_path}/{output_folder}'

#Additional parameters
delete_database = "no"
compute_pagerank = "no"
languages = "en,"

# Whether to delete cache database
if delete_database and delete_database.lower().strip() == 'yes':
    delete_database = True
else:
    delete_database = False

#Whether to compute pagerank
if compute_pagerank and compute_pagerank.lower().strip() == 'yes':
    compute_pagerank = True
else:
    compute_pagerank = False

if languages:
    languages = languages.split(',')

Confirm if the system has zcat installed, so zcat commands can be run below.

In [None]:
exit_code = os.system("which zcat")
if exit_code == 0:
    print("PASS: zcat is available and will be used.")
else:
    raise Exception("FAIL: zcat is a requirement, please install zcat to run this notebook in full.")

## Set up variables for files

In [None]:
#Python and Environment variables
if cache_path:
    store = "{}/wikidata.sqlite3.db".format(cache_path)
    os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_path)
else:
    store = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)
    os.environ['STORE'] = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)

out = "{}/{}".format(output_path, output_folder)
temp = "{}/{}".format(output_path, temp_folder)

claims = wiki_root_folder + claims_file
labels = wiki_root_folder + label_file
aliases = wiki_root_folder + alias_file
descriptions = wiki_root_folder + description_file
items = wiki_root_folder + item_file
quals = wiki_root_folder + qual_file
datatypes = wiki_root_folder + property_datatypes_file 
metadata = wiki_root_folder + metadata_file 
isa = useful_files_output_folder + isa_file #Preprocessed
p279star = useful_files_output_folder + p279star_file #Preprocessed

# shortcuts to commands
kgtk_path = "~/anaconda3/envs/kgtk-subset/bin/kgtk"
kgtk = f'time {kgtk_path} --debug'
kypher = f"{kgtk_path} query --debug --graph-cache " + store

Confirm that the pre-computed files are avaialble from **useful_files_output_folder**

In [None]:
if os.path.isfile(isa) and os.path.isfile(p279star):
    print("PASS: Precomputed input files exist and will be used.")
else: 
    raise Exception("FAIL: Precomputed input files do not exist. Please create them first.")

Go to the output directory and create the subfolders for the output files and the temporary files

In [None]:
!mkdir {out}
!mkdir {temp}

Clean up the output and temp folders before we start

**NOTE:** This command will delete the previous output from `temp` and `output` folders.

In [None]:
!rm {out}/*.tsv {out}/*.tsv.gz
!rm -r {out}/parts {out}/temp.useful_files {out}/useful_files
!rm {temp}/*.tsv {temp}/*.tsv.gz

We can preserve the pre-existing cache database, if desired with **delete_database**
- **delete_database** = `yes` if we want to create a new database
- **delete_database** =  `no` if we want to use pre-existing cache database

In [None]:
if delete_database:
    !rm {store}

### Preview the input files

It is always a good practice to peek a the files to make sure the column headings are what we expect

In [None]:
!{kypher} -i {claims} \
--match '()-[]->()' \
--limit 10

## Creating a list of all the items  to remove

**[REQUIRED] Add classes to remove, given a list of classes** <br>
- **Example:** Let's remove the class (scholarly article, 'Q13442814')
- **NOTE:** This will only remove items that have a `P31_P279star` relation with the class

**[OPTIONAL] Add instances to remove, given a list of instances** <br>
- **Example:** Let's remove instances (Fireball, 'Q5451712'), (Bush, 'Q1017471'), and (Italian Grape Ale, 'Q67772833')

In [None]:
classes_to_remove = ['Q34508', 'Q7378']
# classes_to_remove = ['Q13442814', 'Q523', 'Q318', 'Q7318358', 'Q7187', 'Q11173', 'Q8054'] #Parameter: Add classes manually here
instances_to_remove = []

classes = ', '.join([f'"{c}"' for c in classes_to_remove])
instances = ', '.join([f'"{c}"' for c in instances_to_remove])
print('classes: ', classes)
print('instances: ', instances)

### Compute the items to be removed via classes

First look at the classes we will remove

1. Given all classes in `classes.remove2`, find all subclasses from `p279star`. <br>
2. Given all subclasses from `p279star`, find all subclass instances from `isa`
3. The resulting items to remove will be in `{temp}/items.remove.byclass.tsv.gz`

In [None]:
if classes != '' and instances == '':
    print('Finding items to remove based on classes only.')
    cmd = f'''
    {kypher} -i {p279star} -i {isa} \
    --match 'isa: (item)-[:isa]->(subclass), P279star: (subclass)-[:P279star]->(c)' \
    --return 'distinct item, "p31_p279star" as label, c as node2' \
    --where 'c in [{classes}]' \
    -o {temp}/items.remove.byclass.tsv.gz
    '''
    !{cmd}
else:
    print('Finding items to remove based on classes and instances.')
    cmd = f'''
    {kypher} -i {p279star} -i {isa} \
    --match 'isa: (item)-[:isa]->(subclass), P279star: (subclass)-[:P279star]->(c)' \
    --return 'distinct item, "p31_p279star" as label, c as node2' \
    --where 'c in [{classes}] OR item in [{instances}]' \
    -o {temp}/items.remove.byclass.tsv.gz
    '''
    !{cmd}

Check the result

In [None]:
!zcat {temp}/items.remove.byclass.tsv.gz | head

### Compute the items to be removed via out degree

Specify the # of node out-degrees `k`, and identify items with out-degree less than `k`
- Ex: Find items that have out-degree `k` less than 2.

Create a list of items that have out_degree < `k`, along with any parent classses they belong to. <br>
Put the results into `items.remove.bydegree.tsv.gz`. <br>

In [None]:
# k = 2 #Parameter
# !{kypher} -i {useful_files_output_folder}metadata.out_degree.sorted.tsv.gz \
# --match 'out: (item)-[:out_degree]->(n2)' \
# --where "cast(n2, integer) <= {k} and upper(substr(item,0)) >= 'Q'" \
# --return 'distinct item, "out_degree" as label, n2 as node2' \
# -o {temp}/items.remove.bydegree.tsv.gz \

# !zcat {temp}/items.remove.bydegree.tsv.gz | head

### Combine the items to remove by-class and by-outdegree
Concatenate all items from `items.remove.byclass` and `items.remove.bydegree`.
The resulting list of items to remove will be `items.remove`.

In [None]:
!{kgtk} cat -i {temp}/items.remove.*.tsv.gz \
-o {temp}/items.remove.tsv.gz
!zcat {temp}/items.remove.tsv.gz | head

Deduplicate the concatenated file of items to remove. <br>
The resulting list of items to remove will be `items.remove2`.

In [None]:
!{kgtk} sort2 -i {temp}/items.remove.tsv.gz -o {temp}/items.remove.sorted.tsv.gz

In [None]:
!{kgtk} compact -i {temp}/items.remove.sorted.tsv.gz -o {temp}/items.remove2.tsv.gz \
--columns 'node1'
!zcat {temp}/items.remove2.tsv.gz | head

### Validate the items we will remove
Check the `items.remove` file for classes added via different methods: 1) by-class, 2) by-instance, 3) by-outdegree

1) Check for class added manually, i.e. (scholarly article, 'Q13442814')

In [None]:
# !zgrep 'Q13442814' {temp}/items.remove2.tsv.gz | head

2) Check for class added by-instance, i.e. (Fireball, 'Q5451712')

In [None]:
# !zgrep 'Q5451712' {temp}/items.remove2.tsv.gz | head

3) Check for class added by-outdegree, i.e. (??, 'Q100000030')

In [None]:
# !zgrep 'Q100000030' {temp}/items.remove2.tsv.gz | head

Collect all the classes of items we will remove, just as a sanity check

In [None]:
!{kypher} -i {temp}/items.remove2.tsv.gz \
--match '(n1)-[]->()' \
--return 'count(distinct n1)' \

!{kypher} -i {temp}/items.remove2.tsv.gz \
--match '(n1)-[]->()' \
--return 'distinct n1' \
--limit 10

## [TODO] Create a list of all items to protect

## Create the reduced edges file

### Remove the items from the all.tsv and the label, alias and description files
We will be left with `reduced` files where the edges do not have the unwanted items. We have to remove them from the node1 and node2 positions, so we need to run the ifnotexists commands twice.

Before we start preview the files to see the column headings and check whether they look sorted.

In [None]:
!{kgtk} sort2 -i {temp}/items.remove2.tsv.gz -o {temp}/items.remove2.sorted.tsv.gz

In [None]:
!zcat < {temp}/items.remove2.sorted.tsv.gz | head | col

Remove from the full set of edges those edges that have a `node1` present in `items.remove.sorted.tsv`

In [None]:
!zcat {temp}/items.remove2.tsv.gz | head

In [None]:
!{kgtk} ifnotexists -i {claims} -o {temp}/item.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove2.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted 

From the remaining edges, remove those that have a `node2` present in `items.remove.sorted.tsv`

In [None]:
!{kgtk} sort2 -i {temp}/item.edges.reduced.tsv.gz -o {temp}/item.edges.reduced.sorted.tsv.gz \
--columns node2 label node1 id

In [None]:
!{kgtk} ifnotexists -i {temp}/item.edges.reduced.sorted.tsv.gz -o {temp}/item.edges.reduced.2.tsv.gz \
--filter-on {temp}/items.remove2.sorted.tsv.gz \
--input-keys node2 \
--filter-keys node1 \
--presorted 

Create a file with the labels

In [None]:
!{kgtk} ifnotexists -i {labels} -o {temp}/label.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove2.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

In [None]:
!{kgtk} sort2 -i {temp}/label.edges.reduced.tsv.gz -o {out}/labels.tsv.gz

Create a file with the aliases

In [None]:
!{kgtk} ifnotexists -i {aliases} -o {temp}/alias.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove2.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

Create a file with the descriptions

In [None]:
!{kgtk} ifnotexists -i {descriptions} -o {temp}/description.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove2.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

### Produce the output files for claims, labels, aliases and descriptions

In [None]:
!{kgtk} sort2 -i {temp}/item.edges.reduced.2.tsv.gz -o {out}/claims.tsv.gz 

In [None]:
!{kgtk} sort2 -i {temp}/label.edges.reduced.tsv.gz -o {out}/labels.en.tsv.gz 

In [None]:
!{kgtk} sort2 -i {temp}/alias.edges.reduced.tsv.gz -o {out}/aliases.en.tsv.gz 

In [None]:
!{kgtk} sort2 -i {temp}/description.edges.reduced.tsv.gz -o {out}/descriptions.en.tsv.gz 

## Tests: Confirm items were removed from claims file
**NOTE:** We will check the items we removed do not exist in claims file

1) Confirm no instance of class added manually, i.e. (scholarly article, 'Q13442814')

In [None]:
#TEST: This should not return anything, as it has been removed from claims.tsv.gz
# !zgrep 'Q13442814' {out}/claims.tsv.gz 

2) Confirm no instance of instance added manually, i.e. (Fireball, 'Q5451712')

In [None]:
# !zgrep 'Q5451712	' {out}/claims.tsv.gz #PASS: No result

3) Confirm no instance with out-degree < 2, i.e. (instance: 'Q100000030')

In [None]:
# !zgrep 'Q100000030	' {out}/claims.tsv.gz #PASS: No result

## Create the reduced qualifiers file
We do this by finding all the ids of the reduced edges file, and then selecting out from `qual.tsv`

We need to join by id, so we need to sort both files by id, node1, label, node2:

- `{quals}` 
- `{out}/claims.tsv.gz` 

In [None]:
!zcat < {quals} | head | column -t -s $'\t' 

Run `ifexists` to select out the quals for the edges in `{out}/wikidataos.qual.tsv.gz`. Note that we use `node1` in the qualifier file, matching to `id` in the `wikidataos.all.tsv` file.

In [None]:
!$kgtk ifexists -i {quals} -o {out}/qualifiers.tsv.gz \
--filter-on {out}/claims.tsv.gz \
--input-keys node1 \
--filter-keys id \
--presorted

Look at the final output for qualifiers

In [None]:
!zcat < {out}/qualifiers.tsv.gz | head | col

## Call partition and useful_files notebooks, to generate the file output

In [None]:
kgtk_scripts_path = "/nas/home/mbmann/kgtk_subset/kgtk"
os.environ["EXAMPLES_DIR"] = kgtk_scripts_path + "/examples"
os.environ["USECASE_DIR"] = kgtk_scripts_path + "/use-cases"
os.environ["TEMP"] = temp
os.environ["OUT"] = out
os.environ["DATATYPES"] = datatypes
os.environ["METADATA"] = metadata

In [None]:
os.environ["EXAMPLES_DIR"]

In [None]:
!ls "$TEMP"

In [None]:
!ls "$OUT"

**Concatenate all output files together** <br>

**NOTE:** The `metadata.property.datatypes` and `metadata.types` are not currently generated by this notebook, and have been copied from `wikidata-20200803-v5/data`. <br>
**TODO:** We must confirm if these are source files, or computed. If they are computed, we should compute them in this notebook.

In [None]:
!{kgtk} cat \
-i {out}/aliases.en.tsv.gz \
-i {out}/descriptions.en.tsv.gz \
-i {out}/qualifiers.tsv.gz \
-i {out}/claims.tsv.gz \
-i {out}/labels.en.tsv.gz \
-i {datatypes} \
-i {metadata} \
-o {out}/all.tsv.gz

In [None]:
!ls {os.environ["EXAMPLES_DIR"] + "/partition-wikidata.ipynb"}

In [None]:
os.environ["OUT"]

### Call the partition-wikidata notebook
`partition-wikidata` will take all intermediary computed outputs from `all.tsv.gz` and partition each wikidata entity (i.e. claims, aliases, labels, descriptions, qualifiers), into partitions. <br>

**NOTE:** This notebook also produces `claims.wikibase-item.tsv.gz` and outputs it to `wikidata_parts_path`

In [None]:
pm.execute_notebook(
    os.environ["EXAMPLES_DIR"] + "/partition-wikidata.ipynb",
    os.environ["TEMP"] + "/partition-wikidata.out.ipynb",
    parameters=dict(
        wikidata_input_path = os.environ["OUT"] + "/all.tsv.gz",
        wikidata_parts_path = os.environ["OUT"] + "/parts",
        temp_folder_path = os.environ["OUT"] + "/parts/temp",
        sort_extras = "--buffer-size 30% --temporary-directory $OUT/parts/temp",
        verbose = False
    )
)
;

## Call the useful-files notebook
`Wikidata Useful Files` will take intermediary output generated by `partition-wikidata` and produce the following statistics: `derived.P31.tsv.gz`, `derived.P279.tsv.gz`, `derived.isa.tsv.gz`, `derived.P279star.tsv.gz`, and `metadata.out_degree.tsv.gz`.

In [None]:
#NOTE: Don't pass in cache path, as one doesn't yet exist for useful_files
pm.execute_notebook(
    os.environ["USECASE_DIR"] + "/Wikidata Useful Files.ipynb",
    os.environ["TEMP"] + "/Wikidata Useful Files Out.ipynb",
    parameters=dict(
        output_path = os.environ["OUT"],
        output_folder = "useful_files",
        temp_folder = "temp.useful_files",
        wiki_root_folder = os.environ["OUT"] + "/parts/",
        languages = 'en',
        compute_pagerank = True,
        delete_database = False
    )
)
;

## Summary of results

In [None]:
!ls -lh {out}/*wikidataos.*

In [None]:
!zcat < {out}/wikidataos.all.tsv.gz | wc

## Verification

The edges file must contain edges for properties, this is not the case on 2020-11-10`


In [None]:
!{kypher} -i {out}/claims.tsv.gz \
--match '(:P10)-[l]->(n2)' \
--limit 10

## concatenate files to get the `all` file

In [None]:
!{kgtk} cat -i {out}/claims.tsv.gz \
{out}/qualifiers.tsv.gz \
{out}/useful_files/metadata.pagerank.undirected.tsv.gz \
{out}/useful_files/metadata.pagerank.directed.tsv.gz \
{out}/useful_files/metadata.in_degree.sorted.tsv.gz \
{out}/useful_files/metadata.out_degree.sorted.tsv.gz \
-o {out}/wikidataos.all.tsv.gz

## concatenate files to get the `all for triples` file


In [None]:
!{kgtk} cat -i {out}/wikidataos.all.tsv.gz \
{out}/useful_files/derived.P31.tsv.gz \
{out}/useful_files/derived.P279.tsv.gz \
{out}/useful_files/derived.isa.tsv.gz \
{out}/useful_files/derived.P279star.tsv.gz \
-o {out}/wikidataos.all.for.triples.tsv.gz

## Filter out `novalue`, `somevalue` and `P9`

In [None]:
!{kgtk} filter -i {out}/wikidataos.all.for.triples.tsv.gz \
-o {out}/wikidataos.all.for.triples.filtered.tsv.gz \
-p ';;somevalue,novalue,P9' --invert

## Add ids for any edge with missing id

In [None]:
!{kgtk} add-id -i {out}/wikidataos.all.for.triples.filtered.tsv.gz \
-o {out}/wikidataos.all.for.triples.filtered.id.tsv.gz \
--id-style wikidata

## Sort by `id`

In [None]:
!{kgtk} sort2 -i {out}/wikidataos.all.for.triples.filtered.id.tsv.gz \
-o {out}/wikidataos.all.for.triples.filtered.id.sorted.tsv.gz \
-c id