# Generating Subsets of Wikidata

### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

UPDATE EXAMPLE INVOCATION


```
papermill Wikidata\ Useful\ Files.ipynb useful-files.out.ipynb \
-p wiki_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz \
-p label_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.label.en.tsv.gz \
-p item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.wikibase-item.tsv.gz \
-p property_item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.property.wikibase-item.tsv.gz \
-p qual_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/qual.tsv.gz \
-p output_path <local folder> \
-p output_folder useful_files_v4 \
-p temp_folder temp.useful_files_v4 \
-p delete_database no \
-p compute_pagerank no \
-p languages es,ru,zh-cn 
```

In [36]:
# Parameters

# Folder on local machine where to create the output and temporary folders
output_path = "/data/amandeep"

# The names of the output and temporary folders
output_folder = "wikidata-20210215-dwd"
temp_folder = "temp.wikidata-20210215-dwd"

# Classes to remove
remove_classes = "Q591041,Q523,Q318,Q7318358,Q7187,Q11173,Q8054"

# The location of input files
wiki_root_folder = "/data/amandeep/wikidata-20210215/"

claims_file = "claims.tsv.gz"
label_file = "labels.en.tsv.gz"
alias_file = "aliases.en.tsv.gz"
description_file = "descriptions.en.tsv.gz"
item_file = "claims.wikibase-item.tsv.gz"
qual_file = "qualifiers.tsv.gz"
property_datatypes_file = "metadata.property.datatypes.tsv.gz"
metadata_types_file = "metadata.types.tsv.gz"
isa_file = "derived.isa.tsv.gz"
p279star_file = "derived.P279star.tsv.gz"

# Location of the cache database for kypher
cache_path = "/data/amandeep/temp.wikidata-20210215-dwd"

# Whether to delete the cache database
### Needs fixing
delete_database = "no"
if delete_database and delete_database.lower().strip() == 'yes':
    delete_database = True
else:
    delete_database = False

compute_pagerank = "yes"
### Needs fixing
if compute_pagerank and compute_pagerank.lower().strip() == 'yes':
    compute_pagerank = True
else:
    compute_pagerank = False

# shortcuts to commands
kgtk = "time kgtk --debug"
# kgtk = "kgtk --debug"

# Useful files Jupyter notebook
useful_files_notebook = "Wikidata Useful Files.ipynb"
notebooks_folder = "/data/amandeep/Github/kgtk/use-cases/"

languages = "en,ru,es,zh-cn,de,it,nl,pl,fr,pt,sv"
if languages:
    languages = languages.split(',')

In [37]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

import altair as alt

import papermill as pm

## Set up variables for files

In [38]:
if cache_path:
    os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_path)
else:
    os.environ['STORE'] = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)
    
if cache_path:
    store = "{}/wikidata.sqlite3.db".format(cache_path)
else:
    store = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)

out = "{}/{}".format(output_path, output_folder)
temp = "{}/{}".format(output_path, temp_folder)

kypher = "kgtk query --debug --graph-cache " + store

claims = wiki_root_folder + claims_file
items = wiki_root_folder + item_file
isa = wiki_root_folder + isa_file
quals = wiki_root_folder + qual_file
datatypes = wiki_root_folder + property_datatypes_file
metadata_types = wiki_root_folder + metadata_types_file
p279star = wiki_root_folder + p279star_file

labels = wiki_root_folder + label_file
aliases = wiki_root_folder + alias_file
descriptions = wiki_root_folder + description_file


kgtk_path = "/data/amandeep/Github/kgtk"
os.environ["EXAMPLES_DIR"] = kgtk_path + "/examples"
os.environ["USECASE_DIR"] = kgtk_path + "/use-cases"
os.environ['TEMP'] = temp

Go to the output directory and create the subfolders for the output files and the temporary files

In [4]:
cd $output_path

/data/amandeep


In [14]:
!mkdir -p {out}
!mkdir -p {temp}

Clean up the output and temp folders before we start

In [15]:
if delete_database:
    !rm {out}/*.tsv {out}/*.tsv.gz
    !rm {temp}/*.tsv {temp}/*.tsv.gz

### Preview the input files

It is always a good practice to peek a the files to make sure the column headings are what we expect

In [16]:
!zcat {claims} | head

id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"	normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item
P10-P1659-P1651-c4068028-0	P10	P1659	P1651	normal	wikibase-property
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18	normal	wikibase-property
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238	normal	wikibase-property
P10-P1659-P51-86aca4c5-0	P10	P1659	P51	normal	wikibase-property
P10-P1855-Q15075950-7eff6d65-0	P10	P1855	Q15075950	normal	wikibase-item
P10-P1855-Q4504-a69d2c73-0	P10	P1855	Q4504	normal	wikibase-item

gzip: stdout: Broken pipe


In [17]:
!{kypher} -i {claims} \
--match '()-[]->()' \
--limit 10

[2021-03-03 18:05:25 sqlstore]: IMPORT graph directly into table graph_1 from /data/amandeep/wikidata-20210215/claims.tsv.gz ...
[2021-03-03 18:47:35 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_1 AS graph_1_c1
     LIMIT ?
  PARAS: [10]
---------------------------------------------
id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"	normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item
P10-P1659-P1651-c4068028-0	P10	P1659	P1651	normal	wikibase-property
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18	normal	wikibase-property
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238	normal	wikibase-property
P10-P1659-P51-86aca4c5-0	P10	P1659	P51	normal	wikibase-property
P10-P1855-Q15075950-7eff6d65-0	P10	P1855	Q15075950	normal	wikibase-item
P10-P1855-Q4504-a69d2c73-0	P10	P1855	Q4504	normal	w

## Creating a list of all the items we want to remove

### Compute the items to be removed

First look at the classes we will remove

In [18]:
cmd = "wd u {}".format(" ".join(remove_classes.split(",")))
!{cmd}

[90mid[39m Q13442814
[42mLabel[49m scholarly article
[44mDescription[49m article in an academic publication, usually peer reviewed
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mscientific publication [90m(Q591041)[39m | article [90m(Q191067)[39m | scholarly work [90m(Q55915575)[39m

[90mid[39m Q523
[42mLabel[49m star
[44mDescription[49m astronomical object consisting of a luminous spheroid of plasma held together by its own gravity
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39m astronomical object type [90m(Q17444909)[39m
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mastronomical object [90m(Q6999)[39m | fusor [90m(Q1027098)[39m

[90mid[39m Q318
[42mLabel[49m galaxy
[44mDescription[49m astronomical structure
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39m astronomical object type [90m(Q17444909)[39m
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mdeep-sky object [90m(Q249389)[39m

[90mid

Compose the kypher command to remove the classes

In [19]:
!zcat < {isa} | head | col

node1	label	node2

gzip: P10	isa	Q18610173
stdout: Broken pipe
P1000	isa	Q18608871
P1001	isa	Q15720608
P1001	isa	Q22984026
P1001	isa	Q22997934
P1001	isa	Q61719275
P1001	isa	Q70564278
P1002	isa	Q22963600
P1003	isa	Q19595382


Run the command, the items to remove will be in file `{temp}/items.remove.tsv.gz`

In [20]:
classes = ", ".join(list(map(lambda x: '"{}"'.format(x), remove_classes.replace(" ", "").split(","))))
!{kypher}  -i {isa} -i {p279star} -o {temp}/items.remove.tsv.gz \
--match 'isa: (n1)-[:isa]->(c), P279star: (c)-[]->(class)' \
--where 'class in [{classes}]' \
--return 'distinct n1, "p31_p279star" as label, class as node2'


[2021-03-03 18:49:10 sqlstore]: IMPORT graph directly into table graph_2 from /data/amandeep/wikidata-20210215/derived.isa.tsv.gz ...
[2021-03-03 18:50:26 sqlstore]: IMPORT graph directly into table graph_3 from /data/amandeep/wikidata-20210215/derived.P279star.tsv.gz ...
[2021-03-03 18:52:29 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_2_c1."node1", ? "_aLias.label", graph_3_c2."node2" "_aLias.node2"
     FROM graph_2 AS graph_2_c1, graph_3 AS graph_3_c2
     WHERE graph_2_c1."label"=?
     AND graph_2_c1."node2"=graph_3_c2."node1"
     AND (graph_3_c2."node2" IN (?, ?, ?, ?, ?, ?, ?))
  PARAS: ['p31_p279star', 'isa', 'Q13442814', 'Q523', 'Q318', 'Q7318358', 'Q7187', 'Q11173', 'Q8054']
---------------------------------------------
[2021-03-03 18:52:29 sqlstore]: CREATE INDEX on table graph_2 column node2 ...
[2021-03-03 18:53:42 sqlstore]: ANALYZE INDEX on table graph_2 column node2 ...
[2021-03-03 18:53:49 sqlstore]: CREATE INDEX on t

Preview the file

In [21]:
!zcat < {temp}/items.remove.tsv.gz | head | col

node1	label	node2
Q100000005	p31_p279star	Q13442814

gzip: Q100000009	p31_p279star	Q13442814
stdout: Broken pipe
Q100000015	p31_p279star	Q13442814
Q100000022	p31_p279star	Q13442814
Q100000031	p31_p279star	Q13442814
Q100000044	p31_p279star	Q13442814
Q100000056	p31_p279star	Q13442814
Q100000066	p31_p279star	Q13442814
Q100000074	p31_p279star	Q13442814


In [22]:
!zcat < {temp}/items.remove.tsv.gz | wc

50813634 152440902 1624655207


In [23]:
!zcat < {temp}/items.remove.tsv.gz | grep 'Q502268\t'

In [24]:
!zcat < {temp}/items.remove.tsv.gz | grep 'Q15874936\t'

Collect all the classes of items we will remove, just as a sanity check

In [25]:
!{kypher} -i {temp}/items.remove.tsv.gz \
--match '()-[]->(n2)' \
--return 'distinct n2' \
--limit 10

[2021-03-03 19:11:08 sqlstore]: IMPORT graph directly into table graph_4 from /data/amandeep/temp.wikidata-20210215-dwd/items.remove.tsv.gz ...
[2021-03-03 19:11:52 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_4_c1."node2"
     FROM graph_4 AS graph_4_c1
     LIMIT ?
  PARAS: [10]
---------------------------------------------
node2
Q13442814
Q7187
Q11173
Q8054
Q523
Q7318358
Q318


## Create the reduced edges file

### Remove the items from the all.tsv and the label, alias and description files
We will be left with `reduced` files where the edges do not have the unwanted items. We have to remove them from the node1 and node2 positions, so we need to run the ifnotexists commands twice.

Before we start preview the files to see the column headings and check whether they look sorted.

In [26]:
!$kgtk sort2 -i {temp}/items.remove.tsv.gz -o {temp}/items.remove.sorted.tsv.gz


real	0m54.884s
user	0m56.043s
sys	0m6.135s


In [27]:
!zcat < {temp}/items.remove.sorted.tsv.gz | head | col

node1	label	node2
Q100000005	p31_p279star	Q13442814
Q100000009	p31_p279star	Q13442814
Q100000015	p31_p279star	Q13442814
Q100000022	p31_p279star	Q13442814
Q100000031	p31_p279star	Q13442814
Q100000044	p31_p279star	Q13442814
Q100000056	p31_p279star	Q13442814
Q100000066	p31_p279star	Q13442814

gzip: Q100000074	p31_p279star	Q13442814
stdout: Broken pipe


In [28]:
!zcat < "{claims}" | head -5 | col

id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video" normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item
P10-P1659-P1651-c4068028-0	P10	P1659	P1651	normal	wikibase-property

gzip: stdout: Broken pipe


Remove from the full set of edges those edges that have a `node1` present in `items.remove.sorted.tsv`

In [29]:
!$kgtk ifnotexists -i "{claims}" -o {temp}/item.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted 


real	134m45.951s
user	134m26.898s
sys	0m18.649s


From the remaining edges, remove those that have a `node2` present in `items.remove.sorted.tsv`

In [30]:
!$kgtk sort2 -i {temp}/item.edges.reduced.tsv.gz -o {temp}/item.edges.reduced.sorted.tsv.gz \
--columns node2 label node1 id


real	31m44.772s
user	33m46.173s
sys	2m6.857s


In [31]:
!$kgtk ifnotexists -i {temp}/item.edges.reduced.sorted.tsv.gz -o {temp}/item.edges.reduced.2.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node2 \
--filter-keys node1 \
--presorted 


real	83m19.639s
user	83m7.751s
sys	0m7.814s


Create a file with the labels, for all the languages specified

In [35]:
for lang in languages:
    cmd = f"kgtk --debug ifnotexists -i {wiki_root_folder}labels.{lang}.tsv.gz \
    -o {temp}/label.{lang}.edges.reduced.tsv.gz \
    --filter-on {temp}/items.remove.sorted.tsv.gz \
    --input-keys node1 \
    --filter-keys node1 \
    --presorted"
    !$cmd

In [36]:
for lang in languages:
    cmd = f"kgtk sort2 -i {temp}/label.{lang}.edges.reduced.tsv.gz -o {out}/labels.{lang}.tsv.gz" 
    !$cmd

Create a file with the aliases, for all the languages specified

In [37]:
for lang in languages:
    cmd = f"kgtk --debug ifnotexists -i {wiki_root_folder}aliases.{lang}.tsv.gz \
    -o {temp}/alias.{lang}.edges.reduced.tsv.gz \
    --filter-on {temp}/items.remove.sorted.tsv.gz \
    --input-keys node1 \
    --filter-keys node1 \
    --presorted"
    !$cmd

In [38]:
for lang in languages:
    cmd = f"kgtk sort2 -i {temp}/alias.{lang}.edges.reduced.tsv.gz -o {out}/aliases.{lang}.tsv.gz" 
    !$cmd

Create a file with the descriptions, for all the languages specified

In [39]:
for lang in languages:
    cmd = f"kgtk --debug ifnotexists -i {wiki_root_folder}descriptions.{lang}.tsv.gz \
    -o {temp}/description.{lang}.edges.reduced.tsv.gz \
    --filter-on {temp}/items.remove.sorted.tsv.gz \
    --input-keys node1 \
    --filter-keys node1 \
    --presorted"
    !$cmd

In [40]:
for lang in languages:
    cmd = f"kgtk sort2 -i {temp}/description.{lang}.edges.reduced.tsv.gz -o {out}/descriptions.{lang}.tsv.gz" 
    !$cmd

### Produce the output files for claims, labels, aliases and descriptions

In [41]:
!$kgtk sort2 -i {temp}/item.edges.reduced.2.tsv.gz -o {out}/claims.tsv.gz 


real	30m39.899s
user	31m21.509s
sys	2m31.669s


## Create the reduced qualifiers file
We do this by finding all the ids of the reduced edges file, and then selecting out from `qual.tsv`

We need to join by id, so we need to sort both files by id, node1, label, node2:

- `{quals}` 
- `{out}/claims.tsv.gz` 

In [45]:
!zcat < "{quals}" | head | column -t -s $'\t' 

id                                                node1                           label  node2                                                                    node2;wikidatatype
P10-P1855-Q15075950-7eff6d65-0-P10-54b214-0       P10-P1855-Q15075950-7eff6d65-0  P10    "Smoorverliefd 12 september.webm"                                        commonsMedia
P10-P1855-Q15075950-7eff6d65-0-P3831-Q622550-0    P10-P1855-Q15075950-7eff6d65-0  P3831  Q622550                                                                  wikibase-item
P10-P1855-Q4504-a69d2c73-0-P10-bef003-0           P10-P1855-Q4504-a69d2c73-0      P10    "Komodo dragons video.ogv"                                               commonsMedia
P10-P1855-Q69063653-c8cdb04c-0-P10-6fb08f-0       P10-P1855-Q69063653-c8cdb04c-0  P10    "Couch Commander.webm"                                                   commonsMedia
P10-P1855-Q7378-555592a4-0-P10-8a982d-0           P10-P1855-Q7378-555592a4-0      P10    "Elephants Dream (2006).webm"

Run `ifexists` to select out the quals for the edges in `{out}/wikidataos.qual.tsv.gz`. Note that we use `node1` in the qualifier file, matching to `id` in the `wikidataos.all.tsv` file.

In [46]:
!$kgtk ifexists -i "{quals}" -o {out}/qualifiers.tsv.gz \
--filter-on {out}/claims.tsv.gz \
--input-keys node1 \
--filter-keys id \
--presorted


real	49m25.596s
user	49m17.469s
sys	0m7.472s


Look at the final output for qualifiers

In [47]:
!zcat < {out}/qualifiers.tsv.gz | head | col

id	node1	label	node2	node2;wikidatatype
P10-P1855-Q15075950-7eff6d65-0-P10-54b214-0	P10-P1855-Q15075950-7eff6d65-0	P10	"Smoorverliefd 12 september.webm"	commonsMedia
P10-P1855-Q15075950-7eff6d65-0-P3831-Q622550-0	P10-P1855-Q15075950-7eff6d65-0	P3831	Q622550 wikibase-item
P10-P1855-Q4504-a69d2c73-0-P10-bef003-0 P10-P1855-Q4504-a69d2c73-0	P10	"Komodo dragons video.ogv"	commonsMedia
P10-P1855-Q69063653-c8cdb04c-0-P10-6fb08f-0	P10-P1855-Q69063653-c8cdb04c-0	P10	"Couch Commander.webm"	commonsMedia
P10-P1855-Q7378-555592a4-0-P10-8a982d-0 P10-P1855-Q7378-555592a4-0	P10	"Elephants Dream (2006).webm"	commonsMedia
P10-P2302-Q21502404-d012aef4-0-P1793-f4c2ed-0	P10-P2302-Q21502404-d012aef4-0	P1793	"(?i).+\\.(webm\|ogv\|ogg\|gif)"	string
P10-P2302-Q21502404-d012aef4-0-P2316-Q21502408-0	P10-P2302-Q21502404-d012aef4-0	P2316	Q21502408	wikibase-item
P10-P2302-Q21502404-d012aef4-0-P2916-cb0917-0	P10-P2302-Q21502404-d012aef4-0	P2916	'filename with extension: webm, ogg, ogv, or gif (case insensitive)'@en 

In [51]:
!ls "$TEMP"

alias.de.edges.reduced.tsv.gz	     description.sv.edges.reduced.tsv.gz
alias.en.edges.reduced.tsv.gz	     description.zh-cn.edges.reduced.tsv.gz
alias.es.edges.reduced.tsv.gz	     item.edges.reduced.2.tsv.gz
alias.fr.edges.reduced.tsv.gz	     item.edges.reduced.sorted.tsv.gz
alias.it.edges.reduced.tsv.gz	     item.edges.reduced.tsv.gz
alias.nl.edges.reduced.tsv.gz	     items.remove.sorted.tsv.gz
alias.pl.edges.reduced.tsv.gz	     items.remove.tsv.gz
alias.pt.edges.reduced.tsv.gz	     label.de.edges.reduced.tsv.gz
alias.ru.edges.reduced.tsv.gz	     label.en.edges.reduced.tsv.gz
alias.sv.edges.reduced.tsv.gz	     label.es.edges.reduced.tsv.gz
alias.zh-cn.edges.reduced.tsv.gz     label.fr.edges.reduced.tsv.gz
description.de.edges.reduced.tsv.gz  label.it.edges.reduced.tsv.gz
description.en.edges.reduced.tsv.gz  label.nl.edges.reduced.tsv.gz
description.es.edges.reduced.tsv.gz  label.pl.edges.reduced.tsv.gz
description.fr.edges.reduced.tsv.gz  label.pt.edges.reduced.tsv.gz
description.it.e

In [52]:
!ls "$OUT"

aliases.de.tsv.gz     descriptions.de.tsv.gz	 labels.en.tsv.gz
aliases.en.tsv.gz     descriptions.en.tsv.gz	 labels.es.tsv.gz
aliases.es.tsv.gz     descriptions.es.tsv.gz	 labels.fr.tsv.gz
aliases.fr.tsv.gz     descriptions.fr.tsv.gz	 labels.it.tsv.gz
aliases.it.tsv.gz     descriptions.it.tsv.gz	 labels.nl.tsv.gz
aliases.nl.tsv.gz     descriptions.nl.tsv.gz	 labels.pl.tsv.gz
aliases.pl.tsv.gz     descriptions.pl.tsv.gz	 labels.pt.tsv.gz
aliases.pt.tsv.gz     descriptions.pt.tsv.gz	 labels.ru.tsv.gz
aliases.ru.tsv.gz     descriptions.ru.tsv.gz	 labels.sv.tsv.gz
aliases.sv.tsv.gz     descriptions.sv.tsv.gz	 labels.zh-cn.tsv.gz
aliases.zh-cn.tsv.gz  descriptions.zh-cn.tsv.gz  qualifiers.tsv.gz
claims.tsv.gz	      labels.de.tsv.gz


Copy the property datatypes and metadata types file over

In [62]:
os.environ["DATATYPES"] = datatypes

In [63]:
!cp $DATATYPES $OUT/metadata.property.datatypes.tsv.gz

Filter out edges from metdata types file

In [67]:
!$kgtk ifexists -i "{metadata_types}" -o {out}/metadata.types.tsv.gz \
--filter-on {out}/claims.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted


real	31m7.984s
user	31m4.925s
sys	0m2.904s


Contruct the cat command to generate `all.tsv.gz`

In [112]:
_files = []

cat_cmd = ""
for lang in languages:
    _files.append(f"-i \"$OUT\"/labels.{lang}.tsv.gz")
    _files.append(f"-i \"$OUT\"/aliases.{lang}.tsv.gz")
    _files.append(f"-i \"$OUT\"/descriptions.{lang}.tsv.gz")

_files.append("-i \"$OUT\"/qualifiers.tsv.gz")
_files.append("-i \"$OUT\"/claims.tsv.gz")
_files.append("-i \"$OUT\"/metadata.property.datatypes.tsv.gz")
_files.append("-i \"$OUT\"/metadata.types.tsv.gz")
_files.append("-o \"$OUT\"/all.tsv.gz")

cat_command = "kgtk cat " + " ".join(_files)

In [5]:
os.environ["TEMP"] = temp
os.environ["OUT"] = out

### Run the Partitions Notebook

In [118]:
pm.execute_notebook(
    os.environ["EXAMPLES_DIR"] + "/partition-wikidata.ipynb",
    os.environ["TEMP"] + "/partition-wikidata.out.ipynb",
    parameters=dict(
        wikidata_input_path = os.environ["OUT"] + "/all.tsv.gz",
        wikidata_parts_path = os.environ["OUT"] + "/parts",
        temp_folder_path = os.environ["OUT"] + "/parts/temp",
        sort_extras = "--buffer-size 30% --temporary-directory $OUT/parts/temp",
        verbose = False,
        gzip_command = 'gzip'
    )
)
;

HBox(children=(HTML(value='Executing'), FloatProgress(value=0.0, max=49.0), HTML(value='')))




''

### copy the `claims.wikibase-item.tsv` file from the `parts` folder

In [9]:
!cp $OUT/parts/claims.wikibase-item.tsv.gz $OUT

### RUN the Useful Files notebook

In [10]:
pm.execute_notebook(
    os.environ["USECASE_DIR"] + "/Wikidata Useful Files.ipynb",
    os.environ["TEMP"] + "/Wikidata Useful Files Out.ipynb",
    parameters=dict(
        output_path = os.environ["OUT"],
        output_folder = "useful_files",
        temp_folder = "temp.useful_files",
        wiki_root_folder = os.environ["OUT"] + "/parts/",
        cache_path = os.environ["OUT"] + "/temp.useful_files",
        languages = 'en',
        compute_pagerank = True,
        delete_database = False
    )
)
;

HBox(children=(HTML(value='Executing'), FloatProgress(value=0.0, max=112.0), HTML(value='')))




PapermillExecutionError: 
---------------------------------------------------------------------------
Exception encountered at "In [63]":
---------------------------------------------------------------------------
MaxRowsError                              Traceback (most recent call last)
/opt/miniconda3/envs/kgtk-env/lib/python3.7/site-packages/altair/vegalite/v4/api.py in to_dict(self, *args, **kwargs)
    361         copy = self.copy(deep=False)
    362         original_data = getattr(copy, "data", Undefined)
--> 363         copy.data = _prepare_data(original_data, context)
    364 
    365         if original_data is not Undefined:

/opt/miniconda3/envs/kgtk-env/lib/python3.7/site-packages/altair/vegalite/v4/api.py in _prepare_data(data, context)
     82     # convert dataframes  or objects with __geo_interface__ to dict
     83     if isinstance(data, pd.DataFrame) or hasattr(data, "__geo_interface__"):
---> 84         data = _pipe(data, data_transformers.get())
     85 
     86     # convert string input to a URLData

/opt/miniconda3/envs/kgtk-env/lib/python3.7/site-packages/toolz/functoolz.py in pipe(data, *funcs)
    625     """
    626     for func in funcs:
--> 627         data = func(data)
    628     return data
    629 

/opt/miniconda3/envs/kgtk-env/lib/python3.7/site-packages/toolz/functoolz.py in __call__(self, *args, **kwargs)
    301     def __call__(self, *args, **kwargs):
    302         try:
--> 303             return self._partial(*args, **kwargs)
    304         except TypeError as exc:
    305             if self._should_curry(args, kwargs, exc):

/opt/miniconda3/envs/kgtk-env/lib/python3.7/site-packages/altair/vegalite/data.py in default_data_transformer(data, max_rows)
     17 @curried.curry
     18 def default_data_transformer(data, max_rows=5000):
---> 19     return curried.pipe(data, limit_rows(max_rows=max_rows), to_values)
     20 
     21 

/opt/miniconda3/envs/kgtk-env/lib/python3.7/site-packages/toolz/functoolz.py in pipe(data, *funcs)
    625     """
    626     for func in funcs:
--> 627         data = func(data)
    628     return data
    629 

/opt/miniconda3/envs/kgtk-env/lib/python3.7/site-packages/toolz/functoolz.py in __call__(self, *args, **kwargs)
    301     def __call__(self, *args, **kwargs):
    302         try:
--> 303             return self._partial(*args, **kwargs)
    304         except TypeError as exc:
    305             if self._should_curry(args, kwargs, exc):

/opt/miniconda3/envs/kgtk-env/lib/python3.7/site-packages/altair/utils/data.py in limit_rows(data, max_rows)
     82             "than the maximum allowed ({}). "
     83             "For information on how to plot larger datasets "
---> 84             "in Altair, see the documentation".format(max_rows)
     85         )
     86     return data

MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000). For information on how to plot larger datasets in Altair, see the documentation


## Sanity checks

In [11]:
!{kypher} -i {out}/claims.tsv.gz \
--match '(n1:Q368441)-[l]->(n2)' \
--limit 10 \
| col

[2021-03-06 14:52:07 sqlstore]: IMPORT graph directly into table graph_5 from /data/amandeep/wikidata-20210215-dwd/claims.tsv.gz ...
[2021-03-06 15:02:12 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_5 AS graph_5_c1
     WHERE graph_5_c1."node1"=?
     LIMIT ?
  PARAS: ['Q368441', 10]
---------------------------------------------
[2021-03-06 15:02:12 sqlstore]: CREATE INDEX on table graph_5 column node1 ...
[2021-03-06 15:05:15 sqlstore]: ANALYZE INDEX on table graph_5 column node1 ...
id	node1	label	node2	rank	node2;wikidatatype
Q368441-P106-Q937857-ba9afa6b-0 Q368441 P106	Q937857 normal	wikibase-item
Q368441-P109-358e4e-63970f77-0	Q368441 P109	"James Rodriguez Signature.svg" normal	commonsMedia
Q368441-P118-Q82595-62cd72d9-0	Q368441 P118	Q82595	normal	wikibase-item
Q368441-P1344-Q170645-3f2d9c6a-0	Q368441 P1344	Q170645 normal	wikibase-item
Q368441-P1344-Q4630358-8e287039-0	Q368441 P1344	Q4630358	normal	wikibase-item
Q368441-P1344-Q7

In [12]:
!{kypher} -i {out}/claims.tsv.gz \
--match '(n1:P131)-[l]->(n2)' \
--limit 10 \
| col

[2021-03-06 15:05:40 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_5 AS graph_5_c1
     WHERE graph_5_c1."node1"=?
     LIMIT ?
  PARAS: ['P131', 10]
---------------------------------------------
id	node1	label	node2	rank	node2;wikidatatype
P131-P1628-951146-4681d72b-0	P131	P1628	"http://dati.beniculturali.it/cis/GovernamentalAdministrativeArea"	normal	url
P131-P1629-Q56061-0d5b0586-0	P131	P1629	Q56061	normal	wikibase-item
P131-P1647-P276-5cc63556-0	P131	P1647	P276	normal	wikibase-property
P131-P1647-P361-257a2660-0	P131	P1647	P361	normal	wikibase-property
P131-P1659-P1001-f0f7e26a-0	P131	P1659	P1001	normal	wikibase-property
P131-P1659-P1383-3ebd92d5-0	P131	P1659	P1383	normal	wikibase-property
P131-P1659-P150-d414f410-0	P131	P1659	P150	normal	wikibase-property
P131-P1659-P159-e71dc93e-0	P131	P1659	P159	normal	wikibase-property
P131-P1659-P17-bbd89dc1-0	P131	P1659	P17	normal	wikibase-property
P131-P1659-P206-7eb31568-0	P131	P1659	P206	

## Summary of results

In [13]:
!ls -lh {out}/*.tsv.gz

-rw-r--r-- 1 amandeep isdstaff  32M Mar  4 01:21 /data/amandeep/wikidata-20210215-dwd/aliases.de.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  81M Mar  4 01:20 /data/amandeep/wikidata-20210215-dwd/aliases.en.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  24M Mar  4 01:21 /data/amandeep/wikidata-20210215-dwd/aliases.es.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  27M Mar  4 01:21 /data/amandeep/wikidata-20210215-dwd/aliases.fr.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  13M Mar  4 01:21 /data/amandeep/wikidata-20210215-dwd/aliases.it.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  25M Mar  4 01:21 /data/amandeep/wikidata-20210215-dwd/aliases.nl.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 6.7M Mar  4 01:21 /data/amandeep/wikidata-20210215-dwd/aliases.pl.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 9.2M Mar  4 01:21 /data/amandeep/wikidata-20210215-dwd/aliases.pt.tsv.gz
-rw-r--r-- 1 amandeep isdstaff  35M Mar  4 01:20 /data/amandeep/wikidata-20210215-dwd/aliases.ru.tsv.gz
-rw-r--r-- 1 amandeep isdstaff 7.6M Mar  4 01:21 /data/amandeep/

## concatenate files to get the `all` file

In [16]:
lad = []
if 'en' not in languages:
    languages.append('en')
for lang in languages:
    lad.append(f"{out}/labels.{lang}.tsv.gz")
    lad.append(f"{out}/aliases.{lang}.tsv.gz")
    lad.append(f"{out}/descriptions.{lang}.tsv.gz")
lad_file_list = " ".join(lad)

In [18]:
!kgtk cat -i {out}/claims.tsv.gz \
{lad_file_list} \
{out}/qualifiers.tsv.gz \
{out}/useful_files/metadata.pagerank.undirected.tsv.gz \
{out}/useful_files/metadata.pagerank.directed.tsv.gz \
{out}/useful_files/metadata.in_degree.tsv.gz \
{out}/useful_files/metadata.out_degree.tsv.gz \
-o {out}/wikidatadwd.all.tsv.gz

## concatenate files to get the `all for triples` file


In [19]:
!kgtk cat -i $OUT/wikidatadwd.all.tsv.gz \
$OUT/useful_files/derived.isa.tsv.gz \
$OUT/useful_files/derived.P279star.tsv.gz \
-o $OUT/wikidatadwd.all.for.triples.tsv.gz

## concatenate files to get the `all for elasticsearch` file


In [31]:
!kgtk cat -i $OUT/wikidatadwd.all.tsv.gz \
$OUT/useful_files/derived.P279.tsv.gz \
$OUT/useful_files/derived.isastar.tsv.gz \
-o $OUT/wikidatadwd.all.for.es.tsv.gz

#### remove `somevalue,novalue,P9`

#### add text and graph embeddings, augmented wikipedia and abbreviated human names for ES

In [None]:
!kgtk cat \
-i $OUT/wikidatadwd.all.for.es.tsv.gz \
-i $OUT/metadata.property.datatypes.tsv.gz \
-i $OUT/graph-embeddings/wikidataos.complEx.graph-embeddings.tsv.gz \
-i $OUT/graph-embeddings/wikidataos.transE.graph-embeddings.tsv.gz \
-i $OUT/text-embeddings/text-embeddings-concatenated.tsv.gz \
-i $OUT/derived_files_for_es/augmentation.wikipedia.anchors.tsv.gz \
-i $OUT/derived_files_for_es/augmentation.wikipedia.redirect.tsv.gz \
-i $OUT/derived_files_for_es/augmentation.wikipedia.tables.anchors.tsv.gz \
-i $OUT/derived_files_for_es/derived.Q5.abbreviations.tsv.gz \
-o $OUT/wikidatadwd.all.for.es.embeddings.augmented.unsorted.tsv.gz

#### remove columns `id rank node2;wikidatatype url` as it is not required in the ES file and then sort the file by `node1,label`

In [43]:
temp

'/data/amandeep/temp.wikidata-20210215-dwd'

In [None]:
!kgtk sort -c node1,label \
-i $OUT/wikidatadwd.all.for.es.embeddings.augmented.unsorted.tsv.gz \
--extra '--parallel 24 --buffer-size 30% --temporary-directory /data/amandeep/temp.wikidata-20210215-dwd' \
-o $OUT/wikidatadwd.all.for.es.embeddings.augmented.sorted.tsv.gz

## Filter out `novalue`, `somevalue` and `P9`

In [None]:
!kgtk filter -i $OUT/wikidataos.all.for.triples.tsv.gz \
    -o $OUT/wikidataos.all.for.triples.filtered.tsv.gz \
    -p ';;somevalue,novalue,P9' --invert

## Add ids for any edge with missing id

In [None]:
!kgtk add-id -i $OUT/wikidataos.all.for.triples.filtered.tsv.gz \
-o $OUT/wikidataos.all.for.triples.filtered.id.tsv.gz \
--id-style wikidata

## Sort by `id`

In [None]:
!kgtk sort2 -i $OUT/wikidataos.all.for.triples.filtered.id.tsv.gz \
-o $OUT/wikidataos.all.for.triples.filtered.id.sorted.tsv.gz 
-c id

### Run graph embeddings: complEx

In [None]:
# make sure the output directories are created
!kgtk --debug graph-embeddings --verbose -i $OUT/parts/claims.wikibase-item.tsv.gz \
-o $OUT/graph-embeddings/wikidataos.complEx.graph-embeddings.txt \
--retain_temporary_data True \
--operator ComplEx \
--workers 24 \
--log $OUT/graph-embeddings/temp/ge.complex.log \
-T $OUT/graph-embeddings/temp \
-ot w2v \
-e 600

In Processing, Please go to /data/amandeep/wikidata-20210215-dwd/graph-embeddings/temp/ge.complex.log to check details
Opening the input file: /data/amandeep/wikidata-20210215-dwd/parts/claims.wikibase-item.tsv.gz
KgtkReader: File_path.suffix: .gz
KgtkReader: reading gzip /data/amandeep/wikidata-20210215-dwd/parts/claims.wikibase-item.tsv.gz
header: id	node1	label	node2	node2;wikidatatype	rank
input format: kgtk
node1 column found, this is a KGTK edge file
KgtkReader: Special columns: node1=1 label=2 node2=3 id=0
KgtkReader: Reading an edge file.
Opening the output file: /data/amandeep/wikidata-20210215-dwd/graph-embeddings/temp/tmp_claims.wikibase-item.tsv.gz
File_path.suffix: .gz
KgtkWriter: writing gzip /data/amandeep/wikidata-20210215-dwd/graph-embeddings/temp/tmp_claims.wikibase-item.tsv.gz
header: id	node1	label	node2	node2;wikidatatype	rank
Processing the input records.
Processed 182246240 records.
