## Step 6: partition the files to follow the conventions KGTK uses for Wikidata

In [1]:
import sys  
sys.path.insert(0, 'tutorial')
from tutorial_setup import *

ALIAS: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/aliases.en.tsv.gz"
ALL: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/all.tsv.gz"
CLAIMS: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/claims.tsv.gz"
DESCRIPTION: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/descriptions.en.tsv.gz"
EXAMPLES_DIR: "/Users/pedroszekely/Documents/GitHub/kgtk/examples"
GE: "/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v5/graph-embedding"
ISA: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/derived.isa.tsv.gz"
ITEM: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/claims.wikibase-item.tsv.gz"
LABEL: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/labels.en.tsv.gz"
OUT: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v5"
P279: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/derived.P279.tsv.gz"
P279STAR: "/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/derived.P279star.tsv.gz"
PROPERTY_DATATYPES: "/Users/pedroszekely/Downloads/kypher/wikidata_o

In [2]:
%cd {output_path}

/Users/pedroszekely/Downloads/kypher


We'll use the partition-wikidata notebook to complete this step. This notebook expects an input file that includes all edges and qualifiers together. We also need to specify a directory where partitioned files should be created, and a directory where temporary files can be sent (this should be different from our temp directory as the partition notebook will clear any existing files in this folder).

In [3]:
!mkdir $OUT/parts

mkdir: /Users/pedroszekely/Downloads/kypher/wikidata_os_v5/parts: File exists


In [4]:
!$kgtk cat -i $OUT/all.tsv.gz -i $OUT/qualifiers.tsv.gz -o $TEMP/all_and_qualifiers.tsv.gz

        7.64 real         6.57 user         0.19 sys


In [5]:
!zcat < $TEMP/all_and_qualifiers.tsv.gz | head

id	node1	label	node2
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508
P10-P1659-P1651-c4068028-0	P10	P1659	P1651
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238
P10-P1659-P51-86aca4c5-0	P10	P1659	P51
P10-P1855-Q15075950-7eff6d65-0	P10	P1855	Q15075950
P10-P1855-Q69063653-c8cdb04c-0	P10	P1855	Q69063653
zcat: error writing to output: Broken pipe


In [6]:
pm.execute_notebook(
    os.environ["EXAMPLES_DIR"] + "/partition-wikidata.ipynb",
    os.environ["TEMP"] + "/partition-wikidata.out.ipynb",
    parameters=dict(
        wikidata_input_path = os.environ["TEMP"] + "/all_and_qualifiers.tsv.gz",
        wikidata_parts_path = os.environ["OUT"] + "/parts",
        temp_folder_path = os.environ["OUT"] + "/parts/temp",
        sort_extras = "--buffer-size 30% --temporary-directory $OUT/parts/temp",
        verbose = False
    )
)

HBox(children=(HTML(value='Executing'), FloatProgress(value=0.0, max=49.0), HTML(value='')))




{'cells': [{'cell_type': 'markdown',
   'metadata': {'tags': [],
    'papermill': {'exception': False,
     'start_time': '2021-01-08T01:55:49.187689',
     'end_time': '2021-01-08T01:55:49.215617',
     'duration': 0.027928,
     'status': 'completed'}},
   'source': '# Partitioning a subset of Wikidata\n\nThis notebook illustrates how to partition a Wikidata KGTK edges file.\n\nParameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:\n\n```\npapermill partition-wikidata.ipynb partition-wikidata.out.ipynb \\\n-p wikidata_input_path /data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20201130/data/all.tsv.gz \\\n-p wikidata_parts_path /data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20201130/parts \\\n```\n\nHere is a sample of the records that might appear in the input KGTK file:\n```\nid\tnode1\tlabel\tnode2\trank\tnode2;wikidatatype\tlang\nQ1-P1036-418bc4-78f5a565-0\tQ1\tP1036\t"113"\tnormal\texternal-id\t\nQ1-P13

The partition-wikidata notebook created the following partitioned kgtk-files:

In [88]:
!ls $OUT/parts

aliases.en.tsv.gz                  metadata.property.datatypes.tsv.gz
aliases.tsv.gz                     metadata.types.tsv.gz
all.tsv.gz                         qualifiers.tsv.gz
claims.tsv.gz                      sitelinks.en.tsv.gz
descriptions.en.tsv.gz             sitelinks.qualifiers.en.tsv.gz
descriptions.tsv.gz                sitelinks.qualifiers.tsv.gz
labels.en.tsv.gz                   sitelinks.tsv.gz
labels.tsv.gz                      [34mtemp[m[m


In [75]:
!$kypher -i $OUT/parts/claims.tsv.gz \
--match '(n1)-[]->()' \
--return 'count(distinct n1)'

count(DISTINCT graph_36_c1."node1")
13153
        2.61 real         2.55 user         0.37 sys
