# Generating RDF Triples
This notebook shows how to generate RDF triples according ot the conventions used in Wikidata. We have a Jupyter notebook that takes as input a KGTK file and generates as output a turtle file that can be loaded in a triple store. The notebook can also deploy a docker image with an instance of the Wikidata Blazegraph, but it takes more time than we want to wait while doing the tutorial.

We will need to do a bit of data preparation to extract the `datatype` edges, a required input to the triple generation notebook.

In [1]:
import sys  
sys.path.insert(0, 'tutorial')
from tutorial_setup import *

ALIAS: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/aliases.en.tsv.gz"
ALL: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/all.tsv.gz"
CLAIMS: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/claims.tsv.gz"
DESCRIPTION: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/descriptions.en.tsv.gz"
EXAMPLES_DIR: "/Users/pedroszekely/Documents/GitHub/kgtk/examples"
GE: "/Users/pedroszekely/Downloads/kgtk-tutorial/temp/graph-embedding"
ISA: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/derived.isa.tsv.gz"
ITEM: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/claims.wikibase-item.tsv.gz"
KGTK_PATH: "/Users/pedroszekely/Documents/GitHub/kgtk"
LABEL: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/labels.en.tsv.gz"
OUT: "/Users/pedroszekely/Downloads/kgtk-tutorial/output"
P279: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/derived.P279.tsv.gz"
P279STAR: "/Users/pedroszekely/Downloads/kgtk-tutorial/miniwikidata/de

### Sort the file with all edges and qualifiers for the subgraph

Sort the file by identifier, which will place all the qualifiers of a claim immediately after the claim

In [2]:
!$kgtk sort -i "$TEMP"/all_and_qualifiers.tsv.gz -o "$OUT"/all_and_qualifiers.sorted.1.tsv.gz

Our file has `datatype` edges that specify the datatype of each property, including the ones that KGTK adds (unfortunately, at this time we have two properties, with and without dash):

In [3]:
lines = !$kgtk filter -i "$OUT"/all_and_qualifiers.sorted.1.tsv.gz -p ';data_type,datatype;' 
kgtk_to_dataframe(lines)

Unnamed: 0,id,node1,label,node2
0,P10-datatype,P10,datatype,commonsMedia
1,P1001-datatype,P1001,datatype,wikibase-item
2,P1004-datatype,P1004,datatype,external-id
3,P1005-datatype,P1005,datatype,external-id
4,P101-datatype,P101,datatype,wikibase-item
...,...,...,...,...
1078,directed_pagerank-data_type-1a7b30,directed_pagerank,data_type,quantity
1079,in_degree-data_type-1a7b30,in_degree,data_type,quantity
1080,isa-data_type-643cc9,isa,data_type,wikibase-item
1081,out_degree-data_type-1a7b30,out_degree,data_type,quantity


We don't want the data type edges in our triples, so we remove them:

In [4]:
!$kgtk filter -i "$OUT"/all_and_qualifiers.sorted.1.tsv.gz -p ';data_type,datatype;' \
--invert -o "$OUT"/all_and_qualifiers.sorted.tsv.gz

Consolidate all the data type edges as we need to give them as argument for triple generation

In [5]:
!$kgtk filter -i "$TEMP"/kgtk.properties.tsv -p ';data_type;' \
-o "$TEMP/datatypes.kgtk.properties.tsv.gz"

!$kgtk cat -i "$WIKIDATA"/metadata.property.datatypes.tsv.gz  "$TEMP"/datatypes.kgtk.properties.tsv.gz \
-o "$OUT"/all.metadata.property.datatypes.tsv

!gzip -f "$OUT"/all.metadata.property.datatypes.tsv

Look at the file:

In [6]:
lines = !zcat < "$OUT"/all.metadata.property.datatypes.tsv.gz 
kgtk_to_dataframe(lines)

Unnamed: 0,id,node1,label,node2,node2;wikidatatype,rank
0,P10-datatype,P10,datatype,commonsMedia,,
1,P1001-datatype,P1001,datatype,wikibase-item,,
2,P1003-datatype,P1003,datatype,external-id,,
3,P1004-datatype,P1004,datatype,external-id,,
4,P1005-datatype,P1005,datatype,external-id,,
...,...,...,...,...,...,...
1509,P279star-data_type-643cc9,P279star,data_type,wikibase-item,,
1510,directed_pagerank-data_type-1a7b30,directed_pagerank,data_type,quantity,,
1511,undirected_pagerank-data_type-1a7b30,undirected_pagerank,data_type,quantity,,
1512,in_degree-data_type-1a7b30,in_degree,data_type,quantity,,


Execute the Triple Generation notebook

In [7]:
pm.execute_notebook(
    os.environ["USECASE_DIR"] + "/Generate-Triples-And-Load-Blazegraph.ipynb",
    os.environ["TEMP"] + "/Generate-Triples-And-Load-Blazegraph-OUT.ipynb",
    parameters=dict(
        kgtk_path = os.environ["OUT"],
        kgtk_file_name = "all_and_qualifiers.sorted.tsv.gz",
        properties_file_path = os.environ["OUT"] + "/all.metadata.property.datatypes.tsv.gz",
        create_image = False,
        load_triples = False
    )
)
;

Executing:   0%|          | 0/14 [00:00<?, ?cell/s]

''

Take a peek at the output triples file

In [8]:
!gzcat "$OUT"/all.ttl.gz | head -n 50

@prefix wikibase: <http://wikiba.se/ontology#> .
@prefix wd: <http://www.wikidata.org/entity/> .
@prefix wdt: <http://www.wikidata.org/prop/direct/> .
@prefix wdtn: <http://www.wikidata.org/prop/direct-normalized/> .
@prefix wdno: <http://www.wikidata.org/prop/novalue/> .
@prefix wds: <http://www.wikidata.org/entity/statement/> .
@prefix wdv: <http://www.wikidata.org/value/> .
@prefix wdref: <http://www.wikidata.org/reference/> .
@prefix p: <http://www.wikidata.org/prop/> .
@prefix pr: <http://www.wikidata.org/prop/reference/> .
@prefix prv: <http://www.wikidata.org/prop/reference/value/> .
@prefix prn: <http://www.wikidata.org/prop/reference/value-normalized/> .
@prefix ps: <http://www.wikidata.org/prop/statement/> .
@prefix psv: <http://www.wikidata.org/prop/statement/value/> .
@prefix psn: <http://www.wikidata.org/prop/statement/value-normalized/> .
@prefix pq: <http://www.wikidata.org/prop/qualifier/> .
@prefix pqv: <http://www.wikidata.org/prop/qualifier/value/> .
@prefix pqn: <ht