# KGTK Tutorial: Introduction

Beer sites:
- https://www.realbeer.com/edu/health/calories.php
- http://getdrunknotfat.com/alcohol-content-of-beer/

In [1]:
import sys  
sys.path.insert(0, 'tutorial')
from tutorial_setup import *

ALIAS: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/aliases.en.tsv.gz"
ALL: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/all.tsv.gz"
CLAIMS: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/claims.tsv.gz"
DESCRIPTION: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/descriptions.en.tsv.gz"
EXAMPLES_DIR: "/Users/amandeep/Github/kgtk/examples"
GE: "/Users/amandeep/Documents/kypher/temp.wikidata_os_v5/graph-embedding"
ISA: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/derived.isa.tsv.gz"
ITEM: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/claims.wikibase-item.tsv.gz"
LABEL: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/labels.en.tsv.gz"
OUT: "/Users/amandeep/Documents/kypher/wikidata_os_v5"
P279: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/derived.P279.tsv.gz"
P279STAR: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/derived.P279star

In [2]:
!mkdir -p {output_path}
%cd {output_path}

/Users/amandeep/Documents/kypher


In [3]:
!mkdir -p {output_folder}
!mkdir -p {temp_folder}

mkdir: wikidata_os_v5: File exists
mkdir: temp.wikidata_os_v5: File exists


In [4]:
!mkdir -p "$GE"

mkdir: /Users/amandeep/Documents/kypher/temp.wikidata_os_v5/graph-embedding: File exists


In [6]:
!mkdir -p "$TE"

# Wikidata in KGTK
KGTK has the ability to import a Wikidata JSON dump and covert it to the KGTK representation to make it easy to process the full Wikidata KG in a laptop. There are 86 files which include all the information available in the Wikidata dump and files containing commonly used information derived from the dump. We partitioned the files because in most use cases you only need to use a subset of the files.

The files are very large. `claims.tsv` (23GB compressed) contains all the statements in the Wikidata dump, `qualifiers.tsv` contains the qualifiers of those edges, and `labels.en.tsv`, `aliases.en.tsv` and `descriptions.en.tsv` contain the English labels, aliases and descriptions.

In [7]:
!ls -lh "$CLAIMS" "$QUALIFIERS" "$LABEL" "$ALIAS" "$DESCRIPTION"

-rw-------  1 amandeep  staff    73M Dec  9 14:11 /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/aliases.en.tsv.gz
-rw-------  1 amandeep  staff   5.2G Dec  9 14:05 /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/claims.tsv.gz
-rw-------  1 amandeep  staff   290M Dec  9 14:12 /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/descriptions.en.tsv.gz
-rw-------  1 amandeep  staff   401M Dec  9 14:11 /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/labels.en.tsv.gz
-rw-------  1 amandeep  staff   768M Dec  9 14:51 /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/qualifiers.tsv.gz


`claims.tsv` contains many edges:

In [81]:
!time zcat < "$CLAIMS" | wc

 285160805 1770487103 22786890569
zcat < "$CLAIMS"  74.47s user 15.51s system 86% cpu 1:44.48 total
wc  79.97s user 3.02s system 79% cpu 1:44.48 total


# KGTK Data Model
The KGTK data model is a generalization of RDF and property graphs, inspired by the Wikidata data model. In KGTK, a KG is represented using TSV files with four columns: three columns to store the subject, predicate and object of a triple, and a fourth column to store an identifier for the triple. By convention, we use the heading `id` for the identifier, `node1` for the subject, `node2` for the object and `label` for the predicate, as it labels the edge between `node1` and `node2`. The order of the columns is arbitrary.

All KGTK files must include the required `id`, `node1`, `label` and `node2` columns, and can contain additional columns to store addtional information about an edge or the nodes in the edge. We will explain the details after we discuss *qualifiers*.
Let's take a look at the first few lines of the `claims.tsv` file. We see the four required columns and two additional columns that the Wikidata import includes to facilitate processing of the `claims` file using custom scripts. The `rank` column records the Wikidata rank of a statement, and the `node2;wikidatatype` records the Wikidata type of the value in the `node2` column.

## Claims

In [8]:
!zcat < "$CLAIMS" | head | column -t -s $'\t'

id                              node1  label  node2                                    rank    node2;wikidatatype
P10-P1628-32b85d-7927ece6-0     P10    P1628  "http://www.w3.org/2006/vcard/ns#Video"  normal  url
P10-P1628-acf60d-b8950832-0     P10    P1628  "https://schema.org/video"               normal  url
zcat: P10-P1629-Q34508-bcc39400-0     P10    P1629  Q34508                                   normal  wikibase-item
P10-P1659-P1651-c4068028-0      P10    P1659  P1651                                    normal  wikibase-property
P10-P1659-P18-5e4b9c4f-0        P10    P1659  P18                                      normal  wikibase-property
P10-P1659-P4238-d21d1ac0-0      P10    P1659  P4238                                    normal  wikibase-property
P10-P1659-P51-86aca4c5-0        P10    P1659  P51                                      normal  wikibase-property
error writing to outputP10-P1855-Q15075950-7eff6d65-0  P10    P1855  Q15075950                                normal  wik

Wikidata uses numbers to identify items and properties. We can use the `wd` utility (https://github.com/maxlath/wikibase-cli) to understand the first few lines. The second line states that the `P10` property in Wikidata has an equivalent property in another ontology. Notice that each edge has a distinct id. These ids are unique identifiers for statements (the format of the id can be arbitrary, but we assigned ids so that sorting files by id arranges the information so that all edges about a subject are consecutive.

In [83]:
!wd u P10 P1628 P1629

[90mid[39m P10
[42mLabel[49m video
[44mDescription[49m relevant video. For images, use the property P18. For film trailers, qualify with "object has role" (P3831)="trailer" (Q622550)
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mWikidata property to link to Commons [90m(Q18610173)[39m

[90mid[39m P1628
[42mLabel[49m equivalent property
[44mDescription[49m equivalent property in other ontologies (use in statements on properties, use property URI)
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mWikidata metaproperty for ontology mapping [90m(Q42842547)[39m

[90mid[39m P1629
[42mLabel[49m subject item of this property
[44mDescription[49m relationship represented by the property
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39mWikidata property for property documentation [90m(Q19820110)[39m


Let's look at a more meaningful example. `Q31` (https://www.wikidata.org/wiki/Q31) is the Wikidata item about Belgium. We will use the KGTK query to fetch edges about Belgium. `$kypher` is a shortcut to the `kgtk query` command where in addition we pass in the location of the SQLite database we are using ot store the files. KGTK queries use Cypher syntax (https://neo4j.com/developer/cypher/): the following simple query retrieves 10 edges where `node1` is `Q31`, the q-node for Belgium. The results include an edge with `label` `P1036` (Dewey Decimal Classification) and several edges with label `P1081` (human development index).

 **Note:** We are using the `--as` options in `kgtk query` to set an alias for the `$CLAIMS` file. This alias can be used in the subsequent `kgtk query` commands.

In [7]:
result = !$kypher_raw -i "$CLAIMS" --as "claims" \
--match '(:Q31)-[]-()' \
--limit 10 

kgtk_to_dataframe(result)

Unnamed: 0,id,node1,label,node2,rank,node2;wikidatatype
0,Q31-P1036-c4e1ad-df86eeb8-0,Q31,P1036,"""2--493""",normal,external-id
1,Q31-P1081-02c2ed-033524b0-0,Q31,P1081,+0.866,normal,quantity
2,Q31-P1081-02c2ed-7971505b-0,Q31,P1081,+0.866,normal,quantity
3,Q31-P1081-068470-c1c63b8d-0,Q31,P1081,+0.889,normal,quantity
4,Q31-P1081-068470-ddac01e0-0,Q31,P1081,+0.889,normal,quantity
5,Q31-P1081-144738-c1851cdc-0,Q31,P1081,+0.905,normal,quantity
6,Q31-P1081-175742-c07ac1c8-0,Q31,P1081,+0.888,normal,quantity
7,Q31-P1081-19636d-c08dd8a8-0,Q31,P1081,+0.896,normal,quantity
8,Q31-P1081-1efc03-433a7a4d-0,Q31,P1081,+0.913,normal,quantity
9,Q31-P1081-1f8602-ddac530d-0,Q31,P1081,+0.852,normal,quantity


The output of the command above is hard to read because we are seeing the numeric Wikidata identifiers. To make the output more readable, we need to look up the labels of the Wikidata nodes. This information is in the `labels.en.tsv` file.

In [10]:
!zcat < "$LABEL" | head | column -t -s $'\t'

id              node1  label  node2
P10-label-en    P10    label  'video'@en
P1000-label-en  P1000  label  'record held'@en
zcat: P1001-label-en  P1001  label  'applies to jurisdiction'@en
P1002-label-en  P1002  label  'engine configuration'@en
P1003-label-en  P1003  label  'National Library of Romania ID'@en
P1004-label-en  P1004  label  'MusicBrainz place ID'@en
error writing to outputP1005-label-en  P1005  label  'Portuguese National Library ID'@en
: P1006-label-en  P1006  label  'Nationale Thesaurus voor Auteurs ID'@en
Broken pipe
P1007-label-en  P1007  label  'Lattes Platform number'@en


With KGTK accepts multiple files as input, and can do a join to retrieve the label for each property. When using multiple files, it is necessary to tag each clause with the file that provides the data for the clause. For example, the first clause is tagged with `claim` as the word `claim` is part of the file name. The variable property is used to connect the two clauses.

**Note:** We user the alias `claims` defined in a previous cell and introduced a new alias for the `$LABEL` file

In [13]:
result = !$kypher -i claims -i "$LABEL" --as "labels" \
--match 'claim: (n1:Q31)-[l {label: property}]-(n2), label: (property)-[:label]->(property_label)' \
--return 'l as id, n1 as node1, property as label, n2 as node2, property_label as `label;label`' \
--limit 10 

kgtk_to_dataframe(result)

Unnamed: 0,id,node1,label,node2,label;label
0,Q31-P1036-c4e1ad-df86eeb8-0,Q31,P1036,"""2--493""",'Dewey Decimal Classification'@en
1,Q31-P1081-02c2ed-033524b0-0,Q31,P1081,+0.866,'Human Development Index'@en
2,Q31-P1081-02c2ed-7971505b-0,Q31,P1081,+0.866,'Human Development Index'@en
3,Q31-P1081-068470-c1c63b8d-0,Q31,P1081,+0.889,'Human Development Index'@en
4,Q31-P1081-068470-ddac01e0-0,Q31,P1081,+0.889,'Human Development Index'@en
5,Q31-P1081-144738-c1851cdc-0,Q31,P1081,+0.905,'Human Development Index'@en
6,Q31-P1081-175742-c07ac1c8-0,Q31,P1081,+0.888,'Human Development Index'@en
7,Q31-P1081-19636d-c08dd8a8-0,Q31,P1081,+0.896,'Human Development Index'@en
8,Q31-P1081-1efc03-433a7a4d-0,Q31,P1081,+0.913,'Human Development Index'@en
9,Q31-P1081-1f8602-ddac530d-0,Q31,P1081,+0.852,'Human Development Index'@en


Let's look at a the heads of state of Belgium recorded in property `P35`

In [14]:
result = !$kypher -i claims -i labels \
--match 'claims: (n1:Q31)-[l:P35]->(n2), labels: (n2)-[:label]->(n2_label)' \
--return 'l as id, n1 as node1, l.label as label, n2 as node2, n2_label as `node2;label`' \
--limit 10 

kgtk_to_dataframe(result)

Unnamed: 0,id,node1,label,node2,node2;label
0,Q31-P35-Q1079522-c82ed584-0,Q31,P35,Q1079522,'Erasme Louis Surlet de Chokier'@en
1,Q31-P35-Q12967-f2b9aaf3-0,Q31,P35,Q12967,'Leopold II of Belgium'@en
2,Q31-P35-Q12971-2088471b-0,Q31,P35,Q12971,'Leopold I of Belgium'@en
3,Q31-P35-Q12973-31c1b700-0,Q31,P35,Q12973,'Leopold III of Belgium'@en
4,Q31-P35-Q12976-f3e8a567-0,Q31,P35,Q12976,'Baudouin I of Belgium'@en
5,Q31-P35-Q155004-619ba603-0,Q31,P35,Q155004,'Philippe of Belgium'@en
6,Q31-P35-Q3911-137f01fe-0,Q31,P35,Q3911,'Albert II of Belgium'@en
7,Q31-P35-Q445553-7599749f-0,Q31,P35,Q445553,"'Prince Charles, Count of Flanders'@en"
8,Q31-P35-Q55008046-725dce40-0,Q31,P35,Q55008046,'Albert I of Belgium'@en


## Qualifiers
Qualifiers provide additional information about the claims stated in the edges. For `P1081` the qualifiers tell use the year, and for head of state the qualifiers provide information about the period of time and position held by the head of state. The qualifiers can be retrieved using the identifiers of the edges. Let's retrieve the qualifiers associated with the edge for the first head of state (Erasme Louis). To do so, we use the identifier of the edge (`Q31-P35-Q1079522-c82ed584-0`) as `node1` in the `qualifiers.tsv` file. We get three edges, meaning that the edge `Q31/P35/Q1079522` has three qualifiers. Note that the qualifier edges are the same as any other edge in KGTK, having `id`, `node1`, `label` and `node2` columns:

In [15]:
!$kypher -i "$QUALIFIERS" --as "qualifiers" \
--match '(n1:`Q31-P35-Q1079522-c82ed584-0`)-[l]->(n2)' \
--limit 10 \
| column -t -s $'\t'

id                                         node1                        label  node2                     node2;wikidatatype
Q31-P35-Q1079522-c82ed584-0-P39-Q477406-0  Q31-P35-Q1079522-c82ed584-0  P39    Q477406                   wikibase-item
Q31-P35-Q1079522-c82ed584-0-P580-106076-0  Q31-P35-Q1079522-c82ed584-0  P580   ^1831-02-25T00:00:00Z/11  time
Q31-P35-Q1079522-c82ed584-0-P582-774519-0  Q31-P35-Q1079522-c82ed584-0  P582   ^1831-07-20T00:00:00Z/11  time


Let's make them readable: the following query combines the patterns of the previous two queries to retrieve the labels of the property and node2. The query omits the identifier of the qualifier edges to save space. Also, the headers of the two additional columns can be arbitrary, i.e., you can name them whatever you want; the names used follow a KGTK convention that enabled KGTK to automatically parse the output, which is useful if we want to use the output as an input to another KGTK command. The word before the `;` refers to one of the standard columns, and the name after the `;` refers to a property of that element. In this example, we used `label` as the column contains the label of the entity.

In [16]:
!$kypher -i qualifiers -i labels \
--match 'qual: (n1:`Q31-P35-Q1079522-c82ed584-0`)-[l {label: property}]->(n2), labels: (property)-[:label]->(property_label)' \
--return 'n1 as node1, property as label, n2 as node2, property_label as `label;label`' \
--limit 10 \
| column -t -s $'\t'

node1                        label  node2                     label;label
Q31-P35-Q1079522-c82ed584-0  P39    Q477406                   'position held'@en
Q31-P35-Q1079522-c82ed584-0  P580   ^1831-02-25T00:00:00Z/11  'start time'@en
Q31-P35-Q1079522-c82ed584-0  P582   ^1831-07-20T00:00:00Z/11  'end time'@en


Let's put all the values of `P35` in a file, which we will conveniently name `Q31.P35.tsv`

In [17]:
!$kypher -i claims \
--match '(n1:Q31)-[l:P35]->(n2)' \
--return 'l as id, n1 as node1, l.label as label, n2 as node2' \
-o "$TEMP"/Q31.P35.tsv

Now we are going to combine the `P35` edges of Belgium with the qualifiers. To do this we will run a query that uses the edges that we stored in `Q31.P35.tsv`, and retrieve the qualifiers for each of those edges; the result of our query will be the qualifier edges of the head of state edges. To union the qualifier edges with the claim edges, we feed the output of the query to the `cat` command (concatenate), and then feed the output to the `sort2` command to sort the edges. The first 12 edges are shown below. We see a claim edge followed by the qualifiers defined for it.

This snippet illustrates that KGTK commands can be chained using the `/` chain operator to compose more complex workflows.

In [18]:
!$kypher -i qualifiers -i "$TEMP"/Q31.P35.tsv \
--match 'P35: ()-[l]->(), qual: (l)-[lq]->(n2)' \
--return 'lq as id, l as node1, lq.label as label, n2 as node2' \
/ cat -i - -i "$TEMP"/Q31.P35.tsv \
/ sort2 \
| head -12 \
| column -t -s $'\t'

id                                         node1                        label  node2
Q31-P35-Q1079522-c82ed584-0                Q31                          P35    Q1079522
Q31-P35-Q1079522-c82ed584-0-P39-Q477406-0  Q31-P35-Q1079522-c82ed584-0  P39    Q477406
Q31-P35-Q1079522-c82ed584-0-P580-106076-0  Q31-P35-Q1079522-c82ed584-0  P580   ^1831-02-25T00:00:00Z/11
Q31-P35-Q1079522-c82ed584-0-P582-774519-0  Q31-P35-Q1079522-c82ed584-0  P582   ^1831-07-20T00:00:00Z/11
Q31-P35-Q12967-f2b9aaf3-0                  Q31                          P35    Q12967
Q31-P35-Q12967-f2b9aaf3-0-P39-Q13592862-0  Q31-P35-Q12967-f2b9aaf3-0    P39    Q13592862
Q31-P35-Q12967-f2b9aaf3-0-P580-f29037-0    Q31-P35-Q12967-f2b9aaf3-0    P580   ^1865-12-17T00:00:00Z/11
Q31-P35-Q12967-f2b9aaf3-0-P582-136f02-0    Q31-P35-Q12967-f2b9aaf3-0    P582   ^1909-12-17T00:00:00Z/11
Q31-P35-Q12971-2088471b-0                  Q31                          P35    Q12971
Q31-P35-Q12971-2088471b-0-P39-Q13592862-0  Q31-P35-Q12971-20884

## Summary

- KGTK represents graphs in TSV files with standard columns `id`, `node1`, `label` and `node2`
- It is possible to include arbitrary additional columns in KGTK files
- The identifier of an edge can be used as a node in another edge enabling the representation of edges about edges
- KGTK provides a powerful query command based on Cypher as well as a host of other commands, type `kgtk --help` to see the list of commands.