# Creating a subset of Wikidata

This notebook illustrates how to partition a Wikidata KGTK edges file.

Parameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:

```
papermill partition-wikidata.ipynb partition-wikidata.out.ipynb \
-p wikidata_input_path /data4/rogers/elicit/cache/datasets/wikidata-20200803/data/everything.tsv.gz \
-p wikidata_parts_path /data4/rogers/elicit/cache/datasets/wikidata-20200803/parts \
```

Here is a sample of the records that might appear in the input KGTK file:
```
id	node1	label	node2	rank	node2;wikidatatype	lang
Q1-P1036-418bc4-78f5a565-0	Q1	P1036	"113"	normal	external-id	
Q1-P1343-Q19190511-ab132b87-0   Q1      P1343   Q19190511       normal  wikibase-item   
Q1-P18-92a7b3-0dcac501-0        Q1      P18     "Hubble ultra deep field.jpg"   normal  commonsMedia    
Q1-P2386-cedfb0-0fdbd641-0      Q1      P2386   +880000000000000000000000Q828224        normal  quantity        
Q1-P580-a2fccf-63cf4743-0       Q1      P580    ^-13798000000-00-00T00:00:00Z/3 normal  time    
Q1-P920-47c0f2-52689c4e-0       Q1      P920    "LEM201201756"  normal  string  
Q1-P1343-Q19190511-ab132b87-0-P805-Q84065667-0  Q1-P1343-Q19190511-ab132b87-0   P805    Q84065667               wikibase-item   
Q1-P1343-Q88672152-5080b9e2-0-P304-5724c3-0     Q1-P1343-Q88672152-5080b9e2-0   P304    "13-36"         string  
Q1-P2670-Q18343-030eb87e-0-P1107-ce87f8-0       Q1-P2670-Q18343-030eb87e-0      P1107   +0.70           quantity        
Q1-P793-Q273508-1900d69c-0-P585-a2fccf-0        Q1-P793-Q273508-1900d69c-0      P585    ^-13798000000-00-00T00:00:00Z/3         time    
P10-alias-en-282226-0   P10     alias   'gif'@en
P10-description-en      P10     description     'relevant video. For images, use the property P18. For film trailers, qualify with \"object has role\" (P3831)=\"trailer\" (Q622550)'@en                        en
P10-label-en    P10     label   'video'@en                      en
Q1-addl_wikipedia_sitelink-19e42a-0     Q1      addl_wikipedia_sitelink http://enwikiquote.org/wiki/Universe                    en
Q1-addl_wikipedia_sitelink-19e42a-0-language-0  Q1-addl_wikipedia_sitelink-19e42a-0     sitelink-language       en                      en
Q1-addl_wikipedia_sitelink-19e42a-0-site-0      Q1-addl_wikipedia_sitelink-19e42a-0     sitelink-site   enwikiquote                     en
Q1-addl_wikipedia_sitelink-19e42a-0-title-0     Q1-addl_wikipedia_sitelink-19e42a-0     sitelink-title  "Universe"                      en
Q1-wikipedia_sitelink-5e459a-0  Q1      wikipedia_sitelink      http://en.wikipedia.org/wiki/Universe                   en
Q1-wikipedia_sitelink-5e459a-0-badge-Q17437798  Q1-wikipedia_sitelink-5e459a-0  sitelink-badge  Q17437798                       en
Q1-wikipedia_sitelink-5e459a-0-language-0       Q1-wikipedia_sitelink-5e459a-0  sitelink-language       en                      en
Q1-wikipedia_sitelink-5e459a-0-site-0   Q1-wikipedia_sitelink-5e459a-0  sitelink-site   enwiki                  en
Q1-wikipedia_sitelink-5e459a-0-title-0  Q1-wikipedia_sitelink-5e459a-0  sitelink-title  "Universe"                      en
```
Here are some contraints on the contents of the input file:
- The input file starts with a KGTK header record.
  - In addition to the `id`, `node1`, `label`, and `node2` columns, the file is expected contain `rank`, `node2;wikidatatype`, and `lang` columns.
  - The `rank` column is not used in this script.
  - The `node2;wikidatatype` column is used to partion claims by Wikidata property datatype.
  - The `lang` column is used to extract English language sitelinks.
- The `id` column must contain a nonempty value.
  - It must follow certain patterns for claim and qualifier records.
    - Claim records contain 5 sections separated by hyphens (4 hyphens total).
    - Qualifier records contain 8 sections separated by dashes (7 dashes total).
- The first section of an `id` value must be the `node` value for the record.
  - The qualifier extraction operations depend upon this constraint. 
- In addition to the claims and qualifiers, the input file is expected to contain:
  - English language labels for all property entities appearing in the file.
- The input file ought to contain the following:
  - alias records in appropriate languages,
  - description records in appropriate languages,
  - label records in appropriate languages, and
  - sitelink records in appropriate languages.
- Additionally, this script provides for the appearance of `datatype` and `type` records.
  - `datatype` records map Wikidata property entities to Wikidata property datatypes. These records may provide an alternative to the `node2;wikidatatype` column in the future.
  - `type` records list all `entityId` values and identify them as properties or items. These records provides a correctness check on the operation of `kgtk import-wikidata`, and may be deprecated in the future. 

### Parameters for invoking the notebook

| Parameter | Description | Default |
| --------- | ----------- | ------- |
| `wikidata_input_path` | A folder containing the Wikidata KGTK edges to partition. | '/data4/rogers/elicit/cache/datasets/wikidata-20200803/data/everything.tsv.gz' |
| `wikidata_parts_path` | A folder containing the part files of Wikidata, including files such as `part.wikibase-item.tsv.gz` | '/data4/rogers/elicit/cache/datasets/wikidata-20200803/parts' |
| `temp_folder_path` |    A folder that may be used for temporary files. | wikidata_parts_path + '/temp' |
| `gzip_command` |        The compression command for sorting. | 'pigz' |
| `sort_extras` |         Extra parameters for the sort program.  The default specifies a path for temporary files. Other useful parameters include '--buffer-size' and '--parallel'. | '--temporary-directory ' + wikidata_parts_path |
| `unsorted_extension` |  The file extension for unsorted files. | 'unsorted.tsv.gz' |
| `sorted_extension` |    The file extension for sorted files. | 'tsv.gz' |
| `use_mgzip` |           When True, use the mgzip program where appropriate for faster compression. | 'True' |
| `verbose` |             When True, produce additional feedback messages. | 'True' |


In [1]:
# Parameters
wikidata_input_path = '/data4/rogers/elicit/cache/datasets/wikidata-20200803/data/everything.tsv.gz'
wikidata_parts_path = '/data4/rogers/elicit/cache/datasets/wikidata-20200803/parts2'
temp_folder_path =    wikidata_parts_path + '/temp'
gzip_command =        'pigz'
sort_extras =         '--temporary-directory ' + wikidata_parts_path
unsorted_extension =  'unsorted.tsv.gz'
sorted_extension =    'tsv.gz'
use_mgzip =           'True'
verbose =             'True'


### Import the Python modules we will use in this script.
Almost all of this script consists of shell commands, so all we need to import is `os`, which we use for setup.

In [2]:
import os

### Set up environment variables and folders that we need
Define environment variables to pass the script parameters to the KGTK commands.

In [3]:
# file containing wikidata edges.
os.environ['WIKIDATA_INPUT'] =     wikidata_input_path
# folder to receive wikidata broken down into smaller files.
os.environ['WIKIDATA_PARTS'] =     wikidata_parts_path
# temporary folder
os.environ['TEMP'] =               temp_folder_path
# kgtk command to run
# os.environ['kgtk'] =             "kgtk"
os.environ['kgtk'] =               "time kgtk --debug --timing"
# gzip command to run
os.environ['gzip'] =               gzip_command
# extra parameters for sort
os.environ['SORT_EXTRAS'] =        sort_extras
# The unsorted file extension.
os.environ['UNSORTED_EXTENSION'] = unsorted_extension
# The sorted file extension.
os.environ['SORTED_EXTENSION'] =   sorted_extension
# The use_mgzip flag.
os.environ['USE_MGZIP'] =          use_mgzip
# The verbose flag.
os.environ['VERBOSE'] =            verbose


### Create working folders and empty them

In [6]:
!mkdir $WIKIDATA_PARTS
!mkdir $TEMP

In [7]:
!rm $WIKIDATA_PARTS/*.tsv $WIKIDATA_PARTS/*.tsv.gz
!rm $TEMP/*.tsv $TEMP/*.tsv.gz

### Partition the Claims, Qualifiers, and Entity Data
Split out the entity data (alias, description, label, and sitelinks) and additional metadata (datatype, type).  Separate the qualifiers from the claims.

In [8]:
!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --first-match-only --regex \
 --input-file $WIKIDATA_INPUT \
 -p '; ^datatype$ ;'      -o $TEMP/property.datatype.$UNSORTED_EXTENSION \
 -p '; ^alias$ ;'         -o $TEMP/part.alias.$UNSORTED_EXTENSION \
 -p '; ^description$ ;'   -o $TEMP/part.description.$UNSORTED_EXTENSION \
 -p '; ^label$ ;'         -o $TEMP/part.label.$UNSORTED_EXTENSION \
 -p '; ^(addl_wikipedia_sitelink|sitelink-badge|sitelink-language|sitelink-site|sitelink-title|wikipedia_sitelink)$ ;' \
                          -o $TEMP/part.wikipedia_sitelink.$UNSORTED_EXTENSION \
 -p '; ^type$ ;'          -o $TEMP/types.$UNSORTED_EXTENSION \
 -p '^.*-.*-.*-.*-.*$ ;;' -o $TEMP/part.qual.$UNSORTED_EXTENSION \
 --reject-file $TEMP/part.claims.$UNSORTED_EXTENSION

### Sort the initial partitions.
Sort each of the initial partition files.

In [9]:
!$kgtk sort2 --verbose=$VERBOSE --gzip-command=$gzip \
 --input-file  $TEMP/property.datatype.$UNSORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/property.datatype.$SORTED_EXTENSION \
 --columns     id node1 label node2 \
 --extra       "$SORT_EXTRAS"

In [9]:
!$kgtk sort2 --verbose=$VERBOSE --gzip-command=$gzip \
 --input-file  $TEMP/part.alias.$UNSORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.alias.$SORTED_EXTENSION \
 --columns     id node1 label node2 \
 --extra       "$SORT_EXTRAS"

In [9]:
!$kgtk sort2 --verbose=$VERBOSE --gzip-command=$gzip \
 --input-file  $TEMP/part.description.$UNSORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.description.$SORTED_EXTENSION \
 --columns     id node1 label node2 \
 --extra       "$SORT_EXTRAS"

In [9]:
!$kgtk sort2 --verbose=$VERBOSE --gzip-command=$gzip \
 --input-file  $TEMP/part.label.$UNSORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.label.$SORTED_EXTENSION \
 --columns     id node1 label node2 \
 --extra       "$SORT_EXTRAS"

In [9]:
!$kgtk sort2 --verbose=$VERBOSE --gzip-command=$gzip \
 --input-file  $TEMP/part.wikipedia_sitelink.$UNSORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.wikipedia_sitelink.$SORTED_EXTENSION \
 --columns     id node1 label node2 \
 --extra       "$SORT_EXTRAS"

In [9]:
!$kgtk sort2 --verbose=$VERBOSE --gzip-command=$gzip \
 --input-file  $TEMP/types.$UNSORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/types.$SORTED_EXTENSION \
 --columns     id node1 label node2 \
 --extra       "$SORT_EXTRAS"

In [9]:
!$kgtk sort2 --verbose=$VERBOSE --gzip-command=$gzip \
 --input-file  $TEMP/part.qual.$UNSORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --columns     id node1 label node2 \
 --extra       "$SORT_EXTRAS"

In [9]:
!$kgtk sort2 --verbose=$VERBOSE --gzip-command=$gzip \
 --input-file  $TEMP/part.claims.$UNSORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.claims.$SORTED_EXTENSION \
 --columns     id node1 label node2 \
 --extra       "$SORT_EXTRAS"

### Extract the English aliases, descriptions, labels, and sitelinks.
Aliases, descriptions, and labels are extracted by selecting rows where the `node2` value ends in the language suffix for English (`@en`) in a KGTK language-qualified string. This is an abbreviated pattern; a more general pattern would include the single quotes used to delimit a KGTK language-qualified string. If `kgtk import-wikidata` has executed properly, the abbreviated pattern should be sufficient.

Sitelink rows do not have a language-specific marker in the `node2` value. We use the `lang` column to provide the language code for English ('en').  The `lang` column is an additional column created by `kgtk import-wikidata`.

In [9]:
!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \
 --input-file $WIKIDATA_PARTS/part.alias.$SORTED_EXTENSION \
 -p ';; ^.*@en$' -o $WIKIDATA_PARTS/part.alias.en.$SORTED_EXTENSION

In [9]:
!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \
 --input-file $WIKIDATA_PARTS/part.description.$SORTED_EXTENSION \
 -p ';; ^.*@en$' -o $WIKIDATA_PARTS/part.description.en.$SORTED_EXTENSION

In [9]:
!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \
 --input-file $WIKIDATA_PARTS/part.label.$SORTED_EXTENSION \
 -p ';; ^.*@en$' -o $WIKIDATA_PARTS/part.label.en.$SORTED_EXTENSION

In [9]:
!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --obj=lang \
 --input-file $WIKIDATA_PARTS/part.wikipedia_sitelink.$SORTED_EXTENSION \
 -p ';; en' -o $WIKIDATA_PARTS/part.wikipedia_sitelink.en.$SORTED_EXTENSION

### Extract the Entity list
Create `part.claims.entities`.  This is a single-column KGTK node file that contains a list of all the ikidata `entityId` values in the `node` column of the input file. Wikidata items have `entityId` values that start with `Q`, while Wikidata properties have `entityId` values that start with `P`.

In [9]:
!$kgtk unique --verbose=$VERBOSE --format=node-only --use-mgzip=$USE_MGZIP \
 --input-file  $WIKIDATA_PARTS/part.claims.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.claims.entities.$SORTED_EXTENSION \
 --column      node1

# Count the number of claims per Wikidata datatype

In [9]:
!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \
 --input-file  $WIKIDATA_PARTS/part.claims.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.claims.datatypes.$SORTED_EXTENSION \
 --column      'node2;wikidatatype'

### Partition the claims by Wikidata Property Datatype
Wikidata has two names for each Wikidata property datatype: the name that appears in the JSON dump file, and the name that appears in the TTL dump file. `kgtk import-wikidata` currently imports rows from Wikikdata JSON dump files, and these are the names that appear below.

The `part.other` file catches any records that have an unknown Wikidata property datatype. Additional Wikidata property datatypes may occur when processing from certain Wikidata extensions.

In [9]:
!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --first-match-only \
 --input-file $WIKIDATA_PARTS/part.claims.$SORTED_EXTENSION \
 --obj 'node2;wikidatatype' \
 -p ';; commonsMedia'      -o $WIKIDATA_PARTS/part.commonsMedia.$SORTED_EXTENSION \
 -p ';; external-id'       -o $WIKIDATA_PARTS/part.external-id.$SORTED_EXTENSION \
 -p ';; geo-shape'         -o $WIKIDATA_PARTS/part.geo-shape.$SORTED_EXTENSION \
 -p ';; globe-coordinate'  -o $WIKIDATA_PARTS/part.globe-coordinate.$SORTED_EXTENSION \
 -p ';; math'              -o $WIKIDATA_PARTS/part.math.$SORTED_EXTENSION \
 -p ';; monolingualtext'   -o $WIKIDATA_PARTS/part.monolingualtext.$SORTED_EXTENSION \
 -p ';; musical-notation'  -o $WIKIDATA_PARTS/part.musical-notation.$SORTED_EXTENSION \
 -p ';; quantity'          -o $WIKIDATA_PARTS/part.quantity.$SORTED_EXTENSION \
 -p ';; string'            -o $WIKIDATA_PARTS/part.string.$SORTED_EXTENSION \
 -p ';; tabular-data'      -o $WIKIDATA_PARTS/part.tabular-data.$SORTED_EXTENSION \
 -p ';; time'              -o $WIKIDATA_PARTS/part.time.$SORTED_EXTENSION \
 -p ';; url'               -o $WIKIDATA_PARTS/part.url.$SORTED_EXTENSION \
 -p ';; wikibase-form'     -o $WIKIDATA_PARTS/part.wikibase-form.$SORTED_EXTENSION \
 -p ';; wikibase-item'     -o $WIKIDATA_PARTS/part.wikibase-item.$SORTED_EXTENSION \
 -p ';; wikibase-lexeme'   -o $WIKIDATA_PARTS/part.wikibase-lexeme.$SORTED_EXTENSION \
 -p ';; wikibase-property' -o $WIKIDATA_PARTS/part.wikibase-property.$SORTED_EXTENSION \
 -p ';; wikibase-sense'    -o $WIKIDATA_PARTS/part.wikibase-sense.$SORTED_EXTENSION \
 --reject-file $WIKIDATA_PARTS/part.other.$SORTED_EXTENSION

### Partition the qualifiers
Extract the qualifier records for each of the Wikidata property datatype partition files.

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.commonsMedia.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.commonsMedia.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.external-id.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.external-id.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.geo-shape.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.geo-shape.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.globe-coordinate.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.globe-coordinate.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.math.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.math.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.monolingualtext.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.monolingualtext.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.musical-notation.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.musical-notation.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.quantity.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.quantity.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.string.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.string.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.tabular-data.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.tabular-data.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.time.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.time.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.url.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.url.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.wikibase-form.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.wikibase-form.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.wikibase-item.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.wikibase-item.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.wikibase-lexeme.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.wikibase-lexeme.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.wikibase-property.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.wikibase-property.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

In [9]:
!$kgtk ifexists --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --presorted \
 --input-file  $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 --filter-on   $WIKIDATA_PARTS/part.wikibase-sense.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.wikibase-sense.qual.$SORTED_EXTENSION \
 --input-keys  node1 \
 --filter-keys id

### Extract the Property claims
Extract the claims that are made about Wikidata properties.  Then, extract the qualifiers for those claims.

In [9]:
!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \
 --input-file $WIKIDATA_PARTS/part.claims.$SORTED_EXTENSION \
 -p '^P.*$ ;;' -o $WIKIDATA_PARTS/part.property.$SORTED_EXTENSION

In [9]:
!$kgtk filter --verbose=$VERBOSE --use-mgzip=$USE_MGZIP --regex \
 --input-file $WIKIDATA_PARTS/part.qual.$SORTED_EXTENSION \
 -p '^P.*$ ;;' -o $WIKIDATA_PARTS/part.property.qual.$SORTED_EXTENSION

### Count the number of property claims per Wikidata property datatype.

In [9]:
!$kgtk unique --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \
 --input-file  $WIKIDATA_PARTS/part.property.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.property.datatypes.$SORTED_EXTENSION \
 --column      'node2;wikidatatype'

### Count and label the property claims.
Count the number of claims per property, then lift the English label for each property.

In [9]:
!$kgtk unique --verbose=$VERBOSE \
 --use-mgzip $USE_MGZIP \
 --input-file  $WIKIDATA_PARTS/part.property.$SORTED_EXTENSION \
 --output-file $WIKIDATA_PARTS/part.property.counts.$SORTED_EXTENSION \
 --column      label \
 --label       total-count

In [9]:
!$kgtk lift --verbose=$VERBOSE --use-mgzip=$USE_MGZIP \
 --input-file       $WIKIDATA_PARTS/part.property.counts.$SORTED_EXTENSION \
 --input-file       $WIKIDATA_PARTS/part.label.en.$SORTED_EXTENSION \
 --output-file      $WIKIDATA_PARTS/part.property.counts-with-labels.$SORTED_EXTENSION \
 --columns-to-lift  node1 \
 --prefilter-labels