# Computes Graph and Text Embeddings, Elasticsearch Ready KGTK File and RDF Triples for Blazegraph

This notebook computes the following:

- `complEx` graph embeddings
- `transE` graph embeddings
- `BERT` text embeddings
- `elasticsearch` ready KGTK edge for [KGTK Search](https://kgtk.isi.edu/search/)
- `elasticsearch` ready KGTK edge file for Table Linker
- `RDF Triples` to be loaded into blazegraph

Inputs:

- `item_file`: the subset of the `claims_file` consistin of edges for property of data type `wikibase-item`
- `label_file`, `alias_file` and `description_file` containing labels, aliases and descriptions. It is assume that these files contain the labels, aliases and descriptions of all nodes appearing in the claims file. Users may provide these files for specific languages only.


### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

```
papermill Embeddings-Elasticsearch-&-Triples.ipynb Embeddings-Elasticsearch-&-Triples.out.ipynb \
-p claims_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/all.tsv.gz \
-p label_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.label.en.tsv.gz \
-p item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.wikibase-item.tsv.gz \
-p property_item_file = /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.property.wikibase-item.tsv.gz \
-p output_path <local folder> \
-p output_folder useful_files_v4 \
-p temp_folder temp.useful_files_v4 \
-p delete_database no 
-p languages es,ru,zh-cn
```

In [1]:
import os
import sys

import pandas as pd
 
from kgtk.configure_kgtk_notebooks import ConfigureKGTK
from kgtk.functions import kgtk, kypher

In [None]:
# Parameters

input_path = "/data/amandeep/wikidata-20211027-dwd-v3"
output_path = "/data/amandeep/wikidata-20211027-dwd-v3"
kgtk_path = "/Users/amandeep/github/kgtk"

graph_cache_path = None

project_name = "embeddings-elasticsearch-triples"

languages = 'en,ru,es,zh-cn,de,it,nl,pl,fr,pt,sv'

files = 'label_all,alias_all,description_all'

In [None]:
files = files.split(',')

In [None]:
languages = [f"'{x}'" for x in languages.split(",")]

In [None]:
ck = ConfigureKGTK(files, kgtk_path=kgtk_path)
ck.configure_kgtk(input_graph_path=input_path,
                  output_path=output_path,
                  project_name=project_name,
                 graph_cache_path=graph_cache_path)

In [None]:
ck.print_env_variables()

In [None]:
if graph_cache_path is None:
    ck.load_files_into_cache()

## Filter the labels in user provided languages

In [None]:
kypher(f"""-i label_all 
            -o $OUT/labels.filtered.tsv.gz 
            --match '(n1)-[l:label]->(n2)' 
            --where 'n2.kgtk_lqstring_lang_suffix IN {languages}' 
            --return 'n1, l.label, n2, l.id'
            """)


In [None]:
## Filter the aliases in user provided languages

In [None]:
kypher(f""" -i alias_all 
            -o $OUT/aliases.filtered.tsv.gz 
            --match '(n1)-[l:alias]->(n2)' 
            --where 'n2.kgtk_lqstring_lang_suffix IN {languages}' 
            --return 'n1, l.label, n2, l.id'
            """)

In [None]:
## Filter the descriptions in user provided languages

In [None]:
kypher(f""" -i description_all 
            -o $OUT/descriptions.filtered.tsv.gz 
            --match '(n1)-[l:description]->(n2)'
            --where 'n2.kgtk_lqstring_lang_suffix IN {languages}' 
            --return 'n1, l.label, n2, l.id'
            """)


## Graph Embeddings

### complEx

In [6]:
complex_temp_folder = f"{wikidata_root_folder}/temp.graph-embeddings.complex"

In [7]:
!mkdir -p {complex_temp_folder}

In [None]:
os.environ['TEMP_COMPLEX'] = complex_temp_folder

In [None]:
!kgtk graph-embeddings --verbose -i "$ITEMS" \
-o $OUT/wikidatadwd.complEx.graph-embeddings.txt \
--retain_temporary_data True \
--operator ComplEx \
--workers 24 \
--log $TEMP_COMPLEX/ge.complex.log \
-T $TEMP_COMPLEX \
-ot w2v \
-e 600

### transE

In [None]:
transe_temp_folder = f"{wikidata_root_folder}/temp.graph-embeddings.transe"

In [None]:
!mkdir -p {transe_temp_folder}

In [None]:
os.environ['TEMP_TRANSE'] = transe_temp_folder

In [None]:
!$kgtk graph-embeddings --verbose -i "$ITEMS" \
-o $OUT/wikidatadwd.transE.graph-embeddings.txt \
--retain_temporary_data True \
--operator TransE \
--workers 24 \
--log $TEMP_TRANSE/ge.transE.log \
-T $TEMP_TRANSE \
-ot w2v \
-e 600

### BERT Embeddings

In [None]:
!$kgtk text-embedding -i $ALL   \
--model roberta-large-nli-mean-tokens   \
--property-labels-file $LABELS_EN   \
--isa-properties P31 P279 P106 P39 P1382 P373 P452 \
--save-embedding-sentence > $OUT/wikidatadwd-text-embeddings-all.tsv

### Build KGTK edge file for KGTK Search

In [None]:
kgtk("""cat -i $GRAPH/all.tsv.gz 
            -i $GRAPH/derived.isastar.tsv.gz 
            -i $GRAPH/metadata.property.datatypes.tsv.gz 
            -i $GRAPH/metadata.pagerank.undirected.tsv.gz
            -i $GRAPH/metadata.pagerank.directed.tsv.gz
            -o $OUT/wikidata.dwd.all.kgtk.search.unsorted.tsv.gz""")

In [None]:
kgtk(f"""sort -i $OUT/wikidata.dwd.all.kgtk.search.unsorted.tsv.gz
            --columns node1
            --extra '--parallel 24 --buffer-size 30% --temporary-directory {os.environ['TEMP']}'
            -o $OUT/wikidata.dwd.all.kgtk.search.sorted.tsv.gz""")

In [None]:
kgtk("""build-kgtk-search-input --input-file "$OUT"/wikidata.dwd.all.kgtk.search.sorted.tsv.gz
--output-file "$OUT"/wikidata.dwd.all.kgtk.search.sorted.jl 
--label-properties label 
--alias-properties alias 
--extra-alias-properties P1448,P1705,P1477,P1810,P742,P1449 
--description-properties description 
--pagerank-properties Pdirected_pagerank 
--mapping-file "$OUT"/wikidata_dwd_v3_mapping.json 
--property-datatype-file "$OUT"/metadata.property.datatypes.tsv.gz""")

### Build KGTK edge file for Triple generation


In [None]:
!$kgtk cat \
-i $OUT/wikidata.dwd.all.kgtk.search.sorted.tsv.gz \
-i $OUT/derived.isa.tsv.gz \
-i $OUT/derived.P279star.tsv.gz \
-i $OUT/metadata.in_degree.tsv.gz \
-i $OUT/metadata.out_degree.tsv.gz \
-o $TEMP/wikidata.dwd.all.kgtk.triples.1.tsv.gz

In [None]:
!$kgtk add-id -i $TEMP/wikidata.dwd.all.kgtk.triples.1.tsv.gz \
--id-style wikidata \
-o $TEMP/wikidata.dwd.all.kgtk.triples.2.tsv.gz

In [None]:
!$kgtk sort -i $TEMP/wikidata.dwd.all.kgtk.triples.2.tsv.gz \
--columns node1 \
 --extra '--parallel 24 --buffer-size 30% --temporary-directory ' + temp_folder_path \
-o $OUT/wikidata.dwd.all.kgtk.triples.sorted.tsv.gz

Split the triples file to parallelize triple generation

In [None]:
!mkdir -p $OUT/kgtk_triples_split

In [None]:
!$kgtk split -i $OUT/wikidata.dwd.all.kgtk.triples.sorted.tsv.gz \
--output-path $OUT/kgtk_triples_split \
--gzipped-output --lines 10000000 \
--file-prefix kgtk_triples

In [None]:
!curl https://raw.githubusercontent.com/usc-isi-i2/kgtk/dev/kgtk-properties/kgtk.properties.tsv -o $TEMP/kgtk-properties.tsv

In [None]:
!$kgtk filter -p ";data_type;" -i $TEMP/kgtk-properties.tsv -o $TEMP/kgtk-properties.datatype.tsv.gz

In [1]:
!$kgtk cat -i $TEMP/kgtk-properties.datatype.tsv.gz $OUT/metadata.property.datatypes.tsv.gz -o $OUT/metadata.property.datatypes.augmented.tsv.gz

cat: illegal option -- i
usage: cat [-benstuv] [file ...]


In [None]:
ls $OUT/kgtk_triples_split/*.tsv.gz | parallel -j 18  'kgtk --debug generate-wikidata-triples -lp label -ap alias -dp description -pf $OUT/metadata.property.datatypes.augmented.tsv.gz --output-n-lines 100000 --generate-truthy --warning --use-id --log-path $TEMP/generate_triples_log.txt --error-action log -i {} -o {.}.ttl'

