# Computes Graph and Text Embeddings, Elasticsearch Ready KGTK File and RDF Triples for Blazegraph

This notebook computes the following:

- `complEx` graph embeddings
- `transE` graph embeddings
- `BERT` text embeddings
- `elasticsearch` ready KGTK edge for [KGTK Search](https://kgtk.isi.edu/search/)
- `elasticsearch` ready KGTK edge file for Table Linker
- `RDF Triples` to be loaded into blazegraph

Inputs:

- `item_file`: the subset of the `claims_file` consistin of edges for property of data type `wikibase-item`
- `label_file`, `alias_file` and `description_file` containing labels, aliases and descriptions. It is assume that these files contain the labels, aliases and descriptions of all nodes appearing in the claims file. Users may provide these files for specific languages only.


### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

```
papermill Embeddings-Elasticsearch-&-Triples.ipynb Embeddings-Elasticsearch-&-Triples.out.ipynb \
-p claims_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/all.tsv.gz \
-p label_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.label.en.tsv.gz \
-p item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.wikibase-item.tsv.gz \
-p property_item_file = /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v4/part.property.wikibase-item.tsv.gz \
-p output_path <local folder> \
-p output_folder useful_files_v4 \
-p temp_folder temp.useful_files_v4 \
-p delete_database no 
-p languages es,ru,zh-cn
```

In [4]:
# Parameters
wikidata_root_folder = "/data/amandeep/wikidata-20210215-dwd-v2"
items_file = "claims.wikibase-item.tsv.gz"
all_sorted_file = "all.sorted.tsv.gz"
en_labels_file = "labels.en.tsv.gz"

In [None]:
import os

In [8]:
os.environ['OUT'] = f"{wikidata_root_folder}"
os.environ['kgtk'] = "kgtk --debug"
os.environ['ITEMS'] = f"{wikidata_root_folder}/{items_file}"
os.environ['ALL'] = f"{wikidata_root_folder}/{all_sorted_file}"
os.environ['LABELS_EN'] = f"{wikidata_root_folder}/{en_labels_file}"

SyntaxError: EOL while scanning string literal (<ipython-input-8-0da1d2ff866c>, line 4)

## Graph Embeddings

### complEx

In [6]:
complex_temp_folder = f"{wikidata_root_folder}/temp.graph-embeddings.complex"

In [7]:
!mkdir -p {complex_temp_folder}

In [None]:
os.environ['TEMP_COMPLEX'] = complex_temp_folder

In [None]:
!kgtk graph-embeddings --verbose -i "$ITEMS" \
-o $OUT/wikidatadwd.complEx.graph-embeddings.txt \
--retain_temporary_data True \
--operator ComplEx \
--workers 24 \
--log $TEMP_COMPLEX/ge.complex.log \
-T $TEMP_COMPLEX \
-ot w2v \
-e 600

### transE

In [None]:
transe_temp_folder = f"{wikidata_root_folder}/temp.graph-embeddings.transe"

In [None]:
!mkdir -p {transe_temp_folder}

In [None]:
os.environ['TEMP_TRANSE'] = transe_temp_folder

In [None]:
!$kgtk graph-embeddings --verbose -i "$ITEMS" \
-o $OUT/wikidatadwd.transE.graph-embeddings.txt \
--retain_temporary_data True \
--operator TransE \
--workers 24 \
--log $TEMP_TRANSE/ge.transE.log \
-T $TEMP_TRANSE \
-ot w2v \
-e 600

### BERT Embeddings

In [None]:
kgtk text-embedding -i $ALL   \
--model roberta-large-nli-mean-tokens   \
--property-labels-file $LABELS_EN   \
--isa-properties P31 P279 P106 P39 P1382 P373 P452 \
--save-embedding-sentence > $OUT/wikidatadwd-text-embeddings-all.tsv