# Wikifier Docker Runbook

## Setup Elasticsearch cluster on machines wikibase02, wikibase03 and wikibase04

  1. On all machines

    a. edit /etc/sysctl.conf and set vm.max_map_count=262144
    
    b. sudo sysctl --system
    
    c. sudo chmod 666 /var/run/docker.sock

  2. on wikibase02

```
    docker run -d --name elasticsearch --net novartis-es-network -p 9200:9200 -p 9300:9300 --name elasticsearch \
    -v /pool/amandeep/elasticsearch.data:/usr/share/elasticsearch/data \
    -e "discovery.seed_hosts=wikibase03,wikibase04" \
    -e "node.name=es02" \
    -e "cluster.initial_master_nodes=es04,es02,es03" \
    -e "network.publish_host=wikibase02" \
    -e "ES_JAVA_OPTS=-Xms12g -Xmx12g" \
    docker.elastic.co/elasticsearch/elasticsearch:7.12.1
```

  3. on wikibase03

```
    docker run -d -p 9200:9200 -p 9300:9300 \
    -v /pool/amandeep/elasticsearch.data:/usr/share/elasticsearch/data \
    -e "discovery.seed_hosts=wikibase02,wikibase04" \
    -e "node.name=es03" \
    -e "cluster.initial_master_nodes=es02,es03,es04" \
    -e "network.publish_host=wikibase03" \
    -e "ES_JAVA_OPTS=-Xms12g -Xmx12g" \
    docker.elastic.co/elasticsearch/elasticsearch:7.12.1
```

  4. on wikibase04

```
    docker run -d -p 9200:9200 -p 9300:9300 \
    -v /pool/amandeep/elasticsearch.data:/usr/share/elasticsearch/data \
    -e "discovery.seed_hosts=wikibase02,wikibase03" \
    -e "node.name=es04" \
    -e "cluster.initial_master_nodes=es04,es02,es03" \
    -e "network.publish_host=wikibase04" \
    -e "ES_JAVA_OPTS=-Xms12g -Xmx12g" \
    docker.elastic.co/elasticsearch/elasticsearch:7.12.1
```

  5. Check if all three nodes are up and form a cluster

```
    curl localhost:9200/_cat/nodes?v
    ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role   master name
    172.16.4.7           33          92  11    2.41    1.07     1.37 cdfhilmrstw -      es03
    172.16.4.8           54          84  10    2.96    1.23     1.61 cdfhilmrstw -      es04
    172.16.4.6           50          97  13    3.00    1.29     1.72 cdfhilmrstw *      es02
```

  6. Create index with mapping file

```
curl -H "Content-Type: application/json" -XPUT http://localhost:9200/wikidatadwd-augmented-01 -d @wikidata_dwd_mapping_es_ver7.json
```

The mapping file `wikidata_dwd_mapping_es_ver7.json` is present at `/pool/amandeep/wikidata-20210215-dwd` in `wikibase02`

  7. Load data to elasticsearch index using table-linker docker
  
  
  Create a script load_es.sh in `LOCAL_PATH` by copying the following lines to the script:
  
```
es_url=$1
es_index=$2
files_path=$3

for f in $files_path/* ;
do
 tl load-elasticsearch-index --es-url $es_url --es-index $es_index --es-version 7 --kgtk-jl-path $f
 sleep 60
done

```

The first parameter is the Elasticsearch URL

Second parameter: Elasticsearch index name

Third parameter: path to the folder with the files to be loaded into Elasticsearch.
  
The Dockerfile for table-linker: https://github.com/usc-isi-i2/table-linker/blob/master/Dockerfile

```
    git clone https://github.com/usc-isi-i2/table-linker
    cd table-linker
    docker build -t table-linker .
    docker run -it --rm -v <LOCAL_PATH>:/mnt/data   table-linker /bin/bash /mnt/data/load_es.sh "http://localhost:9200" "wikidatadwd-augmented-02" "/mnt/data/es_split"
```
 
THE LOCAL_PATH is a path on local machine where load_es.sh and the data to be loaded should be present. In load_es.sh the --kgtk-jl-path should be the docker container path.

The script path: `/pool/amandeep/wikidata-20210215-dwd/load_es.sh` (wikibase02).

The data files are in the folder `/pool/amandeep/wikidata-20210215-dwd/es_split`

**NOTE**: The above step can take a long time. If running on a server, it is advisable to run the above command in a `tmux` or `screen` session.

 8. Create Alias for Elasticsearch Index

```
curl -X POST "localhost:9200/_aliases?pretty" -H 'Content-Type: application/json' -d'
{
  "actions" : [
    { "add" : { "index" : "wikidatadwd-augmented-01", "alias" : "wikidatadwd-augmented" } }
  ]
}
'
```

9. Create a new index and switch the alias

Suppose the current index is: `wikidatadwd-augmented-01` and we create a new index with more documents, `wikidatadwd-augmented-02`. Following curl command will switch the alias `wikidatadwd-augmented` from `wikidatadwd-augmented-01` to `wikidatadwd-augmented-02`.


```
curl -X POST "localhost:9200/_aliases?pretty" -H 'Content-Type: application/json' -d'
{
  "actions" : [
    { "remove" : { "index" : "wikidatadwd-augmented-01", "alias" : "wikidatadwd-augmented" } },
    { "add" : { "index" : "wikidatadwd-augmented-02", "alias" : "wikidatadwd-augmented" } }
  ]
}
'

```

## Steps to build, setup and run the Wikifier Docker

 1. Download the git repository

```
git clone https://github.com/usc-isi-i2/wikidata-wikifier
```


 2. change directory to `wikidata-wikifier`

```
cd wikidata-wikifier
```


 3. Build the docker image

```
docker build -t wikidata-wikifier .
```

**NOTE: Rebuilding docker image, in case of updates**

```
docker build -t wikidata-wikifier . --no-cache
```


 4. Setup environment variables in `docker-compose.yml`
      - WIKIFIER_ES_URL # Elasticsearch URL, if ES is running on wikibase02, then set this parameter to http://wikibase02:9200
      - WIKIFIER_ES_INDEX # Elasticsearch Index, wikidatadwd-augmented (use the Alias created in previous steps)

 5. Bring the wikifier container up

```
docker-compose up -d
```

 6. Wikifier should be running at `http://localhost:1703`

# Call Wikidata Wikifier Service

In [3]:
import os
import requests
import pandas as pd
from io import StringIO

## Setup parameters

In [1]:
wikifier_service_url = "http://localhost:1703/wikify"
input_file = './wikifier/sample_files/cricketers.csv'
column_to_wikify = "cricketers"

## Peek at the input file

In [6]:
pd.read_csv(input_file).fillna("")

Unnamed: 0,cricketers,teams,weight,dob
0,Virat Kohli,royal challengers bangalore,152,5/11/88
1,Tendulkar,mumbai indians,137,24/04/1973
2,Dhoni,chennai super kings,154,7/7/81
3,Jasprit Bumrah,mumbai indians,154,6/12/93
4,Ajinkya Rahane,rajasthan royals,134,6/6/88
5,Rohit Sharma,mumbai indians,159,30/04/1987
6,Bhuvneshwar Kumar,deccan chargers,154,5/2/90
7,Ravindra Jadeja,chennai super kings,132,6/12/88
8,Rishabh Pant,delhi capitals,136,4/8/97
9,Shikhar Dhawan,delhi capitals,157,5/12/85


## Call via Python

In [7]:
def call_wikifier(url, k=1):
    file_name = os.path.basename(input_file)
    url += f'?k={k}&columns={column_to_wikify}'

    files = {
        'file': (file_name, open(input_file, mode='rb'), 'application/octet-stream')
    }
    resp = requests.post(url, files=files)

    s = str(resp.content, 'utf-8')

    data = StringIO(s)

    return pd.read_csv(data, header=None)

In [None]:
df = call_wikifier(wikifier_service_url, k=3)

In [None]:
df

In [13]:
df.fillna("").to_csv('/tmp/linked_cricketers.csv', index=False)

## Call using `curl`

In [32]:
url  =  f'{wikifier_service_url}?k=3&columns={column_to_wikify}'

In [35]:
curl -XPOST -F "file=@$input_file"  "$url"

 curl -XPOST -F file=@/Users/amandeep/Github/wikidata-wikifier/wikifier/sample_files/cricketers.csv  https://dsbox02.isi.edu:8888/wikifier/wikify?k=3&columns=cricketers 


## Run table-linker commands in a jupyter notebook via Docker

Build the table-linker docker (if not already built)

```
cd table-linker
docker build -t table-linker .
```

Run jupyter notebook

Assuming `LOCAL_PATH` has the notebook to be run and the input file. Mount `LOCAL_PATH` to a folder `/out` in the docker container and run the following command

```
docker run -it -v <LOCAL_PATH>:/out -p 8888:8888 table-linker:latest /bin/bash -c "jupyter lab --ip='*' --port=8888 --no-browser --allow-root"
```

This will run the jupyter server and produce output like,

```
[I 2021-07-19 17:13:15.687 LabApp] JupyterLab extension loaded from /usr/local/lib/python3.7/site-packages/jupyterlab
[I 2021-07-19 17:13:15.687 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab
[I 17:13:15.698 NotebookApp] Serving notebooks from local directory: /
[I 17:13:15.698 NotebookApp] Jupyter Notebook 6.4.0 is running at:
[I 17:13:15.699 NotebookApp] http://a16fa96774c1:8888/?token=2f25a3501853ff6e4177871f869fc9ed83029a42f5242677
[I 17:13:15.699 NotebookApp]  or http://127.0.0.1:8888/?token=2f25a3501853ff6e4177871f869fc9ed83029a42f5242677
[I 17:13:15.699 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 17:13:15.706 NotebookApp]

    To access the notebook, open this file in a browser:
        file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
    Or copy and paste one of these URLs:
        http://a16fa96774c1:8888/?token=2f25a3501853ff6e4177871f869fc9ed83029a42f5242677
     or http://127.0.0.1:8888/?token=2f25a3501853ff6e4177871f869fc9ed83029a42f5242677
^C[I 17:46:25.704 NotebookApp] interrupted
Serving notebooks from local directory: /
0 active kernels
Jupyter Notebook 6.4.0 is running at:
http://a16fa96774c1:8888/?token=2f25a3501853ff6e4177871f869fc9ed83029a42f5242677
 or http://127.0.0.1:8888/?token=2f25a3501853ff6e4177871f869fc9ed83029a42f5242677
```

Copy paste the url `http://127.0.0.1:8888/?token=2f25a3501853ff6e4177871f869fc9ed83029a42f5242677` into the browser to access jupyter notebooks in the browser.