## Deploy BERT server

Instruction website: https://bert-as-service.readthedocs.io/en/latest/section/get-start.html  
Download server and client:
``` bash
pip install -U bert-serving-server bert-serving-client  
```
Downlaod and unzip pretrained bert model(BERT-Large, Uncased, 1024 dimensional output):  
``` bash
cd ${model_path}
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip
unzip uncased_L-24_H-1024_A-16.zip  
```  

Start bert server at local machine: 
``` bash
bert-serving-start -model_dir ${model_path}/uncased_L-24_H-1024_A-16 -num_worker=1  
```
Then, call from client end in python:
``` python
from bert_serving.client import BertClient
bc = BertClient()
bc.encode(['First do it', 'then do it right', 'then do it better'])
```


## Load data as Pandas dataframe

In [3]:
import json
import numpy as np
import pandas as pd

train_file_path = "./JSONFiles/" + "train_with_text.json"
use_test_file = False
if use_test_file:
    test_file_path = './JSONFiles/' + 'test_with_text.json'
else:
    test_file_path = './JSONFiles/' + 'dev_with_text.json'

with open(train_file_path, mode='r') as f:
    train = json.load(f)
with open(test_file_path, mode='r') as f:
    test = json.load(f)

def load_training_data(dataset: dict) -> list:
    dataset_list = []
    for key in dataset.keys():
        record = dataset.get(key)
        claim = record.get("claim")
        evi_texts = record.get("evidence_texts")
        text = ""
        for evi in evi_texts:
            text += evi
        SUP = NOINFO = REF = 0
        if record.get("label") == "SUPPORTS":
            SUP = 1
        elif record.get("label") == "REFUTES":
            REF = 1
        else:
            NOINFO = 1
        dataset_record = {
            "claim": claim,
            "evi_text": text,
            "SUP": SUP,
            "NOINFO": NOINFO,
            "REF": REF
        }
        dataset_list.append(dataset_record)
    return dataset_list

def load_test_data(dataset: dict) -> list:
    dataset_list = []
    for key in dataset.keys():
        record = dataset.get(key)
        claim = record.get("claim")
        evi_index = record.get("evidence")
        evi_texts = record.get("evidence_texts")
        text = ""
        for evi in evi_texts:
            text += evi

        dataset_record = {
            "key": key,
            "claim": claim,
            "evidence": evi_index,
            "evi_text": text
        }
        dataset_list.append(dataset_record)
    return dataset_list

train_df = pd.DataFrame(load_training_data(train))
test_df = pd.DataFrame(load_test_data(test))

train_df[0: 10]

Unnamed: 0,NOINFO,REF,SUP,claim,evi_text
0,0,1,0,Ireland does not have relatively low-lying mou...,The island 's geography comprises relatively l...
1,0,0,1,The drama Dark Matter stars Taylor Schilling.,She made her film debut in the 2007 drama Dark...
2,0,0,1,"In 1932, Prussia was taken over.","In the Weimar Republic , the state of Prussia ..."
3,0,0,1,IZombie premiered in 2015.,"The series premiered on March 17 , 2015 .\n"
4,0,0,1,Ronald Reagan had a nationality.,Ronald Wilson Reagan -LRB- -LSB- ˈrɒnəld_ˈwɪls...
5,0,0,1,Samoa Joe wrestles professionally.,Nuufolau Joel `` Joe '' Seanoa -LRB- born Marc...
6,0,0,1,University of Oxford is in the universe.,The University of Oxford -LRB- informally Oxfo...
7,1,0,0,The Renaissance began online.,The Hokies were led by 27th-year head coach Fr...
8,0,0,1,Portia de Rossi appeared on Scandal.,She appeared as a regular cast member on the A...
9,0,1,0,The Berlin Wall was only standing for 10 years.,The Berlin Wall -LRB- Berliner Mauer -RRB- was...


In [4]:
test_df[0: 10]

Unnamed: 0,claim,evi_text,evidence,key
0,Ripon College's student number totaled in at a...,"As of 2015 , Ripon College 's student body sto...","[[Ripon_College_-LRB-Wisconsin-RRB-, 1]]",100038
1,"Kesha was baptized on March 1st, 1987.",Kesha Rose Sebert -LRB- -LSB- ˈkɛʃə_roʊz_ˈsɛbə...,"[[Kesha, 0]]",100083
2,Birthday Song (2 Chainz song) was banned by So...,"The song , which features fellow American rapp...","[[Birthday_Song_-LRB-2_Chainz_song-RRB-, 1]]",100169
3,The University of Illinois at Chicago is a col...,The University of Illinois at Chicago or UIC i...,"[[University_of_Illinois_at_Chicago, 0]]",100234
4,French Indochina was officially known as the I...,Kenya is comfortably the next most successful ...,"[[10,000_metres_at_the_World_Championships_in_...",100359
5,Damon Albarn has refused to ever work with Bri...,His debut solo studio album Everyday Robots --...,"[[Damon_Albarn, 17]]",100366
6,Lost (TV series) is a series of plays.,Lost is an American television drama series th...,"[[Lost_-LRB-TV_series-RRB-, 0]]",100429
7,Edison Machine Works was barely set up to prod...,The neighborhood is located between 22nd Stree...,"[[Hospital_Hill, 1]]",100457
8,The human brain is set apart from mammalian br...,"Studebaker Building -LRB- Missoula , Montana -...","[[Studebaker_Building, 14]]",100461
9,"There are rumors that Augustus' wife, Livia, p...","Between 1945 and 1954 , Leir saw service in fi...","[[Richard_H._Leir, 10]]",100481


## Feature extraction

### Construct and save bert features to file for reuse

In [6]:
from bert_serving.client import BertClient
bc = BertClient()

# result = bert_embedding(sentences)

In [10]:
train_claim_encode = bc.encode(list(train_df['claim']))
np.save("./BERT_MLP_encodings/train_claim_encode", train_claim_encode)

test_claim_encode = bc.encode(list(test_df['claim']))
np.save("./BERT_MLP_encodings/test_claim_encode", test_claim_encode)


here is what you can do:
- or, start a new server with a larger "max_seq_len"
  '- or, start a new server with a larger "max_seq_len"' % self.length_limit)


In [None]:
train_evi_encode = bc.encode(list(train_df['evi_text']))
np.save("./BERT_MLP_encodings/train_evi_encode", train_evi_encode)

test_evi_endoce = bc.encode(list(test_df['evi_text']))
np.save("./BERT_MLP_encodings/test_evi_endoce", test_evi_endoce)

### Load bert features from file

In [None]:
from scipy.sparse import coo_matrix, hstack

train_claim_features = np.load("train_claim_encode.npy")
train_evi_features = np.load("train_evi_encode.npy")
test_claim_features = np.load("test_claim_encode.npy")
test_evi_features = np.load("test_evi_encode.npy")

x_train = hstack([train_claim_features, train_evi_features])
y_train = train_df[train_df.columns[0:3]].values
test_features = hstack([test_claim_features, test_evi_features])