## Deploy BERT server

Instruction website: https://bert-as-service.readthedocs.io/en/latest/section/get-start.html  
Download server and client:
``` bash
pip install -U bert-serving-server bert-serving-client  
```
Downlaod and unzip pretrained bert model(BERT-Large, Uncased, 1024 dimensional output):  
``` bash
cd ${model_path}
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip
unzip uncased_L-24_H-1024_A-16.zip  
```  

Start bert server at local machine: 
``` bash
bert-serving-start -model_dir ${model_path}/uncased_L-24_H-1024_A-16 -max_seq_len=100 -num_worker=1  
bert-serving-start -model_dir /share/ShareFolder/uncased_L-24_H-1024_A-16/ -max_seq_len=150 -gpu_memory_fraction=0.9 -num_worker=1
```
Then, call from client end in python:
``` python
from bert_serving.client import BertClient
bc = BertClient()
bc.encode(['First do it', 'then do it right', 'then do it better'])
```


## Load data as Pandas dataframe

In [1]:
import json
import numpy as np
import pandas as pd

train_file_path = "./JSONFiles/" + "train_with_text.json"
use_test_file = True
if use_test_file:
    test_file_path = './JSONFiles/' + 'test_with_text.json'
else:
    test_file_path = './JSONFiles/' + 'dev_with_text.json'

with open(train_file_path, mode='r') as f:
    train = json.load(f)
with open(test_file_path, mode='r') as f:
    test = json.load(f)

def load_training_data(dataset: dict) -> list:
    dataset_list = []
    for key in dataset.keys():
        record = dataset.get(key)
        claim = record.get("claim")
        evi_texts = record.get("evidence_texts")
        text = ''.join(evi_texts)
        if len(text) == 0:
            text = "no word"

        SUP = NOINFO = REF = 0
        if record.get("label") == "SUPPORTS":
            SUP = 1
        elif record.get("label") == "REFUTES":
            REF = 1
        else:
            NOINFO = 1
        dataset_record = {
            "claim": claim,
            "evi_text": text,
            "claim_with_evi_text": claim + " ||| " + text,
            "SUP": SUP,
            "NOINFO": NOINFO,
            "REF": REF
        }
        dataset_list.append(dataset_record)
    return dataset_list

def load_test_data(dataset: dict) -> list:
    dataset_list = []
    for key in dataset.keys():
        record = dataset.get(key)
        claim = record.get("claim")
        evi_index = record.get("evidence")
        evi_texts = record.get("evidence_texts")
        text = ''.join(evi_texts)
        if len(text) == 0:
            text = "no word"
            
        dataset_record = {
            "key": key,
            "claim": claim,
            "evidence": evi_index,
            "claim_with_evi_text": claim + " ||| " + text,
            "evi_text": text
        }
        dataset_list.append(dataset_record)
    return dataset_list

train_df = pd.DataFrame(load_training_data(train))
test_df = pd.DataFrame(load_test_data(test))

train_df[0: 10]

Unnamed: 0,NOINFO,REF,SUP,claim,claim_with_evi_text,evi_text
0,0,1,0,From the Earth to the Moon is a WGBH miniseries.,From the Earth to the Moon is a WGBH miniserie...,From the Earth to the Moon is a twelve-part HB...
1,0,0,1,Bhagat Singh was a movie role performed by Aja...,Bhagat Singh was a movie role performed by Aja...,Bhagat Singh -LRB- -LSB- pə̀ɡət̪ sɪ́ŋɡ -RSB- -...
2,1,0,0,Daz Dillinger is the owner of Bad Boy Records.,Daz Dillinger is the owner of Bad Boy Records....,It is one of two high schools in Campbellsvill...
3,0,0,1,The release date of The Smurfs (film) changed ...,The release date of The Smurfs (film) changed ...,After having the release date changed three ti...
4,0,1,0,Morgan Freeman is incapable of being part of B...,Morgan Freeman is incapable of being part of B...,Freeman has appeared in many other box office ...
5,0,0,1,Michael Jackson released the album Thriller.,Michael Jackson released the album Thriller. |...,"His music videos , including those of `` Beat ..."
6,0,0,1,Outlander (TV series) adapts novels.,Outlander (TV series) adapts novels. ||| Drago...,Dragonfly in Amber is the second book in the O...
7,0,0,1,Bruce Springsteen was named MusiCares' person ...,Bruce Springsteen was named MusiCares' person ...,"In 2009 , Springsteen was a Kennedy Center Hon..."
8,0,1,0,Game of Thrones (season 3) had 0 episodes.,Game of Thrones (season 3) had 0 episodes. |||...,It was broadcast on Sunday at 9:00 pm in the U...
9,1,0,0,Anne Hathaway won the Critics' Choice Movie Aw...,Anne Hathaway won the Critics' Choice Movie Aw...,The Ballad .\n


In [2]:
test_df[0: 10]

Unnamed: 0,claim,claim_with_evi_text,evi_text,evidence,key
0,Henry III of France was murdered by the presid...,Henry III of France was murdered by the presid...,"Henry III , Prince of Anhalt-Aschersleben -LRB...","[[Henry_III, 19], [Henry_III, 15], [Henry_III,...",183452
1,Mary-Kate Olsen and Ashley Olsen are also know...,Mary-Kate Olsen and Ashley Olsen are also know...,She is an older sister of actress Elizabeth Ol...,"[[Mary-Kate_Olsen, 3], [Mary-Kate_Olsen, 0], [...",212309
2,Deepika Padukone played the lead role in a Hin...,Deepika Padukone played the lead role in a Hin...,Deepika Padukone -LRB- -LSB- d̪iːpɪkaː pəɖʊkoː...,"[[Deepika_Padukone, 0], [Deepika_Padukone, 14]...",19160
3,NRG Recording Studios is located in a hospital.,NRG Recording Studios is located in a hospital...,NRG Recording Studios is a recording facility ...,"[[NRG_Recording_Studios, 0]]",36451
4,Ayananka Bose is a director of cinematography.,Ayananka Bose is a director of cinematography....,He won the best cinematographer of Zee Cine Aw...,"[[Ayananka_Bose, 2], [Ayananka_Bose, 0], [Ayan...",124694
5,Aubrey Anderson-Emmons is a child actress.,Aubrey Anderson-Emmons is a child actress. |||...,Aubrey Frances Anderson-Emmons -LRB- born June...,"[[Aubrey_Anderson-Emmons, 0], [Aubrey_Anderson...",23211
6,The University of Leicester discovered and ide...,The University of Leicester discovered and ide...,Later editions of the catalogue contained mino...,"[[List_of_compositions_by_Franz_Schubert, 10]]",197637
7,Gal Gadot was ranked ahead of Esti Ginzburgh f...,Gal Gadot was ranked ahead of Esti Ginzburgh f...,Gadot is primarily known for her role as Wonde...,"[[Gal_Gadot, 1], [Gal_Gadot, 4], [Gal_Gadot, 5...",160309
8,Pocahontas had a reception.,Pocahontas had a reception. ||| In a well-know...,"In a well-known historical anecdote , she is s...","[[Pocahontas, 2], [Pocahontas, 8], [Pocahontas...",152201
9,"Watertown, Massachusetts is in the United King...","Watertown, Massachusetts is in the United King...",Watertown is made up of six neighborhoods : Be...,"[[Watertown,_Massachusetts, 6], [Watertown,_Ma...",157022


## Feature extraction

### Construct and save bert features to file for reuse

In [3]:
from bert_serving.client import BertClient
bc = BertClient()


In [None]:
# # train, test claim encode
# restart server with 
# bert-serving-start -model_dir /share/ShareFolder/uncased_L-24_H-1024_A-16/ -max_seq_len=150 -gpu_memory_fraction=0.9 -num_worker=1


# train_claim_encode = bc.encode(list(train_df['claim']))
# np.save("./BERT_MLP_encodings/train_claim_encode", train_claim_encode)

test_claim_encode = bc.encode(list(test_df['claim']))
np.save("./BERT_MLP_encodings/test_claim_encode", test_claim_encode)

In [4]:
# train, test evidence encode
# restart server with 
# bert-serving-start -model_dir /share/ShareFolder/uncased_L-24_H-1024_A-16/ -max_seq_len=150 -gpu_memory_fraction=0.9 -num_worker=1

# train_evi_encode = bc.encode(list(train_df['evi_text']))
# np.save("./BERT_MLP_encodings/train_evi_encode", train_evi_encode)

test_evi_encode = bc.encode(list(test_df['evi_text']))
np.save("./BERT_MLP_encodings/test_evi_encode", test_evi_encode)



here is what you can do:
- or, start a new server with a larger "max_seq_len"
  '- or, start a new server with a larger "max_seq_len"' % self.length_limit)


In [5]:
# # train, test claim+evidence pair encode

# train_pair_encode = bc.encode(list(train_df['claim_with_evi_text']))
# np.save("./BERT_MLP_encodings/train_pair_encode", train_pair_encode)

test_pair_encode = bc.encode(list(test_df['claim_with_evi_text']))
np.save("./BERT_MLP_encodings/test_pair_encode", test_pair_encode)

### Load bert features from file

In [6]:

train_claim_features = np.load("./BERT_MLP_encodings/train_claim_encode.npy")
test_claim_features = np.load("./BERT_MLP_encodings/test_claim_encode.npy")

train_evi_features = np.load("./BERT_MLP_encodings/train_evi_encode.npy")
test_evi_features = np.load("./BERT_MLP_encodings/test_evi_encode.npy")

train_pair_features = np.load("./BERT_MLP_encodings/train_pair_encode.npy")
test_pair_features = np.load("././BERT_MLP_encodings/test_pair_encode.npy")


In [8]:
x_train = np.concatenate([train_claim_features, train_evi_features, train_pair_features], axis=1)
y_train = train_df[train_df.columns[0:3]].values
x_test = np.concatenate([test_claim_features, test_evi_features, test_pair_features], axis=1)

In [9]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)

(145449, 3072)
(145449, 3)
(14997, 3072)


## Build and train model

### Simple MLP model prototype


In [10]:
import keras
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.models import Sequential
from keras.layers import Dense, Dropout
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.optimizers import Adam

seed = 7
np.random.seed(seed)


model = Sequential()
model.add(Dense(units=200, activation='relu', input_dim=x_train.shape[1]))
model.add(Dense(units=50, activation='relu', input_dim=x_train.shape[1]))
model.add(Dropout(0.3))
model.add(Dense(units=3, activation='softmax'))
# optimizer = Adam(lr=0.01)
model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer='adam', metrics=['accuracy'])

model.summary()
# SVG(model_to_dot(model).create(prog='dot', format='svg'))

# callbacks
filepath="best_weights.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='auto')
earlyStopping = EarlyStopping(monitor='val_acc', patience=3, verbose=0, mode='auto')

callbacks_list = [checkpoint, earlyStopping]

# model.fit(x=x_train, y=y_train, batch_size=32, epochs=50, validation_split=0.1, callbacks=callbacks_list)

Using TensorFlow backend.


Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 200)               614600    
_________________________________________________________________
dense_2 (Dense)              (None, 50)                10050     
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 153       
Total params: 624,803
Trainable params: 624,803
Non-trainable params: 0
_________________________________________________________________


### Tune Hyperparameters mannually

## Apply Model

In [11]:
# load from file
model.load_weights("best_weights.hdf5")
y_test = model.predict(x_test, batch_size=128, verbose=1)
y_test


result_dict = {}

for i in range(len(test_df)):
    if np.argmax(y_test[i]) == 0:
        label = "NOT ENOUGH INFO"
    elif np.argmax(y_test[i]) == 1:
        label = "REFUTES"
    else:
        label = "SUPPORTS"
    key = test_df['key'][i]
    result_dict.update({
        key:{
            "claim": test_df['claim'][i],
            "label": label,
            "evidence": test_df['evidence'][i]
        }
    })
    
with open('result_on_dev.json', 'w') as outfile:
    json.dump(result_dict, outfile, indent=4)

