# NER response

### LSTM vs BiLSTM Result
I used precision, recall and F1 score as the metric. BiLSTM performs better than LSTM in the NER task since for each word considering the previous and the next words are important to understand its context. However, both models have overfitting problem. This overfitting here can be caused by lack of features and need more feature engineering(discussed in further work).

### LSTM vs RNN
The advantage of using an LSTM over a vanilla RNN is that LSTM can maintain long-term information by training the memory cell. While both methods train on a sequential dataset, a vanilla RNN easily losses long-term information due to vanishing gradients. The sigmoid function in forget and input/output gates ensures that the small gradients in between two highly correlated words would not corrupt the relationship. With the gates, LSTM achieves higher prediction accuracy than a vanilla RNN in text analysis.

### BiLSTM vs LSTM 
The advantage of using a BiLSTM is that it learns additional future information with the whole context. BiLSTM combines the parameters trained from a positive and a reversed sequential direction to make predictions. Therefore, it preserves information from both the past and the future while a vanilla LSTM only predicts based on the past. This is especially important for the NER project.
BiLSTM is applicable on this assignment because this task does not require online training. We already have the whole context when classifying name entities.

### Further Work
The performance of embedding + LSTM/BiLSTM is not very good on the validation dataset. From the research work by Jason Chiu and Eric Nichols, adding the casing and character information and applying CNN after LSTM will significantly improve the performance. I have tested the two embedding methods, one using the trained embedding from GloVE and the other using embedding trained on this dataset itself. The performance was pretty close and does not address the overfitting problem. The further work to improve the NER classification would be considering capitalization feature, detecting words with LSTM and characters using character-level CNNs.

Reference: Jason P.C. Chiu, Eric Nichols. "Named Entity Recognition with Bidirectional LSTM-CNNs".arXiv:1511.08308

In [1]:
import numpy as np
import pandas as pd
import tqdm
from keras.layers import TimeDistributed,Conv1D,Dense,Embedding,Input,Dropout,LSTM,Bidirectional,MaxPooling1D,Flatten,concatenate
from keras.models import Model
from keras.utils import Progbar
from util import *

Using TensorFlow backend.


In [2]:
train_processed, train_processed_len, wordEmbeddings, label2Idx = readfile('data/ner_train.txt')

In [3]:
val_processed, _, _, _ = readfile('data/ner_validation.txt')

# BiLSTM

In [7]:
# building the BiLSTM model
words_input = Input(shape=(None,),dtype='int32',name='words_input')
words = Embedding(input_dim=wordEmbeddings.shape[0], output_dim=wordEmbeddings.shape[1],  weights=[wordEmbeddings], trainable=False)(words_input)
output = Bidirectional(LSTM(300, return_sequences=True, dropout=0.5, recurrent_dropout=0.5))(words)
output = TimeDistributed(Dense(len(label2Idx), activation='softmax'))(output)
model = Model(inputs=words_input, outputs= output)
model.compile(loss='sparse_categorical_crossentropy', optimizer='nadam')
model.summary()

# training
epochs = 20
for epoch in range(epochs):    
    #print("Epoch %d/%d"%(epoch,epochs))
    a = Progbar(len(train_processed_len))
    for i,j in enumerate(iterate_minibatches(train_processed,train_processed_len)):
        labels, tokens = j      
        model.train_on_batch(tokens, labels)
        a.update(i)

idx2Label = {v: k for k, v in label2Idx.items()}

#  Performance on training dataset        
predLabels, correctLabels = tag_dataset(train_processed, model)        
prec_val, rec_val, f1_val = compute_f1(predLabels, correctLabels, idx2Label)
print("train data precision: %.3f, Rec: %.3f, F1: %.3f" % (prec_val, rec_val, f1_val))

#  Performance on validation dataset        
predLabels, correctLabels = tag_dataset(val_processed, model)        
prec_val, rec_val, f1_val = compute_f1(predLabels, correctLabels, idx2Label)
print("Validation data precision: %.3f, Rec: %.3f, F1: %.3f" % (prec_val, rec_val, f1_val))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
words_input (InputLayer)     (None, None)              0         
_________________________________________________________________
embedding_4 (Embedding)      (None, None, 100)         1841700   
_________________________________________________________________
bidirectional_3 (Bidirection (None, None, 600)         962400    
_________________________________________________________________
time_distributed_4 (TimeDist (None, None, 10)          6010      
Total params: 2,810,110
Trainable params: 968,410
Non-trainable params: 1,841,700
_________________________________________________________________


# LSTM

In [8]:
# building the LSTM model
words_input = Input(shape=(None,),dtype='int32',name='words_input')
words = Embedding(input_dim=wordEmbeddings.shape[0], output_dim=wordEmbeddings.shape[1],  weights=[wordEmbeddings], trainable=False)(words_input)
output = LSTM(300, return_sequences=True, dropout=0.5, recurrent_dropout=0.5)(words)
output = TimeDistributed(Dense(len(label2Idx), activation='softmax'))(output)
model = Model(inputs=words_input, outputs= output)
model.compile(loss='sparse_categorical_crossentropy', optimizer='nadam')
model.summary()

epochs = 20
for epoch in range(epochs):    
    #print("Epoch %d/%d"%(epoch,epochs))
    a = Progbar(len(train_processed_len))
    for i,j in enumerate(iterate_minibatches(train_processed,train_processed_len)):
        labels, tokens = j      
        model.train_on_batch(tokens, labels)
        a.update(i)
        
idx2Label = {v: k for k, v in label2Idx.items()}

#  Performance on training dataset        
predLabels, correctLabels = tag_dataset(train_processed, model)        
prec_val, rec_val, f1_val = compute_f1(predLabels, correctLabels, idx2Label)
print("train data precision: %.3f, Rec: %.3f, F1: %.3f" % (prec_val, rec_val, f1_val))

#  Performance on validation dataset        
predLabels, correctLabels = tag_dataset(val_processed, model)        
prec_val, rec_val, f1_val = compute_f1(predLabels, correctLabels, idx2Label)
print("Validation data precision: %.3f, Rec: %.3f, F1: %.3f" % (prec_val, rec_val, f1_val))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
words_input (InputLayer)     (None, None)              0         
_________________________________________________________________
embedding_5 (Embedding)      (None, None, 100)         1841700   
_________________________________________________________________
lstm_5 (LSTM)                (None, None, 300)         481200    
_________________________________________________________________
time_distributed_5 (TimeDist (None, None, 10)          3010      
Total params: 2,325,910
Trainable params: 484,210
Non-trainable params: 1,841,700
_________________________________________________________________
