# Named Entity Recognition
## Approach
The sequence_lableler_final class implements NER using  language modeling as a secondary objective , which improves the performance of the primary sequence labeling objective.

- Input tokens mapped to word embeddings – to obtain context specific representation for each word 
- biLSTM model built using tesnforflow is used with primary objective of sequence labeling and secondary objective as language modeling 
- CRF or softmax  used for prediction of label for each token 

The approach has been tried and executed on all three benchmarks( CoNLL 2003, ontonotes 5.0 and CHEMDNER)  chosen by G6.

In [2]:
from sequence_lableler_final import Sequence_labeler , parse_config
import os
import sys

In [3]:
path = "conf/fcepublic.conf"
temp_model_path = "model/model"
config = parse_config("config", path)
file_paths = { "train": config["path_train"] , "dev" : config["path_dev"], "test" : config["path_test"]}

Instantiate an object of Sequence_labeler class 

In [5]:
labeler = Sequence_labeler(config)

In [6]:
dataset = labeler.read_dataset(file_paths,"ConLL")

train 91
dev 30
test 141


read and load the datasets for train , test and dev

In [7]:
dataset_formatted = dict()
dataset_formatted["train"] = labeler.data_formatted(dataset["train"])
dataset_formatted["dev"] = labeler.data_formatted(dataset["dev"])
dataset_formatted["test"] = labeler.data_formatted(dataset["test"])
data_train, data_dev, data_test = dataset_formatted["train"],dataset_formatted["dev"],dataset_formatted["test"]

building the vocabulary 

In [8]:
print(data_train[1])

[['Yes', 'UH', '(TOP(S(INTJ*)', 'O', 'bc/cnn/00/cnn_0003', '0', '0', '-', '-', '-', 'Linda_Hamilton', '*', '-'], ['they', 'PRP', '(NP*)', 'O', 'bc/cnn/00/cnn_0003', '0', '1', '-', '-', '-', 'Linda_Hamilton', '*', '(15)'], ['did', 'VBD', '(VP*)', 'O', 'bc/cnn/00/cnn_0003', '0', '2', 'do', '01', '-', 'Linda_Hamilton', '(V*)', '-'], ['/.', '.', '*))', 'O', 'bc/cnn/00/cnn_0003', '0', '3', '-', '-', '-', 'Linda_Hamilton', '*', '-']]


In [9]:
labeler.build_vocabs(data_train, data_dev, data_test, config["preload_vectors"])
labeler.construct_network()
labeler.initialize_session()

all labels:  Counter({'O': 199, 'I-WORK_OF_ART': 17, 'B-PERSON': 9, 'B-WORK_OF_ART': 6, 'I-PERSON': 3, 'I-FAC': 3, 'B-ORDINAL': 2, 'B-GPE': 1, 'TRUE_LABEL': 1, 'B-CARDINAL': 1, 'B-FAC': 1})


loading word embeddings and models preloaded

In [11]:
if config["preload_vectors"] != None:
    labeler.preload_word_embeddings(config["preload_vectors"])
if config["load"] != None and len(config["load"])  > 0 and os.path.exists(config["load"]):
    try:
        labeler.load_model(labeler.session, temp_model_path)
    except:
        print("error in loading the model")

n_preloaded_embeddings: 86


Training the model 

In [12]:
labeler.train(data_train,data_dev,temp_model_path)

EPOCH: 0
current_learningrate: 1.0
precision:  0.7777777777777778  recall:  0.9545454545454546  F1 :  0.8571428571428572
best_epoch: 0
EPOCH: 1
current_learningrate: 1.0
precision:  0.7857142857142857  recall:  1.0  F1 :  0.88
best_epoch: 1
EPOCH: 2
current_learningrate: 1.0
precision:  0.7857142857142857  recall:  1.0  F1 :  0.88
best_epoch: 1
EPOCH: 3
current_learningrate: 1.0
precision:  0.7857142857142857  recall:  1.0  F1 :  0.88
best_epoch: 1


predictions 

In [13]:
predictions_formatted = labeler.predict(data_test)

predictions are printed in the format [start index, span, token, token type] - 
The first two indices are None in this case for all predictions 


In [14]:
print(predictions_formatted[2])

[[None, None, 'but', 'O'], [None, None, 'I', 'O'], [None, None, 'am', 'O'], [None, None, 'not', 'O'], [None, None, 'selling', 'O'], [None, None, 'medicine', 'O'], [None, None, 'or', 'O'], [None, None, 'pharmaceuticals', 'O'], [None, None, '/.', 'O']]


In [15]:
cost = 0
predicted_labels = []
groundTruths = []

for i in range(len(predictions_formatted)):
    predicted_labels_sent = []
    groundTruths_sent = []
    for j in range(len(predictions_formatted[i])):
        predicted_labels_sent.append(labeler.label2id[predictions_formatted[i][j][-1]])

    predicted_labels.append(predicted_labels_sent)


for i in data_test:
    groundTruths_sent = []
    for item in i : 
        groundTruths_sent.append(item)
    groundTruths.append(groundTruths_sent)

In [16]:
with open('result.txt','w',encoding = 'utf-8') as f:
    for i in range(len(data_test)):
        for j in range(len(data_test[i])):


            f.write(str(data_test[i][j][0])+" "+str(data_test[i][j][3])+ " " +  str(labeler.id2label[predicted_labels[i][j]]))
            f.write("\n")

        f.write("\n")

In [18]:
precision,recall,f1 = labeler.evaluate(predicted_labels,groundTruths, cost,"test")

This result is on a  small sample dataset, actual numbers are reported in the slide deck 

In [19]:
print(precision,recall,f1)

0.7727272727272727 1.0 0.8717948717948718
