# Named Entity Recognition With Parallel Recurrent Neural Networks

This paper is for using parallel RNN to predict the tag of entities. By using Parallel RNN, we can reduce the number of parameter of the network thus make the training time faster.  In order to use the library, the users need to download embeddings of [GloVe](https://nlp.stanford.edu/projects/glove/) that translates words into float vectors. The embedding file should be put under `./data/embeddings/` folder. In this demonstration, we use `glove.6B.100d.txt` which is **100** dimensions.

Besides of embedding file, we also need training dataset to train the CNN model. In this document, we use [CoNLL 2003](https://www.clips.uantwerpen.be/conll2003/ner/) as the dataset. It could be download from many places, e.g., from [here](https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003). There will be 3 files in the dataset, i.e. `train.txt` for training, `valid.txt` for validating the model during the training.

After all preparations are done, we need to confirm the Python is version 3 and TensorFlow is 1.13.1.

In [1]:
import tensorflow as tf; print(tf.__version__)

1.13.1


## Import the library

The first step to use the library is to `import` it

In [3]:
import extraction.named_entity.UOI.Ner as Ner

Using TensorFlow backend.


In [4]:
uoi = Ner.UOI()

## Read the data set for training

There are two parameters in the `read_dataset` method, which are `input_files` containing two paths for train dataset, valid dataset and `embedding` for embedding file.

In [14]:
train_sen, train_tag, val_sen, val_tag = uoi.read_dataset(input_files=['/home/ubuntu/UOI-P18-2012/dataset/CoNNL2003eng/train.txt',
                                                                                    '/home/ubuntu/UOI-P18-2012/dataset/CoNNL2003eng/valid.txt'],
                                                                       embedding='/home/ubuntu/UOI-P18-2012/dataset/CoNNL2003eng/glove.6B.100d.txt')

The return value contains `train_sen`, an array containing all training sentences, `train_tag`, an array containing all tags for the training sentences, `val_sen`, an array containing validating sentences, `val_tag`, an array of validating tags.

In [15]:
train_sen[0:3] #let's see an sample 

[['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'],
 ['Peter', 'Blackburn'],
 ['BRUSSELS', '1996-08-22']]

## Train the model

After reading the dataset, we can start to train the model. **In order to speed up the process, only one epoch will be trained and the F1 score might be very low**. There will be a file added `model.h5` will be created so **please confirm python have the privilege to create a file.**

In [16]:
uoi.train()

Embedding...
Embedding open file
Embedding for loop
Embedding done
Embedding all done
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input_Char (InputLayer)         (None, None, 61)     0                                            
__________________________________________________________________________________________________
Input_Word (InputLayer)         (None, None)         0                                            
__________________________________________________________________________________________________
Embedding_Char_Pre (Embedding)  (None, None, 61, 100 8600        Input_Char[0][0]                 
__________________________________________________________________________________________________
Embedding_Word (Embedding)      (None, None, 100)    40000200    Input_Word[0][0]                 
_______________________

## Predict the data

After having model trained, we can try to predict new sentences.

In [17]:
predict_file='dataset/CoNNL2003eng/test.txt'
tokens, tags, predicts = uoi.predict(predict_file)

There will 3 return values from this method. The first one is the tokens, the second one is ground truth tags and last one is the predictions.

In [18]:
print(tokens[0:5])

[['SOCCER', '-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.'], ['Nadim', 'Ladki'], ['AL-AIN', ',', 'United', 'Arab', 'Emirates', '1996-12-06'], ['Japan', 'began', 'the', 'defence', 'of', 'their', 'Asian', 'Cup', 'title', 'with', 'a', 'lucky', '2-1', 'win', 'against', 'Syria', 'in', 'a', 'Group', 'C', 'championship', 'match', 'on', 'Friday', '.'], ['But', 'China', 'saw', 'their', 'luck', 'desert', 'them', 'in', 'the', 'second', 'match', 'of', 'the', 'group', ',', 'crashing', 'to', 'a', 'surprise', '2-0', 'defeat', 'to', 'newcomers', 'Uzbekistan', '.']]


In [19]:
print(tags[0:5])

[['O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'O', 'O', 'O'], ['B-PER', 'I-ORG'], ['B-LOC', 'O', 'B-LOC', 'I-MISC', 'I-LOC', 'O'], ['B-LOC', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'I-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'O', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']]


In [20]:
print(predicts[0:5])

[['O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'B-PER', 'O', 'O', 'O', 'O'], ['B-PER', 'I-PER'], ['B-LOC', 'O', 'B-LOC', 'I-LOC', 'I-LOC', 'O'], ['B-LOC', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'I-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']]


The `evaluation` method is for checking the performance of the model. The arguments for this method are not used since the statistics are generated during the prediction.

In [21]:
uoi.evaluate([],[])

(0.87972996982062, 0.8767705382654443, 0.8782477609509182)

The `convert_ground_truth` method takes one parameter of the path of test file. The return value is the list of tags.

In [8]:
result = uoi.convert_ground_truth('dataset/CoNNL2003eng/test.txt')

In [9]:
result[0:10]

['O', 'O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'B-PER', 'O']