# Fast and Accurate Sequence Labeling with Iterated Dilated Convolution

This paper is for using Iterated Dilated Convolution to predict the tag of entities. Dilated Convolution is faster than traditional CNN. In order to use the library, the users need to download embeddings of [GloVe](https://nlp.stanford.edu/projects/glove/) that translates words into float vectors. The embedding file should be put under `./data/embeddings/` folder. In this demonstration, we use `glove.6B.100d.txt` which is **100** dimensions. 

Besides of embedding file, we also need training dataset to train the CNN model. In this document, we use [CoNLL 2003](https://www.clips.uantwerpen.be/conll2003/ner/) as the dataset. It could be downloaded from many places, e.g., from [here](https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003). There will be 3 files in the dataset, i.e. `eng.train` for training, `eng.testa` for validating the model during the training and `eng.testb` for the evaluation.

After all preparations are done, we need to confirm the Python is version 3 and TensorFlow is 1.13.1.

In [2]:
import tensorflow as tf; print(tf.__version__)

1.13.1


## Import the library

The first step to use the library is to *import* it.

In [3]:
from extraction.named_entity.DilatedCNN import Ner
d = Ner.DilatedCNN()

## Read the data set for training

There are two parameters for reading the dataset. The first parameter is an array containing 3 elements, corresponding to the path of *train*,*testa* and *testb*. The second parameter is for embedding files.

In [4]:
total_training_sample = d.read_dataset(input_files=['/home/ubuntu/dilated-cnn-ner/data/conll2003/eng.train','/home/ubuntu/dilated-cnn-ner/data/conll2003/eng.testa','/home/ubuntu/dilated-cnn-ner/data/conll2003/eng.testb'], embedding='./data/embeddings/glove.6B.100d.txt')

5848


The `read_dataset` will preprocess the input data by vectorizing the words and return the number of sentences. So please **confirm** the Python thread have the previllege to create file and folder.

## Training the model

The most important part is to train the CNN model. By invoking `train` method, the library will start to train the model. As long as the training process, some training information will be given. **In order to do this demo, I set the epoch from 100 to 2 to speed up the training process. So the F1 score will be very low.**

In [5]:
d.train(data=None)

infile/home/ubuntu/dilated-cnn-ner/data/conll2003/eng.train
out_dir/home/ubuntu/dilated-cnn-ner/data/conll2003/conll2003-w3-lample/train.txt
vocab/home/ubuntu/dilated-cnn-ner/data/conll2003/vocabs/conll2003_cutoff_4.txt
{'in_file': '/home/ubuntu/dilated-cnn-ner/data/conll2003/eng.train', 'vocab': '/home/ubuntu/dilated-cnn-ner/data/conll2003/vocabs/conll2003_cutoff_4.txt', 'labels': '', 'shapes': '', 'chars': '', 'embeddings': './data/embeddings/glove.6B.100d.txt', 'out_dir': '/home/ubuntu/dilated-cnn-ner/data/conll2003/conll2003-w3-lample/train.txt', 'window_size': 3, 'lowercase': False, 'start_end': False, 'debug': False, 'predict_pad': False, 'documents': False, 'update_maps': True, 'update_vocab': '', 'dataset': 'conll2003', 'f': ''}
start preprocess
Embeddings coverage: 89.64%
infile/home/ubuntu/dilated-cnn-ner/data/conll2003/eng.testa
out_dir/home/ubuntu/dilated-cnn-ner/data/conll2003/conll2003-w3-lample/valid.txt
vocab/home/ubuntu/dilated-cnn-ner/data/conll2003/vocabs/conll2003_c

Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Training on 14041 sentences (14041 examples)
                2816 examples at 65.04 examples/sec. Error: 0.61038
                5632 examples at 63.54 examples/sec. Error: 0.51379
                8448 examples at 65.65 examples/sec. Error: 0.45537
               11353 examples at 65.80 examples/sec. Error: 0.40900
Segment evaluation TRAIN (iteration 1):
	        F1	Prec	Recall
Micro (Seg)	44.80	51.75	39.50
Macro (Seg)	35.65	58.64	25.61
-------
-------
       ORG	35.65	58.64	25.61
         O	0.00	0.00	100.00
      MISC	0.00	0.00	0.00
       PER	48.51	42.73	56.09
       LOC	58.02	60.82	55.46
Processed 203621 tokens with 23499 phrases; found: 17936 phrases; correct:

## Predict the tags

After training the model, we can predict new dataset by invoking `predict` method. The `predict` method has 1 parameter and return a list containing the tags. The return value is a list containing the predictions. 

In [7]:
result = d.predict('/home/ubuntu/dilated-cnn-ner/data/conll2003-w3-lample/eng.testb')
result[0:100]

num classes: 17
/home/ubuntu/dilated-cnn-ner/data/conll2003/conll2003-w3-lample/train.txt/sizes.txt
num train examples: 14041
num train tokens: 203621
/home/ubuntu/dilated-cnn-ner/data/conll2003/conll2003-w3-lample/valid.txt
/home/ubuntu/dilated-cnn-ner/data/conll2003/conll2003-w3-lample/valid.txt/sizes.txt
num dev examples: 3250
num dev tokens: 51362
{'ORG': 0, 'O': 1, 'MISC': 2, 'PER': 3, 'LOC': 4}
Loaded 3226/5850 embeddings (55.15% coverage)
[<tf.Tensor 'forward/embedding_lookup/Identity:0' shape=(?, ?, 100) dtype=float32>, <tf.Tensor 'forward/embedding_lookup_1/Identity:0' shape=(?, ?, 5) dtype=float32>]
Adding initial layer conv0: width: 3; filters: 300
input feats expanded drop (?, 1, ?, 105)
last out shape (?, 1, ?, 300)
last dims 300
Adding layer conv1: dilation: 1; width: 3; filters: 300; take: False
Adding layer conv2: dilation: 2; width: 3; filters: 300; take: False
Adding layer conv3: dilation: 1; width: 3; filters: 300; take: True
input feats expanded drop (?, 1, ?, 105)


In [9]:
print(result[100:500])

 

AFTER B-ORG O 

<OOV> B-ORG O 

<OOV> B-ORG O 

. B-ORG O 



LONDON B-MISC B-LOC 

0000-00-00 B-ORG O 



West B-PER B-PER 

Indian I-PER I-PER 

<OOV> B-ORG O 

Phil B-LOC B-PER 

<OOV> B-ORG I-PER 

took B-ORG O 

four B-ORG O 

for B-ORG O 

00 B-ORG O 

on B-ORG O 

Friday B-ORG O 

as B-ORG O 

Leicestershire O B-ORG 

beat B-ORG O 

Somerset O B-ORG 

by B-ORG O 

an B-ORG O 

innings B-


## Evaluate the model

The `evaluation` method will return the performance of the evaluation. 

In [10]:
d.evaluate()

['\t        F1\tPrec\tRecall\n',
 'Micro (Seg)\t1.91\t9.90\t1.06\n',
 'Macro (Seg)\t0.14\t2.52\t0.07\n']

## Convert to ground truth

The `convert_ground_truth` will return the tags in the test file whose path is given in the parameter. The result

In [11]:
result = d.convert_ground_truth('/home/ubuntu/dilated-cnn-ner/data/conll2003/eng.testb')
result[0:20]

['O',
 'O',
 'O',
 'B-LOC',
 'O',
 'O',
 'O',
 'O',
 'B-PER',
 'O',
 'O',
 'O',
 'O',
 'B-PER',
 'I-PER',
 'B-LOC',
 'O',
 'B-LOC',
 'I-LOC',
 'I-LOC']