# 4. Named Entity Recognition with Window Classifier

We will perform NER(Named Entity Recognition) with Window Classifier. As you may have already noticed, non-feedforward neural networks like RNN, GRU, LSTM will work well in these kinds of tasks. So we will revisit NER after we will have covered those networks.

### References
- [CS224n: Natural Language Processing with Deep Learning - Lecture 4](http://web.stanford.edu/class/cs224n/lectures/lecture4.pdf)



In [1]:
from models import WindowClassifier
import nltk
import random

## Load and Preprocess Corpus

In [2]:
corpus = nltk.corpus.conll2002.iob_sents()

In [3]:
data = []
for sent in corpus:
    words, _, tags = list(zip(*sent))
    data.append([words, tags])

In [4]:
print(len(data))
print(data[0])

35651
[('Sao', 'Paulo', '(', 'Brasil', ')', ',', '23', 'may', '(', 'EFECOM', ')', '.'), ('B-LOC', 'I-LOC', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'O')]


In [5]:
# split train/test data
random.seed(1004)
random.shuffle(data)
idx = int(len(data) * 0.8)
train_data = data[:idx]
test_data = data[idx:]

# Fit and Train WindowClassifier

In [6]:
model = WindowClassifier.WindowClassifier(word_embedding_size=100,
                                          window_size=5,
                                          hidden_size=300,
                                          learning_rate=0.001)

DEBUG: 04132118


In [7]:
model.fit_to_data(train_data)

Instructions for updating:
Use the retry module or similar alternatives.


In [8]:
model.train(10, save_dir="save/04_ner", log_dir="log/04_ner", print_every=1000)

Writing TensorBoard summaries to log/04_ner
Saving TensorFlow models to save/04_ner
--------------------------------------------------------------------------------
Created and Initialized fresh model. Size: 6130409
Total number of batches: 4241
--------------------------------------------------------------------------------
step: 1000, epoch:1, time/batch: 0.004488, avg_loss: 0.5008
step: 2000, epoch:1, time/batch: 0.004278, avg_loss: 0.3411
step: 3000, epoch:1, time/batch: 0.004164, avg_loss: 0.2906
step: 4000, epoch:1, time/batch: 0.0042, avg_loss: 0.2473
step: 5000, epoch:2, time/batch: 0.00419, avg_loss: 0.1753
Saved summaries at step 5000
Saved a model at step 5000
step: 6000, epoch:2, time/batch: 0.003725, avg_loss: 0.1726
step: 7000, epoch:2, time/batch: 0.003706, avg_loss: 0.1559
step: 8000, epoch:2, time/batch: 0.004471, avg_loss: 0.1451
step: 9000, epoch:3, time/batch: 0.004477, avg_loss: 0.09224
step: 10000, epoch:3, time/batch: 0.004463, avg_loss: 0.09264
Saved summaries a

# Test
According to [Named Entity Recognition with Character-Level Models - Klein et al.](https://nlp.stanford.edu/cmanning/papers/conll-ner.pdf), "*because of data sparsity, sophisticated
unknown word models are generally required for good performance.*"

But in this model, we will just ignore unknown words in test time. We will embed unknown words to zero-vector for convenience. Maybe we will go deeper into NER after we cover some CNN and RNN models.

In [9]:
model.test(test_data, load_dir="save/04_ner")

INFO:tensorflow:Restoring parameters from save/04_ner/WC_NER-42401.model
--------------------------------------------------------------------------------
Restored model from checkpoint for testing. Size: 6130409
--------------------------------------------------------------------------------
             precision    recall  f1-score   support

      B-LOC       0.82      0.78      0.79      2237
     B-MISC       0.68      0.63      0.65      1608
      B-ORG       0.82      0.80      0.81      2963
      B-PER       0.90      0.84      0.87      2534
      I-LOC       0.68      0.58      0.63       615
     I-MISC       0.57      0.52      0.54      1305
      I-ORG       0.78      0.77      0.77      2043
      I-PER       0.90      0.84      0.87      1859

avg / total       0.79      0.75      0.77     15164

