# Named Entity Recognition

## Annotated Corpus for Named Entity Recognition (From Kaggle)
In this notebook we're building a basic model for NER on the specified corpus using ELMo embedding module and splitting all the sentences down to shorter sequences of words of a fixed length.

In [1]:
import numpy as np
import pandas as pd

### Importing data
Downloading the data - it is already well preprocessed - there are a lot of features added to the file

In [2]:
df = pd.read_csv("data/ner.csv", encoding = "ISO-8859-1", error_bad_lines=False)
df = df.iloc[281835:]
df.head()

b'Skipping line 281837: expected 25 fields, saw 34\n'


Unnamed: 0.1,Unnamed: 0,lemma,next-lemma,next-next-lemma,next-next-pos,next-next-shape,next-next-word,next-pos,next-shape,next-word,...,prev-prev-lemma,prev-prev-pos,prev-prev-shape,prev-prev-word,prev-shape,prev-word,sentence_idx,shape,word,tag
281835,0,thousand,of,demonstr,NNS,lowercase,demonstrators,IN,lowercase,of,...,__start2__,__START2__,wildcard,__START2__,wildcard,__START1__,1.0,capitalized,Thousands,O
281836,1,of,demonstr,have,VBP,lowercase,have,NNS,lowercase,demonstrators,...,__start1__,__START1__,wildcard,__START1__,capitalized,Thousands,1.0,lowercase,of,O
281837,2,demonstr,have,march,VBN,lowercase,marched,VBP,lowercase,have,...,thousand,NNS,capitalized,Thousands,lowercase,of,1.0,lowercase,demonstrators,O
281838,3,have,march,through,IN,lowercase,through,VBN,lowercase,marched,...,of,IN,lowercase,of,lowercase,demonstrators,1.0,lowercase,have,O
281839,4,march,through,london,NNP,capitalized,London,IN,lowercase,through,...,demonstr,NNS,lowercase,demonstrators,lowercase,have,1.0,lowercase,marched,O


We want to leave only the initial text

In [3]:
data=df.drop(['Unnamed: 0', 'lemma', 'next-lemma', 'next-next-lemma', 'next-next-pos',
       'next-next-shape', 'next-next-word', 'next-pos', 'next-shape',
       'next-word', 'prev-iob', 'prev-lemma', 'prev-pos',
       'prev-prev-iob', 'prev-prev-lemma', 'prev-prev-pos', 'prev-prev-shape',
       'prev-prev-word', 'prev-shape', 'prev-word',"pos","shape"],axis=1)
data.head()

Unnamed: 0,sentence_idx,word,tag
281835,1.0,Thousands,O
281836,1.0,of,O
281837,1.0,demonstrators,O
281838,1.0,have,O
281839,1.0,marched,O


### Preprocess data into list of sentencies

In [4]:
from text_preprocessing import group_sentences, tokenize, pad, split_sentences, split_text_sentences

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Forming text sentences

In [5]:
input_sentences, label_sentences = group_sentences(data, "word"), group_sentences(data, "tag")
print("Sample sentence:", input_sentences[0])
print("Sample label sentence:", label_sentences[0])

Sample sentence: Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .
Sample label sentence: O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O


Tokenizing sentences

In [6]:
tokenized_labels, tokenizer_y = tokenize(label_sentences, to_lower=False)
labels_vocab_size = len(tokenizer_y.word_index) + 1
print("Vocabulary size:", labels_vocab_size)

Vocabulary size: 18


Dividing sentences to the fixed length

In [7]:
WINDOW_SIZE = 8

divided_input = split_text_sentences(input_sentences, WINDOW_SIZE)
divided_tokenized_labels = split_sentences(tokenized_labels, WINDOW_SIZE)

divided_input[:5]

array([['Thousands', 'of', 'demonstrators', 'have', 'marched', 'through',
        'London', 'to'],
       ['of', 'demonstrators', 'have', 'marched', 'through', 'London',
        'to', 'protest'],
       ['demonstrators', 'have', 'marched', 'through', 'London', 'to',
        'protest', 'the'],
       ['have', 'marched', 'through', 'London', 'to', 'protest', 'the',
        'war'],
       ['marched', 'through', 'London', 'to', 'protest', 'the', 'war',
        'in']], dtype='<U64')

Preparing training and test sets

In [8]:
from sklearn.model_selection import train_test_split
y = divided_tokenized_labels.reshape(*divided_tokenized_labels.shape, 1)


X_train, X_test, y_train, y_test = train_test_split(divided_input, y, test_size=0.2)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2)

### Modelling

In [9]:
from models import embed_gru_model,embed_bi_gru_model,elmo_bi_gru_model
from evaluation import Evaluator
eval = Evaluator(tokenizer_y)

#### Bi-directional GRU model with ELMo embedding
![RNN](images/elmo-word-embedding.png)

cutting the lengths of the datasets to be a multiple of ten

In [10]:
BATCH_SIZE=512

train_length = len(X_train)-(len(X_train)%BATCH_SIZE)
valid_length = len(X_valid)-(len(X_valid)%BATCH_SIZE)
test_length = len(X_test)-(len(X_test)%BATCH_SIZE)

print(train_length)
print(test_length)

334848
104448


###### Build model

In [11]:
elmo_bi_gru_model = elmo_bi_gru_model(WINDOW_SIZE, BATCH_SIZE, labels_vocab_size)
elmo_bi_gru_model.summary()






INFO:tensorflow:Saver not created because there are no variables in the graph to restore
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 8)                 0         
_________________________________________________________________
lambda_1 (Lambda)            (None, 8, 1024)           0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 8, 256)            885504    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 8, 18)             4626      
Total params: 890,130
Trainable params: 890,130
Non-trainable params: 0
_________________________________________________________________


##### Train model

In [12]:
elmo_bi_gru_model.fit(X_train[:train_length], y_train[:train_length], validation_data=(X_valid[:valid_length], y_valid[:valid_length]), batch_size=BATCH_SIZE, epochs=1, validation_split=0.2)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 334848 samples, validate on 83456 samples
Epoch 1/1

KeyboardInterrupt: 

##### Evaluation

In [13]:
print(eval.evaluate_metrics(y_test[:test_length], elmo_bi_gru_model.predict(X_test[:test_length], batch_size=BATCH_SIZE)))

           precision    recall  f1-score   support

      eve       0.36      0.21      0.26       291
      gpe       0.95      0.93      0.94     11188
      nat       0.46      0.25      0.32       187
      per       0.76      0.79      0.77     13350
      art       0.50      0.05      0.09       370
      tim       0.87      0.81      0.84     15976
      org       0.69      0.67      0.68     16608
      geo       0.83      0.89      0.86     29430

micro avg       0.81      0.82      0.81     87400
macro avg       0.81      0.82      0.81     87400

