# Named Entity Recognition

## Annotated Corpus for Named Entity Recognition (From Kaggle)
In this notebook we're building a basic model for NER on the specified corpus using padding up to the most long sentence in the corpus applied to all the sentences

In [1]:
import numpy as np
import pandas as pd

### Importing data
Downloading the data - it is already well preprocessed - there are a lot of features added to the file

In [2]:
df = pd.read_csv("data/ner.csv", encoding = "ISO-8859-1", error_bad_lines=False)
df = df.iloc[281835:]
df.head()

b'Skipping line 281837: expected 25 fields, saw 34\n'


Unnamed: 0.1,Unnamed: 0,lemma,next-lemma,next-next-lemma,next-next-pos,next-next-shape,next-next-word,next-pos,next-shape,next-word,...,prev-prev-lemma,prev-prev-pos,prev-prev-shape,prev-prev-word,prev-shape,prev-word,sentence_idx,shape,word,tag
281835,0,thousand,of,demonstr,NNS,lowercase,demonstrators,IN,lowercase,of,...,__start2__,__START2__,wildcard,__START2__,wildcard,__START1__,1.0,capitalized,Thousands,O
281836,1,of,demonstr,have,VBP,lowercase,have,NNS,lowercase,demonstrators,...,__start1__,__START1__,wildcard,__START1__,capitalized,Thousands,1.0,lowercase,of,O
281837,2,demonstr,have,march,VBN,lowercase,marched,VBP,lowercase,have,...,thousand,NNS,capitalized,Thousands,lowercase,of,1.0,lowercase,demonstrators,O
281838,3,have,march,through,IN,lowercase,through,VBN,lowercase,marched,...,of,IN,lowercase,of,lowercase,demonstrators,1.0,lowercase,have,O
281839,4,march,through,london,NNP,capitalized,London,IN,lowercase,through,...,demonstr,NNS,lowercase,demonstrators,lowercase,have,1.0,lowercase,marched,O


We want to leave only the initial text

In [3]:
data=df.drop(['Unnamed: 0', 'lemma', 'next-lemma', 'next-next-lemma', 'next-next-pos',
       'next-next-shape', 'next-next-word', 'next-pos', 'next-shape',
       'next-word', 'prev-iob', 'prev-lemma', 'prev-pos',
       'prev-prev-iob', 'prev-prev-lemma', 'prev-prev-pos', 'prev-prev-shape',
       'prev-prev-word', 'prev-shape', 'prev-word',"pos","shape"],axis=1)
data.head()

Unnamed: 0,sentence_idx,word,tag
281835,1.0,Thousands,O
281836,1.0,of,O
281837,1.0,demonstrators,O
281838,1.0,have,O
281839,1.0,marched,O


### Preprocess data into list of sentencies

In [4]:
from text_preprocessing import group_sentences, tokenize, pad, split_sentences

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Forming text sentences

In [5]:
input_sentences, label_sentences = group_sentences(data, "word"), group_sentences(data, "tag")
print("Sample sentence:", input_sentences[0])
print("Sample label sentence:", label_sentences[0])

Sample sentence: Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .
Sample label sentence: O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O


Tokenizing sentences

In [6]:
tokenized_input, tokenizer_X = tokenize(input_sentences)
tokenized_labels, tokenizer_y = tokenize(label_sentences, to_lower=False)

input_vocab_size = len(tokenizer_X.word_index) + 1
labels_vocab_size = len(tokenizer_y.word_index) + 1

print("X Vocabulary size:", input_vocab_size)
print("y Vocabulary size:", labels_vocab_size)

X Vocabulary size: 27420
y Vocabulary size: 18


Padding sentences to the max length

In [7]:
padded_tokenized_input = pad(tokenized_input)
padded_tokenized_labels = pad(tokenized_labels)

max_sentence_size = padded_tokenized_input.shape[1]

In [8]:
y = padded_tokenized_labels.reshape(*padded_tokenized_labels.shape, 1)

Preparing training and test sets

In [9]:
from sklearn.model_selection import train_test_split
y = padded_tokenized_labels.reshape(*padded_tokenized_labels.shape, 1)


X_train, X_test, y_train, y_test = train_test_split(padded_tokenized_input, y, test_size=0.2)

### Modelling

In [10]:
from models import embed_gru_model,embed_bi_gru_model
from evaluation import Evaluator
eval = Evaluator(tokenizer_y)

#### One directional GRU model
![RNN](images/embedding.png)

###### Build model

In [11]:
embed_gru_model = embed_gru_model(max_sentence_size, input_vocab_size, labels_vocab_size)
embed_gru_model.summary()





Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 81, 128)           3509760   
_________________________________________________________________
gru_1 (GRU)                  (None, 81, 128)           98688     
_________________________________________________________________
time_distributed_1 (TimeDist (None, 81, 18)            2322      
_________________________________________________________________
activation_1 (Activation)    (None, 81, 18)            0         
Total params: 3,610,770
Trainable params: 3,610,770
Non-trainable params: 0
_________________________________________________________________


##### Train model

In [12]:
embed_gru_model.fit(X_train, y_train, batch_size=32, epochs=5, validation_split=0.2)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 22512 samples, validate on 5629 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7ff6b0155e80>

##### Evaluation

In [13]:
print(eval.evaluate_metrics(y_test, embed_gru_model.predict(X_test)))

           precision    recall  f1-score   support

      eve       0.42      0.11      0.17        46
      org       0.45      0.47      0.46      2930
      tim       0.79      0.77      0.78      3007
      per       0.68      0.70      0.69      2400
      nat       0.67      0.26      0.37        31
      art       0.00      0.00      0.00        49
      gpe       0.94      0.94      0.94      2337
      geo       0.78      0.83      0.80      5485

micro avg       0.73      0.74      0.73     16285
macro avg       0.72      0.74      0.73     16285



#### Bi-directional GRU model
![RNN](images/bidirectional.png)

###### Build model

In [14]:
embed_bi_gru_model = embed_bi_gru_model(max_sentence_size, input_vocab_size, labels_vocab_size)
embed_bi_gru_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 81)                0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 81, 128)           3509760   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 81, 256)           197376    
_________________________________________________________________
time_distributed_2 (TimeDist (None, 81, 18)            4626      
Total params: 3,711,762
Trainable params: 3,711,762
Non-trainable params: 0
_________________________________________________________________


##### Train model

In [15]:
embed_bi_gru_model.fit(X_train, y_train, batch_size=32, epochs=5, validation_split=0.2)

Train on 22512 samples, validate on 5629 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7ff69c365b38>

##### Evaluation

In [16]:
print(eval.evaluate_metrics(y_test, embed_bi_gru_model.predict(X_test)))

           precision    recall  f1-score   support

      eve       0.45      0.11      0.18        46
      org       0.57      0.57      0.57      2930
      tim       0.86      0.84      0.85      3007
      per       0.71      0.69      0.70      2400
      nat       0.73      0.26      0.38        31
      art       0.00      0.00      0.00        49
      gpe       0.95      0.93      0.94      2337
      geo       0.84      0.84      0.84      5485

micro avg       0.79      0.78      0.78     16285
macro avg       0.79      0.78      0.78     16285

