# Named Entity Recognition

## Annotated Corpus for Named Entity Recognition (From Kaggle)
In this notebook we're building a basic model for NER on the specified corpus using splitting all the sentences down to shorter sequences of words of a fixed length

In [1]:
import numpy as np
import pandas as pd

### Importing data
Downloading the data - it is already well preprocessed - there are a lot of features added to the file

In [2]:
df = pd.read_csv("data/ner.csv", encoding = "ISO-8859-1", error_bad_lines=False)
df = df.iloc[281835:]
df.head()

b'Skipping line 281837: expected 25 fields, saw 34\n'


Unnamed: 0.1,Unnamed: 0,lemma,next-lemma,next-next-lemma,next-next-pos,next-next-shape,next-next-word,next-pos,next-shape,next-word,...,prev-prev-lemma,prev-prev-pos,prev-prev-shape,prev-prev-word,prev-shape,prev-word,sentence_idx,shape,word,tag
281835,0,thousand,of,demonstr,NNS,lowercase,demonstrators,IN,lowercase,of,...,__start2__,__START2__,wildcard,__START2__,wildcard,__START1__,1.0,capitalized,Thousands,O
281836,1,of,demonstr,have,VBP,lowercase,have,NNS,lowercase,demonstrators,...,__start1__,__START1__,wildcard,__START1__,capitalized,Thousands,1.0,lowercase,of,O
281837,2,demonstr,have,march,VBN,lowercase,marched,VBP,lowercase,have,...,thousand,NNS,capitalized,Thousands,lowercase,of,1.0,lowercase,demonstrators,O
281838,3,have,march,through,IN,lowercase,through,VBN,lowercase,marched,...,of,IN,lowercase,of,lowercase,demonstrators,1.0,lowercase,have,O
281839,4,march,through,london,NNP,capitalized,London,IN,lowercase,through,...,demonstr,NNS,lowercase,demonstrators,lowercase,have,1.0,lowercase,marched,O


We want to leave only the initial text

In [3]:
data=df.drop(['Unnamed: 0', 'lemma', 'next-lemma', 'next-next-lemma', 'next-next-pos',
       'next-next-shape', 'next-next-word', 'next-pos', 'next-shape',
       'next-word', 'prev-iob', 'prev-lemma', 'prev-pos',
       'prev-prev-iob', 'prev-prev-lemma', 'prev-prev-pos', 'prev-prev-shape',
       'prev-prev-word', 'prev-shape', 'prev-word',"pos","shape"],axis=1)
data.head()

Unnamed: 0,sentence_idx,word,tag
281835,1.0,Thousands,O
281836,1.0,of,O
281837,1.0,demonstrators,O
281838,1.0,have,O
281839,1.0,marched,O


### Preprocess data into list of sentencies

In [4]:
from text_preprocessing import group_sentences, tokenize, pad, split_sentences

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Forming text sentences

In [5]:
input_sentences, label_sentences = group_sentences(data, "word"), group_sentences(data, "tag")
print("Sample sentence:", input_sentences[0])
print("Sample label sentence:", label_sentences[0])

Sample sentence: Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .
Sample label sentence: O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O


Tokenizing sentences

In [6]:
tokenized_input, tokenizer_X = tokenize(input_sentences)
tokenized_labels, tokenizer_y = tokenize(label_sentences, to_lower=False)

input_vocab_size = len(tokenizer_X.word_index) + 1
labels_vocab_size = len(tokenizer_y.word_index) + 1

print("Vocabulary size:", input_vocab_size)
print("Vocabulary size:", labels_vocab_size)

Vocabulary size: 27420
Vocabulary size: 18


Dividing sentences to the fixed length

In [7]:
WINDOW_SIZE = 8

divided_tokenized_input = split_sentences(tokenized_input, WINDOW_SIZE)
divided_tokenized_labels = split_sentences(tokenized_labels, WINDOW_SIZE)

Preparing training and test sets

In [8]:
from sklearn.model_selection import train_test_split
y = divided_tokenized_labels.reshape(*divided_tokenized_labels.shape, 1)


X_train, X_test, y_train, y_test = train_test_split(divided_tokenized_input, y, test_size=0.2)

### Modelling

In [9]:
from models import embed_gru_model,embed_bi_gru_model,encdec_embed_gru_model
from evaluation import Evaluator
eval = Evaluator(tokenizer_y)

#### One directional GRU model
![RNN](images/embedding.png)

###### Build model

In [10]:
embed_gru_model = embed_gru_model(WINDOW_SIZE, input_vocab_size, labels_vocab_size)
embed_gru_model.summary()





Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 8, 128)            3509760   
_________________________________________________________________
gru_1 (GRU)                  (None, 8, 128)            98688     
_________________________________________________________________
time_distributed_1 (TimeDist (None, 8, 18)             2322      
_________________________________________________________________
activation_1 (Activation)    (None, 8, 18)             0         
Total params: 3,610,770
Trainable params: 3,610,770
Non-trainable params: 0
_________________________________________________________________


##### Train model

In [11]:
embed_gru_model.fit(X_train, y_train, batch_size=1024, epochs=15, validation_split=0.2)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 335349 samples, validate on 83838 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f078472e208>

##### Evaluation

In [12]:
print(eval.evaluate_metrics(y_test, embed_gru_model.predict(X_test)))

           precision    recall  f1-score   support

      nat       0.79      0.56      0.65       190
      per       0.84      0.87      0.85     13462
      gpe       0.96      0.96      0.96     11237
      tim       0.89      0.86      0.88     16232
      eve       0.55      0.48      0.51       273
      org       0.65      0.65      0.65     16531
      geo       0.86      0.93      0.89     29373
      art       0.66      0.49      0.56       318

micro avg       0.84      0.85      0.84     87616
macro avg       0.83      0.85      0.84     87616



#### Bi-directional GRU model
![RNN](images/bidirectional.png)

###### Build model

In [13]:
embed_bi_gru_model = embed_bi_gru_model(WINDOW_SIZE, input_vocab_size, labels_vocab_size)
embed_bi_gru_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 8)                 0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 8, 128)            3509760   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 8, 256)            197376    
_________________________________________________________________
time_distributed_2 (TimeDist (None, 8, 18)             4626      
Total params: 3,711,762
Trainable params: 3,711,762
Non-trainable params: 0
_________________________________________________________________


##### Train model

In [14]:
embed_bi_gru_model.fit(X_train, y_train, batch_size=1024, epochs=15, validation_split=0.2)

Train on 335349 samples, validate on 83838 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f077074d668>

##### Evaluation

In [15]:
print(eval.evaluate_metrics(y_test, embed_bi_gru_model.predict(X_test)))

           precision    recall  f1-score   support

      nat       0.83      0.72      0.77       190
      per       0.92      0.91      0.91     13462
      gpe       0.98      0.96      0.97     11237
      tim       0.93      0.92      0.93     16232
      eve       0.70      0.74      0.72       273
      org       0.84      0.83      0.83     16531
      geo       0.93      0.95      0.94     29373
      art       0.82      0.70      0.76       318

micro avg       0.92      0.91      0.92     87616
macro avg       0.92      0.91      0.92     87616



#### Bi-directional GRU encoder decoder model
![RNN](images/encoder-decoder.jpg)

###### Build model

In [16]:
encdec_embed_gru_model=encdec_embed_gru_model(WINDOW_SIZE, input_vocab_size, labels_vocab_size)
encdec_embed_gru_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, 8)            0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 8, 64)        1754880     input_2[0][0]                    
__________________________________________________________________________________________________
gru_3 (GRU)                     [(None, 128), (None, 74112       embedding_3[0][0]                
__________________________________________________________________________________________________
encoder_dense (Dense)           (None, 128)          16512       gru_3[0][0]                      
__________________________________________________________________________________________________
dropout_1 

##### Train model

In [17]:
encdec_embed_gru_model.fit(X_train, y_train, batch_size=1024, epochs=30, validation_split=0.2)

Train on 335349 samples, validate on 83838 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7f074857bb00>

##### Evaluation

In [18]:
print(eval.evaluate_metrics(y_test, encdec_embed_gru_model.predict(X_test)))

           precision    recall  f1-score   support

      nat       0.86      0.66      0.75       190
      per       0.88      0.89      0.88     13462
      gpe       0.97      0.96      0.97     11237
      tim       0.91      0.91      0.91     16232
      eve       0.62      0.58      0.60       273
      org       0.80      0.79      0.80     16531
      geo       0.91      0.94      0.92     29373
      art       0.78      0.58      0.66       318

micro avg       0.89      0.90      0.89     87616
macro avg       0.89      0.90      0.89     87616

