# Recognize Named


In this project, I  use a recurrent neural network to solve Named Entity Recognition  problem. Named Entity Recognition  is a common task in natural language processing systems. It serves for extraction such entities from the text as persons, organizations, locations, etc. In this project I will experiment to recognize named entities from Twitter.



A solution of this problem will be based on neural networks, particularly, on Bi-Directional Long Short-Term Memory Networks (Bi-LSTMs).



### Load the Twitter Named Entity Recognition corpus





In [0]:
def read_data(file_path):
    tokens = []
    tags = []
    
    tweet_tokens = []
    tweet_tags = []
    for line in open(file_path, encoding='utf-8'):
        line = line.strip()
        if not line:
            if tweet_tokens:
                tokens.append(tweet_tokens)
                tags.append(tweet_tags)
            tweet_tokens = []
            tweet_tags = []
        else:
            token, tag = line.split()
            if token.startswith('@'):
                token = '<USR>'
            elif token.startswith('http://') or token.startswith('https://'):
                token = '<URL>'
            
            tweet_tokens.append(token)
            tweet_tags.append(tag)
            
    return tokens, tags

In [0]:
train_tokens, train_tags = read_data('/content/data/train.txt')  
validation_tokens, validation_tags = read_data('/content/data/validation.txt') 
test_tokens, test_tags = read_data('/content/data/test.txt') 

In [0]:
for i in range(3):
    for token, tag in zip(train_tokens[i], train_tags[i]):
        print('%s\t%s' % (token, tag))
    print()

RT	O
<USR>	O
:	O
Online	O
ticket	O
sales	O
for	O
Ghostland	B-musicartist
Observatory	I-musicartist
extended	O
until	O
6	O
PM	O
EST	O
due	O
to	O
high	O
demand	O
.	O
Get	O
them	O
before	O
they	O
sell	O
out	O
...	O

Apple	B-product
MacBook	I-product
Pro	I-product
A1278	I-product
13.3	I-product
"	I-product
Laptop	I-product
-	I-product
MD101LL/A	I-product
(	O
June	O
,	O
2012	O
)	O
-	O
Full	O
read	O
by	O
eBay	B-company
<URL>	O
<URL>	O

Happy	O
Birthday	O
<USR>	O
!	O
May	O
Allah	B-person
s.w.t	O
bless	O
you	O
with	O
goodness	O
and	O
happiness	O
.	O



### Prepare dictionaries

To train a neural network, I use two mappings: 
- {token}$\to${token id}: address the row in embeddings matrix for the current token;
- {tag}$\to${tag id}: one-hot ground truth probability distribution vectors for computing the loss at the output of the network.

Token indices will be used to address the row in embeddings matrix. The mapping for tags will be used to create one-hot ground truth probability distribution vectors to compute the loss at the output of the network.

The function *build_dict* which will return {token or tag}$\to${index} and vice versa.

In [0]:
from collections import defaultdict

In [0]:
def build_dict(tokens_or_tags, special_tokens):
#tokens_or_tags: a list of lists of tokens or tags
#special_tokens: some special tokens

    # Create a dictionary with default value 0
    tok2idx = defaultdict(lambda: 0)
    idx2tok = []
    for special, tokens in enumerate(special_tokens):
        tok2idx[tokens] = special
        idx2tok.append(tokens)
    for tk in tokens_or_tags:
        for tks in tk:
            if token not in tok2idx:
                tok2idx[tks] = len(special_tokens)
                idx2tok.append(tks)
                nextIndex += 1
 
    
    return tok2idx, idx2tok

After implementing the function *build_dict* , making  dictionaries for tokens and tags. Special tokens in our case will be:
 - `<UNK>` token for out of vocabulary tokens;
 - `<PAD>` token is a token for padding sentence to the same length when   create batches of sentences.

In [0]:
special_tokens = ['<UNK>', '<PAD>']
special_tags = ['O']

# Create dictionaries 
token2idx, idx2token = build_dict(train_tokens + validation_tokens, special_tokens)
tag2idx, idx2tag = build_dict(train_tags, special_tags)

In [0]:
def words2idxs(tokens_list):
    return [token2idx[word] for word in tokens_list]

def tags2idxs(tags_list):
    return [tag2idx[tag] for tag in tags_list]

def idxs2words(idxs):
    return [idx2token[idx] for idx in idxs]

def idxs2tags(idxs):
    return [idx2tag[idx] for idx in idxs]

### Generate batches

Neural Networks are usually trained with batches. It means that weight updates of the network are based on several sequences at every single time.  

In [0]:
def batches_generator(batch_size, tokens, tags,
                      shuffle=True, allow_smaller_last_batch=True):
#Generates padded batches of tokens and tags.
    
    n_samples = len(tokens)
    if shuffle:
        order = np.random.permutation(n_samples)
    else:
        order = np.arange(n_samples)

    n_batches = n_samples // batch_size
    if allow_smaller_last_batch and n_samples % batch_size:
        n_batches += 1

    for k in range(n_batches):
        batch_start = k * batch_size
        batch_end = min((k + 1) * batch_size, n_samples)
        current_batch_size = batch_end - batch_start
        x_list = []
        y_list = []
        max_len_token = 0
        for idx in order[batch_start: batch_end]:
            x_list.append(words2idxs(tokens[idx]))
            y_list.append(tags2idxs(tags[idx]))
            max_len_token = max(max_len_token, len(tags[idx]))
            
        # Fill in the data into numpy nd-arrays filled with padding indices.
        x = np.ones([current_batch_size, max_len_token], dtype=np.int32) * token2idx['<PAD>']
        y = np.ones([current_batch_size, max_len_token], dtype=np.int32) * tag2idx['O']
        lengths = np.zeros(current_batch_size, dtype=np.int32)
        for n in range(current_batch_size):
            utt_len = len(x_list[n])
            x[n, :utt_len] = x_list[n]
            lengths[n] = utt_len
            y[n, :utt_len] = y_list[n]
        yield x, y, lengths

## Build a recurrent neural network

This is the most important part of the assignment. Here I will specify the network architecture based on TensorFlow building blocks.I will create an LSTM network which will produce probability distribution over tags for each token in a sentence. To take into account both right and left contexts of the token, I will use Bi-Directional LSTM (Bi-LSTM). Dense layer will be used on top to perform tag classification.  

In [0]:
import tensorflow as tf
import numpy as np

In [0]:
class BiLSTMModel():
    pass

Creating  [placeholders](https://www.tensorflow.org/versions/r0.12/api_docs/python/io_ops/placeholders) to specify what data  that going to feed into the network during the exectution time.  


 

In [0]:
def declare_placeholders(self):
    """Specifies placeholders for the model."""

    # Placeholders for input and ground truth output.
    self.input_batch = tf.placeholder(dtype=tf.int32, shape=[None, None], name='input_batch') 
    self.ground_truth_tags = tf.placeholder(dtype=tf.int32, shape=[None, None], name='ground_truth_tags')  
  
    # Placeholder for lengths of the sequences.
    self.lengths = tf.placeholder(dtype=tf.int32, shape=[None], name='lengths') 
    
    # Placeholder for a dropout keep probability.  

    self.dropout_ph = tf.placeholder_with_default(tf.cast(1.0, tf.float32), shape=[])
    
    # Placeholder for a learning rate (tf.float32).
    self.learning_rate_ph = tf.placeholder(dtype=tf.float32, shape=[], name='learning_rate') 

In [0]:
BiLSTMModel.__declare_placeholders = classmethod(declare_placeholders)

In [0]:
def build_layers(self, vocabulary_size, embedding_dim, n_hidden_rnn, n_tags):
    """Specifies bi-LSTM arcitecture and computes logits for inputs."""
    
    # Create embedding variable 
    initial_embedding_matrix = np.random.randn(vocabulary_size, embedding_dim) / np.sqrt(embedding_dim)
    embedding_matrix_variable = tf.Variable(initial_embedding_matrix, dtype=tf.float32) 
    
    forward_cell = tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell(n_hidden_rnn),input_keep_prob=self.dropout_ph,output_keep_prob=self.dropout_ph,state_keep_prob=self.dropout_ph)  
    backward_cell =  tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.BasicLSTMCell(num_units=n_hidden_rnn),  input_keep_prob=self.dropout_ph, output_keep_prob=self.dropout_ph, state_keep_prob=self.dropout_ph) 

    
    embeddings =tf.nn.embedding_lookup(embedding_matrix_variable, self.input_batch)   
    
   
    (rnn_output_fw, rnn_output_bw), _ = tf.nn.bidirectional_dynamic_rnn(forward_cell, backward_cell, embeddings, self.lengths, dtype=tf.float32) 
    rnn_output = tf.concat([rnn_output_fw, rnn_output_bw], axis=2)

    
    self.logits = tf.layers.dense(rnn_output, n_tags, activation=None)

In [0]:
BiLSTMModel.__build_layers = classmethod(build_layers)

To compute the actual predictions of the neural network, I need to apply [softmax](https://www.tensorflow.org/api_docs/python/tf/nn/softmax) to the last layer and find the most probable tags with [argmax](https://www.tensorflow.org/api_docs/python/tf/argmax).

In [0]:
def compute_predictions(self):
    """Transforms logits to probabilities and finds the most probable tags."""
    
  
    softmax_output = tf.nn.softmax(logits=self.logits) 
    
    
    self.predictions = tf.argmax(softmax_output, axis=-1) 

In [0]:
BiLSTMModel.__compute_predictions = classmethod(compute_predictions)

In [0]:
def compute_loss(self, n_tags, PAD_index):
    """Computes masked cross-entopy loss with logits."""
    
    
    ground_truth_tags_one_hot = tf.one_hot(self.ground_truth_tags, n_tags)
    loss_tensor = tf.nn.softmax_cross_entropy_with_logits(labels = ground_truth_tags_one_hot,logits = self.logits)
    mask = tf.cast(tf.not_equal(loss_tensor, PAD_index), tf.float32)
    self.loss = tf.reduce_mean(mask * loss_tensor) 

In [0]:
BiLSTMModel.__compute_loss = classmethod(compute_loss)

The last thing to specify is how I want to optimize the loss. 
Using [Adam](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) optimizer with a learning rate from the corresponding placeholder. 
I will also need to apply [clipping](https://www.tensorflow.org/versions/r0.12/api_docs/python/train/gradient_clipping) to eliminate exploding gradients. It can be easily done with [clip_by_norm](https://www.tensorflow.org/api_docs/python/tf/clip_by_norm) function. 

In [0]:
def perform_optimization(self):
    
    self.optimizer =  tf.train.AdamOptimizer(learning_rate=self.learning_rate_ph) 
    self.grads_and_vars = self.optimizer.compute_gradients(self.loss)
    
    clip_norm = 1.0
    self.grads_and_vars =  [(tf.clip_by_norm(i,clip_norm),j) for i,j in self.grads_and_vars] 
    
    self.train_op = self.optimizer.apply_gradients(self.grads_and_vars)

In [0]:
BiLSTMModel.__perform_optimization = classmethod(perform_optimization)

In [0]:
def init_model(self, vocabulary_size, n_tags, embedding_dim, n_hidden_rnn, PAD_index):
    self.__declare_placeholders()
    self.__build_layers(vocabulary_size, embedding_dim, n_hidden_rnn, n_tags)
    self.__compute_predictions()
    self.__compute_loss(n_tags, PAD_index)
    self.__perform_optimization()

In [0]:
BiLSTMModel.__init__ = classmethod(init_model)

## Train the network and predict tags

In [0]:
def train_on_batch(self, session, x_batch, y_batch, lengths, learning_rate, dropout_keep_probability):
    feed_dict = {self.input_batch: x_batch,
                 self.ground_truth_tags: y_batch,
                 self.learning_rate_ph: learning_rate,
                 self.dropout_ph: dropout_keep_probability,
                 self.lengths: lengths}
    
    session.run(self.train_op, feed_dict=feed_dict)

In [0]:
BiLSTMModel.train_on_batch = classmethod(train_on_batch)

In [0]:
def predict_for_batch(self, session, x_batch, lengths):
    feed_dict = {self.input_batch: x_batch,
                 self.lengths: lengths}

    predictions = session.run(self.predictions,feed_dict)
    
    return predictions

In [0]:
BiLSTMModel.predict_for_batch = classmethod(predict_for_batch)

 
### Evaluation 


In [0]:
from evaluation import precision_recall_f1

In [0]:
def predict_tags(model, session, token_idxs_batch, lengths):
    """Performs predictions and transforms indices to tokens and tags."""
    
    tag_idxs_batch = model.predict_for_batch(session, token_idxs_batch, lengths)
    
    tags_batch, tokens_batch = [], []
    for tag_idxs, token_idxs in zip(tag_idxs_batch, token_idxs_batch):
        tags, tokens = [], []
        for tag_idx, token_idx in zip(tag_idxs, token_idxs):
            if token_idx != token2idx['<PAD>']:
                tags.append(idx2tag[tag_idx])
                tokens.append(idx2token[token_idx])
        tags_batch.append(tags)
        tokens_batch.append(tokens)
    return tags_batch, tokens_batch
    
    
def eval_conll(model, session, tokens, tags, short_report=True):
    
    y_true, y_pred = [], []
    for x_batch, y_batch, lengths in batches_generator(1, tokens, tags):
        tags_batch, tokens_batch = predict_tags(model, session, x_batch, lengths)
        ground_truth_tags = [idx2tag[tag_idx] for tag_idx in y_batch[0]]

        # extend every prediction and ground truth sequence with 'O' tag to indicate a possible end of entity.
        y_true.extend(ground_truth_tags + ['O'])
        y_pred.extend(tags_batch[0] + ['O'])
    results = precision_recall_f1(y_true, y_pred, print_results=True, short_report=short_report)
    return results

In [0]:
#!pip install tensorflow-gpu==1.15

In [0]:
tf.reset_default_graph()

model = BiLSTMModel(vocabulary_size=len(token2idx), n_tags=len(tag2idx), embedding_dim=200, n_hidden_rnn=200, PAD_index=token2idx['<PAD>'])  

batch_size = 32  
n_epochs = 4  
learning_rate = 0.005  
learning_rate_decay = np.sqrt(2) 
dropout_keep_probability = 0.5  

In [0]:
sess = tf.Session()
sess.run(tf.global_variables_initializer())

print('Start training... \n')
for epoch in range(n_epochs):
    print('-' * 20 + ' Epoch {} '.format(epoch+1) + 'of {} '.format(n_epochs) + '-' * 20)
    print('Train data evaluation:')
    eval_conll(model, sess, train_tokens, train_tags, short_report=True)
    print('Validation data evaluation:')
    eval_conll(model, sess, validation_tokens, validation_tags, short_report=True)
    
    for x_batch, y_batch, lengths in batches_generator(batch_size, train_tokens, train_tags):
        model.train_on_batch(sess, x_batch, y_batch, lengths, learning_rate, dropout_keep_probability)
        
    # Decaying the learning rate
    learning_rate = learning_rate / learning_rate_decay
    
print('...training finished.')

Start training... 

-------------------- Epoch 1 of 4 --------------------
Train data evaluation:
processed 105778 tokens with 4489 phrases; found: 76658 phrases; correct: 180.

precision:  0.23%; recall:  4.01%; F1:  0.44

Validation data evaluation:
processed 12836 tokens with 537 phrases; found: 9315 phrases; correct: 32.

precision:  0.34%; recall:  5.96%; F1:  0.65

-------------------- Epoch 2 of 4 --------------------
Train data evaluation:
processed 105778 tokens with 4489 phrases; found: 1735 phrases; correct: 432.

precision:  24.90%; recall:  9.62%; F1:  13.88

Validation data evaluation:
processed 12836 tokens with 537 phrases; found: 144 phrases; correct: 35.

precision:  24.31%; recall:  6.52%; F1:  10.28

-------------------- Epoch 3 of 4 --------------------
Train data evaluation:
processed 105778 tokens with 4489 phrases; found: 4559 phrases; correct: 1924.

precision:  42.20%; recall:  42.86%; F1:  42.53

Validation data evaluation:
processed 12836 tokens with 537 phr

In [0]:
print('-' * 20 + ' Train set quality: ' + '-' * 20)
train_results = eval_conll(model, sess, train_tokens, train_tags, short_report=False)

validation_results = eval_conll(model, sess, validation_tokens, validation_tags, short_report=False)

print('-' * 20 + ' Test set quality: ' + '-' * 20) 
test_results =eval_conll(model, sess, test_tokens, test_tags, short_report=False)

-------------------- Train set quality: --------------------
processed 105778 tokens with 4489 phrases; found: 4647 phrases; correct: 3387.

precision:  72.89%; recall:  75.45%; F1:  74.15

	     company: precision:   76.39%; recall:   89.58%; F1:   82.46; predicted:   754

	    facility: precision:   71.02%; recall:   71.02%; F1:   71.02; predicted:   314

	     geo-loc: precision:   84.74%; recall:   90.86%; F1:   87.69; predicted:  1068

	       movie: precision:    0.00%; recall:    0.00%; F1:    0.00; predicted:     4

	 musicartist: precision:   44.35%; recall:   23.71%; F1:   30.90; predicted:   124

	       other: precision:   70.65%; recall:   78.86%; F1:   74.53; predicted:   845

	      person: precision:   69.74%; recall:   93.91%; F1:   80.04; predicted:  1193

	     product: precision:   55.86%; recall:   56.92%; F1:   56.39; predicted:   324

	  sportsteam: precision:   85.71%; recall:    8.29%; F1:   15.13; predicted:    21

	      tvshow: precision:    0.00%; recall:  

### Conclusions

For better results could be obtained by some combinations of several types of methods, e.g.  [this](https://arxiv.org/abs/1603.01354) paper  for more details