## Named Entity Recognition using the MIT movie corpus



There are different approaches to perform the task of named entity recognition. Few of the approaches are :

1) Lexical lookup:
Match text with manually defined lists (Gazetteers) of common lexical variants for entities. \
2) Rules :
Match text with manually defined extraction patterns for identifying entities/relations. Example : Hearst Patterns \
3) Machine learning :
Supervised, semi-supervised, or unsupervised learning of patterns and lexical variants, which is the most commonly used approach nowadays. But combining the Machine learning approaches with rule-based approaches could also yield good results

I would be using a Supervised machine learning ( Deep Learning to be specific) approach for the task here, having the labelled data makes our task easier.

## Loading the data and Pre-processing

We will work with a corpus, which contains sentences about movies with their NE tags. Every line of a file contains a pair of a token and a tag, separated by a whitespace. Different sentences are separated by an empty line.

The function read_data reads a corpus from the file_path and returns two lists: one with tokens and one with the corresponding tags.



In [None]:
# function to extract tokens and tags into different lists from the file
def read_data(file_path):
   
    # initialisng empty tags and token lists , the output of which will be a list of lists 
    #in which each list each list contains tokens/tags specific to each movie as in the data
    tokens = []
    tags = []
    
    # initialising empty lists to iterate through each movie list
    movie_tokens = []
    movie_tags = []
    for line in open(file_path, encoding='utf-8'):
        #removing any spaces at the beginning and at the end of the string 
        line = line.strip()
        if not line:
            if movie_tokens:
                tokens.append(movie_tokens)
                tags.append(movie_tags)
            movie_tokens = []
            movie_tags = []
        else:
            #splitting the line by space to seperate tag and token
            tag, token = line.split()
            movie_tokens.append(token)
            movie_tags.append(tag)
    # returning the tokens and tags
    return tokens, tags

In [None]:
# calling the above function to split both the train data and test data into tokens and tags
train_tokens, train_tags = read_data('trivia10k13train.bio.txt')
test_tokens, test_tags = read_data('trivia10k13test.bio.txt')

In [None]:
print(train_tokens)
print(train_tags)

[['B-Actor', 'I-Actor', 'O', 'O', 'B-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'B-Opinion', 'I-Opinion', 'I-Opinion', 'B-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot'], ['B-Actor', 'I-Actor', 'O', 'B-Actor', 'I-Actor', 'B-Award', 'I-Award', 'O', 'O', 'O', 'O', 'O', 'B-Year', 'O', 'B-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot'], ['O', 'O', 'O', 'B-Actor', 'I-Actor', 'O', 'B-Actor', 'I-Actor', 'O', 'O', 'B-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot'], ['O', 'O', 'O', 'O', 'B-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot', 'I-Plot'], ['O', 'O', 'O', 'O', 'O', 'B-Genre', 'O', 'B-Origin', 'I-Origin', 'I-Origin', 'I-Origin', 'I-Origin', 'I-Origin', 'I-Origin', 'I-Origin', 'I-Origin', 'O', 'B-Plot', 'I-Plot', 

In [None]:
# printing few samples of tags and tokens just to have a look if everything is alright
for i in range(3):
    for token, tag in zip(test_tokens[i], test_tags[i]):
        print('%s\t%s' % (token, tag))
    print()

i	O
need	O
that	O
movie	O
which	O
involves	O
aliens	B-Plot
invading	I-Plot
earth	I-Plot
in	I-Plot
a	I-Plot
particular	I-Plot
united	I-Plot
states	I-Plot
place	I-Plot
in	I-Plot
california	I-Plot

what	O
soviet	B-Genre
science	I-Genre
fiction	I-Genre
classic	B-Opinion
about	O
a	B-Plot
mysterious	I-Plot
planet	I-Plot
was	O
later	O
remade	B-Relationship
by	O
steven	B-Director
soderbergh	I-Director
and	O
george	B-Actor
clooney	I-Actor

this	O
american	B-Genre
classic	I-Genre
based	O
on	O
margaret	B-Origin
mitchell	I-Origin
s	I-Origin
novel	I-Origin
had	O
more	O
than	O
50	O
speaking	O
roles	O
and	O
2	O
400	O
extras	O
in	O
the	O
film	O



## Prepare dictionaries

To train a neural network, we will use two mappings:

{token}$\to${token id}: address the row in embeddings matrix for the current token. \
{tag}$\to${tag id}: one-hot ground truth probability distribution vectors for computing the loss at the output of the network.


In [None]:
from collections import defaultdict

#building the dictionary of tokens-ids and ids-tokens

def build_dict(tokens_or_tags, special_tokens):

    # Creating a dictionary with default value 0
    tok2idx = defaultdict(lambda: 0)
    idx2tok = defaultdict(lambda: 0)
    count = 0
    # Iterating through the special tokens and adding them to the dictionary
    for token in special_tokens:
      tok2idx[token] = count
      count += 1
    #Iterating through each list in tokens/tags list(i.e list of lists)  
    for token in tokens_or_tags:
      #Iterating through each element of the inner list
      for word in token:
        # If the word is not already in the tok2idx dictionary then add it to the dictionary
        if word not in tok2idx:
          tok2idx[word] = count
          count += 1

    #creating the id-token dictionary from the tok-id dictionary ( which is required ahead)  
    idx2tok = {index:word for word,index in tok2idx.items()}
      
    return tok2idx, idx2tok

In [None]:
# defining the list of special tokens and their tag
# <UNK> token is for the out-of-vocabulary words;
# <PAD> token is for padding sentence to the same length when we create batches of sentences.
special_tokens = ['<UNK>', '<PAD>']
special_tags = ['O']

# Calling the above created build_dict function to create dictionary 
token2idx, idx2token = build_dict(train_tokens, special_tokens)
tag2idx, idx2tag = build_dict(train_tags, special_tags)

In [None]:
len(token2idx)

10989

In [None]:
tag2idx

defaultdict(<function __main__.build_dict.<locals>.<lambda>>,
            {'B-Actor': 1,
             'B-Award': 7,
             'B-Character_Name': 21,
             'B-Director': 13,
             'B-Genre': 10,
             'B-Opinion': 5,
             'B-Origin': 11,
             'B-Plot': 3,
             'B-Quote': 23,
             'B-Relationship': 19,
             'B-Soundtrack': 17,
             'B-Year': 9,
             'I-Actor': 2,
             'I-Award': 8,
             'I-Character_Name': 22,
             'I-Director': 14,
             'I-Genre': 15,
             'I-Opinion': 6,
             'I-Origin': 12,
             'I-Plot': 4,
             'I-Quote': 24,
             'I-Relationship': 20,
             'I-Soundtrack': 18,
             'I-Year': 16,
             'O': 0})

In [None]:
# functions to create the mapping between tokens and ids for a sentence
def words2idxs(tokens_list):
    return [token2idx[word] for word in tokens_list]

def tags2idxs(tags_list):
    return [tag2idx[tag] for tag in tags_list]

def idxs2words(idxs):
    return [idx2token[idx] for idx in idxs]

def idxs2tags(idxs):
    return [idx2tag[idx] for idx in idxs]

## Generate batches

Neural Networks are usually trained with mini-batches, which means that weight updates of the network are based on several sequences at every single time, so all the sequences within a particular batch need to have the same length. Hence we will pad them with a special <PAD> token.



In [None]:
# Function to generate padded batches of tokens and tags
def batches_generator(batch_size, tokens, tags,
                      shuffle=True, allow_smaller_last_batch=True):
    
    n_samples = len(tokens)
    if shuffle:
        #Randomly shuffle/permute the input sequences
        order = np.random.permutation(n_samples)
    else:
        order = np.arange(n_samples)

    #number of batches
    n_batches = n_samples // batch_size
    # allow smaller last batch with the remaining entries
    if allow_smaller_last_batch and n_samples % batch_size:
        n_batches += 1

    # Iterating through the number of batches
    for k in range(n_batches):
        # getting starting and ending point of each batch 
        batch_start = k * batch_size
        batch_end = min((k + 1) * batch_size, n_samples)
        # calculating the current batch size
        current_batch_size = batch_end - batch_start
        # lists to append training data ('X' i.e tokens) and their labels ('Y' i.e labels)
        x_list = []
        y_list = []
        max_len_token = 0
        # Iteratiing through each sample through the batch start to end and appending them to the lists
        for idx in order[batch_start: batch_end]:
            x_list.append(words2idxs(tokens[idx]))
            y_list.append(tags2idxs(tags[idx]))
            # getting the max length
            max_len_token = max(max_len_token, len(tags[idx]))
            
        # Fill in the data into numpy nd-arrays filled with padding indices.
        x = np.ones([current_batch_size, max_len_token], dtype=np.int32) * token2idx['<PAD>']
        y = np.ones([current_batch_size, max_len_token], dtype=np.int32) * tag2idx['O']
        lengths = np.zeros(current_batch_size, dtype=np.int32)
        for n in range(current_batch_size):
            utt_len = len(x_list[n])
            x[n, :utt_len] = x_list[n]
            lengths[n] = utt_len
            y[n, :utt_len] = y_list[n]

        # The yield statement suspends function’s execution and sends a value back to the caller, 
        #but retains enough state to enable function to resume where it is left off. 
        #When resumed, the function continues execution immediately after the last yield run
        yield x, y, lengths

## Building a Bi-directional recurrent neural network

Here I will building a bi-directional recurrent neural network using tensorflow which will basically be producing probability distribution over tags for each token in a sentence. To take into account, the contexts in both right and left directions of the token, I would be using a Bi-Directional LSTM . Dense layer will be used on top to perform tag classification.

In [None]:
# importing tensorflow and disabling the version 2 behaviour for my convinience here
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import numpy as np

Instructions for updating:
non-resource variables are not supported in the long term


In [None]:
class BiLSTMModel():
   # The pass statement is used as a placeholder for future code.
   # When the pass statement is executed, nothing happens, but you avoid getting an error when empty code is not allowed.
    pass

In [None]:
# function to create placeholders for the model to specify what data we are going to feed into the network during the execution time.
def declare_placeholders(self):

    # placeholders for input and ground truth output. [ i.e the sequence of words and tags]
    self.input_batch = tf.placeholder(dtype=tf.int32, shape=[None, None], name='input_batch') 
    self.ground_truth_tags = tf.placeholder(dtype=tf.int32, shape=[None, None], name='ground_truth_tags') 
    
    # Placeholder for lengths of the sequences.
    self.lengths = tf.placeholder(dtype=tf.int32, shape=[None], name='lengths') 
    
    # Placeholder for a dropout keep probability. If we don't feed
    # a value for this placeholder, it will be equal to 1.0.
    self.dropout_ph = tf.placeholder_with_default(tf.cast(1.0, tf.float32), shape=[])
    
    # Placeholder for a learning rate (tf.float32).
    self.learning_rate_ph = tf.placeholder(dtype=tf.float32, shape=[]) 

In [None]:
BiLSTMModel.__declare_placeholders = classmethod(declare_placeholders)

In [None]:
# function to specify bi-LSTM architecture and computes logits for inputs.
def build_layers(self, vocabulary_size, embedding_dim, n_hidden_rnn, n_tags):
    
    # the embedding matrix will be initialized randomly
    initial_embedding_matrix = np.random.randn(vocabulary_size, embedding_dim) / np.sqrt(embedding_dim)

    # Create embedding variable (tf.Variable) with dtype tf.float32
    embedding_matrix_variable = tf.Variable(initial_value = initial_embedding_matrix, name = 'embeddings_matrix', dtype = tf.float32)
    
    # Creating forward and backward LSTM cells with n_hidden_rnn number of units 
    # and then wrapping the cells with dropout( tensorflow version 1 provides dropout as a wrapper unlike keras which provides it as layer)
    # Also initializing all *_keep_prob with dropout placeholder.
    forward_cell =  tf.nn.rnn_cell.BasicLSTMCell(num_units = n_hidden_rnn)
    forward_cell = tf.nn.rnn_cell.DropoutWrapper(forward_cell, input_keep_prob=self.dropout_ph,output_keep_prob=self.dropout_ph,state_keep_prob=self.dropout_ph)

    backward_cell =  tf.nn.rnn_cell.BasicLSTMCell(num_units = n_hidden_rnn)
    backward_cell = tf.nn.rnn_cell.DropoutWrapper(backward_cell, input_keep_prob=self.dropout_ph,output_keep_prob=self.dropout_ph,state_keep_prob=self.dropout_ph)

    # Look up embeddings for self.input_batch (tf.nn.embedding_lookup).
    # Shape: [batch_size, sequence_len, embedding_dim].
    embeddings = tf.nn.embedding_lookup(embedding_matrix_variable, self.input_batch)
   
    # Pass them through Bidirectional Dynamic RNN (tf.nn.bidirectional_dynamic_rnn) 
    # which takes input and builds independent forward and backward RNNs. The input_size of forward and backward cell must match.
    (rnn_output_fw, rnn_output_bw), _ =  tf.nn.bidirectional_dynamic_rnn(cell_fw = forward_cell, cell_bw = backward_cell,inputs = embeddings,dtype=tf.float32,sequence_length=self.lengths)
    
    rnn_output = tf.concat([rnn_output_fw, rnn_output_bw], axis=2)

    # adding a dense layer on top.
    # Shape: [batch_size, sequence_len, n_tags].   
    self.logits = tf.layers.dense(rnn_output, n_tags, activation=None)

In [None]:
BiLSTMModel.__build_layers = classmethod(build_layers)

In [None]:
# To compute the actual predictions of the neural network, we need to apply softmax to the last layer and
# find the most probable tags with argmax.

# function to transform logits to probabilities and finds the most probable tags
def compute_predictions(self):
    
    # Creating a softmax function
    softmax_output = tf.nn.softmax(self.logits)
    
    # Using argmax (tf.argmax) to get the most probable tags
    self.predictions = tf.argmax(softmax_output, axis = -1)

In [None]:
BiLSTMModel.__compute_predictions = classmethod(compute_predictions)

In [None]:
# During training we do not need predictions of the network, but we need a loss function. 
# We will use cross-entropy loss, which is most common loss function used for NLP problems.
# It is applied to logits of the model (not to softmax probabilities!). 
# Also we don't want to take into account loss terms coming from <PAD> tokens, so we will be masking them out, before computing mean.

# Function to compute masked cross-entopy loss with logits
def compute_loss(self, n_tags, PAD_index):
    
    # Creating a cross entropy function
    ground_truth_tags_one_hot = tf.one_hot(self.ground_truth_tags, n_tags)
    loss_tensor = tf.nn.softmax_cross_entropy_with_logits_v2(labels=ground_truth_tags_one_hot, logits=self.logits) 
    
    mask = tf.cast(tf.not_equal(self.input_batch, PAD_index), tf.float32)
    # Creating a loss function which doesn't operate with <PAD> tokens
    self.loss =  tf.reduce_mean(tf.reduce_sum(tf.multiply(loss_tensor, mask), axis=-1) / tf.reduce_sum(mask, axis=-1))
    

In [None]:
BiLSTMModel.__compute_loss = classmethod(compute_loss)

In [None]:
# The last thing to specify is how we want to optimize the loss. 
# I'm using the Adam optimizer here, which is the most common optimiser used for most state-of-the-art deep learning models
# and then applying clipping to eliminate exploding gradients

def perform_optimization(self):
    
    # Creating an optimizer 
    self.optimizer =  tf.train.AdamOptimizer(learning_rate=self.learning_rate_ph) 
    
    # computing gradients
    self.grads_and_vars = self.optimizer.compute_gradients(self.loss)
    
    # applying gradient clipping(to prevent exploding gradients in recurrent neural networks) for gradients in self.grads_and_vars
    clip_norm = tf.cast(1.0, tf.float32)
    self.grads_and_vars =   [(tf.clip_by_norm(grad,clip_norm = clip_norm),var) for grad, var in self.grads_and_vars]
    
    self.train_op = self.optimizer.apply_gradients(self.grads_and_vars)

In [None]:
BiLSTMModel.__perform_optimization = classmethod(perform_optimization)

In [None]:
# the constructor method for our Bi-LSTM class
def init_model(self, vocabulary_size, n_tags, embedding_dim, n_hidden_rnn, PAD_index):
    self.__declare_placeholders()
    self.__build_layers(vocabulary_size, embedding_dim, n_hidden_rnn, n_tags)
    self.__compute_predictions()
    self.__compute_loss(n_tags, PAD_index)
    self.__perform_optimization()

In [None]:
BiLSTMModel.__init__ = classmethod(init_model)

In [None]:
# function to feed the actual data through the placeholders that we defined before
def train_on_batch(self, session, x_batch, y_batch, lengths, learning_rate, dropout_keep_probability):
    feed_dict = {self.input_batch: x_batch,
                 self.ground_truth_tags: y_batch,
                 self.learning_rate_ph: learning_rate,
                 self.dropout_ph: dropout_keep_probability,
                 self.lengths: lengths}
                 
    #Session.run is a point which initiates computations in the graph that we have defined
    session.run(self.train_op, feed_dict=feed_dict)

In [None]:
BiLSTMModel.train_on_batch = classmethod(train_on_batch)

In [None]:
# Implementing the function predict_for_batch by initializing feed_dict with input x_batch and lengths 
# and running the session for self.predictions
def predict_for_batch(self, session, x_batch, lengths):
    
    predictions = session.run(self.predictions, feed_dict={self.input_batch:x_batch, self.lengths:lengths})
    return predictions

In [None]:
BiLSTMModel.predict_for_batch = classmethod(predict_for_batch)

## Training and Evaluating the Model

Precision , recall and F1 score are the three metrics most widely used to measure the retrieval effectiveness of a system.

Precision, P= # correctly extracted items / Total # of extracted items \
Recall, R = # correctly extracted items / Total # of gold items \

F- score is basically theweighted harmonic mean of Precision and recall which combines both these scores into one.

F-Score = 2*P*R / P+R

In [None]:
# the code in the link below contains a method to calculate precision, recall and F1 scores
import os
os.system("wget https://raw.githubusercontent.com/hse-aml/natural-language-processing/master/week2/evaluation.py")

0

In [None]:
from evaluation import precision_recall_f1

In [None]:
# function to perform predictions and transform indices to tokens and tags
def predict_tags(model, session, token_idxs_batch, lengths):
    
    # predicted tags ids for the given batch
    tag_idxs_batch = model.predict_for_batch(session, token_idxs_batch, lengths)
    
    # extracting tags and tokens from their ids 
    tags_batch, tokens_batch = [], []
    for tag_idxs, token_idxs in zip(tag_idxs_batch, token_idxs_batch):
        tags, tokens = [], []
        for tag_idx, token_idx in zip(tag_idxs, token_idxs):
            tags.append(idx2tag[tag_idx])
            tokens.append(idx2token[token_idx])
        tags_batch.append(tags)
        tokens_batch.append(tokens)
    return tags_batch, tokens_batch
    

# function to evaluate the model using the performance metrics
def eval_conll(model, session, tokens, tags, short_report=True):
    
    y_true, y_pred = [], []
    for x_batch, y_batch, lengths in batches_generator(1, tokens, tags):
        tags_batch, tokens_batch = predict_tags(model, session, x_batch, lengths)
        if len(x_batch[0]) != len(tags_batch[0]):
            raise Exception("Incorrect length of prediction for the input, "
                            "expected length: %i, got: %i" % (len(x_batch[0]), len(tags_batch[0])))
        predicted_tags = []
        ground_truth_tags = []
        for gt_tag_idx, pred_tag, token in zip(y_batch[0], tags_batch[0], tokens_batch[0]): 
            if token != '<PAD>':
                ground_truth_tags.append(idx2tag[gt_tag_idx])
                predicted_tags.append(pred_tag)

        # We extend every prediction and ground truth sequence with 'O' tag to indicate a possible end of entity.
        y_true.extend(ground_truth_tags + ['O'])
        y_pred.extend(predicted_tags + ['O'])
        
    results = precision_recall_f1(y_true, y_pred, print_results=True, short_report=short_report)
    return results

In [None]:
tf.reset_default_graph()

model = BiLSTMModel(len(token2idx), len(tag2idx), 200, 200, token2idx['<PAD>'])

batch_size = 32 
n_epochs = 5 
learning_rate = 0.005
learning_rate_decay = 1.414
dropout_keep_probability = 0.5 

Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor




In [None]:
sess = tf.Session()
sess.run(tf.global_variables_initializer())

for epoch in range(n_epochs):
    # For each epoch evaluate the model on train and validation data
    print('-' * 20 + ' Epoch {} '.format(epoch+1) + 'of {} '.format(n_epochs) + '-' * 20)
    print('Train data evaluation:')
    eval_conll(model, sess, train_tokens, train_tags, short_report=True)
    
    # Train the model
    for x_batch, y_batch, lengths in batches_generator(batch_size, train_tokens, train_tags):
        model.train_on_batch(sess, x_batch, y_batch, lengths, learning_rate, dropout_keep_probability)
        
    # Decaying the learning rate
    learning_rate = learning_rate / learning_rate_decay
    

-------------------- Epoch 1 of 5 --------------------
Train data evaluation:
processed 166639 tokens with 23030 phrases; found: 117508 phrases; correct: 257.

precision:  0.22%; recall:  1.12%; F1:  0.37

-------------------- Epoch 2 of 5 --------------------
Train data evaluation:
processed 166639 tokens with 23030 phrases; found: 24721 phrases; correct: 15519.

precision:  62.78%; recall:  67.39%; F1:  65.00

-------------------- Epoch 3 of 5 --------------------
Train data evaluation:
processed 166639 tokens with 23030 phrases; found: 24216 phrases; correct: 17231.

precision:  71.16%; recall:  74.82%; F1:  72.94

-------------------- Epoch 4 of 5 --------------------
Train data evaluation:
processed 166639 tokens with 23030 phrases; found: 23856 phrases; correct: 17899.

precision:  75.03%; recall:  77.72%; F1:  76.35

-------------------- Epoch 5 of 5 --------------------
Train data evaluation:
processed 166639 tokens with 23030 phrases; found: 24083 phrases; correct: 18565.

pre

In [None]:
print('-' * 20 + ' Train set quality: ' + '-' * 20)
train_results = eval_conll(model, sess, train_tokens, train_tags, short_report=False)


print('-' * 20 + ' Test set quality: ' + '-' * 20)
test_results = eval_conll(model, sess, test_tokens, test_tags, short_report=False)


-------------------- Train set quality: --------------------
processed 166639 tokens with 23030 phrases; found: 23811 phrases; correct: 19051.

precision:  80.01%; recall:  82.72%; F1:  81.34

	       Actor: precision:   98.47%; recall:   98.94%; F1:   98.71; predicted:  5034

	       Award: precision:   57.06%; recall:   64.08%; F1:   60.37; predicted:   347

	Character_Name: precision:   85.27%; recall:   90.93%; F1:   88.01; predicted:  1093

	    Director: precision:   91.09%; recall:   96.64%; F1:   93.78; predicted:  1896

	       Genre: precision:   76.29%; recall:   78.37%; F1:   77.32; predicted:  3476

	     Opinion: precision:   52.02%; recall:   57.16%; F1:   54.47; predicted:   890

	      Origin: precision:   50.57%; recall:   57.12%; F1:   53.65; predicted:   880

	        Plot: precision:   69.71%; recall:   71.60%; F1:   70.64; predicted:  6643

	       Quote: precision:   62.69%; recall:   66.67%; F1:   64.62; predicted:   134

	Relationship: precision:   48.01%; reca

Considering we have used randomly initialised embedding matrix and trained the model for very less number of epochs, We should say the model performs really good, as we are achieving a precision of 63% , a recall of 69% and an F-score of 65.47. But there are many improvements that could be done to this which are mentioned below.

## Potential Improvements for a better performance of the model: 

1) Hyperparameter Optimisations: We can optimise several hyperparameters like below, which will result in a better performance of the model. \
-- 	Learning rate \
--	Number of Iterations/epochs \
--	Choice of activation function \
--	Regularisations \
--	Number of RNN layers in the network etc.

2)	NLP models could be hard to train given the high dimensionality associated with them. Transfer learning could be very useful here, using which we could use the already extracted low-level features that might be similar for similar tasks, and build up on that. \

3)	Using word embeddings like Glove or Fasttext could definitely increase the performance of the model as they would provide better representation of the word than a randomly initialized one. That too, using a higher dimensional word vectors could be much useful if we have the resources to train. (For instance, Glove vectors with 300 dimensions will give better results than 50-dimensional glove vectors). \

4)	As an improvement to using static word embeddings like mentioned above, we could use contextual word embeddings like BERT or ELMO, which could help increase the performance of the model further. \

5)	Recently character level embeddings have been used in many state-of-the-art NLP models, so using a character-level embeddings could help to improve the performance as it helps in handling any out-of-vocabulary or misspelled words better than word embeddings. \

6)	We could also use an ensemble of neural networks to give us the best predictions. But this could be a expensive task and might not be required, unless the task requires very less error rate. \

7) Transformers being the latest state-of-the-art in NLP, we could also implement a transformers based approach to increase the performance of model.


## References 

- https://www.coursera.org/learn/nlp-sequence-models

- https://machinelearningmastery.com/how-to-avoid-exploding-gradients-in-neural-networks-with-gradient-clipping/

- https://www.coursera.org/projects/named-entity-recognition-lstm-keras-tensorflow

- http://colah.github.io/posts/2015-08-Understanding-LSTMs/

- https://arxiv.org/abs/1603.01354

- https://www.datacamp.com/community/tutorials/tensorflow-tutorial

- https://github.com/hse-aml

- https://github.com/tensorflow/tensorflow/tree/r1.8

- https://aclanthology.org/L18-1708.pdf

- https://towardsdatascience.com/entity-level-evaluation-for-ner-task-c21fb3a8edf

- https://gist.github.com/jisungk/6e3a111aff72f8e00ec0bb987bf258a4

- https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)