# Sequence Labeling

In this assignment, you will work on the [MeasEval](https://competitions.codalab.org/competitions/25770) shared task that was part of SemEval-2021. The goal of **MeasEval** is  the extraction of counts, measurements, and related context from scientific documents. The task is a complex problem that involves solving a number of steps that range from identifying quantities and units of measurement to identify relationships between them. For this assignment, you will focus only on the *Quantity* recognition step:

*  Given a paragraph from a scientific text, identify all spans containing quantities like *12 kg*. This problem can be approached as a Sequence Labeling task.

You will develop a Recurrent Neural Network with [Keras](https://keras.io/), a high-level Deep Learning API written in **Python** that provides a user-friendly interface for the [TensorFlow](https://www.tensorflow.org/) library, one of the most popular low-level Deep Learning frameworks. You will use the following objects and functions:

In [1]:
import pandas as pd
import numpy as np
from tensorflow.keras.utils import set_random_seed
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional, TimeDistributed
from tensorflow.keras.initializers import Constant
from sklearn.metrics import classification_report

When working with Neural Networks, there are a large number of random operations such as initializing the weights of the network, shuffling the data for training, or choosing samples. This causes that different training runs of the same model can lead to different results. To ensure reproducibility, i.e. obtaining the same results in the different runs, the random number generator must be initialized with a fixed value known as seed. In `Keras`, this can be done as follows:

In [2]:
seed = 42
set_random_seed(seed)

When developing a model, if the results you get are not as expected, try re-initializing the seed by running the cell above before compiling and training the model.

> **Note!** With models as complex as Neural Networks, reproducibility is susceptible to factors such as software versions or the hardware on which the models are run. Even with seed initialization, there may be slight differences in the results.

Working with Neural Networks also involves defining a number of hyperparameters that set the configuration of the model. Finding the appropriate hyperparameter values requires training the model with different combinations and testing them on the development set. This hyperparameter tuning is a costly process that needs multiple rounds of experimentation. However, for this assignments, you will use the following values:

In [3]:
maxlen = 130  # Maximum length of the input sequence accepted by the model
epochs = 6  # Number of epochs to train the model
batch_size = 64  # Number of examples used per gradient update
embedding_dim = 300  # Dimension of the embeddings
rnn_units = 256  # Number of units per RNN layer

Training a Deep Learning model with a large train set can be a time-consuming process, as the model needs to iterate over the entire set multiple times, often requiring significant computational resources. During the implementation of the model, it is often a good practice to use only a subset of the training data. This allows a faster debugging of the code. Set the `shrink_dataset` variable as `True` when a faster training is required and set it as `False` to train the model on the whole train set:

In [4]:
shrink_dataset = False

Although the value of this variable does not affect the tests that will evaluate your code, the output examples distributed throughout this notebook are based on a `shrink_dataset` variable set as `False`.

The train set for the assignment consists of 248 articles with 1366 sentences in total. The test set contains 136 articles with 848 sentences. A development set with 65 documents and 459 sentences is also provided. The dataset is annotated at the token level following a BIO schema with 3 labels: *B-Quantity*, *I-Quantity* and *O*.  The dataset can be loaded into three `DataFrames` as follows:

In [5]:
!git clone https://github.com/thanhxuan1995/NLP.git

Cloning into 'NLP'...
remote: Enumerating objects: 243, done.[K
remote: Counting objects: 100% (243/243), done.[K
remote: Compressing objects: 100% (194/194), done.[K
remote: Total 243 (delta 113), reused 80 (delta 29), pack-reused 0 (from 0)[K
Receiving objects: 100% (243/243), 15.14 MiB | 15.25 MiB/s, done.
Resolving deltas: 100% (113/113), done.


In [7]:
import os
new_folder_path = 'NLP/week3/assignment'
os.makedirs(new_folder_path, exist_ok=True)

In [8]:
!unzip NLP/week3/assignment/data.zip -d NLP/week3/assignment/


Archive:  NLP/week3/assignment/data.zip
  inflating: NLP/week3/assignment/data/trial.tsv  
  inflating: NLP/week3/assignment/data/eval.tsv  
  inflating: NLP/week3/assignment/data/train.tsv  


In [9]:
%cd /content/NLP/week3/assignment/

/content/NLP/week3/assignment


In [10]:
def load_data(data_path, shrink_dataset, seed):
    data = pd.read_csv(data_path, sep="\t", encoding="utf8").dropna()
    if shrink_dataset:
        sample = data[["docId",  "sentId"]].drop_duplicates().sample(frac=0.2, random_state=seed)
        data = pd.merge(data, sample, on=["docId", "sentId"])
    return data

In [11]:
train_data = load_data("data/train.tsv", shrink_dataset, seed)
dev_data = load_data("data/trial.tsv", shrink_dataset, seed)
test_data = load_data("data/eval.tsv", shrink_dataset, seed)
train_data[(train_data.docId == "S0378383912000130-3601") & (train_data.sentId == 3)]

Unnamed: 0,docId,sentId,word,lemma,label
22276,S0378383912000130-3601,3,The,the,O
22277,S0378383912000130-3601,3,experiments,experiment,O
22278,S0378383912000130-3601,3,involved,involve,O
22279,S0378383912000130-3601,3,two,two,B-Quantity
22280,S0378383912000130-3601,3,beach,beach,I-Quantity
22281,S0378383912000130-3601,3,materials,material,I-Quantity
22282,S0378383912000130-3601,3,with,with,O
22283,S0378383912000130-3601,3,nominal,nominal,O
22284,S0378383912000130-3601,3,sediment,sediment,O
22285,S0378383912000130-3601,3,diameters,diameter,O


The `DataFrames` created include the lemmatization of words in the `lemma` columns. You will use the lemmas as the input of the model.

## Data Pre-processing

In this assignment, you will have to implement some steps to pre-process and obtain a representation of the data. You will implement a model with an `Embedding` lookup table as the input layer, so the tokens of the input sentences should be represented as indexes. The target labels should also be represented in similar way. Besides, as one would expect, the sentences in the **MeasEval** dataset have different lengths. However, the input for a Deep Learning model is a batch of examples (in this case, sentences) in the form of a single tensor which requires that all examples in the batch must have the same length. Therefore, the sentences should be padded or truncated to a specific length.

> **Note!** For this particular task, the `maxlen` value provided to you guarantees that padding is sufficient to make all sentences the same length without the need for truncation.

This first of these pre-processing steps will be to obtain both a vocabulary and the set of labels from the train set. The vocabulary should be the list of unique lemmas and must include the special tokens `[PAD]`, that will be used for padding the sequences, and `[UNK]`, that will be used to represent out-of-vocabulary words. Along with the vocabulary and the label set, you will also have to build a dictionary mapping each lemma to its position in the vocabulary and a dictionary mapping each label to its position in the label set. These dictionaries will be used later to obtain the representation of the input and output of the model. The text is already tokenized and lemmatized which will help in this task.

You must complete the code for the `get_vocabulary` function that takes as input the `DataFrame` containing the train set. The function should create a list with the all the unique lemmas and include the special tokens `[PAD]` and `[UNK]` in the first two positions. Similarly, the function should create a list with the unique labels with the special token `[PAD]` in the first position. The **pandas** library provides some [functions](https://pandas.pydata.org/docs/reference/index.html) that may help you. Along with those lists, `get_vocabulary` should return the dictionaries mapping the lemmas and the labels to their corresponding positions. In total, the vocabulary and the label set should have 5508 and 4 items respectively:

> Vocabulary size: 5508  
Vocabulary first 5 lemmas: ['[PAD]', '[UNK]', 'datum', 'be', 'draw']  
Vocabulary dictionary: {'[PAD]': 0, '[UNK]': 1, 'datum': 2, 'be': 3, 'draw': 4}  
>
>Labels size: 4  
Labels: ['[PAD]', 'O', 'B-Quantity', 'I-Quantity']  
Labels dictionary: {'[PAD]': 0, 'O': 1, 'B-Quantity': 2, 'I-Quantity': 3}  


## **test_get_vocabulary = 3 Marks**

In [12]:
def get_vocabulary(train_data):
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    xu_list = list(train_data['lemma'].unique())
    vocab = ['[PAD]', '[UNK]'] + xu_list
    word2idx = {w: i for i, w in enumerate(vocab)}
    labels = ['[PAD]'] + list(train_data['label'].unique())
    label2idx = {l: i for i, l in enumerate(labels)}
    return vocab, word2idx, labels, label2idx
    #
    pass

In [13]:
vocab, word2idx, labels, label2idx = get_vocabulary(train_data)
vocab_size = len(vocab)
label_size = len(labels)
print(f"Vocabulary size: {vocab_size}")
print(f"Vocabulary first 5 words: {vocab[:5]}")
print(f"Vocabulary dictionary: { {w: word2idx[w] for w in vocab[:5]}}")
print("")
print(f"Labels size: {label_size}")
print(f"Labels: {labels}")
print(f"Labels dictionary: {label2idx}")

Vocabulary size: 5508
Vocabulary first 5 words: ['[PAD]', '[UNK]', 'datum', 'be', 'draw']
Vocabulary dictionary: {'[PAD]': 0, '[UNK]': 1, 'datum': 2, 'be': 3, 'draw': 4}

Labels size: 4
Labels: ['[PAD]', 'O', 'B-Quantity', 'I-Quantity']
Labels dictionary: {'[PAD]': 0, 'O': 1, 'B-Quantity': 2, 'I-Quantity': 3}


Since the *Quantity* recognition task is a Sequence Labeling problem, the input for the model must be the sequence of lemmas in the sentence and the output the sequence of labels. Therefore, the train, development and test `DataFrames` must be reformated by aggregating the data corresponding to each sentence. The `integrate_sentences` will do this for you. The output of `integrate_sentences` is a `DataFrame` with a row for each sentence and the columns `lemmas` and `labels` that contain the list of lemmas and the list of labels of the sentences respectively.  

In [14]:
def integrate_sentences(data):
    agg_func = lambda s: [s['lemma'].values.tolist(), s['label'].values.tolist()]
    data = data.groupby(["docId", "sentId"], sort=False).apply(agg_func).reset_index().rename(columns={0: 'lemmas_labels'})
    data['lemmas'] = data.apply(lambda x: x['lemmas_labels'][0], axis=1)
    data['labels'] = data.apply(lambda x: x['lemmas_labels'][1], axis=1)
    data = data.drop(columns="lemmas_labels")
    return data

In [15]:
train_examples = integrate_sentences(train_data)
dev_examples = integrate_sentences(dev_data)
test_examples = integrate_sentences(test_data)
pd.set_option('display.max_colwidth', None)
train_examples[(train_examples.docId == "S0378383912000130-3601") & (train_examples.sentId == 3)]

  data = data.groupby(["docId", "sentId"], sort=False).apply(agg_func).reset_index().rename(columns={0: 'lemmas_labels'})
  data = data.groupby(["docId", "sentId"], sort=False).apply(agg_func).reset_index().rename(columns={0: 'lemmas_labels'})
  data = data.groupby(["docId", "sentId"], sort=False).apply(agg_func).reset_index().rename(columns={0: 'lemmas_labels'})


Unnamed: 0,docId,sentId,lemmas,labels
767,S0378383912000130-3601,3,"[the, experiment, involve, two, beach, material, with, nominal, sediment, diameter, of, 1.5, mm, and, 8.5, mm, .]","[O, O, O, B-Quantity, I-Quantity, I-Quantity, O, O, O, O, O, B-Quantity, I-Quantity, I-Quantity, I-Quantity, I-Quantity, O]"


The dataset is now ready for you to get the numerical representation of both input and output. You must perform two steps to process the sequence of lemmas and the sequence of labels:
1. For each sentence, translate each lemma or label to its corresponding index using the `word2idx` and `label2idx` dictionaries. In case the lemma is not found in `word2idx`, use the index of the `[UNK]` token instead.
2. Pad both the sequences of lemmas and the sequences of labels to the same length as defined by the `maxlen` variable. For this, you should use the [pad_sequences](https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences) function with its default padding strategy. This function uses `0` as the default padding value which corresponds to the index of the `[PAD]` token in the vocabulary.   

You must complete the code for the `format_examples` function. This function takes as input a `DataFrame` in the format returned by `integrate_sentences`, the `word2idx` and `label2idx` dictionaries, and the `maxlen` variable. The function must run the steps described above and return a **numpy** array with the processed lemma sequences and a **numpy** array with the processed label sequences that will be used as input and output of the model respectively. Applying `format_examples` to the train, development and test sets should result on 6 arrays with the following shapes:

>Shape of train input data :  (1366, 130)  
Shape of train output data :  (1366, 130)  
Shape of development input data :  (459, 130)  
Shape of development output data :  (459, 130)  
Shape of test input data :  (848, 130)  
Shape of test output data :  (848, 130)  

## **test_format_examples = 3 Marks**

In [16]:
def format_examples(data, word2idx, label2idx, maxlen):
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    data['lemas_x'] = data.apply(lambda x: [word2idx.get(w, word2idx['[UNK]']) for w in x['lemmas']], axis=1)
    data['labels_x'] = data.apply(lambda x: [label2idx[l] for l in x['labels']], axis=1)
    x = pad_sequences(data['lemas_x'].tolist(), maxlen=maxlen)
    y = pad_sequences(data['labels_x'].tolist(), maxlen=maxlen)
    return x, y
    #
    pass

In [17]:
x_train, y_train = format_examples(train_examples, word2idx, label2idx, maxlen)
x_dev, y_dev = format_examples(dev_examples, word2idx, label2idx, maxlen)
x_test, y_test = format_examples(test_examples, word2idx, label2idx, maxlen)
print("Shape of train input data: ", x_train.shape)
print("Shape of train output data: ", y_train.shape)
print("Shape of development input data: ", x_dev.shape)
print("Shape of development output data: ", y_dev.shape)
print("Shape of train input data: ", x_test.shape)
print("Shape of test output data: ", y_test.shape)

Shape of train input data:  (1366, 130)
Shape of train output data:  (1366, 130)
Shape of development input data:  (459, 130)
Shape of development output data:  (459, 130)
Shape of train input data:  (848, 130)
Shape of test output data:  (848, 130)


## Recurrent Neural Network

There are three ways to create a neural network with **Kerars**: using the [Functional API](https://keras.io/guides/functional_api/), by [Model subclassing](https://keras.io/guides/making_new_layers_and_models_via_subclassing/) or creating a [Sequential model](https://keras.io/guides/sequential_model/). In this assignment, you will use the latter option. A `Sequential` model is a straightforward approach to build simple neural networks by stacking the layers. You will construct a RNN with the following 3 layers:
1. An [Embedding](https://keras.io/api/layers/core_layers/embedding/) layer with an input dimension equal to the vocabulary size, an embedding dimension defined by `embedding_dim` and where the length of the input sequences is equal to `maxlen`. The layer must also mask out the padding values so that they are not considered when computing the loss.
2. A Bidirectional [LSTM](https://keras.io/api/layers/recurrent_layers/lstm/) layer with a number of units determined by `rnn_units`. Since you are working on Sequence Labeling, the `LSTM` must return outputs for the full sequence. To make it Bidirectional, the `LSTM` must be wrapped by a [Bidirectional](https://keras.io/api/layers/recurrent_layers/bidirectional/) layer.
3. A [Dense](https://keras.io/api/layers/core_layers/dense/) layer with a number of units equal to the number of labels and a `softmax` activation function.  Since you are working on Sequence Labeling, the `Dense` layer must be wrapped by a [TimeDistributed](https://keras.io/api/layers/recurrent_layers/time_distributed/) layer.

You must complete the code for the `create_model` function. This function takes as input the size of the vocabulary, the number of labels and the `maxlen`, `embedding_dim` and `rnn_units` hyperparameters. The function must create a RNN according to the configuration described above. Read carefully all the linked documentation to learn how to create such a model.  Any option not mentioned in the description should be kept with its default value. The summary of the resulting model should look like:


> <pre>
> Model: "sequential_1"
> __________________________________________________________________________________________
> Layer (type)                           Output Shape                        Param #       
> ==========================================================================================
> embedding_1 (Embedding)               (None, 130, 300)                    1652400       
>                                                                                          
> bidirectional_1 (Bidirectional)       (None, 130, 512)                    1140736       
>                                                                                          
> time_distributed_1 (TimeDistributed)  (None, 130, 4)                      2052          
>                                                                                          
> ==========================================================================================
> Total params: 2,795,188
> Trainable params: 2,795,188
> Non-trainable params: 0
> __________________________________________________________________________________________
> </pre>

Before returning the model, the `create_model` function should [compile](https://keras.io/api/models/model_training_apis/#compile-method) it using `'sparse_categorical_crossentropy'` as the loss function, `'adam'` as the optimizer and `'sparse_categorical_accuracy'` as a metric to evaluate the model during training.

## **test_create_model = 3 Marks**

4

In [28]:
def create_model(vocab_size, label_size, maxlen, embedding_dim, rnn_units):
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    model = Sequential()
    # Embedding Layer
    model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=maxlen, #mask_zero=True,
                        input_shape = (maxlen,)))

    # Bidirectional LSTM Layer
    model.add(Bidirectional(LSTM(units=rnn_units, return_sequences=True)))

    # TimeDistributed Dense Layer
    model.add(TimeDistributed(Dense(label_size, activation='softmax')))

    # Compile the model
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam',
                  metrics=['sparse_categorical_accuracy'])
    return model
    #
    pass

In [29]:
model = create_model(vocab_size, label_size, maxlen, embedding_dim, rnn_units)
model.summary(line_length=90)

Once the data has been processed and the model has been compiled, you can proceed to train it.

You must complete the `train_model` function. The function takes as input the model created by `create_model`, the train input and output obtained by `format_examples` as well as the development input and output produced by the same function. The function also takes the `batch_size` and `epochs` hyperparameters. The function should train the model on the training data using those hyperparameters. During the training, `train_model` should evaluate the loss and any model metrics on the development data. With `shrink_dataset = False`, the training will take several minutes.

## **test_train_model= 4 Marks**

In [30]:
def train_model(model, x_train, y_train, x_dev, y_dev, batch_size, epochs):
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(x_dev, y_dev))
    #
    pass

In [31]:
train_model(model, x_train, y_train, x_dev, y_dev, batch_size, epochs)

Epoch 1/6
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m91s[0m 3s/step - loss: 0.4377 - sparse_categorical_accuracy: 0.7913 - val_loss: 0.0622 - val_sparse_categorical_accuracy: 0.9875
Epoch 2/6
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m66s[0m 3s/step - loss: 0.0566 - sparse_categorical_accuracy: 0.9873 - val_loss: 0.0512 - val_sparse_categorical_accuracy: 0.9875
Epoch 3/6
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m92s[0m 3s/step - loss: 0.0473 - sparse_categorical_accuracy: 0.9873 - val_loss: 0.0425 - val_sparse_categorical_accuracy: 0.9882
Epoch 4/6
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 3s/step - loss: 0.0347 - sparse_categorical_accuracy: 0.9890 - val_loss: 0.0314 - val_sparse_categorical_accuracy: 0.9904
Epoch 5/6
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 3s/step - loss: 0.0227 - sparse_categorical_accuracy: 0.9929 - val_loss: 0.0258 - val_sparse_categorical_accuracy: 0.9923
Epoch 6/6
[1m2

After training, the model can be used to make predictions on unlabeled data using the [predict](https://keras.io/api/models/model_training_apis/#predict-method) method.

You must complete the code for the `make_predictions` function. The functions takes as input the model already trained, the test input data produced by `format_examples` and the `batch_size` hyperparameter. The function must run the `predict` method on the input data using batches of size equal to `batch_size`. The `predict` method will return a **numpy** array with 3 axes: `(number of sentences, maxlen, label_size)`. For each token in each sentence, `predict` returns a vector with the probabilities predicted for every label. The output of `make_predictions` must include only the index of the label with the highest probability for each token. For example, if the prediction for one token is the vector `[0.04974193, 0.1511916, 0.65180656, 0.14725993]`, the output for that token should be `2`. For this, you can apply the [argmax](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html) method along the last axis of the **numpy** array.

## **test_make_predictions= 3 Marks**

In [32]:
def make_predictions(model, x_test, batch_size):
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    result = model.predict(x_test, batch_size = batch_size)
    return np.argmax(result, axis=-1)
    #
    pass

In [33]:
predictions = make_predictions(model, x_test, batch_size)

[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 572ms/step


Since the predictions are now label indexes, they can be translated to the corresponding label by accessing the `labels` list. The `predictions_to_labels` functions iterates over all the sequences in the test set and translates the prediction of each token to the corresponding label. The function skips the padding tokens. The new format of the predictions can be stored in the `prediction` column of the test `DataFrame`.

In [34]:
def predictions_to_labels(predictions, x_test, labels):
    pred_labels = []
    for pred_seq, x_seq in zip(predictions, x_test):
        pred_seq_labels = [labels[p] for p, x in zip(pred_seq, x_seq) if x!=0]
        pred_labels.extend(pred_seq_labels)
    return pred_labels

In [35]:
test_data['prediction'] = predictions_to_labels(predictions, x_test, labels)
test_data[(test_data.docId == "S0038071711004354-1624") & (test_data.sentId == 2)]

Unnamed: 0,docId,sentId,word,lemma,label,prediction
7328,S0038071711004354-1624,2,Approximately,approximately,B-Quantity,B-Quantity
7329,S0038071711004354-1624,2,15,15,I-Quantity,B-Quantity
7330,S0038071711004354-1624,2,min,min,I-Quantity,I-Quantity
7331,S0038071711004354-1624,2,before,before,O,O
7332,S0038071711004354-1624,2,the,the,O,O
7333,S0038071711004354-1624,2,beginning,beginning,O,O
7334,S0038071711004354-1624,2,of,of,O,O
7335,S0038071711004354-1624,2,the,the,O,O
7336,S0038071711004354-1624,2,experiment,experiment,O,O
7337,S0038071711004354-1624,2,",",",",O,O


Although it is not the usual way of evaluating Sequence Labeling tasks, **MeasEval** uses a metric based on the Reading Comprehension *Macro-Averaged F1*. This metric measures the amount of overlapping tokens between the predictions and the true labels. In this assignment, we will approximate this metric by evaluating how many tokens belonging to a *Quantity* are captured by the model. This is done by the `evaluate` function. For the model trained above, the result of this evaluation should look like:

> <pre>
>               precision    recall  f1-score   support
>
>     Quantity       0.85      0.69      0.76      1263
>
>    micro avg       0.85      0.69      0.76      1263
>    macro avg       0.85      0.69      0.76      1263
> weighted avg       0.85      0.69      0.76      1263
> </pre>

In [36]:
def evaluate(data):
    labels = data.apply(lambda x: x['label'].replace("B-", "").replace("I-", ""), axis=1).values
    predictions = data.apply(lambda x: x['prediction'].replace("B-", "").replace("I-", ""), axis=1).values
    print(classification_report(labels, predictions, labels=["Quantity"]))

In [37]:
evaluate(test_data)

              precision    recall  f1-score   support

    Quantity       0.91      0.67      0.77      1263

   micro avg       0.91      0.67      0.77      1263
   macro avg       0.91      0.67      0.77      1263
weighted avg       0.91      0.67      0.77      1263



## Pre-trained Word Embeddings

Initializing neural networks with pre-trained word embeddings has a significant impact on many NLP tasks. In the following exercise, you will experiment whether this is also the case for *Quantity* recognition using **GloVe**. You can refer to the following tutorial to learn how to complete this exercise with **keras**: [Using pre-trained word embeddings
](https://keras.io/examples/nlp/pretrained_word_embeddings/)


First, the `load_embeddings` function will load **GloVe** and return a dictionary mapping words to their embeddings.

In [38]:
def load_embeddings(glove_path):
    embedding_index = {}
    with open(glove_path, encoding="utf8") as glove_file:
        for line in glove_file:
            word, coefs = line.split(maxsplit=1)
            coefs = np.fromstring(coefs, "f", sep=" ")
            embedding_index[word] = coefs
    return embedding_index

In [None]:
glove_path = f"glove/glove.6B.{embedding_dim}d.txt"
embedding_index = load_embeddings(glove_path)

To initialize the `Embedding` layer with the **GloVe** embeddings, you have to create a matrix with `(vocab_size, embedding_dim)` dimensions. The *i-th* row in the matrix corresponds to the *i-th* lemma in the vocabulary and contains the **GloVe** embedding for that lemma.

You must complete the code for the `create_embedding_matrix` function. The function takes the embedding dictionary created by `load_embeddings`, the vocabulary dictionary, the size of the vocabulary and the `embedding_dim` hyperparameter. The function should initialize a `(vocab_size, embedding_dim)` **numpy** array with zeros and then replace each row with the appropriate **GloVe** embedding if the corresponding lemma exists in the embedding dictionary. For example, the embedding for "*statistic*" should exist in the resulting `embedding_matrix`:
> <pre>
array([ 0.1085    ,  0.82801998,  0.10672   ,  0.0094136 , -0.30441001,
        0.75617999, -0.14704999, -0.15469   , -0.97372001, -0.60413003,
        0.065233  , -0.055324  , -0.094477  ,  0.23502   ,  0.16466001,
        ...
</pre>

### **test_create_embedding_matrix= 2 Marks**

In [None]:
def create_embedding_matrix(embedding_index, word2idx, vocab_size, embedding_dim):
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    #
    pass

In [None]:
embedding_matrix = create_embedding_matrix(embedding_index, word2idx, vocab_size, embedding_dim)
embedding_matrix[word2idx["statistic"]]

Finally, you can create a new model and load the pre-trained word embedding matrix into the `Embedding` layer.

You must complete the code for the `create_model_with_embeddings` function. This function takes as input the size of the vocabulary, the number of labels, the `maxlen`, `embedding_dim` and `rnn_units` hyperparameters, and the embedding matrix created by `create_embedding_matrix`. The function should construct, compile and return a RNN equal to the one built in `create_model` with the only difference that the `Embedding` layer must be initialized with the embedding matrix. Use the [Constant](https://keras.io/api/layers/initializers/) initializer for this purpose. For this task, the Embedding layer must be kept trainable so the embeddings can be updated during training.

The configuration of the `Embedding` layer for this version of the RNN should look like:
> <pre>
> {'name': 'embedding_1',
>  'trainable': True,
>  'batch_input_shape': (None, 130),
>  'dtype': 'float32',
>  'input_dim': 5508,
>  'output_dim': 300,
>  'embeddings_initializer': {'class_name': 'Constant',
>   'config': {'value': array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
>             0.        ,  0.        ],
>           [ 0.        ,  0.        ,  0.        , ...,  0.        ,
>             0.        ,  0.        ],
>           [ 0.72004002,  0.80954999,  0.77170002, ...,  0.39351001,
>            -0.47082999, -0.60759002],
>           ...,
>           [ 0.        ,  0.        ,  0.        , ...,  0.        ,
>             0.        ,  0.        ],
>           [-0.40123999, -0.27991   , -0.42445999, ...,  0.45576   ,
>             0.61864001, -0.30489001],
>           [ 0.        ,  0.        ,  0.        , ...,  0.        ,
>             0.        ,  0.        ]])}},
>  'embeddings_regularizer': None,
>  'activity_regularizer': None,
>  'embeddings_constraint': None,
>  'mask_zero': True,
>  'input_length': 130}
> </pre>

## **test_create_model_with_embeddings= 2 Marks**

In [None]:
def create_model_with_embeddings(vocab_size, label_size, maxlen, embedding_dim, rnn_units, embedding_matrix):
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    #
    pass

In [None]:
model_with_embeddings = create_model_with_embeddings(vocab_size, label_size, maxlen, embedding_dim, rnn_units, embedding_matrix)
model_with_embeddings.get_layer(index=0).get_config()

Initializing the `Embedding` layer with **GloVe** embeddings should have a positive impact on the model performance for *Quantity* recognition by improving the recall.

> <pre>
>               precision    recall  f1-score   support
>
>     Quantity       0.85      0.79      0.82      1263
>
>    micro avg       0.85      0.79      0.82      1263
>    macro avg       0.85      0.79      0.82      1263
> weighted avg       0.85      0.79      0.82      1263
> </pre>

In [None]:
train_model(model_with_embeddings, x_train, y_train, x_dev, y_dev, batch_size, epochs)

In [None]:
predictions_with_embeddings = make_predictions(model_with_embeddings, x_test, batch_size)
test_data['prediction'] = predictions_to_labels(predictions_with_embeddings, x_test, labels)

In [None]:
evaluate(test_data)