# Named Entity Recognition (NER)

Named Entity Recognition (NER) is an important  task in natural language processing. In this assignment you will implement a neural network model for NER.  In particular you will implement an approach called Sliding Window Neural Network. The dataset is composed of sentences. The dataframe already has each words parsed in one column and the corresponding label (entity) in the second column. We will build a "window" model, the idea on the window model is to use 5-word window to predict the name entity of the middle word. Here is the first observation in our data:

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from ner import *
import pandas as pd
from sklearn.model_selection import train_test_split

In [3]:
data = pd.read_csv("data/Genia4ERtask1.iob2", sep="\t", header=None, names=["word", "label"])

In [4]:
data.head()

Unnamed: 0,word,label
0,IL-2,B-DNA
1,gene,I-DNA
2,expression,O
3,and,O
4,NF-kappa,B-protein


In [5]:
tiny_data = pd.read_csv("data/tiny.ner.train", sep="\t", header=None, names=["word", "label"])

The second observation is the 5 words starting with 'gene' and the label is the entity for the word 'and'. We have 5 features (categorical variables) which are words. We will use a word embedding to represent each value of the categorical features. For each observation, we concatenate the values of the 5 word embeddings for that observation. The vector of concatenated embeddings is feeded to a linear layer.

## Split dataset

In [6]:
N = int(data.shape[0]*0.8)
N

394040

In [7]:
train_df = data.iloc[:N,].copy()
valid_df = data.iloc[N:,].copy()

In [8]:
train_df.shape, valid_df.shape

((394040, 2), (98511, 2))

## Word and label to index mapping

In [9]:
vocab2index = label_encoding(train_df["word"].values)
label2index = label_encoding(train_df["label"].values)

In [10]:
len(label2index)

11

In [11]:
label2index

{'B-DNA': 0,
 'B-RNA': 1,
 'B-cell_line': 2,
 'B-cell_type': 3,
 'B-protein': 4,
 'I-DNA': 5,
 'I-RNA': 6,
 'I-cell_line': 7,
 'I-cell_type': 8,
 'I-protein': 9,
 'O': 10}

## Label Encoding categorical variables

In [12]:
tiny_vocab2index = label_encoding(tiny_data["word"].values)
tiny_label2index = label_encoding(tiny_data["label"].values)
tiny_data_enc = dataset_encoding(tiny_data, tiny_vocab2index, tiny_label2index)

In [13]:
actual = np.array([17, 53, 31, 25, 44, 41, 32,  0, 11,  1])
assert(np.array_equal(tiny_data_enc.iloc[30:40].word.values, actual))

## Dataset definition

In [14]:
tiny_ds = NERDataset(tiny_data_enc)

In [15]:
len(tiny_ds)

93

In [16]:
x, y = tiny_ds[0]
x,y

(array([11, 30, 26, 18, 13]), 6)

In [17]:
x, y = tiny_ds[0]
assert(np.array_equal(x, np.array([11, 30, 26, 18, 13])))
assert(y == 6)
assert(len(tiny_ds) == 93)

## Testing

In [18]:
# encoding datasets
train_df_enc = dataset_encoding(train_df, vocab2index, label2index)
valid_df_enc = dataset_encoding(valid_df, vocab2index, label2index)

In [19]:
# creating datasets
train_ds =  NERDataset(train_df_enc)
valid_ds = NERDataset(valid_df_enc)

# dataloaders
batch_size = 10000
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=batch_size)

In [20]:
vocab_size = len(vocab2index)+1
n_class = len(label2index)
emb_size = 100

model = NERModel(vocab_size, n_class, emb_size)
optimizer = get_optimizer(model, lr = 0.01, wd = 1e-5)
train_model(model, optimizer, train_dl, valid_dl, epochs=10)

train loss  0.756 val loss 0.406 and accuracy 0.877
train loss  0.319 val loss 0.326 and accuracy 0.899
train loss  0.251 val loss 0.302 and accuracy 0.905
train loss  0.217 val loss 0.296 and accuracy 0.908
train loss  0.196 val loss 0.297 and accuracy 0.908
train loss  0.181 val loss 0.287 and accuracy 0.911
train loss  0.170 val loss 0.316 and accuracy 0.906
train loss  0.162 val loss 0.312 and accuracy 0.907
train loss  0.157 val loss 0.317 and accuracy 0.908
train loss  0.151 val loss 0.306 and accuracy 0.908


In [21]:
optimizer = get_optimizer(model, lr = 0.001, wd = 1e-5)
train_model(model, optimizer, train_dl, valid_dl, epochs=10)

train loss  0.134 val loss 0.294 and accuracy 0.912
train loss  0.129 val loss 0.297 and accuracy 0.911
train loss  0.126 val loss 0.298 and accuracy 0.911
train loss  0.125 val loss 0.299 and accuracy 0.911
train loss  0.123 val loss 0.303 and accuracy 0.910
train loss  0.122 val loss 0.304 and accuracy 0.910
train loss  0.121 val loss 0.304 and accuracy 0.910
train loss  0.120 val loss 0.307 and accuracy 0.910
train loss  0.119 val loss 0.306 and accuracy 0.910
train loss  0.118 val loss 0.310 and accuracy 0.909


In [22]:
valid_loss, valid_acc = valid_metrics(model, valid_dl)

In [23]:
valid_loss, valid_acc

(0.3096579613748972, 0.9094785142172638)

In [24]:
assert(np.abs(valid_loss - 0.3) < 0.02)

In [25]:
assert(np.abs(valid_acc - 0.9) < 0.01)