# Named Entity Recognition (NER)

Named Entity Recognition (NER) is an important  task in natural language processing. In this assignment you will implement a neural network model for NER.  In particular you will implement an approach called Sliding Window Neural Network. The dataset is composed of sentences. The dataframe already has each words parsed in one column and the corresponding label (entity) in the second column. We will build a "window" model, the idea on the window model is to use 5-word window to predict the name entity of the middle word. Here is the first observation in our data:

In [1]:
%load_ext autoreload
%autoreload 2

In [4]:
from ner import *
import pandas as pd
from sklearn.model_selection import train_test_split

In [5]:
data = pd.read_csv("data/Genia4ERtask1.iob2", sep="\t", header=None, names=["word", "label"])

In [6]:
data.head()

Unnamed: 0,word,label
0,IL-2,B-DNA
1,gene,I-DNA
2,expression,O
3,and,O
4,NF-kappa,B-protein


In [7]:
tiny_data = pd.read_csv("data/tiny.ner.train", sep="\t", header=None, names=["word", "label"])

The second observation is the 5 words starting with 'gene' and the label is the entity for the word 'and'. We have 5 features (categorical variables) which are words. We will use a word embedding to represent each value of the categorical features. For each observation, we concatenate the values of the 5 word embeddings for that observation. The vector of concatenated embeddings is feeded to a linear layer.

## Split dataset

In [8]:
N = int(data.shape[0]*0.8)
N

394040

In [9]:
train_df = data.iloc[:N,].copy()
valid_df = data.iloc[N:,].copy()

In [10]:
train_df.shape, valid_df.shape

((394040, 2), (98511, 2))

## Word and label to index mapping

In [65]:
vocab2index = label_encoding(train_df["word"].values)
label2index = label_encoding(train_df["label"].values)

In [10]:
len(label2index)

11

In [11]:
label2index

{'B-DNA': 0,
 'B-RNA': 1,
 'B-cell_line': 2,
 'B-cell_type': 3,
 'B-protein': 4,
 'I-DNA': 5,
 'I-RNA': 6,
 'I-cell_line': 7,
 'I-cell_type': 8,
 'I-protein': 9,
 'O': 10}

## Label Encoding categorical variables

In [70]:
tiny_vocab2index = label_encoding(tiny_data["word"].values)
tiny_label2index = label_encoding(tiny_data["label"].values)
tiny_data_enc = dataset_encoding(tiny_data, tiny_vocab2index, tiny_label2index)

In [71]:
actual = np.array([17, 53, 31, 25, 44, 41, 32,  0, 11,  1])
assert(np.array_equal(tiny_data_enc.iloc[30:40].word.values, actual))

## Dataset definition

In [82]:
idx = 0
tiny_data_enc.word[idx:idx+5].to_numpy()
tiny_data_enc.label[idx+2]

6

In [83]:
tiny_ds = NERDataset(tiny_data_enc)

In [86]:
len(tiny_ds)

93

In [87]:
x, y = tiny_ds[0]
x,y

(array([11, 30, 26, 18, 13]), 6)

In [88]:
x, y = tiny_ds[0]
assert(np.array_equal(x, np.array([11, 30, 26, 18, 13])))
assert(y == 6)
assert(len(tiny_ds) == 93)

## Testing

In [255]:
valid_df_enc

Unnamed: 0,word,label
394040,7042,2
394041,17256,7
394042,10191,7
394043,44,10
394044,19261,10
...,...,...
492546,9939,10
492547,13637,10
492548,13779,4
492549,17544,10


In [288]:
train_ds =  NERDataset(train_df_enc)
valid_ds =  NERDataset(valid_df_enc)

In [290]:
valid_ds[0]

(array([ 7042, 17256, 10191,    44, 19261]), 7)

In [270]:
train_df_enc

Unnamed: 0,word,label
0,5052,0
1,12638,5
2,12206,10
3,9097,10
4,6171,4
...,...,...
394035,725,9
394036,44,10
394037,18748,10
394038,12203,10


In [287]:
valid_df_enc.iloc[0,0]

7042

In [277]:
df =valid_df_enc.to_numpy()

In [283]:
df[idx:idx+5, 0]

(5,)

In [262]:
len(valid_df_enc)

98511

In [264]:
valid_ds

KeyError: 2

In [221]:
# encoding datasets
train_df_enc = dataset_encoding(train_df, vocab2index, label2index)
valid_df_enc = dataset_encoding(valid_df, vocab2index, label2index)

In [222]:
train_df_enc.shape

(394040, 2)

In [291]:
# creating datasets
train_ds =  NERDataset(train_df_enc)
valid_ds =  NERDataset(valid_df_enc)

# dataloaders
batch_size = 10000
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=batch_size)

In [292]:
valid_ds[1]

(array([17256, 10191,    44, 19261, 18482]), 10)

In [293]:
next(iter(valid_dl))

[tensor([[ 7042, 17256, 10191,    44, 19261],
         [17256, 10191,    44, 19261, 18482],
         [10191,    44, 19261, 18482, 15557],
         ...,
         [ 8175, 17356, 14585, 12182, 11377],
         [17356, 14585, 12182, 11377, 13490],
         [14585, 12182, 11377, 13490, 18482]]),
 tensor([ 7, 10, 10,  ..., 10, 10, 10])]

In [294]:
valid_dl

<torch.utils.data.dataloader.DataLoader at 0x7f9672839ee0>

In [298]:
class NERModel(nn.Module):
    def __init__(self, vocab_size, n_class, emb_size=50, seed=3):
        """Initialize an embedding layer and a linear layer
        """
        super(NERModel, self).__init__()
        torch.manual_seed(seed)
        ### BEGIN SOLUTION
        self.word_emb = nn.Embedding(vocab_size, emb_size)
        self.linear = nn.Linear(emb_size*5, n_class)

        ### END SOLUTION
        
    def forward(self, x):
        """Apply the model to x
        
        1. x is a (N,5). Lookup embeddings for x
        2. reshape the embeddings (or concatenate) such that x is N, 5*emb_size 
           .flatten works
        3. Apply a linear layer
        """
        ### BEGIN SOLUTION
        x = self.word_emb(x)
        l,h,w = x.shape
        x =x.reshape(l, h*w)
        x = self.linear(x)
        ### END SOLUTION
        return x

In [305]:

def train_model(model, optimizer, train_dl, valid_dl, epochs=10):
    for i in range(epochs):
        train_loss = []
        model.train()
        ### BEGIN SOLUTION
        for x, y in train_dl:
            y_onehot = torch.nn.functional.one_hot(y)
            y_pred = model(x)
            y_hat = torch.softmax(y_pred, dim = 1)
            #print(y_onehot.shape, y_pred.shape)
            L = nn.CrossEntropyLoss()
            loss = L(y_pred.float(), y_onehot.float())
            
            #loss = nn.CrossEntropyLoss(y_onehot,y_pred)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            train_loss.append(loss.item())
        ### END SOLUTION
        valid_loss, valid_acc = valid_metrics(model, valid_dl)
        print(np.mean(train_loss), valid_loss,valid_acc)
#         print("train loss  %.3f val loss %.3f and accuracy %.3f" % (
#             print(np.mean(train_loss)), valid_loss, valid_acc))

def valid_metrics(model, valid_dl):
    model.eval()
    ### BEGIN SOLUTION
    losses = []
    y_hats = []
    ys = []
    for x, y in valid_dl:
        y_onehot = torch.nn.functional.one_hot(y) # One hot encoding so that we can calculate crossentropy
        y_pred = model(x) 
        y_hat = torch.softmax(y_pred, dim = 1) # Probability for each class
        L = nn.CrossEntropyLoss()
        loss = L(y_pred.float(), y_onehot.float())
        #loss = nn.CrossEntropyLoss(y_onehot,y_pred)  
        y_hat = np.argmax(y_hat.detach(), axis=1) # Getting back the predicted class
        y_hats.append(y_hat)
        ys.append(y.numpy())
        losses.append(loss.item())
    
    ys = np.concatenate(ys)
    y_hats = np.concatenate(y_hats)
    val_loss = np.mean(losses)#, roc_auc_score(ys, y_hats)
    val_acc = (ys == y_hats).sum()/len(ys)
    ### END SOLUTION
    return val_loss, val_acc

In [306]:
vocab_size = len(vocab2index)+1
n_class = len(label2index)
emb_size = 100

model = NERModel(vocab_size, n_class, emb_size)
optimizer = get_optimizer(model, lr = 0.01, wd = 1e-5)
train_model(model, optimizer, train_dl, valid_dl, epochs=10)

0.7339784786105156 0.40738130211830137 0.8781406397514897
0.31660365015268327 0.32812400460243224 0.8986975544885135
0.2499956078827381 0.3057739049196243 0.9046159156202097
0.2154882848262787 0.29571646749973296 0.9080775985462962
0.1943738043308258 0.3014722615480423 0.9071436547656512
0.1795968994498253 0.2902646899223328 0.9112144314617235
0.16924089901149272 0.3090768575668335 0.907427898524978
0.1618583507835865 0.30718915462493895 0.9081385079232948
0.1561602033674717 0.2988574832677841 0.9100470017359172
0.15191591382026673 0.3098491162061691 0.9084532063711208


In [307]:
optimizer = get_optimizer(model, lr = 0.001, wd = 1e-5)
train_model(model, optimizer, train_dl, valid_dl, epochs=10)

0.13339512664824724 0.30031385719776155 0.910229729866913
0.1282341878861189 0.30028333365917204 0.9103718517465764
0.1258051436394453 0.30114724636077883 0.9104632158120742
0.1243304267525673 0.30344178676605227 0.9099759407960856
0.12317418307065964 0.30476149916648865 0.9101789720527476
0.12187269330024719 0.3058551371097565 0.9100774564244165
0.12052327319979668 0.3047191560268402 0.910128214238582
0.11957029327750206 0.3058358788490295 0.9102703361182454
0.11869703885167837 0.30654798448085785 0.9100266986102511
0.11769825760275125 0.30898159444332124 0.9099759407960856


In [21]:
optimizer = get_optimizer(model, lr = 0.001, wd = 1e-5)
train_model(model, optimizer, train_dl, valid_dl, epochs=10)

train loss  0.134 val loss 0.294 and accuracy 0.912
train loss  0.129 val loss 0.297 and accuracy 0.911
train loss  0.126 val loss 0.298 and accuracy 0.911
train loss  0.125 val loss 0.299 and accuracy 0.911
train loss  0.123 val loss 0.303 and accuracy 0.910
train loss  0.122 val loss 0.304 and accuracy 0.910
train loss  0.121 val loss 0.304 and accuracy 0.910
train loss  0.120 val loss 0.307 and accuracy 0.910
train loss  0.119 val loss 0.306 and accuracy 0.910
train loss  0.118 val loss 0.310 and accuracy 0.909


In [308]:
valid_loss, valid_acc = valid_metrics(model, valid_dl)

In [309]:
valid_loss, valid_acc

(0.30898159444332124, 0.9099759407960856)

In [310]:
assert(np.abs(valid_loss - 0.3) < 0.02)

In [311]:
assert(np.abs(valid_acc - 0.9) < 0.01)