# Named Entity Recognition (NER)

Named Entity Recognition (NER) is an important  task in natural language processing. In this assignment I implemented a neural network model for NER.  In particular I used an approach called Sliding Window Neural Network. The dataset is composed of sentences. The dataframe already has each words parsed in one column and the corresponding label (entity) in the second column. We will build a "window" model, the idea on the window model is to use 5-word window to predict the name entity of the middle word. 

**Note: This notebook is testing notebook based on the code in ner.py**

In [35]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
from ner import *
import pandas as pd
from sklearn.model_selection import train_test_split

In [3]:
data = pd.read_csv("data/Genia4ERtask1.iob2", sep="\t", header=None, names=["word", "label"])

**Here is the first few observation in our data**

In [4]:
data.head()

Unnamed: 0,word,label
0,IL-2,B-DNA
1,gene,I-DNA
2,expression,O
3,and,O
4,NF-kappa,B-protein


We used window model to learn about the context of a word. For this case we assign the named entity of expression('O') as the label for the words within window size of 5. 

In [5]:
tiny_data = pd.read_csv("data/tiny.ner.train", sep="\t", header=None, names=["word", "label"])

The second observation is the 5 words starting with 'gene' and the label is the entity for the word 'and'. We have 5 features (categorical variables) which are words. We will use a word embedding to represent each value of the categorical features. For each observation, we concatenate the values of the 5 word embeddings for that observation. The vector of concatenated embeddings is feeded to a linear layer.

## Split dataset

In [6]:
N = int(data.shape[0]*0.8)
N

394040

In [7]:
train_df = data.iloc[:N,].copy()
valid_df = data.iloc[N:,].copy()

In [8]:
train_df.shape, valid_df.shape

((394040, 2), (98511, 2))

## Word and label to index mapping

In [9]:
vocab2index = label_encoding(train_df["word"].values)
label2index = label_encoding(train_df["label"].values)

In [10]:
len(label2index)

11

In [11]:
label2index

{'B-DNA': 0,
 'B-RNA': 1,
 'B-cell_line': 2,
 'B-cell_type': 3,
 'B-protein': 4,
 'I-DNA': 5,
 'I-RNA': 6,
 'I-cell_line': 7,
 'I-cell_type': 8,
 'I-protein': 9,
 'O': 10}

## Label Encoding categorical variables

In [12]:
tiny_vocab2index = label_encoding(tiny_data["word"].values)
tiny_label2index = label_encoding(tiny_data["label"].values)
tiny_data_enc = dataset_encoding(tiny_data, tiny_vocab2index, tiny_label2index)

In [13]:
actual = np.array([17, 53, 31, 25, 44, 41, 32,  0, 11,  1])
assert(np.array_equal(tiny_data_enc.iloc[30:40].word.values, actual))

## Dataset definition

In [14]:
idx = 0
tiny_data_enc.word[idx:idx+5].to_numpy()
tiny_data_enc.label[idx+2]

6

In [15]:
tiny_ds = NERDataset(tiny_data_enc)

In [16]:
len(tiny_ds)

93

In [17]:
x, y = tiny_ds[0]
x,y

(array([11, 30, 26, 18, 13]), 6)

In [18]:
x, y = tiny_ds[0]
assert(np.array_equal(x, np.array([11, 30, 26, 18, 13])))
assert(y == 6)
assert(len(tiny_ds) == 93)

**Loading The DataSet**

In [28]:
# encoding datasets
train_df_enc = dataset_encoding(train_df, vocab2index, label2index)
valid_df_enc = dataset_encoding(valid_df, vocab2index, label2index)

In [29]:
train_df_enc.shape

(394040, 2)

In [30]:
# creating datasets
train_ds =  NERDataset(train_df_enc)
valid_ds =  NERDataset(valid_df_enc)

# dataloaders
batch_size = 10000
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=batch_size)

In [31]:
valid_ds[1]

(array([17256, 10191,    44, 19261, 18482]), 10)

In [32]:
next(iter(valid_dl))

[tensor([[ 7042, 17256, 10191,    44, 19261],
         [17256, 10191,    44, 19261, 18482],
         [10191,    44, 19261, 18482, 15557],
         ...,
         [ 8175, 17356, 14585, 12182, 11377],
         [17356, 14585, 12182, 11377, 13490],
         [14585, 12182, 11377, 13490, 18482]]),
 tensor([ 7, 10, 10,  ..., 10, 10, 10])]

In [36]:
vocab_size = len(vocab2index)+1
n_class = len(label2index)
emb_size = 100

model = NERModel(vocab_size, n_class, emb_size)
optimizer = get_optimizer(model, lr = 0.01, wd = 1e-5)
train_model(model, optimizer, train_dl, valid_dl, epochs=10)

0.7339784786105156 0.40738130211830137 0.8781406397514897
0.31660365015268327 0.32812400460243224 0.8986975544885135
0.2499956078827381 0.3057739049196243 0.9046159156202097
0.2154882848262787 0.29571646749973296 0.9080775985462962
0.1943738043308258 0.3014722615480423 0.9071436547656512
0.1795968994498253 0.2902646899223328 0.9112144314617235
0.16924089901149272 0.3090768575668335 0.907427898524978
0.1618583507835865 0.30718915462493895 0.9081385079232948
0.1561602033674717 0.2988574832677841 0.9100470017359172
0.15191591382026673 0.3098491162061691 0.9084532063711208


**As we can see the training loss keeps on decreasing while the validation loss/acc stated to stabilize after some time**

In [307]:
optimizer = get_optimizer(model, lr = 0.001, wd = 1e-5)
train_model(model, optimizer, train_dl, valid_dl, epochs=10)

0.13339512664824724 0.30031385719776155 0.910229729866913
0.1282341878861189 0.30028333365917204 0.9103718517465764
0.1258051436394453 0.30114724636077883 0.9104632158120742
0.1243304267525673 0.30344178676605227 0.9099759407960856
0.12317418307065964 0.30476149916648865 0.9101789720527476
0.12187269330024719 0.3058551371097565 0.9100774564244165
0.12052327319979668 0.3047191560268402 0.910128214238582
0.11957029327750206 0.3058358788490295 0.9102703361182454
0.11869703885167837 0.30654798448085785 0.9100266986102511
0.11769825760275125 0.30898159444332124 0.9099759407960856


In [37]:
valid_loss, valid_acc = valid_metrics(model, valid_dl)

In [38]:
valid_loss, valid_acc

(0.3098491162061691, 0.9084532063711208)