# Named Entity Recognition using Transformers 
[source](https://keras.io/examples/nlp/ner_transformers/)

## Introduction

Named Entity Recognition (NER) is the process of identifying named entities in text. Example of named entities are: "Person", "Location", "Organization", "Dates" etc. NER is essentially a token classification task where every token is classified into one or more predetermined categories.

In this exercise, we will train a simple Transformer based model to perform NER. We will be using the data from CoNLL 2003 shared task. For more information about the dataset, please visit the dataset website. However, since obtaining this data requires an additional step of getting a free license, we will be using HuggingFace's datasets library which contains a processed version of this dataset.

## Install the open source datasets library from HuggingFace

* Huggingface `datasets` library [link](https://huggingface.co/docs/datasets/quicktour.html#loading-a-dataset)
> Datasets provides datasets for many NLP tasks like text classification, question answering, language modeling
> https://huggingface.co/datasets

In [1]:
# datasets library download
!pip install datasets

# script used to evaluate NER models
!wget https://raw.githubusercontent.com/sighsmile/conlleval/master/conlleval.py

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
--2021-08-08 07:44:57--  https://raw.githubusercontent.com/sighsmile/conlleval/master/conlleval.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7502 (7.3K) [text/plain]
Saving to: ‘conlleval.py.1’


2021-08-08 07:44:57 (104 MB/s) - ‘conlleval.py.1’ saved [7502/7502]



In [2]:
from datasets import list_datasets
datasets_list = list_datasets()
print(len(datasets_list))
print(', '.join(dataset for dataset in datasets_list))

1136
acronym_identification, ade_corpus_v2, adversarial_qa, aeslc, afrikaans_ner_corpus, ag_news, ai2_arc, air_dialogue, ajgt_twitter_ar, allegro_reviews, allocine, alt, amazon_polarity, amazon_reviews_multi, amazon_us_reviews, ambig_qa, amttl, anli, app_reviews, aqua_rat, aquamuse, ar_cov19, ar_res_reviews, ar_sarcasm, arabic_billion_words, arabic_pos_dialect, arabic_speech_corpus, arcd, arsentd_lev, art, arxiv_dataset, ascent_kb, aslg_pc12, asnq, asset, assin, assin2, atomic, autshumato, babi_qa, banking77, bbaw_egyptian, bbc_hindi_nli, bc2gm_corpus, best2009, bianet, bible_para, big_patent, billsum, bing_coronavirus_query_set, biomrc, blended_skill_talk, blimp, blog_authorship_corpus, bn_hate_speech, bookcorpus, bookcorpusopen, boolq, bprec, break_data, brwac, bsd_ja_en, bswac, c3, c4, cail2018, caner, capes, catalonia_independence, cawac, cbt, cc100, cc_news, ccaligned_multilingual, cdsc, cdt, cfq, chr_en, cifar10, cifar100, circa, civil_comments, clickbait_news_bg, climate_fever, 

## Defining a Transformer(Encoder)Block layer

In [3]:
import os
import numpy as np
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

from datasets import load_dataset
from conlleval import evaluate

**[Transformer Implementaion Source from]** 
- [blog post - tutorial](https://towardsdatascience.com/how-to-code-the-transformer-in-pytorch-24db27c8f9ec)
- [github](https://github.com/SamLynnEvans/Transformer)

**Attention is All you need**

> **3.4 Embeddings and Softmax(p5)**

> Similarly to other sequence transduction models, we use learned embeddings to convert the input
tokens and output tokens to vectors of dimension dmodel. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In
our model, we share the same weight matrix between the two embedding layers and the pre-softmax
linear transformation, similar to [24]. **In the embedding layers, we multiply those weights by √
dmodel.**

* The reason we increase the embedding values before addition is **to make the positional encoding relatively smaller**. This means **the original meaning in the embedding vector won’t be lost** when we add them together.

1. Pytorch `self.register_buffer` 로 layer를 등록하면 어떤 특징이 있는가?

    1) optimizer가 업데이트하지 않는다.
    2) 그러나 값은 존재한다(하나의 layer로써 작용한다고 보면 된다.)
    3) state_dict()로 확인이 가능하다.
    4) GPU연산이 가능하다.

2. `Variable`이란? ([source1](https://medium.com/@poperson1205/%EC%B4%88%EA%B0%84%EB%8B%A8-pytorch%EC%97%90%EC%84%9C-tensor%EC%99%80-variable%EC%9D%98-%EC%B0%A8%EC%9D%B4-a846dfb72119), [source2](https://9bow.github.io/PyTorch-tutorials-kr-0.3.1/beginner/examples_autograd/two_layer_net_autograd.html))

    * Variable이 최근 버전에서 deprecated 상태 Variable은 원래 autograd를 사용하기 위해서 사용되던 타입이었으나, 현재는 Tensor 타입과 병합되었다고 한다. 즉, Tensor 타입에서 디폴트로 autograd 기능을 지원하도록 되어있다.

    * PyTorch Variable은 PyTorch Tensor의 래퍼(Wrapper)이며, 연산 그래프(Computational Graph)에서 노드(Node)로 표현(represent)된다. 

    * PyTorch 0.4 이상 버전에서는 더이상 Variable을 사용할 필요가 없다. Legacy 코드에 Variable이 있다면 그냥 Tensor라고 생각하고 읽으면 된다.

In [4]:
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model, padding_idx=0)
    def forward(self, x):
        return self.embed(x)

class PositionalEncoding(nn.Module):
    """
    PE (pos,2i) = sin(pos/10000^(2i/d_model))
    PE (pos,2i+1) = cos(pos/10000^(2i/d_model)) 
    """
    def __init__(self, d_model, max_seq_len = 128):
        super().__init__()
        self.d_model = d_model
        
        # create constant 'pe' matrix with values dependant on 
        # pos and i
        pe = torch.zeros(max_seq_len, d_model)
        for pos in range(max_seq_len):
            for i in range(0, d_model, 2):
                pe[pos, i] = math.sin(pos / (10000 ** ((2 * i)/d_model)))
                pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))
                
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        # make embeddings relatively larger
        x = x * math.sqrt(self.d_model)
        # add constant to embedding
        seq_len = x.size(1)
        x = x + Variable(self.pe[:,:seq_len], requires_grad=False)
        return x

# def attention(q, k, v, d_k, mask=None, dropout=None):
    
#     scores = torch.matmul(q, k.transpose(-2, -1)) /  math.sqrt(d_k)
    
#     if mask is not None:
#         mask = mask.unsqueeze(1)
#         scores = scores.masked_fill(mask == 0, -1e9)
    
#     scores = F.softmax(scores, dim=-1)
    
#     if dropout is not None:
#         scores = dropout(scores)
        
#     output = torch.matmul(scores, v)
#     return output
    
class MultiHeadAttention(nn.Module):
    def __init__(self, heads, d_model, dropout = 0.1):
        super().__init__()
        
        self.d_model = d_model
        self.d_k = d_model // heads
        self.h = heads
        
        self.q_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
        self.out = nn.Linear(d_model, d_model)

    def forward(self, q, k, v, mask=None):
        
        bs = q.size(0)
        
        # perform linear operation and split into h heads
        k = self.k_linear(k).view(bs, -1, self.h, self.d_k)
        q = self.q_linear(q).view(bs, -1, self.h, self.d_k)
        v = self.v_linear(v).view(bs, -1, self.h, self.d_k)
        
        # transpose to get dimensions bs * h * sl * d_model
        k = k.transpose(1,2)
        q = q.transpose(1,2)
        v = v.transpose(1,2)
        
        # calculate attention
        scores = torch.matmul(q, k.transpose(-2, -1)) /  math.sqrt(self.d_k)
        scores = F.softmax(scores, dim=-1)
        scores = torch.matmul(scores, v)

        # calculate attention using function we will define next
#         scores = attention(q, k, v, self.d_k, mask, self.dropout)
        
        # concatenate heads and put through final linear layer
        concat = scores.transpose(1,2).contiguous().view(bs, -1, self.d_model)
        
        output = self.out(concat)
    
        return output

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=64, dropout = 0.1):
        super().__init__() 
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)
    def forward(self, x):
        x = self.dropout(F.relu(self.linear_1(x)))
        x = self.linear_2(x)
        return x

class Norm(nn.Module):
    def __init__(self, d_model, eps = 1e-6):
        super().__init__()
    
        self.size = d_model
        # create two learnable parameters to calibrate normalisation
        self.alpha = nn.Parameter(torch.ones(self.size))
        self.bias = nn.Parameter(torch.zeros(self.size))
        self.eps = eps
    def forward(self, x):
        norm = self.alpha * (x - x.mean(dim=-1, keepdim=True)) \
        / (x.std(dim=-1, keepdim=True) + self.eps) + self.bias
        return norm

In [5]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, heads, dropout = 0.1):
        super().__init__()
        self.attn = MultiHeadAttention(heads, d_model)
        self.norm_1 = Norm(d_model)
        self.norm_2 = Norm(d_model)
        self.ff = FeedForward(d_model)
        self.dropout_1 = nn.Dropout(dropout)
        self.dropout_2 = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # self attention + add&norm
        attn_output = self.attn(x, x, x, mask)
        attn_output = self.dropout_1(attn_output)
        norm_output = self.norm_1(x + attn_output)
        
        # ffn + add&norm
        ffn_output = self.ff(norm_output)
        ffn_output = self.dropout_2(ffn_output)
        out = self.norm_2(norm_output + ffn_output)
        return out

## Build the NER model class

In [6]:
class NERModel(nn.Module):
    def __init__(self, num_tags, vocab_size, max_seq_len=128, d_model=32, heads=4, d_ff=64):
        super(NERModel, self).__init__()
        self.token_embedding = TokenEmbedding(vocab_size, d_model)
        self.add_positional_encoding = PositionalEncoding(d_model, max_seq_len)
        self.transformer_block = TransformerEncoderLayer(d_model, heads)
        self.dropout1 = nn.Dropout(0.1)
        self.ff = nn.Linear(d_model, d_ff) #Relu
        self.dropout2 = nn.Dropout(0.1)
        self.ff_final = nn.Linear(d_ff, num_tags) # Softmax
        
    def forward(self, inputs):
        x = self.token_embedding(inputs) 
        x = self.add_positional_encoding(x)
        x = self.transformer_block(x)
        x = self.dropout1(x)
        x = F.relu(self.ff(x))
        x = self.dropout2(x)
        x = self.ff_final(x) # torch.Size([batch_size, max_seq_len, num_tags])

        return x

## Load the CoNLL 2003 dataset from the datasets library and process it

* length / tokens / tags
* 8	China	says	time	right	for	Taiwan	talks	.	5	0	0	0	0	5	0	0

In [7]:
conll_data = load_dataset("conll2003")

Reusing dataset conll2003 (/home/subinkim/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6)


In [8]:
def export_to_file(export_file_path, data):
    with open(export_file_path, "w") as f:
        for record in data:
            ner_tags = record["ner_tags"]
            tokens = record["tokens"]
            f.write(
                str(len(tokens))
                + "\t"
                + "\t".join(tokens)
                + "\t"
                + "\t".join(map(str, ner_tags))
                + "\n"
            )

export_to_file("./datasets/conll2003/conll_train.txt", conll_data["train"])
export_to_file("./datasets/conll2003/conll_val.txt", conll_data["validation"])

## Make the NER label lookup table

 **Data Examlpe** 
 ```
 EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O

Peter NNP B-NP B-PER
Blackburn NNP I-NP I-PER
```
[내용 출처](https://wikidocs.net/24682)

* 데이터의 형식은 [단어] [품사 태깅] [청크 태깅] [개체명 태깅]의 형식으로 되어 있음

**[품사 태깅]**

- NNP : 고유 명사 단수형
- VBZ는 3인칭 단수 동사 현재형
- 추가 상세 정보 [link](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

**[개체명 태깅]**

* LOC : location
* ORG : organization
* PER : person
* MISC : miscellaneous(여러 가지 종류의, 이것저것 다양한)


   ***BIO 표현 방법을 사용***
   

* B : Begin, 개체명이 시작되는 부분
* I : Inside, 개체명의 내부 부분
* O : Outside, 개체명이 앙닌 부분

        
   ***개체명 인식 상세 설명***    
   1) 개체명의 시작 부분이면서 Organization을 의미하는 German에는 `B-ORG`라는 개체명 태깅이 붙음. 다만, German 그 자체로 개체명 하나이기 때문에 거기서 개체명 인식은 종료되면서 뒤에 `I`가 별도로 붙는 단어가 나오지는 않음. 이에 German 뒤에 나오는 call은 개체명이 아니기 때문에 `O`가 태깅됨.
    
   2) `. . O O` 다음에 11번째 줄 Peter가 나오는 부분 사이에서 10번째 줄은 공란으로 되어 있는데, 이는 9번째 줄에서 문장이 끝나고 11번째 줄에서 새로운 문장이 시작됨을 의미
   3) 그 다음 문장이 시작되는 11번째 줄에서는 개체명이 하나의 단어로 끝나지 않았을 때, 어떻게 다음 단어로 개체명 인식이 이어지는지를 보여줌. Peter는 개체명이 시작되면서 person에 해당되기 때문에 `B-PER`이라는 개체명 태깅이 붙고, 아직 개체명에 대한 인식은 끝나지 않았기 때문에 뒤에 붙는 Blackburn에서는 `I`가 나오면서 `I-PER`이 개체명 태깅으로 붙게 됨. 즉, Peter Blackburn이 person에 속하는 하나의 개체명으로 인식

In [9]:
def make_tag_lookup_table():
    iob_labels = ["B", "I"]
    ner_labels = ["PER", "ORG", "LOC", "MISC"]
    all_labels = [(label1, label2) for label2 in ner_labels for label1 in iob_labels]
    all_labels = ["-".join([a, b]) for a, b in all_labels]
    all_labels = ["[PAD]", "O"] + all_labels
    # target의 index 0은 padding을 의미하도록 설정
    return dict(zip(range(0, len(all_labels) + 1), all_labels))


mapping = make_tag_lookup_table()
print(mapping)

{0: '[PAD]', 1: 'O', 2: 'B-PER', 3: 'I-PER', 4: 'B-ORG', 5: 'I-ORG', 6: 'B-LOC', 7: 'I-LOC', 8: 'B-MISC', 9: 'I-MISC'}


In [10]:
# tokens : ['china', 'says', 'time', 'right', 'for', 'taiwan', 'talks', '.']
# tags : [6, 1, 1, 1, 1, 6, 1, 1]

def map_record_to_training_data(record):
    record = record.lower()
    record = record.split('\t')
    length = int(record[0])
    tokens = record[1 : length + 1]
    tags = record[length + 1 :]
    tags = [int(tag) + 1 for tag in tags]
    return tokens, tags

In [11]:
map_record_to_training_data('8\tChina\tsays\ttime\tright\tfor\tTaiwan\ttalks .\t50\t0\t0\t0\t5\t0\t0')

(['china', 'says', 'time', 'right', 'for', 'taiwan', 'talks .', '50'],
 [1, 1, 1, 6, 1, 1])

In [12]:
map_record_to_training_data('9\tEU\trejects\tGerman\tcall\tto\tboycott\tBritish\tlamb\t.\t3\t0\t7\t0\t0\t0\t7\t0\t0')

(['eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.'],
 [4, 1, 8, 1, 1, 1, 8, 1, 1])

* Custom Vocab

> 데이터셋 자체가 tokenized 된 상태이므로 별도의 tokenizer 활용하지 말고, 직접 vocab을 구축하여 token -> id 변환을 해주어야한다.

* Counter 함수
> 
```
from collections import Counter
Counter('hello world') # Counter({'l': 3, 'o': 2, 'h': 1, 'e': 1, ' ': 1, 'w': 1, 'r': 1, 'd': 1})
Counter('hello world').most_common() # [('l', 3), ('o', 2), ('h', 1), ('e', 1), (' ', 1), ('w', 1), ('r', 1), ('d', 1)]
```
* vocab size 20000개
> 20000개 중 2개는 [UNK], [PAD]로 사용하고자 함
> training set 기반으로 vocab구성했으므로, 실제 training 시에는 [UNK]가 거의 없겠지만, test/inference시에 사용될 수 있음

In [13]:
from collections import Counter

all_tokens = sum(conll_data["train"]["tokens"], [])
all_tokens_array = np.array(list(map(str.lower, all_tokens)))

counter = Counter(all_tokens_array)
print(len(counter))

num_tags = len(mapping)
vocab_size = 20000

# We only take (vocab_size - 2) most commons words from the training data since
# the `StringLookup` class uses 2 additional tokens - one denoting an unknown
# token and another one denoting a masking token
vocabulary = [token for token, count in counter.most_common(vocab_size - 2)]

21009


In [14]:
len(vocabulary)

19998

In [15]:
vocabulary

['the',
 '.',
 ',',
 'of',
 'in',
 'to',
 'a',
 'and',
 '(',
 ')',
 '"',
 'on',
 'said',
 "'s",
 'for',
 '1',
 '-',
 'at',
 'was',
 '2',
 '0',
 '3',
 'with',
 'that',
 'he',
 'from',
 'it',
 'by',
 'is',
 ':',
 'as',
 '4',
 'had',
 'his',
 'has',
 'but',
 'an',
 'not',
 'were',
 'be',
 'after',
 'have',
 'first',
 'new',
 'who',
 'will',
 'they',
 '5',
 'two',
 'u.s.',
 'been',
 '$',
 '--',
 'their',
 'beat',
 'are',
 '6',
 'which',
 'would',
 'this',
 'up',
 'its',
 'year',
 'i',
 'last',
 'percent',
 'out',
 'we',
 'thursday',
 'one',
 'million',
 'over',
 'government',
 'wednesday',
 'police',
 '7',
 'results',
 'against',
 'second',
 'when',
 '/',
 'also',
 'tuesday',
 'three',
 'soccer',
 'president',
 'no',
 'division',
 'told',
 '10',
 'monday',
 'people',
 'about',
 'or',
 'friday',
 'league',
 'some',
 'london',
 'there',
 'world',
 'her',
 'minister',
 'under',
 'more',
 'york',
 '9',
 '1996-08-28',
 'won',
 'into',
 'state',
 'sunday',
 '8',
 'before',
 'south',
 'played',
 

In [16]:
token_to_id = {}
token_to_id['[UNK]'] = 0
token_to_id['[PAD]'] = 1
for i, token in enumerate(vocabulary):
    token_to_id[token] = i + 2

In [17]:
token_to_id

{'[UNK]': 0,
 '[PAD]': 1,
 'the': 2,
 '.': 3,
 ',': 4,
 'of': 5,
 'in': 6,
 'to': 7,
 'a': 8,
 'and': 9,
 '(': 10,
 ')': 11,
 '"': 12,
 'on': 13,
 'said': 14,
 "'s": 15,
 'for': 16,
 '1': 17,
 '-': 18,
 'at': 19,
 'was': 20,
 '2': 21,
 '0': 22,
 '3': 23,
 'with': 24,
 'that': 25,
 'he': 26,
 'from': 27,
 'it': 28,
 'by': 29,
 'is': 30,
 ':': 31,
 'as': 32,
 '4': 33,
 'had': 34,
 'his': 35,
 'has': 36,
 'but': 37,
 'an': 38,
 'not': 39,
 'were': 40,
 'be': 41,
 'after': 42,
 'have': 43,
 'first': 44,
 'new': 45,
 'who': 46,
 'will': 47,
 'they': 48,
 '5': 49,
 'two': 50,
 'u.s.': 51,
 'been': 52,
 '$': 53,
 '--': 54,
 'their': 55,
 'beat': 56,
 'are': 57,
 '6': 58,
 'which': 59,
 'would': 60,
 'this': 61,
 'up': 62,
 'its': 63,
 'year': 64,
 'i': 65,
 'last': 66,
 'percent': 67,
 'out': 68,
 'we': 69,
 'thursday': 70,
 'one': 71,
 'million': 72,
 'over': 73,
 'government': 74,
 'wednesday': 75,
 'police': 76,
 '7': 77,
 'results': 78,
 'against': 79,
 'second': 80,
 'when': 81,
 '/': 82

In [18]:
id_to_token = {v: k for k, v in token_to_id.items()}

In [19]:
def encode_tokens(tokens, max_seq_len=128):
    for i, token in enumerate(tokens):
        if token in list(token_to_id.keys()):
            tokens[i] = token_to_id[token]
        else:
            tokens[i] = token_to_id['[UNK]'] # unknown token
    # padding
    if len(tokens) < max_seq_len:
        tokens = tokens + [0] * (max_seq_len - len(tokens))
    # truncate
    elif len(tokens) >= max_seq_len:
        tokens = tokens[:max_seq_len]
        
    return tokens

# map_record_to_training_data('9\tEU\trejects\tGerman\tcall\tto\tboycott\tBritish\tlamb\t.\t3\t0\t7\t0\t0\t0\t7\t0\t0')
encode_tokens(['eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.'])

[989,
 10951,
 205,
 629,
 7,
 3939,
 216,
 5774,
 3,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [20]:
from torch.utils.data import DataLoader

class NERDataset(torch.utils.data.Dataset): 
    def __init__(self, data_path, max_length):
            self.data_path = data_path
            self.max_length = max_length

    def __len__(self):
        f = open(self.data_path, "r")
        return len(f.readlines())

    def __getitem__(self, idx): 
        with open(self.data_path, 'r') as f:
            lines = f.readlines()
            tokens, tags = map_record_to_training_data(lines[idx])
            token_ids = encode_tokens(tokens)
            
            # padding
            if len(tags) < self.max_length:
                tags = tags + [0] * (self.max_length - len(tags))
                
            # truncate
            elif len(tags) >= self.max_length:
                tags = tags[:self.max_length]
            output = {'input' : torch.tensor(token_ids),
                     'target' : torch.tensor(tags)}
            
            return output

In [21]:
train_dataset = NERDataset('datasets/conll2003/conll_train.txt', max_length=128)
train_loader = DataLoader(dataset=train_dataset, batch_size=32, shuffle=True)

In [22]:
val_dataset = NERDataset('datasets/conll2003/conll_val.txt', max_length=128)
val_loader = DataLoader(dataset=val_dataset, batch_size=32, shuffle=False)

In [23]:
print(iter(train_loader).next()['input'])
print(iter(train_loader).next()['input'].shape)

tensor([[ 1958,    23,    21,  ...,     0,     0,     0],
        [    2,   127,   269,  ...,     0,     0,     0],
        [   18, 12005, 12006,  ...,     0,     0,     0],
        ...,
        [    6,     8,  1016,  ...,     0,     0,     0],
        [ 1389,  2997,   897,  ...,     0,     0,     0],
        [  747,   306,  1450,  ...,     0,     0,     0]])
torch.Size([32, 128])


In [24]:
print(iter(train_loader).next()['target'])
print(iter(train_loader).next()['target'].shape)

tensor([[4, 1, 4,  ..., 0, 0, 0],
        [1, 2, 3,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [6, 1, 6,  ..., 0, 0, 0],
        [6, 1, 0,  ..., 0, 0, 0]])
torch.Size([32, 128])


## Train the model

**PyTorch 구현 과정 중 알게 된 개념**

* model.zero_grad(), optimizer.zero_grad() 차이 [link](https://minsuksung-ai.tistory.com/24)
> 내가 학습하고자 하는 가중치만 zero_grad할 경우에는 optimizer.zero_grad()
> 모델의 모든 가중치에 대해 적용할 때에는 model.zero_grad()

* Keras의 sparsecategoricalCE

```
keras.losses.SparseCategoricalCrossentropy(
             from_logits=True, reduction=keras.losses.Reduction.NONE
         )

```
1) 훈련 데이터의 label(target)이 one-hot vector 이면 CategoricalCrossentropy
2) 훈련 데이터의 label(target)이 정수이면 SparseCategoricalCrossentropy
3) One advantage of using sparse categorical cross entropy is it saves time in memory as well as computation because it simply uses a single integer for a class, rather than a whole vector. [link](https://stats.stackexchange.com/questions/326065/cross-entropy-vs-sparse-cross-entropy-when-to-use-one-over-the-other)

* PyTorch의 `nn.CrossEntropyLoss()`, `nn.NLLLoss()` ! [link](https://stackoverflow.com/questions/65408027/how-to-correctly-use-cross-entropy-loss-vs-softmax-for-classification)

    * 모델 학습이 잘 되지 않는다면, 그리고 데이터나 모델 자체에 문제가 없는 것 같다면, loss 계산 과정도 의심해보자!
    1) The input given through a forward call is expected to contain log-probabilities of each class. [Doc](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html)
    2) This criterion combines LogSoftmax and NLLLoss in one single class. [Doc](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)

**Appendix 참조**
* CE loss
Keras의 CE Loss 구현 상세
    * 아래 두 개는 동일한 결과를 출력하는데, keras는 2번 방식으로 구현해 둠
    1) nn.CrossEntropyLoss(ignore_index=0)
    2) nn.CrossEntropyLoss(reduce=False) + padding masking

* rearrange
    * reshpae =/= transpose

##### AdamW

In [25]:
from tqdm.notebook import tqdm
from einops import rearrange

num_tags = len(mapping)
vocab_size = 20000

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = 'cpu'
print ('Current device : ', device)
model = NERModel(num_tags, vocab_size, max_seq_len=128, d_model=32, heads=4, d_ff=64).to(device)

num_epochs=15
total_step = len(train_loader)
learning_rate = 0.001 #0.0005

criterion =  nn.CrossEntropyLoss(ignore_index=0)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=0.001)

Current device :  cpu


In [26]:
for epoch in tqdm(range(0, num_epochs)):    
    for i_batch, sample_batched in enumerate(train_loader):
        
        batch_inputs = sample_batched['input'].to(device)
        batch_targets = sample_batched['target'].to(device)
        
        batch_size = batch_targets.size(0)
        
        # Forward
        outputs = model(batch_inputs) # torch.Size([32, 128, 10])
        
        # Compute loss
        batch_predicts = rearrange(outputs, 'b c l -> b l c') # torch.Size([32, 10, 128])
        loss = criterion(batch_predicts, batch_targets) # torch.Size([32, 128])
        
        # We will be using a custom loss function that will ignore the loss from padded tokens.
        # if nn.CrossEntropyLoss(reduce=False)
#         mask = torch.tensor(batch_targets > 0, dtype=float)
#         loss = loss * mask
#         loss = torch.sum(loss)/torch.sum(mask)
        
        # Backward
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        
        if (i_batch+1)%10 == 0:
            print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                  .format(epoch+1, num_epochs, i_batch, total_step, loss.item())) 
    
    # Save the model checkpoints
    torch.save(model.state_dict(), './models/ner_transformer_encoder_adamW-{}.ckpt'.format(epoch+1))

HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))

Epoch [1/15], Step [9/439], Loss: 1.8871
Epoch [1/15], Step [19/439], Loss: 1.3657
Epoch [1/15], Step [29/439], Loss: 0.9088
Epoch [1/15], Step [39/439], Loss: 1.0242
Epoch [1/15], Step [49/439], Loss: 1.0744
Epoch [1/15], Step [59/439], Loss: 0.8411
Epoch [1/15], Step [69/439], Loss: 0.9017
Epoch [1/15], Step [79/439], Loss: 0.7189
Epoch [1/15], Step [89/439], Loss: 0.8066
Epoch [1/15], Step [99/439], Loss: 0.7353
Epoch [1/15], Step [109/439], Loss: 0.8044
Epoch [1/15], Step [119/439], Loss: 0.7908
Epoch [1/15], Step [129/439], Loss: 0.6834
Epoch [1/15], Step [139/439], Loss: 0.7378
Epoch [1/15], Step [149/439], Loss: 0.6197
Epoch [1/15], Step [159/439], Loss: 0.7084
Epoch [1/15], Step [169/439], Loss: 0.8667
Epoch [1/15], Step [179/439], Loss: 0.7346
Epoch [1/15], Step [189/439], Loss: 0.7088
Epoch [1/15], Step [199/439], Loss: 0.6992
Epoch [1/15], Step [209/439], Loss: 0.7471
Epoch [1/15], Step [219/439], Loss: 0.7901
Epoch [1/15], Step [229/439], Loss: 0.8683
Epoch [1/15], Step [23

In [27]:
# AdamW

# Sample inference using the trained model
tokens = map_record_to_training_data('9\tEU\trejects\tGerman\tcall\tto\tboycott\tBritish\tlamb\t.\t3\t0\t7\t0\t0\t0\t7\t0\t0')[0]

sample_input = encode_tokens(tokens)

sample_input = torch.tensor(sample_input).reshape(1,-1)

output = model(sample_input)
output = output.detach().numpy()
prediction = np.argmax(output, axis=-1)[0]
prediction = [mapping[i] for i in prediction]

# eu -> B-ORG, german -> B-MISC, british -> B-MISC
print(prediction)

['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


##### Adam

In [28]:
from tqdm.notebook import tqdm
from einops import rearrange

num_tags = len(mapping)
vocab_size = 20000

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = 'cpu'
print ('Current device : ', device)
model = NERModel(num_tags, vocab_size, max_seq_len=128, d_model=32, heads=4, d_ff=64).to(device)

num_epochs=10
total_step = len(train_loader)
learning_rate = 0.001

criterion =  nn.CrossEntropyLoss(ignore_index=0)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

Current device :  cpu


* 20 epochs -> loss 0.1x

In [29]:
for epoch in tqdm(range(0, num_epochs)):    
    for i_batch, sample_batched in enumerate(train_loader):
        
        batch_inputs = sample_batched['input'].to(device)
        batch_targets = sample_batched['target'].to(device)
        
        batch_size = batch_targets.size(0)
        
        # Forward
        outputs = model(batch_inputs) # torch.Size([32, 128, 10])
        
        # Compute loss
        batch_predicts = rearrange(outputs, 'b c l -> b l c') # torch.Size([32, 10, 128])
        loss = criterion(batch_predicts, batch_targets) # torch.Size([32, 128])
        
#         mask = torch.tensor(batch_targets > 0, dtype=float)
#         loss = loss * mask
#         loss = torch.sum(loss)/torch.sum(mask)
        
        # Backward
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        
        if (i_batch+1)%10 == 0:
            print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                  .format(epoch+1, num_epochs, i_batch, total_step, loss.item())) 
    
    # Save the model checkpoints
    torch.save(model.state_dict(), './models/ner_transformer_encoder_adam-{}.ckpt'.format(epoch+1))

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

Epoch [1/10], Step [9/439], Loss: 1.9091
Epoch [1/10], Step [19/439], Loss: 1.4056
Epoch [1/10], Step [29/439], Loss: 0.8548
Epoch [1/10], Step [39/439], Loss: 0.8687
Epoch [1/10], Step [49/439], Loss: 0.6290
Epoch [1/10], Step [59/439], Loss: 0.7341
Epoch [1/10], Step [69/439], Loss: 0.7536
Epoch [1/10], Step [79/439], Loss: 0.8288
Epoch [1/10], Step [89/439], Loss: 0.8940
Epoch [1/10], Step [99/439], Loss: 0.7245
Epoch [1/10], Step [109/439], Loss: 0.9001
Epoch [1/10], Step [119/439], Loss: 0.5984
Epoch [1/10], Step [129/439], Loss: 0.7616
Epoch [1/10], Step [139/439], Loss: 0.9274
Epoch [1/10], Step [149/439], Loss: 0.8124
Epoch [1/10], Step [159/439], Loss: 0.7212
Epoch [1/10], Step [169/439], Loss: 0.9759
Epoch [1/10], Step [179/439], Loss: 0.7139
Epoch [1/10], Step [189/439], Loss: 0.7959
Epoch [1/10], Step [199/439], Loss: 0.6450
Epoch [1/10], Step [209/439], Loss: 0.6368
Epoch [1/10], Step [219/439], Loss: 0.6915
Epoch [1/10], Step [229/439], Loss: 0.7311
Epoch [1/10], Step [23

In [30]:
# Adam

# Sample inference using the trained model
tokens = map_record_to_training_data('9\tEU\trejects\tGerman\tcall\tto\tboycott\tBritish\tlamb\t.\t3\t0\t7\t0\t0\t0\t7\t0\t0')[0]

sample_input = encode_tokens(tokens)
sample_input = torch.tensor(sample_input).reshape(1,-1)

output = model(sample_input)
output = output.detach().numpy()
prediction = np.argmax(output, axis=-1)[0]
prediction = [mapping[i] for i in prediction]

# eu -> B-ORG, german -> B-MISC, british -> B-MISC
print(prediction)

['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


## Metrics calculation

Here is a function to calculate the metrics. The function calculates F1 score for the
overall NER dataset as well as individual scores for each NER tag.

In [31]:
def calculate_metrics(dataset):
    all_true_tag_ids, all_predicted_tag_ids = [], []

    for i, batch_sample in enumerate(dataset):
        
        x = batch_sample['input'].reshape(1,-1)
        y = batch_sample['target'].reshape(1,-1)

        output = model(x) # [32, len, 10]
        output = output.detach().numpy()
        
        predictions = np.argmax(output, axis=-1)
        predictions = np.reshape(predictions, [-1])

        true_tag_ids = np.reshape(y.detach().numpy(), [-1])

        mask = (true_tag_ids > 0) & (predictions > 0)
        true_tag_ids = true_tag_ids[mask]
        predicted_tag_ids = predictions[mask]

        all_true_tag_ids.append(true_tag_ids)
        all_predicted_tag_ids.append(predicted_tag_ids)

    all_true_tag_ids = np.concatenate(all_true_tag_ids)
    all_predicted_tag_ids = np.concatenate(all_predicted_tag_ids)

    predicted_tags = [mapping[tag] for tag in all_predicted_tag_ids]
    real_tags = [mapping[tag] for tag in all_true_tag_ids]

    evaluate(real_tags, predicted_tags)

calculate_metrics(val_dataset)

processed 51362 tokens with 5942 phrases; found: 5830 phrases; correct: 2568.
accuracy:  42.03%; (non-O)
accuracy:  88.68%; precision:  44.05%; recall:  43.22%; FB1:  43.63
              LOC: precision:  70.27%; recall:  65.87%; FB1:  68.00  1722
             MISC: precision:  56.48%; recall:  47.72%; FB1:  51.73  779
              ORG: precision:  38.42%; recall:  34.90%; FB1:  36.58  1218
              PER: precision:  21.32%; recall:  24.43%; FB1:  22.77  2111


## Appendix

* CE loss

In [32]:
loss = nn.CrossEntropyLoss()
input = torch.randn(3, 5, requires_grad=True)
print(input)
target = torch.tensor([3,0,0])
print(target)
output = loss(input, target)

print(output)

tensor([[ 1.0145, -0.0471, -2.4738, -0.3983, -1.3942],
        [ 0.4398, -0.4428,  0.0362,  1.5519,  0.7035],
        [-0.5667,  0.2630,  0.6030,  0.3430,  1.9727]], requires_grad=True)
tensor([3, 0, 0])
tensor(2.2950, grad_fn=<NllLossBackward>)


In [33]:
loss = nn.CrossEntropyLoss(ignore_index=0)
print(input)
output = loss(input, target)
print(output)

tensor([[ 1.0145, -0.0471, -2.4738, -0.3983, -1.3942],
        [ 0.4398, -0.4428,  0.0362,  1.5519,  0.7035],
        [-0.5667,  0.2630,  0.6030,  0.3430,  1.9727]], requires_grad=True)
tensor(1.9492, grad_fn=<NllLossBackward>)


In [34]:
loss = nn.CrossEntropyLoss(ignore_index=0, reduce=False)
print(input)
output = loss(input, target)
print(output)

tensor([[ 1.0145, -0.0471, -2.4738, -0.3983, -1.3942],
        [ 0.4398, -0.4428,  0.0362,  1.5519,  0.7035],
        [-0.5667,  0.2630,  0.6030,  0.3430,  1.9727]], requires_grad=True)
tensor([1.9492, 0.0000, 0.0000], grad_fn=<NllLossBackward>)




In [35]:
loss = nn.CrossEntropyLoss(reduce=False)
print(input)
output = loss(input, target)
print(output)

tensor([[ 1.0145, -0.0471, -2.4738, -0.3983, -1.3942],
        [ 0.4398, -0.4428,  0.0362,  1.5519,  0.7035],
        [-0.5667,  0.2630,  0.6030,  0.3430,  1.9727]], requires_grad=True)
tensor([1.9492, 1.8601, 3.0758], grad_fn=<NllLossBackward>)


* rearrange

In [36]:
from einops import rearrange

loss = nn.CrossEntropyLoss()
input = input.reshape(1,3,5)
# input = torch.tensor(np.transpose(input.detach().numpy(), (0,2,1)))
print(">> original input")
print(input)
input_rearranged = rearrange(input, 'b c l -> b l c')
print(">> by rearrange")
print(input_rearranged)

print(">> by reshape")
input_reshaped = input.reshape(1,5,3)
print(input_reshaped)

>> original input
tensor([[[ 1.0145, -0.0471, -2.4738, -0.3983, -1.3942],
         [ 0.4398, -0.4428,  0.0362,  1.5519,  0.7035],
         [-0.5667,  0.2630,  0.6030,  0.3430,  1.9727]]],
       grad_fn=<ViewBackward>)
>> by rearrange
tensor([[[ 1.0145,  0.4398, -0.5667],
         [-0.0471, -0.4428,  0.2630],
         [-2.4738,  0.0362,  0.6030],
         [-0.3983,  1.5519,  0.3430],
         [-1.3942,  0.7035,  1.9727]]], grad_fn=<ViewBackward>)
>> by reshape
tensor([[[ 1.0145, -0.0471, -2.4738],
         [-0.3983, -1.3942,  0.4398],
         [-0.4428,  0.0362,  1.5519],
         [ 0.7035, -0.5667,  0.2630],
         [ 0.6030,  0.3430,  1.9727]]], grad_fn=<ViewBackward>)
