<a href="https://colab.research.google.com/github/sadat1971/Deep_Learning_NLP/blob/main/Named_Entity_COSC6336_HW_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
This notebook aims to famailiarize you with the Named Entity Recognition (NER) task. To this end, we will implement two different sequence tagging models to recognize entities from text.

Below is the overview of this notebook:

* Load Data
* Part 1: CRF for Sequence Tagging
    * 1.1 Feature Extraction
    * 1.2 Define the Data Splits
    * 1.3 Define the Model Architecture
    * 1.4 Training and Evaluation
* Part 2: BERT for Sequence Tagging
    * 2.1 Define the Tokenizer
    * 2.2 Define the Datasets
    * 2.3 Define the DataLoaders
    * 2.4 Define the Model Architecture 
    * 2.5 Training and Evaluation
    * 2.6 Attention Visualization

**Note**: 
- You may encounter some errors when installing libraries. You can ignore them if they are just conflicts and move on.
- You may want to change the runtime type to `GPU` to accelerate training in the Part 2. It takes ~8 mins to train a BERT model with a single GPU.

# Load Data

We will use the CoNLL 2003 corpus. Below are some details about this corpus.

**Data Summary**
> CoNLL-2003
> 
> The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.
>
> The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2 tagging scheme, whereas the original dataset uses IOB1.


**Data Instances**
```
{
    "chunk_tags": [11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0],
    "id": "0",
    "ner_tags": [0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    "pos_tags": [12, 22, 22, 38, 15, 22, 28, 38, 15, 16, 21, 35, 24, 35, 37, 16, 21, 15, 24, 41, 15, 16, 21, 21, 20, 37, 40, 35, 21, 7],
    "tokens": ["The", "European", "Commission", "said", "on", "Thursday", "it", "disagreed", "with", "German", "advice", "to", "consumers", "to", "shun", "British", "lamb", "until", "scientists", "determine", "whether", "mad", "cow", "disease", "can", "be", "transmitted", "to", "sheep", "."]
}
```

**Data Fields**

The data fields are the same among all splits:

* id: a string feature.
* tokens: a list of string features.
* pos_tags: a list of classification labels (int). Full tagset with indices:
```
{'"': 0, "''": 1, '#': 2, '$': 3, '(': 4, ')': 5, ',': 6, '.': 7, ':': 8, '``': 9, 'CC': 10, 'CD': 11, 'DT': 12, 'EX': 13, 'FW': 14, 'IN': 15, 'JJ': 16, 'JJR': 17, 'JJS': 18, 'LS': 19, 'MD': 20, 'NN': 21, 'NNP': 22, 'NNPS': 23, 'NNS': 24, 'NN|SYM': 25, 'PDT': 26, 'POS': 27, 'PRP': 28, 'PRP$': 29, 'RB': 30, 'RBR': 31, 'RBS': 32, 'RP': 33, 'SYM': 34, 'TO': 35, 'UH': 36, 'VB': 37, 'VBD': 38, 'VBG': 39, 'VBN': 40, 'VBP': 41, 'VBZ': 42, 'WDT': 43, 'WP': 44, 'WP$': 45, 'WRB': 46}
```
* chunk_tags: a list of classification labels (int). Full tagset with indices:
```
{'O': 0, 'B-ADJP': 1, 'I-ADJP': 2, 'B-ADVP': 3, 'I-ADVP': 4, 'B-CONJP': 5, 'I-CONJP': 6, 'B-INTJ': 7, 'I-INTJ': 8, 'B-LST': 9, 'I-LST': 10, 'B-NP': 11, 'I-NP': 12, 'B-PP': 13, 'I-PP': 14, 'B-PRT': 15, 'I-PRT': 16, 'B-SBAR': 17, 'I-SBAR': 18, 'B-UCP': 19, 'I-UCP': 20, 'B-VP': 21, 'I-VP': 22}
```
* ner_tags: a list of classification labels (int). Full tagset with indices:
```
{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
```

In [1]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.5.2-py3-none-any.whl (432 kB)
[K     |████████████████████████████████| 432 kB 39.2 MB/s 
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 69.1 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 63.0 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 53.3 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 63.9 MB/s 
Installing collected packa

In [2]:
from datasets import load_dataset

corpus = load_dataset('conll2003')
corpus

Downloading builder script:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 (download: 959.94 KiB, generated: 9.78 MiB, post-processed: Unknown size, total: 10.72 MiB) to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98...


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

We will create a one-to-one mapping between the NER tags and numeric values according to the label scheme in the corpus. These numeric values correspond the entries of labels for loss calculation or model predictions. 

In [3]:
LABEL_SCHEME = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}

In [4]:
index2label = {i: l for l, i in LABEL_SCHEME.items()}
label2index = LABEL_SCHEME

print('index2label: ', index2label)
print('label2index: ', label2index)

index2label:  {0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}
label2index:  {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}


In [5]:
def read_data(corpus):
    data = dict()
    for split in corpus:
        data[split] = {'tokens': [], 'labels': []}
        for i in range(len(corpus[split])):
            data[split]['tokens'].append(corpus[split][i]['tokens'])
            data[split]['labels'].append(corpus[split][i]['ner_tags'])
    return data

data = read_data(corpus)

print('*** Example ***')
print('tokens: ', data['train']['tokens'][0])
print('labels: ', data['train']['labels'][0])

*** Example ***
tokens:  ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
labels:  [3, 0, 7, 0, 0, 0, 7, 0, 0]


In [6]:
corpus

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [7]:
data["train"].keys()

dict_keys(['tokens', 'labels'])

# Part 1: CRF for Sequence Tagging

In this part, we will implement a sequence tagging model using a CRF as the underlying learning algorithm.

In [8]:
!pip install scikit-learn==0.23.2
!pip install sklearn_crfsuite

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-learn==0.23.2
  Downloading scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 1.9 MB/s 
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.5 requires scikit-learn>=1.0.0, but you have scikit-learn 0.23.2 which is incompatible.
imbalanced-learn 0.8.1 requires scikit-learn>=0.24, but you have scikit-learn 0.23.2 which is incompatible.[0m
Successfully installed scikit-learn-0.23.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab

## 1.1 Feature Extraction

Now, we extract the features on tokens as the inputs to the model. Below, we provide a list to features for you to consider and you are also encouraged to explore other features to represent tokens based on your own intuition of papers using feature engineering for NER.
* prefix
* suffix
* lowercase
* uppercase
* special symbols
* ...


In [9]:
def convert_tokens_to_features(tokens):
    """
    Description: 
        This function extracts the linguistic features as the representation for tokens. 
    Inputs:
        - tokens (List[str]): a list of strings to represent input sequence tokens.
    Outputs:
        - features (List[Dict]): a list of dictionary which contains a mapping between feature names and feature values.
    """

    features = []
    
    for token in tokens:
        ### YOUR CODE HERE
        ###
        ### TODO:
        ###     1) Create features for each tokens. 
        ### Note:
        ###     - The feature for each token should be a dictionary.
        token_feature = None
        
        ### END YOUR CODE
        
        features.append(token_feature)
        
    return features

In [10]:
import re

In [11]:
def convert_tokens_to_features(tokens):
    """
    Description: 
        This function extracts the linguistic features as the representation for tokens. 
    Inputs:
        - tokens (List[str]): a list of strings to represent input sequence tokens.
    Outputs:
        - features (List[Dict]): a list of dictionary which contains a mapping between feature names and feature values.
    """

    features = []
    
    for token in tokens:
        ### YOUR CODE HERE
        ###
        ### TODO:
        ###     1) Create features for each tokens. 
        ### Note:
        ###     - The feature for each token should be a dictionary.
        token_feature = dict()
        token_feature["word"] = token
        token_feature["len"] = len(token) #Checks the length of the token
        token_feature["starts_upper"] = int(token[0].isupper()) #checks if the token starts with uppercase
        token_feature["all_upper"] =  int(token.isupper()) #checks if it is an acronym
        token_feature["is_numeric"] =  int(token.isnumeric()) #checks if it is a number
        token_feature["is_any_numeric"] = int(any(char.isdigit() for char in token)) #checks if it contains a number
        token_feature["Any_spcl_char"] = int(re.search('[^A-Za-z0-9]+',token) is not None)
        token_feature["Any_hyphen"] = int('-' in token)
        token_feature["prefix"] =int((token[0:-1].isupper()) or (token[0:3].isupper())) #is prefix uppercased ?
        
        
        ### END YOUR CODE
        
        features.append(token_feature)
        
    return features

## 1.2 Define the Data Splits

Next, we create data splits for training.

In [12]:
### YOUR CODE HERE
###
### TODO:
###     1) Prepare data splits for training.
### Hint:
###     1) Call `convert_tokens_to_features` fucntion to convert tokens into feature.
###     2) Use `index2label` to convert labels from numerical values to alphabetical values.
###     3) The inputs X should be lists of dictionaries while the labels y should be lists of strings.
X_train = [convert_tokens_to_features(data["train"]["tokens"][i]) for i in range(len(data["train"]["tokens"]))]
y_train = []
for i in range(len(data["train"]["tokens"])):
    M = []
    for j in range(len(data["train"]["labels"][i])):
        idx = index2label[data["train"]["labels"][i][j]]
        M.append(idx)
    y_train.append(M)
X_valid = None
y_valid = None

X_test = None
y_test = None

### END YOUR CODE

## 1.3 Define the Model Architecture

We will use `sklearn_crfsuite.CRF` model for sequence tagging. Complete the code below and  feel free to try different parameters.

In [13]:
from sklearn_crfsuite import CRF

### YOUR CODE HERE
###
### TODO:
###     1) Define the model architecture here.
crf = None

### END YOUR CODE

crf

In [14]:
crf = CRF()

crf.fit(X_train, y_train)

KeyboardInterrupt: ignored

In [None]:
crf

In [None]:
y_pred = crf.predict(X_train)

## 1.4 Training and Evaluation

In [None]:
crf.fit(X_train, y_train)

In [None]:
len(y_train)

NER is evaluated on span **micro F1** score. This means that partially correct predictions do not receive any credit; only correct predictions for entire events are considered correct. Below, we will use the `seqeval` library to compute metrics.

In [9]:
!pip install seqeval

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 686 kB/s 
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16180 sha256=f41e8edb2733433a15c3cc3a02f14a275c509cfa4220bce6d2573f207c1f3d8c
  Stored in directory: /root/.cache/pip/wheels/05/96/ee/7cac4e74f3b19e3158dce26a20a1c86b3533c43ec72a549fd7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [None]:
from seqeval.metrics import classification_report

print(classification_report(y_train, y_pred, digits=5, zero_division=0))

# Part 2: BERT for Sequence Tagging
In this part, we will implement a sequence tagging architecture with the BERT. BERT is a transformer-based large-scale pre-trained langauge model. We will use this model to embed tokens into the embedding space and encode them to the latent space.

Please see the following docs for more details:
* BERT: https://aclanthology.org/N19-1423.pdf

In [10]:
!pip install transformers
!pip install seqeval
!pip install bertviz

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 30.1 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 39.0 MB/s 
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.1 transformers-4.23.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bertviz
  Downloading bertviz-1.4.0-py3-none-any.whl (157 kB)
[K     |████████████████████████████████| 157 kB 35.7 MB/s 
Collecting boto3
  Downloading boto3-1.24.89-py3-none-any.whl (132 kB)
[K     |██████████████████████

In [11]:
import datasets
import transformers

# set up verbosity of libraries
datasets.utils.logging.set_verbosity_error()
transformers.utils.logging.set_verbosity_error()

In [12]:
import torch

# Set up device. Recommend to use GPU to accelerate training
DEVICE = 'cuda:0' if torch.cuda.is_available() else 'cpu'

## 2.1 Define the Tokenizer
We will use `BertTokenizer` to convert tokens into indexes. `BertTokenizer` has a vocabulary to index tokens. It also has some special tokens such as `pad_token` for padding and `unk_token` for unknown (out-of-vocabulary) tokens. 

Note that BERT uses subwords as inputs instead of original tokens. Each token is converted into multiple subwords. For example, the token `embeddings` is represented with `['em', '##bed', '##ding', '##s']`. All subwords in BERT start with `##`. For NER, we select the prediction for the first subword as the prediction for the whole token (i.e., use the prediction for `em` as the prediction for `embeddings`). For the rest of subwords, we still keep them in the inputs. However, we assign each subword with a special label `ignore_index` so that the predictions for these subwords can be ignored when calculating losses or metrics. 

Please see the following docs for more details:
* BertTokenizer: https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer
* Subword Tokenization: https://huggingface.co/docs/transformers/tokenizer_summary#subword-tokenization



In [13]:
# specify which pre-trained BERT model to use
PRETRAINED_MODEL_NAME_OR_PATH = 'bert-base-cased'

# speicify the value of `ignore_index`
IGNORE_INDEX = -100

In [14]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME_OR_PATH)
tokenizer

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

PreTrainedTokenizer(name_or_path='bert-base-cased', vocab_size=28996, model_max_len=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

## 2.2 Define the Datasets

Now, we create datasets by converting tokens and labels to indexes. We will do tokenization with `BertTokenizer` to create `input_ids` and `label_ids` as inputs to the model. Specifically, we can call `tokenizer.tokenize()` method to convert each token to subwords. If the `tokenier` cannot find subwords for this token, it will return an empty list. In this case, just replace it with an unknown token. Then we can call `tokenizer.convert_tokens_to_ids()` to convert subwords into indexes. 

Afterwards, we use dataloaders to convert samples into batches to accelerate training. To this end, we need to pad each sentences to the same length with `pad_token`.

Please see the following docs for more details:
* Datasets & DataLoaders: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
* tokenize(): https://huggingface.co/docs/transformers/v4.21.3/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.tokenize
* convert_tokens_to_ids(): https://huggingface.co/docs/transformers/v4.21.3/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.convert_tokens_to_ids

In [16]:
def prepare_encoding_fields_with_bert(data, tokenizer, ignore_index):
    """
    Description:
        This function constructs `input_ids` and `label_ids` by encoding tokens and labels into indexes, respectively.
    Inputs:
        - data (Dict[str, Dict]): a dictionary containing a mapping between split name and split values. 
          Each split (Dict[str, List]) is a dictionary containing `tokens (List[List])` and `labels (List[List])`. 
        - tokenizer (BertTokenizer): an object of BertTokenizer, which is used to convert tokens into indexes.
        - ignore_index (int): an integer to be used as the padding value of labels for second or later subwords.
    """

    for split in data:
        data[split]['tokenized'] = []
        data[split]['input_ids'] = []
        data[split]['label_ids'] = []
        for tokens, labels in zip(data[split]['tokens'], data[split]['labels']):
            ### YOUR CODE HERE (~15 Lines)
            ###
            ### TODO:
            ###     1) Construct `tokenized`  by converting each token to subwords with the tokenizer. If the token cannot
            ###        be found in the vocabulary, just replace it with the unknown token (i.e., `unk_token` in tokenizer).
            ###     2) Construct `input_ids` by converting subwords in `tokenized` to indexes.
            ###     3) Construct `label_ids` using the indexes of labels. Note that we use the true label for the first subword
            ###        and `ignore_index` for the rest of subwords.
            ### Note:
            ###     - The inputs to BERT should start with a `[CLS]` token. You need to insert this token to the begining of each 
            ###       sentences. You can use `ignore_index` as the label for this token as we don't care about the prediction for it.
            tokenized = []
            input_ids = []
            label_ids = []
            
            tokenized.append('[CLS]')
            label_ids.append(IGNORE_INDEX)

            for T, L in zip(tokens,labels):
                tok = tokenizer.tokenize(T)
                tokenized += tok
                label_ids.append(L)
                if len(tok)>1:
                    for TOK in tok[1:]:
                        label_ids.append(IGNORE_INDEX)
            
            input_ids = tokenizer.convert_tokens_to_ids(tokenized)
            
            ### END YOUR CODE

            data[split]['tokenized'].append(tokenized)
            data[split]['input_ids'].append(input_ids)
            data[split]['label_ids'].append(label_ids)
        
        assert all([len(t) == len(i) == len(l) for t, i, l in zip(data[split]['tokenized'], data[split]['input_ids'], data[split]['label_ids'])])


prepare_encoding_fields_with_bert(data, tokenizer, ignore_index=IGNORE_INDEX)

print('*** Example ***')
print('tokens: ', data['train']['tokens'][0])
print('labels: ', data['train']['labels'][0])
print('======')
print('tokenized: ', data['train']['tokenized'][0])
print('input_ids: ', data['train']['input_ids'][0])
print('label_ids: ', data['train']['label_ids'][0])

*** Example ***
tokens:  ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
labels:  [3, 0, 7, 0, 0, 0, 7, 0, 0]
tokenized:  ['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.']
input_ids:  [101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119]
label_ids:  [-100, 3, 0, 7, 0, 0, 0, 7, 0, -100, 0]


In [17]:
import torch
from torch.utils.data import Dataset

class NERDataset(Dataset):
    def __init__(self, data, pad_token_id, pad_label_id):
        self.tokenized = data['tokenized']
        self.input_ids = data['input_ids']
        self.label_ids = data['label_ids']

        self.pad_token_id = pad_token_id
        self.pad_label_id = pad_label_id
    
    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, index):
        return self.input_ids[index], self.label_ids[index]
    
    def collate_fn(self, batch):
        """
        Description:
            This function collates lists of samples into batches. It should be passed as the `collate_fn` argument when creating dataloaders.
        Inputs:
            - batch (List[Tuple]): a list of tuples. The tuple (Tuple[List]) in the batch is the return value (List[List]) of 
              `__getitem(self, index)` function. The element (List[int]) (i.e., `input_ids` and `label_ids`) in the tuple is a list of integers.
        Outputs:
            - input_dict (Dict[str, torch.LongTensor]): a dictionary containing a mapping between input names and input values. The `input_ids` 
              (torch.LongTensor of shape (batch_size, sequence_length)) and `label_ids` (torch.LongTensor of shape (batch_size, sequence_length)) 
              in the dictionary are token indexes and label indexes, respectively.
        """

        # unwrap the batch into every field
        input_ids, label_ids = map(list, zip(*batch))
        maxlen = 0
        for I in input_ids:
            if len(I)>maxlen:
                maxlen=len(I)
        ### YOUR CODE HERE (~10 Lines)
        ###
        ### TODO:
        ###     1) pad `input_ids` with `pad_token_id` to the max length of a batch
        ###     2) pad `label_ids` with `ignore_index` to the max length of a batch
        ### Hint:
        ###     - You need to find the max length first
        ###     - Based on the max length, figure out how much padding we need for each sample
        ### Note:
        ###     - We don't need `masks` for BERT. When calculating losses or computing metrics,
        ###       we can simply create `masks` using `labels`, i.e., if the label index for a 
        ###       certain token is `ignore_index`, we should ignore it. Otherwise, we keep it.
        batch_input_ids = []
        batch_label_ids = []
        all_pad_id = [self.pad_token_id]*maxlen
        all_pad_label = [self.pad_label_id]*maxlen
        for I, L in zip(input_ids, label_ids):
            if(len(I)<maxlen):
                ids = all_pad_id
                ids[0:len(I)] = I
                labs = all_pad_label
                labs[0:len(L)] = L
                batch_input_ids.append(ids)
                batch_label_ids.append(labs)
            else:
                batch_input_ids.append(I)
                batch_label_ids.append(L)
        
        ### END YOUR CODE
        
        assert all([len(i) == len(l) for i, l in zip(batch_input_ids, batch_label_ids)])

        input_dict = {
            'input_ids': torch.tensor(batch_input_ids).long(),
            'labels': torch.tensor(batch_label_ids).long()
        }

        return input_dict

In [18]:
def get_datasets(data, pad_token_id, pad_label_id):
    datasets = dict()
    for split in data:
        datasets[split] = NERDataset(data[split], pad_token_id, pad_label_id)
    return datasets


ner_datasets = get_datasets(data, pad_token_id=tokenizer.pad_token_id, pad_label_id=IGNORE_INDEX)
ner_datasets

{'train': <__main__.NERDataset at 0x7f6841905b90>,
 'validation': <__main__.NERDataset at 0x7f6841905250>,
 'test': <__main__.NERDataset at 0x7f6841905190>}

## 2.3 Define the DataLoaders

Next, we create dataloaders with datasets to collate lists of samples into batches with the `collate_fn` function in the dataset class. For the `train` dataset, we use `RandomSampler` so that we can shuffle the training data after every epoch to avoid proving the training examples in a meaningful order to our model as this may bias the optimization algorithm. For the `validation` and `test` dataset, we can simply use `SequentialSampler`.

Please see the following docs for more details:
* Overview: https://pytorch.org/docs/stable/data.html#module-torch.utils.data
* DataLoader: https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader
* RandomSampler: https://pytorch.org/docs/stable/data.html#torch.utils.data.RandomSampler
* SequentialSampler: https://pytorch.org/docs/stable/data.html#torch.utils.data.SequentialSampler

In [19]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

def get_dataloaders(datasets, train_batch_size, eval_batch_size):
    """
    Description:
        This function implements batch training by createing dataloaders for datasets to acclerate training.
    Inputs:
        - datasets (Dict[str, Dict]): a dictionary containing a mapping between dataset names and dataset values. 
        - train_batch_size (int): an integer which is used as the batch size when creating the train dataloader
        - eval_batch_size (int): an integer which is used as the batch size when creating the validation and test dataloader
    """

    dataloaders = dict()

    for split in datasets.keys():
        dataset = datasets[split]
        
        ### YOUR CODE HERE (~15 Lines)
        ###
        ### TODO:
        ###     create dataloader for each split in datasets. For training data, use 
        ###     `RandomSampler()` and set batch size to `train_batch_size`. For validation
        ###     and test data, use `SequentialSampler()` and set batch size to `eval_batch_size`.
        dataloaders[split] = None
        if split=="train":
            dataloaders[split] = DataLoader(dataset,collate_fn=dataset.collate_fn, sampler = RandomSampler(dataset), batch_size=train_batch_size)
        else:
            dataloaders[split] = DataLoader(dataset,collate_fn=dataset.collate_fn, sampler = SequentialSampler(dataset), batch_size=eval_batch_size)
        ### END YOUR CODE        
        ### END YOUR CODE

    return dataloaders

dataloaders = get_dataloaders(ner_datasets, train_batch_size=32, eval_batch_size=32)
dataloaders

{'train': <torch.utils.data.dataloader.DataLoader at 0x7f682f274210>,
 'validation': <torch.utils.data.dataloader.DataLoader at 0x7f682f274390>,
 'test': <torch.utils.data.dataloader.DataLoader at 0x7f682f274450>}

## 2.4 Define the Model Architecture
Next, we create a sequence tagging framework with BERT.

Please see the following docs for more details:
* BertModel: https://huggingface.co/docs/transformers/v4.21.3/en/model_doc/bert#transformers.BertModel

In [20]:
from transformers import BertConfig

# set up the configuration for BERT model
config = BertConfig.from_pretrained(PRETRAINED_MODEL_NAME_OR_PATH, num_labels=len(LABEL_SCHEME))
config

BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6,
    "LABEL_7": 7,
    "LABEL_8": 8
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.23.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

In [21]:
from torch import nn
from transformers import BertModel, BertPreTrainedModel

class BERTTagger(BertPreTrainedModel):
    def __init__(self, config, pretrained_model_name_or_path, ignore_index):
        super().__init__(config)

        self.hidden_size = config.hidden_size
        self.n_classes = config.num_labels

        self.ignore_index = ignore_index

        # Load a pre-trained BERT model
        self.bert = BertModel.from_pretrained(pretrained_model_name_or_path)

        # Here, we simply use a linear layer as the classifier. Feel free to try CRF.
        self.classifier = nn.Linear(self.hidden_size, self.n_classes)

        # define the loss function
        self.loss_fct = nn.CrossEntropyLoss(reduction='sum', ignore_index=ignore_index)

    
    def get_ner_tags(self, logits, masks):
        """
        `logits` (torch.LongTensor of shape (batch_size, sequence_length, max_length))
            - Logits as probality values for each class
        `masks` (torch.FloatTensor of shape (batch_size, sequence_length))
            - Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
        """
        ner_tags = []

        scores = torch.softmax(logits.float(), dim=-1)
        for i in range(len(scores)):
            _, preds = scores[i, masks[i] == 1].max(dim=-1)
            ner_tags.append(preds.data.detach().cpu().tolist())
        
        return ner_tags

    def forward(self, input_ids, labels):
        """
        `input_ids` (torch.LongTensor of shape (batch_size, sequence_length))
             Indices of input sequence tokens in the vocabulary.
        `labels` (torch.LongTensor of shape (batch_size, sequence_length))
            - Labels for computing the NER loss. Indices should be in [0, ..., self.n_classes - 1].
        """
        # bert output
        bert_outputs = self.bert(input_ids)
        
        # context representation for sentences: (batch_size, sequence_length, hidden_size)
        sequence_output = bert_outputs[0]
        
        # output probability values: (batch_size, sequence_length, n_classes)
        logits = self.classifier(sequence_output)

        # calculate losses
        loss = self.loss_fct(logits.view(-1, self.n_classes), labels.view(-1))
        
        # create masks for decoding: (batch_size, sequence_length)
        masks = (labels != self.ignore_index)
        
        # decode logits into ner tags
        ner_tags = self.get_ner_tags(logits, masks)

        return loss, logits, ner_tags, masks


model = BERTTagger(config, pretrained_model_name_or_path=PRETRAINED_MODEL_NAME_OR_PATH, ignore_index=IGNORE_INDEX)
model

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

BERTTagger(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
 

## 2.5 Training and Evaluation

In [22]:
import os
import torch

from tqdm.auto import tqdm

def train(model, train_dataloader, optimizer, num_epochs=10, print_every=1, device='cpu'):
    model.to(device)

    model.train()

    for epoch_i in tqdm(range(num_epochs)):
        # For NER, we ues token-level losses instead of sentence-level losses.
        # The token-level loss is the sum of batch losses divided by the total 
        # number of tokens in the dataset.
        epoch_loss = 0
        num_tokens = 0

        for step, batch in enumerate(train_dataloader):
            for field in batch:
                batch[field] = batch[field].to(device)

            loss, _, _, masks = model(**batch)

            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

            epoch_loss += loss.sum().item()
            num_tokens += (masks == 1).sum().item()

        if (epoch_i + 1) % print_every == 0:
            print(f"**** Epoch {epoch_i+1:03d} - Loss: {epoch_loss / num_tokens:.6f} ****")
    
    return model

In [23]:
# We will use `AdamW` optimizer for BERT.
def get_optimizer(model, learning_rate):
    no_decay = ["bias", "LayerNorm.weight"]

    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": 0.01,
        },
        {
            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 
            "weight_decay": 0.0
        },
    ]

    optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=learning_rate)

    return optimizer

In [24]:
# set up model configuration
config = BertConfig.from_pretrained(PRETRAINED_MODEL_NAME_OR_PATH, num_labels=len(LABEL_SCHEME))

# define the model architecture
model = BERTTagger(config, pretrained_model_name_or_path=PRETRAINED_MODEL_NAME_OR_PATH, ignore_index=IGNORE_INDEX)

# define the optimizer for training
optimizer = get_optimizer(model, learning_rate=5.0e-5)

# start training
model = train(model, dataloaders['train'], optimizer, num_epochs=5, print_every=1, device=DEVICE)

  0%|          | 0/5 [00:00<?, ?it/s]

**** Epoch 001 - Loss: 0.211219 ****
**** Epoch 002 - Loss: 0.102609 ****
**** Epoch 003 - Loss: 0.090002 ****
**** Epoch 004 - Loss: 0.080052 ****
**** Epoch 005 - Loss: 0.071123 ****


In [25]:
def evaluate(model, dataloader, index2label, device):
    model.to(device)

    model.eval()

    all_preds, all_truth = [], []
    for step, batch in enumerate(tqdm(dataloader)):
        for field in batch:
            batch[field] = batch[field].to(device)

        _, _, preds, masks = model(**batch)

        for i in range(len(preds)):
            preds_i = preds[i]
            truth_i = batch['labels'][i, masks[i] == 1].data.detach().cpu().tolist()

            all_preds.append([index2label[p] for p in preds_i])
            all_truth.append([index2label[t] for t in truth_i])
    
    return all_preds, all_truth

In [26]:
test_preds, test_truth = evaluate(model, dataloaders['test'], index2label, device=DEVICE)

  0%|          | 0/108 [00:00<?, ?it/s]

In [28]:
from seqeval.metrics import classification_report

print(classification_report(test_truth, test_preds, digits=5, zero_division=0))

              precision    recall  f1-score   support

         LOC    0.88534   0.78321   0.83115      2514
        MISC    0.88214   0.58950   0.70672      1257
         ORG    0.70527   0.86761   0.77806      2976
         PER    0.95509   0.96483   0.95994      3725

   micro avg    0.84725   0.84855   0.84790     10472
   macro avg    0.85696   0.80129   0.81897     10472
weighted avg    0.85859   0.84855   0.84694     10472



## 2.6 Attention Visualization
Next, we use `BertViz` for visualizing attention in Transformer language models such as BERT.
The code below shows the head view which visualizes attention for one or more attention heads in the same layer. Feel tree to play with different samples. 

In [29]:
from bertviz import head_view, model_view

# select a sample for visualization
sample_id = 0

# prepare the model inputs
tokenized = ner_datasets['train'].tokenized[sample_id]
input_ids = ner_datasets['train'].input_ids[sample_id]
input_ids = torch.tensor(input_ids).unsqueeze(0).to('cuda:0' if torch.cuda.is_available() else 'cpu')

# output model attentions for visualization
attention = model.bert(input_ids, output_attentions=True)[-1]

# we will call `head_view(attention, tokenized)` to visualize attentions in one or more heads from each layer
# feel free to try `model_view(attention, tokenized)` to visualize all attentions throughout the entire model
head_view(attention, tokenized)

<IPython.core.display.Javascript object>