![DLI Header](images/DLI_Header.png)

# Token Classification with Large Language Models #

## 02 - Domain-Specific Token Classification Model ##

In this notebook, you will learn to fine-tune a pre-trained language model to perform token classification for specific domains. Specifically, you will develop an NER model that finds disease names in medical disease abstracts. 

**Table of Contents**<br>
This notebook covers the below sections: 
* Project Overview
* Dataset
    * Download Data
    * Preprocess Data
* Fine-Tune a Pre-Trained Model for Custom Domain
    * Configuration File
    * Download Domain-Specific Pre-Trained Model
    * Exercise # 1 - Instantiate Model and Trainer
    * Exercise # 2 - Model Training
    * Model Evaluation

## Project Overview ##

<img src='images/workflow.png' width=1080>

## Dataset ##

For this notebook, we're going to use the [NCBI-disease](https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/) corpus, which is a set of 793 PubMed abstracts, annotated by 14 annotators. The annotations take the form of HTML-style tags inserted into the abstract text using the clearly defined rules. The annotations identify named diseases and can be used to fine-tune a language model to identify disease mentions in future abstracts, *whether those diseases were part of the original training set or not*.  

### Download Data ###

In [None]:
import os
import wget

# set data path
DATA_DIR = "data/NCBI"
os.makedirs(DATA_DIR, exist_ok=True)

Here's an example of what an annotated abstract from the corpus looks like: 

In [None]:
with open(f'{DATA_DIR}/NCBI_corpus_testing.txt') as f: 
    sample_text=f.readline()
    
print(sample_text)

In this example, we see the following tags within the abstract:

In [None]:
import re

# use regular expression to find labels
categories=re.findall('<category.*?<\/category>', sample_text)
for sample in categories: 
    print(sample)

For our purposes, we will consider any identified category (such as "Modifier", "Specific Disease", and a few others) to generally be a "disease".  If you want to see more examples, you can explore the text of the corpus using the file browser to the left, or open files directly: 

* [data/NCBI/NCBI_corpus_training.txt](data/NCBI/NCBI_corpus_training.txt)
* [data/NCBI/NCBI_corpus_testing.txt](data/NCBI/NCBI_corpus_testing.txt)
* [data/NCBI/NCBI_corpus_development.txt](data/NCBI/NCBI_corpus_development.txt)

We have already derived a dataset from this corpus. For NER, the dataset labels individual words as diseases. 

In [None]:
NER_DATA_DIR = f'{DATA_DIR}/NER'
os.makedirs(os.path.join(DATA_DIR, 'NER'), exist_ok=True)

# show downloaded files
!ls -lh $NER_DATA_DIR

In [None]:
!head $NER_DATA_DIR/train.tsv

_Note:_ We can see that the abstract has been broken into sentences. Each sentence is then further parsed into words with labels that correspond to the original HTML-style tags in the corpus.

### Preprocess Data ###

We need to convert these to a format that is compatible with NeMo token classification module. For convenience, we've provided the script for this conversion [here](https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/token_classification/data/import_from_iob_format.py). 

In [None]:
# invoke the conversion script 
!python import_from_iob_format.py --data_file=$NER_DATA_DIR/train.tsv
!python import_from_iob_format.py --data_file=$NER_DATA_DIR/dev.tsv
!python import_from_iob_format.py --data_file=$NER_DATA_DIR/test.tsv

Recall that the sentences and labels in the NER dataset map to each other with _inside, outside, beginning (IOB)_ tagging. Anything separated by white space is a word, including punctuation. This mechanism can be used in a general way for multiple named entity types:
* B-{CHUNK_TYPE} – for the word in the Beginning chunk
* I-{CHUNK_TYPE} – for words Inside the chunk
* O – Outside any chunk

In our case, we are only looking for "disease" as our entity (or chunk) type, so we don't need to identify beyond the three classes: I, O, and B.
**Three classes**
* B - Beginning of disease name
* I - Inside word of disease name
* O - Outside of all disease names

As an example, for the first sentence we have the following mapping: 

```text
Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor .
O              O  O    O O O         O  O   B           I         I    I      O          O  
```

For comparison, the original corpus tags looked like:
```html
Identification of APC2, a homologue of the <category="Modifier">adenomatous polyposis coli tumour</category> suppressor.
```

The beginning word of the tagged text, "adenomatous", is now IOB-tagged with a **B** (beginning) tag, the other parts of the disease, "polyposis coli tumour" tagged with **I** (inside) tags, and everything else tagged as **O** (outside).

In [None]:
# preview dataset
!head -n 1 $NER_DATA_DIR/text_train.txt
!head -n 1 $NER_DATA_DIR/labels_train.txt

## Fine-Tune a Pre-Trained Model for Custom Domain ##

A name entity recognition model is typically comprised of a pre-trained [BERT](https://arxiv.org/pdf/1810.04805.pdf) model followed by a token classification layer. For training, we can use a configuration file to define the model. The configuration (config) file consists of several important sections, including: 
* **model**: All arguments that are related to the Model - language model, token classifier, optimizer and schedulers, datasets and any other related information
* **trainer**: Any argument to be passed to PyTorch Lightning

_Note:_ NeMo provides a template for creating the configuration file, which is recommended as a starting point, but you can create your own as long as it follows the required format. 

### Configuration File ###

In [None]:
# define config path
MODEL_CONFIG = "token_classification_config.yaml"
WORK_DIR = "WORK_DIR"
os.makedirs(WORK_DIR, exist_ok=True)

In [None]:
# download the model's configuration file 
BRANCH = 'main'
config_dir = WORK_DIR + '/configs/'
os.makedirs(config_dir, exist_ok=True)

if not os.path.exists(config_dir + MODEL_CONFIG):
    print('Downloading config file...')
    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/token_classification/conf/' + MODEL_CONFIG, config_dir)
else:
    print ('config file already exists')

The config file for NER, `token_classification_config.yaml`, specifies model, training, and experiment management details, such as file locations, pretrained models, and hyperparameters. The YAML config file we downloaded provides default values for most of the parameters, but there are a few items that must be specified for this experiment.

Each YAML section is a bit easier to view using the `omegaconf` package, which allows you to access and manipulate the configuration keys using a "dot" notation. We'll take a look at the details of each section using the `OmegaConf` tool. 

In [None]:
from omegaconf import OmegaConf

CONFIG_DIR = "/dli/task/WORK_DIR/configs"
CONFIG_FILE = "token_classification_config.yaml"

config=OmegaConf.load(CONFIG_DIR + "/" + CONFIG_FILE)

# print the entire configuration file
print(OmegaConf.to_yaml(config))

Notice that some config lines, including `model.dataset.data_dir`, have `???` in place of paths, this means that values for these fields are required to be specified by the user. Details about the model arguments can be found in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/token_classification.html#training-the-token-classification-model). 

In [None]:
# in this exercise, train and dev datasets are located in the same folder under the default names, 
# so it is enough to add the path of the data directory to the config
config.model.dataset.data_dir = os.path.join(DATA_DIR, 'NER')

# print the model section
print(OmegaConf.to_yaml(config.model))

_Note:_ The required `model.dataset.data_dir` argument (for token classification) has been modified. 

### Download Domain-Specific Pre-Trained Model ###

For this token classification task, we can start with the pre-trained `BioMegatron` language model. The `BioMegatron` model is a domain-specific, BERT-like Megatron-LM model trained on large biomedical text corpus. Since the model was trained on domain-specific text, we can expect to have better performance compared to the general language model for identifying disease. 

_Note:_ There are alternatives of BioMegatron such as BioBERT. It's worth experimenting with different pre-trained models to find the one that provide optimal performance for a specific task. 

In [None]:
# import dependencies
from nemo.collections.nlp.models.language_modeling.megatron_bert_model import MegatronBertModel

# list available pre-trained models
for model in MegatronBertModel.list_available_models(): 
    print(model.pretrained_model_name)

To load the pretrained BERT LM model, we change the `model.language_mode` argument in the config as well as a few other arguments. 

In [None]:
# add the specified above model parameters to the config
MODEL_NAME='biomegatron345m_biovocab_30k_cased'
# MODEL_NAME='biomegatron-bert-345m-cased'

config.model.language_model.lm_checkpoint=None
config.model.language_model.pretrained_model_name=MODEL_NAME
config.model.tokenizer.tokenizer_name=None

# use appropriate configurations based on GPU capacity
config.model.dataset_max_seq_length=64
config.model.train_ds.batch_size=32
config.model.validation_ds.batch_size=32
config.model.test_ds.batch_size=32

# limit the number of epochs for this demonstration
config.trainer.max_epochs=1
# config.trainer.precision=16
# config.trainer.amp_level='O1'

_Note:_ Once the `token_classification_config.yaml` file has been loaded into memory, changing the configuration file will require the `config` variable to be re-defined. 

Now, we are ready to initialize our model. During the model initialization call, the dataset and data loaders will be prepared for training and evaluation. Also, the pretrained BERT model will be downloaded, which can take up to a few minutes depending on the size of the chosen BERT model.

#### Exercise # 1 - Instantiate Model and Trainer ####

* Modify the `<FIXME>` to instantiate a `TokenClassificationModel` based on the configuration file and trainer. 

In [None]:
# create trainer and model instances
from nemo.collections.nlp.models import TokenClassificationModel
import pytorch_lightning as pl

trainer=pl.Trainer(**config.trainer)
ner_model=TokenClassificationModel(<<<<FIXME>>>>)

click ... to show solution. 

### Exercise # 2 - Model Training ###

* Modify the `<FIXME>` to train the model. 

In [None]:
# start model training
trainer.<<<<FIXME>>>>

click ... to show solution. 

### Exercise # 3 - Model Evaluation ###

* Modify the `<FIXME>` to evaluate the model. 

To see how the model performs, we can generate prediction similar to the way we did it before and compare it with the labels. Alternatively, the `evaluate_from_file()` method enables us to evaluate the model given `text_file` and `labels_file`. Optionally, you can use the `add_confusion_matrix` to get a visual representation of the model performance. 

In [None]:
# create a subset of our dev data
!head -n 100 $NER_DATA_DIR/text_dev.txt > $NER_DATA_DIR/sample_text_dev.txt
!head -n 100 $NER_DATA_DIR/labels_dev.txt > $NER_DATA_DIR/sample_labels_dev.txt

Now, let's generate predictions for the provided text file. If labels file is also specified, the model will evaluate the predictions and plot confusion matrix.

In [None]:
# evaluate model performance on sample
ner_model.<<<<FIXME>>>>

click ... to show solution. 

In [None]:
# restart the kernel
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

**Well Done!** 

![DLI Header](images/DLI_Header.png)