# Model Training

Until this point the data has been preprocessed for two purposes. First it has been mostly preprocessed for abstractive summarization, and again entirely for extractive summarization.

For my purposes I will probably focus on abstractive summarization. Some amount of literature seems to suggest that a combination of extractive and abstractive summarization seems to work well. The idea there being that with extractive summarization it pulls out sentences that are most relevant to the document in question, and then applying the abstractive summarization on the remaining sentences to synthesize the key components. 

The first step will be to play around with a base implementation of a Transformer model for text summarization. The simplest model for this purpose is the T5-(small/base) models. 

In [None]:
!nvidia-smi

Sun Apr 17 15:05:52 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

This is why I pay for colab!

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


## Install dependencies and load libraries

In [None]:
!pip install --quiet transformers[torch]
!pip install --quiet pytorch-lightning
!pip install --quiet wandb

[K     |████████████████████████████████| 4.0 MB 8.3 MB/s 
[K     |████████████████████████████████| 6.6 MB 89.2 MB/s 
[K     |████████████████████████████████| 895 kB 60.1 MB/s 
[K     |████████████████████████████████| 596 kB 76.9 MB/s 
[K     |████████████████████████████████| 77 kB 8.9 MB/s 
[K     |████████████████████████████████| 582 kB 7.0 MB/s 
[K     |████████████████████████████████| 136 kB 94.0 MB/s 
[K     |████████████████████████████████| 408 kB 71.0 MB/s 
[K     |████████████████████████████████| 1.1 MB 70.1 MB/s 
[K     |████████████████████████████████| 271 kB 97.4 MB/s 
[K     |████████████████████████████████| 144 kB 94.4 MB/s 
[K     |████████████████████████████████| 94 kB 4.6 MB/s 
[K     |████████████████████████████████| 1.8 MB 8.2 MB/s 
[K     |████████████████████████████████| 181 kB 95.9 MB/s 
[K     |████████████████████████████████| 144 kB 75.7 MB/s 
[K     |████████████████████████████████| 63 kB 2.4 MB/s 
[?25h  Building wheel for pathto

In [None]:
import torch
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np 
from tqdm.notebook import tqdm
import pathlib


In [None]:
datapath = pathlib.Path.cwd()/'gdrive/My Drive/Capstone_three/data/mtsamples'

In [None]:
df = pd.read_json(datapath/'preprocessed2.jsonl', lines=True)

In [None]:
df.head()

Unnamed: 0,input,target
0,",Duplex and color flow imaging as well as real...",",Trace bilateral hydroceles, which are nonspec..."
1,",The left testicle is normal in size and atten...",1. Hypervascularity of the left epididymis com...
2,"Flexible cystoscopy.Atrophic vaginitis.,The pa...",Atrophic vaginitis with overactive bladder wit...
3,"Performed for evaluation of anemia, gastrointe...",Internal hemorrhoids External hemorrhoids Unab...
4,",Informed consent was obtained from the patien...",Ultrasound-guided paracentesis as above.


## T5-small:

Here I will test the T5 small model as a baseline on the preprocessed data.

In [None]:
from transformers import T5ForConditionalGeneration, T5TokenizerFast as T5Tokenizer

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
torch.set_default_tensor_type('torch.cuda.FloatTensor')

In [None]:
model = T5ForConditionalGeneration.from_pretrained('t5-small')

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

In [None]:
tokenizer = T5Tokenizer.from_pretrained('t5-small')

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

In [None]:
def base_summarize(text):
    text_enc = tokenizer(
        "summarize:"+text+tokenizer.eos_token,
        max_length= 512,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,
        add_special_tokens=True,
        return_tensors="pt"
    )
    generated_ids = model.generate(
        input_ids = text_enc['input_ids'],
        attention_mask=text_enc['attention_mask'],
        max_length=128,
        num_beams=4,
        repetition_penalty = 5.5,
        length_penalty=1.1,
        early_stopping=True   
    )
    predictions = [
        tokenizer.decode(gen_id, skip_special_tokens=True, clean_up_tokenization_spaces=True)
        for gen_id in generated_ids
    ]

    return "".join(predictions)
    

In [None]:
sample_row = df.iloc[10]

In [None]:
base_summarize(sample_row.input)

'the left main coronary artery bifurcates into the left anterior descending and circumflex arteries. there is no evidence of any hemodynamically significant stenosis.'

In [None]:
sample_row.input

'Non-ST elevation MI.LEFT MAIN CORONARY ARTERY: The left main coronary artery is a moderate caliber vessel, which bifurcates into the left anterior descending and circumflex arteries. There is no evidence of any hemodynamically significant stenosis'

In [None]:
base_predictions = []

for q in tqdm(np.random.choice(df.index,5)):
    base_predictions.append(base_summarize(df.iloc[q].input))

  0%|          | 0/5 [00:00<?, ?it/s]

In [None]:
for item in base_predictions:
    print(item,"\n")

total of 100 mL of Isovue was administered intravenously. oral contrast was also administered. the liver is enlarged and decreased in attenuation. 

there is no evidence of any wall motion abnormalities with an estimated ejection fraction of 60%. left ventricular end-diastolic pressure was 24 mmHg preinjection and 26 mmHg postinjection. 

his lower extremity edema has improved with higher doses of furosemide. he complains of urinary frequency, nocturia, weak stream and dribbling. 

prostatitis sufferer has prostatic hypertrophy. the patient is alert and oriented with a pleasant affect. 

the right index finger has some small soreness at the PIP joint. there is no crepitation at the wrist, forearm, elbow or shoulder with full range of motion. 



In [None]:
del model

### Results:

From the above sampling of preprocessed documents the impression of it works well sometimes is given. To try and improve the quality of summarizations provided by this model I am going to attempt to "fine-tune" the model.

The first step will be to determine the unique tokens in the preprocessed corpora and add them to the tokenizer.

The next step will require a resizing of the embedding dimension to accomodate the increased numner of tokens.

To do all of this I will follow a similar outline to the process taken [here](https://huggingface.co/docs/transformers/training), [here](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb), and [here](https://www.youtube.com/watch?v=KMyZUIraHio&list=WL&index=22&t=2072s&ab_channel=VenelinValkov).

## Tuning Model:

In [None]:
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers.wandb import WandbLogger
from sklearn.model_selection import train_test_split
from pytorch_lightning import Trainer
from transformers import AdamW

import wandb
import re

In [None]:
len(tokenizer)

32100

## Extending 

By testing the t5-base model on several of the documents preprocessed for this model, 



In [None]:
tokenizer.get_vocab().keys()



In [None]:
set1 = set([re.sub(r'▁','', key) for key, _ in tokenizer.get_vocab().items()])

In [None]:
set2 = set(re.findall(r'\w+',df.input.sum(0)+df.target.sum(0)))

In [None]:
tokens_to_add = ['▁'+item for item in list(set2.difference(set1)) if (item != ' ' or item != '')]

In [None]:
len(tokens_to_add)

6054

[Is this a good idea?](https://media.giphy.com/media/9Pz3MzP8FUdxXsAHFl/giphy.gif)

In [None]:
custom_tokenizer = T5Tokenizer.from_pretrained('t5-small')

In [None]:
custom_tokenizer.add_tokens(tokens_to_add)

6054

In [None]:
custom_tokenizer.tokenize("summarize:"+df.iloc[1].input+custom_tokenizer.eos_token, return_tensors='pt')

['▁summarize',
 ':',
 ',',
 'The',
 '▁left',
 ' testicle',
 '▁is',
 '▁normal',
 '▁in',
 '▁size',
 '▁and',
 ' attenuation',
 '▁',
 ',',
 '▁it',
 '▁measures',
 '▁',
 '3.2',
 '▁',
 'x',
 '▁',
 '1.7',
 '▁',
 'x',
 '▁',
 '2.3',
 '▁cm',
 '.',
 '▁The',
 '▁right',
 ' epididymis',
 '▁measures',
 '▁up',
 '▁to',
 '▁9',
 '▁',
 'mm',
 '.',
 '▁There',
 '▁is',
 '▁',
 'a',
 ' hydrocele',
 '▁on',
 '▁the',
 '▁right',
 '▁side',
 '.',
 '▁Normal',
 '▁flow',
 '▁is',
 '▁seen',
 '▁within',
 '▁the',
 ' testicle',
 '▁and',
 ' epididymis',
 '▁on',
 '▁the',
 '▁right',
 '.',
 '▁The',
 '▁left',
 ' testicle',
 '▁is',
 '▁normal',
 '▁in',
 '▁size',
 '▁and',
 ' attenuation',
 '▁',
 ',',
 '▁it',
 '▁measures',
 '▁',
 '3.9',
 '▁',
 'x',
 '▁',
 '2.1',
 '▁',
 'x',
 '▁',
 '2.6',
 '▁cm',
 '.',
 '▁The',
 '▁left',
 ' testicle',
 '▁shows',
 '▁normal',
 '▁blood',
 '▁flow',
 '.',
 '▁The',
 '▁left',
 ' epididymis',
 '▁measures',
 '▁up',
 '▁to',
 '▁9',
 '▁',
 'mm',
 '▁and',
 '▁shows',
 '▁',
 'a',
 ' markedly',
 '▁increased',
 '▁',
 

In [None]:
new_embed = len(custom_tokenizer) +128

## Fine-Tuning

In [None]:
class CustomDataset(Dataset):
    def __init__(
        self, 
        dataframe: pd.DataFrame,
        tokenizer: T5Tokenizer,
        input_max_token_len: int = 512,
        target_max_token_len: int = 128
    ):

        self.tokenizer = tokenizer
        self.data=dataframe
        self.input_max_token_len = input_max_token_len
        self.target_max_token_len = target_max_token_len


    def __len__(self):
        return len(self.data)

    def __getitem__(self, index:int):
        eos = self.tokenizer.eos_token
        pad = self.tokenizer.pad_token

        data_row = self.data.iloc[index]
        text = data_row['input']
        text_enc = self.tokenizer(
            "summarize:"+text+eos,
            max_length=self.input_max_token_len,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
            add_special_tokens=True,
            return_tensors='pt'
        )

        target_enc = self.tokenizer(
            pad+data_row['target']+eos,
            max_length=self.input_max_token_len,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
            add_special_tokens=True,
            return_tensors='pt'
        )

        labels = target_enc['input_ids']
        labels[labels==0] = -100

        return dict(
            text=text,
            target=data_row['target'],
            text_input_ids=text_enc['input_ids'].flatten().to(device),
            text_attention_mask=text_enc['attention_mask'].flatten().to(device),
            labels=labels.flatten().to(device),
            labels_attention_mask = target_enc["attention_mask"].flatten().to(device)
        )



In [None]:
class CustomSummaryDataModule(pl.LightningDataModule):
    
    def __init__(
        self,
        train_df: pd.DataFrame,
        test_df: pd.DataFrame,
        tokenizer: T5Tokenizer,
        batch_size: int = 2,
        input_max_token_len: int = 512,
        target_max_token_len: int = 128
    ):

        super().__init__()
        self.train_df = train_df
        self.test_df = test_df

        self.batch_size = batch_size
        self.tokenizer = tokenizer
        self.input_max_token_len = input_max_token_len
        self.target_max_token_len = target_max_token_len

    def setup(self, stage=None):
        self.train_dataset = CustomDataset(
            self.train_df,
            self.tokenizer,
            self.input_max_token_len,
            self.target_max_token_len
        )

        self.test_dataset = CustomDataset(
            self.test_df,
            self.tokenizer,
            self.input_max_token_len,
            self.target_max_token_len
        )

    def train_dataloader(self):
        return DataLoader(
            self.train_dataset,
            batch_size=self.batch_size,
            shuffle=False,
            num_workers=0
        )

    def val_dataloader(self):
        return DataLoader(
            self.test_dataset,
            batch_size=self.batch_size,
            shuffle=False,
            num_workers=0
        )
    def test_dataloader(self):
        return DataLoader(
            self.test_dataset,
            batch_size=self.batch_size,
            shuffle=False,
            num_workers=0
        )

In [None]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)

In [None]:
data_module = CustomSummaryDataModule(df_train, df_test, tokenizer)

In [None]:
class CustomDataSummaryModel(pl.LightningModule):
    
    def __init__(self, learning_rate = 1e-5, new_embed_dim: int = None):
        """Passing in the specific model so that I can update the embedding dim"""
        super().__init__()
        self.model = T5ForConditionalGeneration.from_pretrained('t5-small')
        self.learning_rate = learning_rate
        if new_embed_dim:
            self.model.resize_token_embeddings(new_embed_dim)
        

    def forward(self, input_ids, input_attention_mask, target_attention_mask, labels=None):
        output = self.model(
            input_ids,
            attention_mask=input_attention_mask,
            labels=labels,
            decoder_attention_mask = target_attention_mask
        )
        return output.loss, output.logits

    def training_step(self, batch, batch_idx):
        input_ids = batch['text_input_ids']
        input_attention_mask = batch['text_attention_mask']
        labels=batch['labels']
        labels_attention_mask = batch['labels_attention_mask']

        loss, outputs = self(
            input_ids=input_ids,
            input_attention_mask=input_attention_mask,
            target_attention_mask = labels_attention_mask,
            labels=labels
        )

        self.log("train_loss", loss, prog_bar=True, logger=True)
        return loss

    def validation_step(self, batch, batch_idx):
        input_ids = batch['text_input_ids']
        input_attention_mask = batch['text_attention_mask']
        labels=batch['labels']
        labels_attention_mask = batch['labels_attention_mask']

        loss, outputs = self(
            input_ids=input_ids,
            input_attention_mask=input_attention_mask,
            target_attention_mask = labels_attention_mask,
            labels=labels
        )

        self.log("val_loss", loss, prog_bar=True, logger=True)
        return loss
    
    def test_step(self, batch, batch_idx):
        input_ids = batch['text_input_ids']
        input_attention_mask = batch['text_attention_mask']
        labels=batch['labels']
        labels_attention_mask = batch['labels_attention_mask']

        loss, outputs = self(
            input_ids=input_ids,
            input_attention_mask=input_attention_mask,
            target_attention_mask = labels_attention_mask,
            labels=labels
        )

        self.log("test_loss", loss, prog_bar=True, logger=True)
        return loss

    def configure_optimizers(self):
        return AdamW(self.parameters(), lr=self.learning_rate)


In [None]:
my_model = CustomDataSummaryModel().to(device)

In [None]:
!nvidia-smi

Sun Apr 17 15:11:06 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    31W / 250W |   1669MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!wandb login

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [None]:
checkpoint_callback = ModelCheckpoint(
    dirpath=pathlib.Path.cwd()/'gdrive/My Drive/Capstone_three/checkpoints',
    filename='best-checkpoint',
    save_top_k=1,
    verbose=True,
    monitor="val_loss",
    mode='min'
)

In [None]:
wandb_logger = WandbLogger(project='mddoc-project-small')



[34m[1mwandb[0m: Currently logged in as: [33mthimmis[0m (use `wandb login --relogin` to force relogin)


In [None]:
N_EPOCHS = 50

In [None]:
trainer = pl.Trainer(
    logger=wandb_logger,
    checkpoint_callback=checkpoint_callback,
    max_epochs=N_EPOCHS,
    gpus=1,
    progress_bar_refresh_rate=10,
    auto_lr_find=True
)

  f"Setting `Trainer(checkpoint_callback={checkpoint_callback})` is deprecated in v1.5 and will "
  f"Setting `Trainer(progress_bar_refresh_rate={progress_bar_refresh_rate})` is deprecated in v1.5 and"
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [None]:
trainer.fit(my_model,data_module)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 60.5 M
-----------------------------------------------------
60.5 M    Trainable params
0         Non-trainable params
60.5 M    Total params
242.026   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]



Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

In [None]:
tmp_model = CustomDataSummaryModel()

In [None]:
checkpoint = torch.load(trainer.checkpoint_callback.best_model_path)

In [None]:
tmp_model.load_state_dict(checkpoint['state_dict'], strict=False)

<All keys matched successfully>

In [None]:
tmp_model.freeze()

In [None]:
tmp_model.to(device)

CustomDataSummaryModel(
  (model): T5ForConditionalGeneration(
    (shared): Embedding(32128, 512)
    (encoder): T5Stack(
      (embed_tokens): Embedding(32128, 512)
      (block): ModuleList(
        (0): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=512, out_features=512, bias=False)
                (k): Linear(in_features=512, out_features=512, bias=False)
                (v): Linear(in_features=512, out_features=512, bias=False)
                (o): Linear(in_features=512, out_features=512, bias=False)
                (relative_attention_bias): Embedding(32, 8)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseReluDense(
                (wi): Linear(in_features=512, out_features=2048, bias=False)
                (wo): Linear

In [None]:
def summarize(text):
    text_enc = tokenizer(
        "summarize:"+text+tokenizer.eos_token,
        max_length= 512,
        padding="max_length",
        truncation=True,
        return_attention_mask=True,
        add_special_tokens=True,
        return_tensors="pt"
    )
    generated_ids = tmp_model.model.generate(
        input_ids = text_enc['input_ids'],
        attention_mask=text_enc['attention_mask'],
        max_length=128,
        num_beams=4,
        repetition_penalty = 4.5,
        length_penalty=1.0,
        early_stopping=True   
    )
    predictions = [
        tokenizer.decode(gen_id, skip_special_tokens=True, clean_up_tokenization_spaces=True)
        for gen_id in generated_ids
    ]

    return "".join(predictions)

In [None]:
df_train.iloc[1].input


'The bone marrow demonstrates normal signal intensity. There is no evidence of bone contusion or fracture. There is no evidence of joint effusion. Tendinous structures surrounding the ankle joint are intact. No abnormal mass or fluid collection is seen surrounding the ankle joint'

In [None]:
summarize(df_train.iloc[1].input)

'Normal signal intensity of the bone marrow. No evidence of bone contusion or fracture.'

In [None]:
base_summarize(df_train.iloc[1].input)

'bone marrow shows normal signal intensity. no abnormal mass or fluid collection is seen surrounding the ankle joint.'

In [None]:
torch.save(my_model.model,'/content/gdrive/MyDrive/Capstone_three/models/tuned_model.pkl')

In [None]:
my_model.model.save_pretrained('/content/gdrive/MyDrive/Capstone_three/models/fine-tuned')
tokenizer.save_pretrained('/content/gdrive/MyDrive/Capstone_three/models/fine-tuned')

('/content/gdrive/MyDrive/Capstone_three/models/fine-tuned/tokenizer_config.json',
 '/content/gdrive/MyDrive/Capstone_three/models/fine-tuned/special_tokens_map.json',
 '/content/gdrive/MyDrive/Capstone_three/models/fine-tuned/tokenizer.json')