<p  style="font-size:30px; text-align:center; font-weight:bold">T5 (Text-to-Text Transfer Transformer)</p>

<div style="width:100%;height:1px; background-color:black"></div>

<p style="font-size: 20px">The T5 model, which was developed by Google Research, is based on the idea that practically all NLP problems can be reduced to text-to-text issues. In other words, the input and output can both be treated as text for any given purpose, be it translation, summarization, question answering, or another. The model is uniformly trained to translate textual inputs into textual outputs.</p>

<p style="font-size: 20px; font-weight:bold"><u>Justification (Why used for medical chatbot (doctor-patient dialogues dataset))</u></p>

<p style="font-size: 20px">T5, with its text-to-text design, offers a streamlined approach suitable for building medical chatbots using doctor-patient dialogues. It's pre-trained on vast texts and can be further refined for medical conversations, making it both versatile and effective. Whether it's giving brief advice or going into detailed topics, T5 is adept due to its proven performance and deep contextual understanding, important for sensitive medical conversations.</p>

<p style="font-size: 20px; font-weight:bold"><u>Version</u></p>

<p style="font-size: 20px">T5-small, version has been used in this project for the training on patient-doctor dialogues. T5-small has approximately 60 million parameters.</p>

<div style="width:100%;height:1px; background-color:black"></div>

<p id="lib" style="font-size:30px; text-align:center; font-weight:bold">Required libraries or packages</p> <a href="#top">Back To Top</a>

<div style="width:100%;height:1px; background-color:black"></div>

In [1]:
import torch # PyTorch open-source deep learning framework
import json # json library to load JSON data
import pandas as pd # for dataframe
from sklearn.model_selection import train_test_split # train_test_split method from scikit learn for datset splitting 
from transformers import (
    T5Tokenizer, # T5 tokenizer
    DataCollatorForLanguageModeling,  # collates batches for language modeling tasks
    T5ForConditionalGeneration, # T5 model
    TrainingArguments, # class to store arguments for training transformers models
    Trainer,  # trainer class for training and evaluating transformers models
    TrainerCallback, # callbacks used with the trainer class
    IntervalStrategy, # enumeration of different interval strategies for logging
  
)
from datasets import load_metric # load metric utility from the datasets library
from torch.nn import CrossEntropyLoss # method from pytorch to calculate loss

from torch.utils.data import Dataset, DataLoader # pytorch utility that provides iterators over datasets, which helps in efficient data feeding
from nltk.translate.bleu_score import corpus_bleu # bleu_score from nltk to calculate BLEU score (evaluation metric)
from rouge import Rouge # rouge to calculate ROUGE score (evaluation metric)

2023-09-06 03:15:50.274225: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-06 03:15:50.778040: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2023-09-06 03:15:50.778082: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64


<div style="width:100%;height:1px; background-color:black"></div>

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # check if GPU support available than use GPU otherwise use CPU


<p style="font-size: 23px; margin-top: 10px; font-weight: bold">Load Dataset</p>

In [3]:
with open("cleaned_medical_dialogues_dataset.json", "r") as f: # load the data from the JSON file
    data = json.load(f)

data = data[::4] # sampling 1/4th of the dataset to manage time and computational resources

print(len(data)) # print length number of samples in the sampled data

61776


<div style="width:100%;height:1px; background-color:black"></div>

<p style="font-size: 23px; margin-top: 10px; font-weight: bold">Data Formatting</p>

<p style="font-size: 20px">
The T5 model treats every NLP task as converting one text into another. In the context of doctor-patient dialogues, combine the description and patient's question to give the model a full picture (source), and the doctor's response is the desired answer (target). This setup aligns with T5's design, allowing it to use the given context to produce an appropriate doctor's response.
</p>

In [4]:
sources = [] # list for saving queries
targets = [] # list for saving responses

for item in data: # loop to save dialogues in sources and targets list 
    source = item["Description"] + " " + item["Patient"]
    target = item["Doctor"]
    sources.append(source)
    targets.append(target)

<div style="width:100%;height:1px; background-color:black"></div>

<p style="font-size: 23px; margin-top: 10px; font-weight: bold">Prepare the Dataset </p>

In [5]:
tokenizer = T5Tokenizer.from_pretrained("t5-small") # tokenization method of t5 pre-trained model from transformer

In [6]:
class DoctorPatientDataset(Dataset):
    def __init__(self, sources, targets, tokenizer, max_length=512): # initialization method for the dataset
        self.sources = sources # list of source sequences
        self.targets = targets # list of target sequences
        self.tokenizer = tokenizer  # T5 tokenizer for tokenizing input data
        self.max_length = max_length  # maximum token length for each sequence
        
    def __len__(self): # returns the number of items in the dataset
        return len(self.sources)
    
    # fetches and returns the source-target pair at the given index 'idx'
    def __getitem__(self, idx):
        source = self.sources[idx] # get the source sequence at the specified index
        target = self.targets[idx] # get the corresponding target sequence
        
        # tokenize and format the source sequence, truncatation and padding to 'max_length'.
        source_encodings = tokenizer(source, truncation=True, padding='max_length', max_length=self.max_length, return_tensors="pt")
        
        # tokenize and format the target sequence
        target_encodings = tokenizer(target, truncation=True, padding='max_length', max_length=self.max_length, return_tensors="pt")
        
        # prepare the data in the format T5 model expects with necessary fields like input_ids, attention_mask, etc.
        item = {
            "input_ids": source_encodings["input_ids"].squeeze(), # source sequence input IDs
            "attention_mask": source_encodings["attention_mask"].squeeze(), # source sequence attention masks
            "decoder_input_ids": target_encodings["input_ids"].squeeze(), # target sequence input IDs for the decoder
            "decoder_attention_mask": target_encodings["attention_mask"].squeeze(), # target sequence attention masks for the decoder
            "labels": target_encodings["input_ids"].squeeze()  # labels for calculating loss during training
        }
        
        return item

In [7]:
dataset = DoctorPatientDataset(sources, targets, tokenizer) # calling the DoctorPatientDataset class

In [8]:
# data collator for language modeling
data_collator = DataCollatorForLanguageModeling( 
    tokenizer=tokenizer, # assigning the tokenizer we have intialized earlier
    mlm=False # passing false to mention it is not masked langauge modelling (like BERT) task 
)

<div style="width:100%;height:1px; background-color:black"></div>

<p style="font-size: 23px; margin-top: 10px; font-weight: bold">Split Dataset</p>

<p style="font-size: 20px">
  The training set and validation/test set are two sets that are typically separated from the available dataset in machine learning and deep learning.
This separation helps in evaluating the model's performance on unseen data. The main goal is to prevent the model from overfitting the training set of data. If a model performs exceptionally well on training data but poorly on validation data, it is obviously overfitted.</p>

In [9]:

train_dataset, val_dataset = train_test_split(dataset, test_size=0.2)  # 20% of data as validation


<p style="font-size: 20px">The parameter <b>test_size=0.2</b> indicates that 20% of the dataset will be used as the validation set, while the remaining 80% will be used for training.</p>

<div style="width:100%;height:1px; background-color:black"></div>

<p id="lib" style="font-size:30px; text-align:center; font-weight:bold">Fine tuning T5-small</p> <a href="#top">Back To Top</a>

<div style="width:100%;height:1px; background-color:black"></div>

<p style="font-size: 23px; margin-top: 10px; font-weight: bold">Define Early Stopping</p>

<p style="font-size: 20px">Callback function to halt the training process when there is no improvement in the model
</p>

In [10]:
class EarlyStoppingCallback(TrainerCallback):
    def __init__(self, early_stopping_patience): # intialize the callback with number of epochs to wait
        self.early_stopping_patience = early_stopping_patience
        self.patience_counter = 0 # counter to keep track of epochs
        self.best_score = None # store the best evaluation score

    def on_evaluate(self, args, state, control, metrics, **kwargs):
        current_score = metrics.get("eval_loss") # monitoring 'eval_loss'

        if self.best_score is None or current_score < self.best_score: #compare the evaluation score and reset the patience counter
            self.best_score = current_score
            self.patience_counter = 0
        else:
            self.patience_counter += 1

        if self.patience_counter >= self.early_stopping_patience: # condition if the patient counter exceeds or equal than stop training 
            control.should_training_stop = True


<div style="width:100%;height:1px; background-color:black"></div>

<p style="font-size: 23px; margin-top: 10px; font-weight: bold">Model Intialization and Define Training Argument</p>

In [11]:

model = T5ForConditionalGeneration.from_pretrained("t5-small").to(device) # model Initialization

# setting the training arguments or parameters for training of the model
training_args = TrainingArguments(
    output_dir='./medical_t5_finetuned', # saved model directory
    num_train_epochs=2, # number of epochs 2, tried with more than 2, but it was taking a lot of time so for the prototype just limited to 2 
    per_device_train_batch_size=2, # batch size 2
    logging_dir='./t5_logs', # saving logs directory
    logging_steps=10, # logging after 10 steps to check losses
    save_steps=10000, # saving the model after 10000 steps 
    save_strategy="epoch",  # save model after each epoch
    evaluation_strategy="epoch",  # evaluate after each epoch with validation dataset
)


<div style="width:100%;height:1px; background-color:black"></div>

<p style="font-size: 23px; margin-top: 10px; font-weight: bold">Initialize  Trainer</p>

In [12]:
# initialize the Trainer instance 
trainer = Trainer(
    model=model, # passing the model to be fine tuned
    args=training_args, # passing the training configurations
    data_collator=data_collator, # data batching and format
    train_dataset=train_dataset, # passing the training dataset
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)], # callback function for ealry stopping, with 2 epoch
    eval_dataset=val_dataset # passing the validation dataset
)

<div style="width:100%;height:1px; background-color:black"></div>

<p style="font-size: 23px; margin-top: 10px; font-weight: bold">Train Model</p>

In [13]:
trainer.train() # training the model

***** Running training *****
  Num examples = 49420
  Num Epochs = 2
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 49420
  Number of trainable parameters = 60506624


Epoch,Training Loss,Validation Loss
1,3.5149,2.897026
2,3.3202,2.535566


***** Running Evaluation *****
  Num examples = 12356
  Batch size = 8
Saving model checkpoint to ./medical_t5_finetuned/checkpoint-24710
Configuration saved in ./medical_t5_finetuned/checkpoint-24710/config.json
Model weights saved in ./medical_t5_finetuned/checkpoint-24710/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 12356
  Batch size = 8
Saving model checkpoint to ./medical_t5_finetuned/checkpoint-49420
Configuration saved in ./medical_t5_finetuned/checkpoint-49420/config.json
Model weights saved in ./medical_t5_finetuned/checkpoint-49420/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=49420, training_loss=3.7003659997088665, metrics={'train_runtime': 64126.876, 'train_samples_per_second': 1.541, 'train_steps_per_second': 0.771, 'total_flos': 1.337718365749248e+16, 'train_loss': 3.7003659997088665, 'epoch': 2.0})

<div style="width:100%;height:1px; background-color:black"></div>

<p style="font-size: 23px; margin-top: 10px; font-weight: bold">Save the model</p>

In [15]:
trainer.save_model("./medical_t5_finetuned") # save the model

Saving model checkpoint to ./medical_t5_finetuned
Configuration saved in ./medical_t5_finetuned/config.json
Model weights saved in ./medical_t5_finetuned/pytorch_model.bin


<div style="width:100%;height:1px; background-color:black"></div>

<p ><center><u style="font-size: 28px; margin-top: 10px; font-weight: bold">Model Evaluation</u></center></p>

<p style="font-size: 20px">
Model evaluation is an important part of creating machine learning models. It's about testing how good the model is using data it hasn't seen during training, usually called test or validation data. We do this to see if the model's answers are right and understand any mistakes it might make.
</p

<div style="width:100%;height:1px; background-color:black"></div>

<p style="font-size: 23px; margin-top: 10px; font-weight: bold">Load the model</p>

In [13]:
finetuned_model = T5ForConditionalGeneration.from_pretrained("./medical_t5_finetuned")
finetuned_model = finetuned_model.to(device)


loading configuration file ./medical_t5_finetuned/config.json
Model config T5Config {
  "_name_or_path": "t5-small",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_s

<div style="width:100%;height:1px; background-color:black"></div>

<p style="font-size: 23px; margin-top: 10px; font-weight: bold">Loss and Perplexity</p>

<p style="font-size: 20px">
Loss measures the difference between a model's predictions and the actual data. It helps in adjusting the model to make better predictions. By looking at the loss, we can see if the model is improving and make necessary changes if needed.
</p

<p style="font-size: 20px">
    Perplexity checks how good a model is at guessing the next word. For medical chatbots, a lower value means the bot can chat more smoothly and make sense.
</p

<p style="font-size: 20px; margin-top: 10px; font-weight: bold">ROUGE score</p>

<p style="font-size: 20px">
    The ROUGE score looks at how much the predicted text matches the reference text using different measures like precision, recall, and F1-score. It's particularly useful for tasks like summarization to see how much key information the model includes in its output. In the context of medical chatbot, ROUGE can help determine how closely the generated response matches a desired or reference answer, indicating the system's ability to provide accurate and relevant information.
</p>

In [15]:
rouge = load_metric("rouge")

def compute_metrics_on_subset(model, subset, tokenizer): # function definition to calculate evaluation metrics
    # evaluation mode
    model.eval()

    # initialize the variables for metrics storing
    total_loss = 0.0
    total_correct = 0
    total_count = 0

    # lists to collect all predicted and actual sentences for ROUGE score
    predictions = []
    actuals = []

    # loss criterion initialization
    criterion = CrossEntropyLoss()

    with torch.no_grad():
        for item in subset:
            # extract tensors and move them to the correct device
            input_ids = item['input_ids'].unsqueeze(0).to(device)
            attention_mask = item['attention_mask'].unsqueeze(0).to(device)
            decoder_input_ids = item['decoder_input_ids'].unsqueeze(0).to(device)
            labels = item['labels'].unsqueeze(0).to(device)

            # model forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, decoder_input_ids=decoder_input_ids, labels=labels)
            loss, logits = outputs.loss, outputs.logits

            # calculate loss
            total_loss += loss.item()

            # calculate accuracy
            preds = logits.argmax(dim=-1)
            total_correct += (preds == labels).sum().item()
            total_count += labels.numel()

            # decode for ROUGE score
            predictions.append(tokenizer.decode(preds.squeeze()))
            actuals.append(tokenizer.decode(labels.squeeze()))

    # calculate perplexity from the average loss
    perplexity = torch.exp(torch.tensor(total_loss / len(subset)))

    # calculate accuracy
    accuracy = total_correct / total_count

    # calculate ROUGE scores
    rouge_scores = rouge.compute(predictions=predictions, references=actuals)

    rouge1 = {
        'r': rouge_scores['rouge1'].mid.recall,
        'p': rouge_scores['rouge1'].mid.precision,
        'f': rouge_scores['rouge1'].mid.fmeasure
    }
    rouge2 = {
        'r': rouge_scores['rouge2'].mid.recall,
        'p': rouge_scores['rouge2'].mid.precision,
        'f': rouge_scores['rouge2'].mid.fmeasure
    }
    rougel = {
        'r': rouge_scores['rougeL'].mid.recall,
        'p': rouge_scores['rougeL'].mid.precision,
        'f': rouge_scores['rougeL'].mid.fmeasure
    }

    return {
        "loss": total_loss / len(subset),
        "perplexity": perplexity.item(),
        "accuracy": accuracy,
        "rouge-1": rouge1,
        "rouge-2": rouge2,
        "rouge-l": rougel
    }

subset_length = 100  # subset from validation dataset
subset = [val_dataset[i] for i in range(subset_length)]
metrics = compute_metrics_on_subset(finetuned_model, subset, tokenizer)
print(metrics)


  """Entry point for launching an IPython kernel.


{'loss': 22.296912651062012, 'perplexity': 4824211456.0, 'accuracy': 0.004140625, 'rouge-1': {'r': 0.0189294489628268, 'p': 0.04288250505494566, 'f': 0.023984178275725195}, 'rouge-2': {'r': 0.0005700558142582217, 'p': 0.0011365764373910494, 'f': 0.0006913916670250918}, 'rouge-l': {'r': 0.011724279000924332, 'p': 0.026785558832261913, 'f': 0.014857218795111458}}


In [16]:
metrics

{'loss': 22.296912651062012,
 'perplexity': 4824211456.0,
 'accuracy': 0.004140625,
 'rouge-1': {'r': 0.0189294489628268,
  'p': 0.04288250505494566,
  'f': 0.023984178275725195},
 'rouge-2': {'r': 0.0005700558142582217,
  'p': 0.0011365764373910494,
  'f': 0.0006913916670250918},
 'rouge-l': {'r': 0.011724279000924332,
  'p': 0.026785558832261913,
  'f': 0.014857218795111458}}

<p style="font-size: 23px; margin-top: 10px; font-weight: bold"><u>Saving Results to the dataframe</u></p>

<p style="font-size: 20px">
  I am saving the results in the dataframe one by one of each model so i can compare the results in the separate python file (medical_chatbot_eval_metrics.ipynb).
</p>

In [19]:
eval_metrics_results = {
    'model_name': 'T5-small',
    'loss': metrics['loss'],
    'perplexity': metrics['perplexity'],
    'accuracy': metrics['accuracy'],
    'rouge-1_r': metrics['rouge-1']['r'],
    'rouge-1_p': metrics['rouge-1']['p'],
    'rouge-1_f': metrics['rouge-1']['f'],
    'rouge-2_r': metrics['rouge-2']['r'],
    'rouge-2_p': metrics['rouge-2']['p'],
    'rouge-2_f': metrics['rouge-2']['f'],
    'rouge-l_r': metrics['rouge-l']['r'],
    'rouge-l_p': metrics['rouge-l']['p'],
    'rouge-l_f': metrics['rouge-l']['f']
}


In [20]:
eval_metrics_results

{'model_name': 'T5-small',
 'loss': 22.296912651062012,
 'perplexity': 4824211456.0,
 'accuracy': 0.004140625,
 'rouge-1_r': 0.0189294489628268,
 'rouge-1_p': 0.04288250505494566,
 'rouge-1_f': 0.023984178275725195,
 'rouge-2_r': 0.0005700558142582217,
 'rouge-2_p': 0.0011365764373910494,
 'rouge-2_f': 0.0006913916670250918,
 'rouge-l_r': 0.011724279000924332,
 'rouge-l_p': 0.026785558832261913,
 'rouge-l_f': 0.014857218795111458}

In [21]:
eval_metrics_results_dataframe = pd.read_csv('eval_metrics_results_dataframe.csv') # load the csv  
eval_metrics_results_dataframe

Unnamed: 0,model_name,loss,perplexity,accuracy,rouge-1_r,rouge-1_p,rouge-1_f,rouge-2_r,rouge-2_p,rouge-2_f,rouge-l_r,rouge-l_p,rouge-l_f
0,Encoder-Decoder LSTM,0.074382,1.05291,0.990507,0.985153,0.967791,0.976377,0.975947,0.946462,0.960935,0.985153,0.967791,0.976377
1,GPT-2 Medium,10.379869,32204.726562,0.009531,0.497472,0.526547,0.511501,0.176954,0.186899,0.181735,0.388866,0.411154,0.399379
2,Facebook/BART-base,0.393779,1.482573,0.920898,0.946877,0.954811,0.950963,0.911431,0.918899,0.915209,0.933774,0.941522,0.937755


In [22]:
eval_metrics_results_dataframe = eval_metrics_results_dataframe.append(eval_metrics_results, ignore_index=True) # append the record in the dataframe
eval_metrics_results_dataframe.to_csv('eval_metrics_results_dataframe.csv', index=False) # save to the same file
eval_metrics_results_dataframe

Unnamed: 0,model_name,loss,perplexity,accuracy,rouge-1_r,rouge-1_p,rouge-1_f,rouge-2_r,rouge-2_p,rouge-2_f,rouge-l_r,rouge-l_p,rouge-l_f
0,Encoder-Decoder LSTM,0.074382,1.05291,0.990507,0.985153,0.967791,0.976377,0.975947,0.946462,0.960935,0.985153,0.967791,0.976377
1,GPT-2 Medium,10.379869,32204.73,0.009531,0.497472,0.526547,0.511501,0.176954,0.186899,0.181735,0.388866,0.411154,0.399379
2,Facebook/BART-base,0.393779,1.482573,0.920898,0.946877,0.954811,0.950963,0.911431,0.918899,0.915209,0.933774,0.941522,0.937755
3,T5-small,22.296913,4824211000.0,0.004141,0.018929,0.042883,0.023984,0.00057,0.001137,0.000691,0.011724,0.026786,0.014857


<div style="width:100%;height:1px; background-color:black"></div>

<p style="font-size: 23px; margin-top: 10px; font-weight: bold"><u>Answer to user queries by using the model</u></p>

In [21]:
def generate_response(input_text, model, tokenizer, max_length=512):
    input_text = "Medical query: " + input_text # T5 expects this format for seq2seq tasks
    input_tensor = tokenizer.encode(input_text, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model.generate(input_tensor, max_length=max_length, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


user_query = "what is nlp?"
response = generate_response(user_query, finetuned_model, tokenizer)
print(response)


Medical query: is lp?
