**An Automated Discharge Summary system Built for
Multiple Clinical Texts by Pre-trained distilbart Model**

The code mounts Google Drive into the Colab environment, allowing access to files and folders stored in the Google Drive.

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Installind and importing necessary libraries

In [None]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from transformers import BartTokenizer, TFBartForConditionalGeneration
from transformers import AutoTokenizer
from sklearn.model_selection import train_test_split

The code reads a CSV dataset named "sample_data_1000.csv" from the mounted Google Drive and loads it into a variable called "Dataset_for_BART_Model." Then, it uses the Hugging Face's `load_dataset` function to load the same CSV dataset into a variable called "ds."

In [None]:
Dataset_for_BART_Model = pd.read_csv('/content/drive/MyDrive/sample_data_1000.csv')
ds = load_dataset('csv',data_files = '/content/drive/MyDrive/sample_data_1000.csv')




  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
def convert_to_string(example):
  for key in example.keys():
    example[key] = str(example[key])
  return example

ds = ds.map(convert_to_string)



In [None]:
! git clone https://huggingface.co/philschmid/tf-distilbart-cnn-12-6.git

fatal: destination path 'tf-distilbart-cnn-12-6' already exists and is not an empty directory.


In [None]:
model_name = "philschmid/tf-distilbart-cnn-12-6"

The code imports a configuration class called "BartConfig" from the Hugging Face's Transformers library. It then creates a BART model configuration by loading a pre-trained configuration named "philschmid/tf-distilbart-cnn-12-6." The code accesses the maximum input dimension of the BART model configuration and stores it in the variable "max_input_dimension."

In [None]:
from transformers import BartConfig

# Create a BART model configuration
config = BartConfig.from_pretrained("philschmid/tf-distilbart-cnn-12-6")

# Access the maximum input dimension
max_input_dimension = config.max_position_embeddings

print("Maximum Input Dimension:", max_input_dimension)

Maximum Input Dimension: 1024


In [None]:
tf.config.optimizer.set_experimental_options({"auto_mixed_precision": True})


Preprocessing dataset

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
to_remove_columns = ['HADM_ID', 'ECG', 'Echo', 'Nursing', 'Physician ', 'Radiology']

ds = ds["train"].remove_columns(to_remove_columns)


In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:

max_input_length = 500 #400 #4000 #47000
max_target_length = 200 #200 #1000 #10700
#(10681.578, 46985.783)
from nltk.corpus import stopwords
from nltk.corpus import stopwords

def preprocess_function(examples):
    stop_words = set(stopwords.words('english')) # Get the English stop words

    # Define a function to remove stop words from a given text
    def remove_stop_words(text):
        tokens = text.split()
        filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
        return ' '.join(filtered_tokens)

    # Remove stop words from examples['Concatenated_Text']
    examples['Concatenated_Text'] = [remove_stop_words(text) for text in examples['Concatenated_Text']]

    # Remove stop words from examples['Discharge summary']
    examples['Discharge summary'] = [remove_stop_words(text) for text in examples['Discharge summary']]

    model_inputs = tokenizer(
        examples['Concatenated_Text'],
        max_length=max_input_length,
        truncation=True
    )

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples['Discharge summary'],
            max_length=max_target_length,
            truncation=True
        )

    model_inputs["labels"] = labels["input_ids"]
    model_inputs["references"] = examples['Discharge summary']

    return model_inputs

tokenized_datasets = ds.map(preprocess_function, batched=True)



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]



The code sets the value of the test size to 15% of the total dataset size. Then, it processes the tokenized dataset and splits it into training and testing subsets using the specified test size.

In [None]:
test_size=.15

processed_dataset = tokenized_datasets.shuffle().train_test_split(test_size=test_size)

The code sets various configuration parameters for training a machine learning model using the Hugging Face Transformers library.

In [None]:
from huggingface_hub import HfFolder
import tensorflow as tf


num_train_epochs = 5
train_batch_size = 4
eval_batch_size = 4
learning_rate = 5.6e-5
weight_decay_rate= 0.01
num_warmup_steps= 155
output_dir=model_name.split("/")[1]
hub_token = HfFolder.get_token() # or your token directly "hf_xxx"
hub_model_id = f'{model_name.split("/")[1]}-tradetheevent'
fp16= False #True

# Train in mixed-precision float16
# Comment this line out if you're using a GPU that will not benefit from this
if fp16:
  tf.keras.mixed_precision.set_global_policy("mixed_float16")



The code imports a model class called "TFAutoModelForSeq2SeqLM" from the Hugging Face Transformers library. It then loads a pre-trained sequence-to-sequence language model using the "from_pretrained" method and stores it in the variable "model."

In [None]:
from transformers import TFAutoModelForSeq2SeqLM
# load pre-trained model
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name)


All model checkpoint layers were used when initializing TFBartForConditionalGeneration.

All the layers of TFBartForConditionalGeneration were initialized from the model checkpoint at philschmid/tf-distilbart-cnn-12-6.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartForConditionalGeneration for predictions without further training.


The code imports a data collator class called "DataCollatorForSeq2Seq" from the Hugging Face Transformers library. It creates a data collator instance named "data_collator" that dynamically pads the input and label sequences.

Then, it converts the processed training dataset ("processed_dataset['train']") and testing dataset ("processed_dataset['test']") into TensorFlow `tf.data.Dataset` objects. During conversion, it selects specific columns ("input_ids," "attention_mask," and "labels") from the dataset and shuffles the training dataset. It uses the previously created "data_collator" to collate the data into batches based on the specified batch sizes for training and evaluation ("train_batch_size" and "eval_batch_size," respectively).

In [None]:

from transformers import DataCollatorForSeq2Seq

# Data collator that will dynamically pad the inputs received, as well as the labels.
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

# converting our train dataset to tf.data.Dataset
tf_train_dataset = processed_dataset["train"].to_tf_dataset(
   columns=["input_ids", "attention_mask", "labels"],
   shuffle=True,
   batch_size=train_batch_size,
   collate_fn=data_collator)

# converting our test dataset to tf.data.Dataset
tf_eval_dataset = processed_dataset["test"].to_tf_dataset(
   columns=["input_ids", "attention_mask", "labels"],
   shuffle=True,
   batch_size=eval_batch_size,
   collate_fn=data_collator)


You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


The code imports a function called "create_optimizer" from the Hugging Face Transformers library. It creates an optimizer with weight decay, which is commonly used for fine-tuning machine learning models.

The number of training steps is calculated based on the length of the training dataset and the number of training epochs. The optimizer is initialized with the specified learning rate, weight decay rate, and the number of warm-up steps.

After creating the optimizer and learning rate schedule, the model is compiled using the specified optimizer.

In [None]:
from transformers import create_optimizer


# create optimizer wight weigh decay
num_train_steps = len(tf_train_dataset) * num_train_epochs
optimizer, lr_schedule = create_optimizer(
    init_lr=learning_rate,
    num_train_steps=num_train_steps,
    weight_decay_rate=weight_decay_rate,
    num_warmup_steps=num_warmup_steps,
)

# compile model
model.compile(optimizer=optimizer)


In this code, several callbacks are defined and stored in the "callbacks" list. First, the "TensorboardCallback" from TensorFlow is included to log training and evaluation metrics during model training. Then, an "if" condition checks if the "hub_token" variable has a value. If the condition is true (i.e., a valid Hugging Face Hub token is available), the "PushToHubCallback" from Hugging Face Transformers is included in the "callbacks" list. This callback is used to push the trained model and its associated tokenizer to the Hugging Face model hub, allowing easy sharing and version control of the model. The "PushToHubCallback" requires the "output_dir," "tokenizer," "hub_model_id," and "hub_token" as inputs, which are provided accordingly in the callback definition.

In [None]:
import os
from transformers.keras_callbacks import PushToHubCallback
from tensorflow.keras.callbacks import TensorBoard as TensorboardCallback

callbacks=[]

callbacks.append(TensorboardCallback(log_dir=os.path.join(output_dir,"logs")))
if hub_token:
  callbacks.append(PushToHubCallback(output_dir=output_dir,
                                     tokenizer=tokenizer,
                                     hub_model_id=hub_model_id,
                                     hub_token=hub_token))




Fit the model

In [None]:
train_results = model.fit(
    tf_train_dataset,
    validation_data=tf_eval_dataset,
    callbacks=callbacks,
    epochs=num_train_epochs,
)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


calculating ROUGE

In [None]:
from datasets import load_metric
from tqdm import tqdm
import numpy as np
import nltk
nltk.download("punkt")
from nltk.tokenize import sent_tokenize


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
!pip install rouge_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
metric = load_metric("rouge")

  metric = load_metric("rouge")


In [None]:
def evaluate(model, dataset):
    all_predictions = []
    all_labels = []
    for batch in tqdm(dataset):
        predictions = model.generate(batch["input_ids"])
        decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
        labels = batch["labels"].numpy()
        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
        decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
        decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]
        all_predictions.extend(decoded_preds)
        all_labels.extend(decoded_labels)
        result = metric.compute(
            predictions=decoded_preds, references=decoded_labels, use_stemmer=True
        )
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    return {k: round(v, 4) for k, v in result.items()}

results = evaluate(model, tf_eval_dataset)


100%|██████████| 38/38 [1:03:26<00:00, 100.18s/it]


ROUGE scores

In [None]:
print(results)
#1 epoch = {'rouge1': 34.0009, 'rouge2': 19.8162, 'rougeL': 27.9357, 'rougeLsum': 32.1849}

{'rouge1': 40.5229, 'rouge2': 27.8146, 'rougeL': 37.2549, 'rougeLsum': 39.8693}
