To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

[NEW] Llama-3.1 8b, 70b & 405b are trained on a crazy 15 trillion tokens with 128K long context lengths!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [None]:
%%capture
!pip install unsloth "xformers==0.0.28.post2"
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [None]:
%%capture
!pip install -U bitsandbytes transformers

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
* [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

# Use Llama 3B Model

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 500 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.11.9: Fast Llama patching. Transformers = 4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu124. CUDA = 7.5. CUDA Toolkit = 12.4.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

In [None]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072, padding_idx=128004)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (rotary_emb): LlamaExtendedRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (up_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=3072, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): Llam

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.11.9 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [None]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 3072, padding_idx=128004)
        (layers): ModuleList(
          (0-27): 28 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lor

<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%%time
import pandas as pd
from datasets import Dataset

# Load the dataset
MRIdf = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Capstone2/Data/MRI_full.csv')

# Remove rows with missing values in the relevant columns
MRIdf = MRIdf.dropna(subset=['clinical_information', 'findings', 'impression'])
MRIdf.info()


<class 'pandas.core.frame.DataFrame'>
Index: 27484 entries, 0 to 33900
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            27484 non-null  int64  
 1   clinical_information  27484 non-null  object 
 2   technique             27484 non-null  object 
 3   findings              27484 non-null  object 
 4   comparison            23491 non-null  object 
 5   impression            27484 non-null  object 
 6   report_id_x           27484 non-null  object 
 7   join                  27484 non-null  int64  
 8   report_id_y           27484 non-null  float64
 9   modality              27484 non-null  object 
 10  instruction           27484 non-null  object 
dtypes: float64(1), int64(2), object(8)
memory usage: 2.5+ MB
CPU times: user 932 ms, sys: 110 ms, total: 1.04 s
Wall time: 2.76 s


# Goal is to fine-tune model to generate impression using clinical information and findings

In [None]:
%%time
import pandas as pd
from datasets import Dataset

#HAD SAME RESULT AS FINDINGS AND CLINICAL INFO USED AS COMBINED INPUTS


# Load only the required columns and drop any rows with missing values in those columns
MRIdf_subset = MRIdf[['findings', 'clinical_information', 'impression']].dropna().drop_duplicates()

# Ensure that each column is of type string
MRIdf_subset['findings'] = MRIdf_subset['findings'].astype(str)
MRIdf_subset['clinical_information'] = MRIdf_subset['clinical_information'].astype(str)
MRIdf_subset['impression'] = MRIdf_subset['impression'].astype(str)

# Convert the subset DataFrame to a Hugging Face Dataset
MRIdf_dataset = Dataset.from_pandas(MRIdf_subset)

# Define the Alpaca-style prompt template
alpaca_prompt = """

### Input:
{}

### Input2:
{}

### Response:
{}"""

# EOS token for end-of-sequence in generated text
EOS_TOKEN = tokenizer.eos_token

# Formatting function to structure input-output pairs
def formatting_prompts_func(examples):
    inputs = examples["clinical_information"]
    inputs2 = examples["findings"]
    outputs = examples["impression"]
    texts = []
    for input, input2, output in zip(inputs, inputs2, outputs):
        # Use Alpaca prompt template, adding EOS token at the end
        text = alpaca_prompt.format(input, input2, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

# Apply formatting function to dataset with batched processing
formatted_data = MRIdf_dataset.map(formatting_prompts_func, batched=True)

# Show the first few formatted examples
formatted_data[:2]


Map:   0%|          | 0/27484 [00:00<?, ? examples/s]

CPU times: user 763 ms, sys: 268 ms, total: 1.03 s
Wall time: 1.24 s


{'findings': ['There is a heterogeneous left supratentorial and infratentorial lesion with associated cystic compartments, some of which contain fluid-fluid levels suggestive of hemorrhage. There is also surrounding vasogenic edema in the left cerebellar hemisphere. There is marked effacement of the fourth ventricle and dilatation of the third and fourth ventricles, which have increased in size slightly. There is also marked mass effect upon the brainstem. There is also a similar contiguous heterogeneous lesion within the left temporal and occipital lobes with surrounding vasogenic edema. There is no herniation. There are skin-based fiducial markers.',
  'No evidence of restricted diffusion is seen. Patchy periventricular T2 hyperintensity is evident, a nonspecific finding which likely represents chronic small vessel ischemic disease. No intracranial hemorrhage or abnormal extra-axial fluid collections are seen. There is no evidence of parenchymal edema or mass effect. The ventricular 

In [None]:
%%time
from sklearn.model_selection import train_test_split

# Convert the formatted dataset to a DataFrame for train-test split, and reset index to avoid duplicate index errors
formatted_df = formatted_data.to_pandas().reset_index(drop=True)

# First split into train and temp (80-20)
train_data, temp_data = train_test_split(formatted_df, test_size=0.2, random_state=42)

# Split temp into validation and test (50-50, meaning 10-10 of original data)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

# Convert back to Hugging Face Datasets without additional indices
train_dataset = Dataset.from_pandas(train_data.reset_index(drop=True))
val_dataset = Dataset.from_pandas(val_data.reset_index(drop=True))
test_dataset = Dataset.from_pandas(test_data.reset_index(drop=True))

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Test samples: {len(test_dataset)}")

Training samples: 21987
Validation samples: 2748
Test samples: 2749
CPU times: user 361 ms, sys: 275 ms, total: 635 ms
Wall time: 779 ms


In [None]:
train_dataset[1]

{'findings': 'There is a heterogeneously enhancing, low T1 and T2 signal intensity left temporal region calvarium lesion that measures up to 5 cm in length with epidural extension that measures up to 7 mm in thickness. There is mild mass effect upon the underlying brain parenchyma. There also appears to be mild subgaleal extension of tumor. There is no evidence of intracranial hemorrhage or acute infarct. There is mild nonspecific periventricular cerebral white matter T2 hyperintensity, but no evidence of intraparenchymal enhancing lesions. The ventricles and basal cisterns are normal in size and configuration. There is no midline shift or herniation. The major cerebral flow voids are intact. The orbits are grossly unremarkable.',
 'clinical_information': 'Metastatic breast cancer.',
 'impression': 'Left temporal calvarium region metastasis with epidural extension measuring up to 7 mm in width. ',
 '__index_level_0__': 16548,
 'text': '\n\n### Input:\nMetastatic breast cancer.\n\n### I

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
     train_dataset=train_dataset,
    eval_dataset=val_dataset,  # Added validation dataset
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 16,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
       # num_train_epochs = 10, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=2):   0%|          | 0/21987 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/2748 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
11.582 GB of memory reserved.


In [None]:
%%time
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 21,987 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 16 | Gradient Accumulation steps = 4
\        /    Total batch size = 64 | Total steps = 60
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss
1,2.5878
2,2.5468
3,2.5939
4,2.4607
5,2.4376
6,2.4215
7,2.2876
8,2.314
9,2.1244
10,2.0615


Step,Training Loss
1,2.5878
2,2.5468
3,2.5939
4,2.4607
5,2.4376
6,2.4215
7,2.2876
8,2.314
9,2.1244
10,2.0615


CPU times: user 22min 53s, sys: 14min, total: 36min 54s
Wall time: 37min 30s


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
%%time
# Save the trained model and tokenizer
model.save_pretrained('/content/drive/MyDrive/Colab Notebooks/Capstone2/classicsMRILlama3b60stepsclinc')
tokenizer.save_pretrained('/content/drive/MyDrive/Colab Notebooks/Capstone2/classicsMRILlama3b60stepsclinc')

#Save training arguments
#torch.save(training_params, '/content/drive/MyDrive/Colab Notebooks/Capstone2/classicsMRILlama8b60steps')

CPU times: user 343 ms, sys: 226 ms, total: 569 ms
Wall time: 1.18 s


('/content/drive/MyDrive/Colab Notebooks/Capstone2/classicsMRILlama3b60stepsclinc/tokenizer_config.json',
 '/content/drive/MyDrive/Colab Notebooks/Capstone2/classicsMRILlama3b60stepsclinc/special_tokens_map.json',
 '/content/drive/MyDrive/Colab Notebooks/Capstone2/classicsMRILlama3b60stepsclinc/tokenizer.json')

In [None]:
#Reload Model
%%time
from transformers import AutoModelForCausalLM, AutoTokenizer

modelsaved = AutoModelForCausalLM.from_pretrained('/content/drive/MyDrive/Colab Notebooks/Capstone2/classicsMRILlama3b60stepsclinc')
tokenizersaved = AutoTokenizer.from_pretrained('/content/drive/MyDrive/Colab Notebooks/Capstone2/classicsMRILlama3b60stepsclinc')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


CPU times: user 17.7 s, sys: 5.96 s, total: 23.7 s
Wall time: 1min 42s


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [None]:
# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Prepare the inputs for the model
inputs = tokenizer(
    [
        alpaca_prompt.format(
            "Hip pain",
            "ACETABULAR LABRUM: Focal tear along the anterosuperior aspect at approximately 1-2 o'clock without evidence of significant displacement or additional labral findings throughout the remainder. ARTICULAR CARTILAGE AND BONE: Intact and without focal defects. No abnormal discrete signal abnormality. Marrow signal is consistent with red marrow replacement scattered throughout the pelvis and visualized portions of the proximal femur; unremarkable for age. SOFT TISSUES: No significant abnormality noted. ADDITIONAL",  # input
            ""  # output - leave this blank for generation!
        )
    ],
    return_tensors="pt"
).to("cuda")

# Generate the outputs
outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)

# Decode the outputs
generated_texts = tokenizer.batch_decode(outputs)

# Print the generated text
for text in generated_texts:
    print(text)

<|begin_of_text|>

### Input:
Hip pain

### Input2:
ACETABULAR LABRUM: Focal tear along the anterosuperior aspect at approximately 1-2 o'clock without evidence of significant displacement or additional labral findings throughout the remainder. ARTICULAR CARTILAGE AND BONE: Intact and without focal defects. No abnormal discrete signal abnormality. Marrow signal is consistent with red marrow replacement scattered throughout the pelvis and visualized portions of the proximal femur; unremarkable for age. SOFT TISSUES: No significant abnormality noted. ADDITIONAL

### Response:
1. Focal labral tear at approximately 1-2 o'clock without evidence of significant displacement or additional labral findings throughout the remainder.2. No focal cartilage defect or abnormal signal abnormality.<|eot_id|>


In [None]:
# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Prepare the inputs for the model
inputs = tokenizer(
    [
        alpaca_prompt.format(
            "55 years Male (DOB:4/28/1960)Reason: lower back pain and LLE radiculopathy History: abovePROVIDER/ATTENDING NAME: MICHAEL J LEE MICHAEL J LEE",
            "Five lumbar type vertebral bodies are presumed to be present which are appropriate in overall alignment and height. The conus medullaris on sagittal imaging is grossly intact.At L5-S1 there is no significant compromise to spinal canal or neural foramina. There is loss of disc space height, disc desiccation and diffuse disc bulge at this level associated with some ligamentum flavum hypertrophy and some mild facet hypertrophy. Although the neural foraminal are narrowed, within the neural foramina the exiting nerve roots are surrounded by fat.At L4-5 there is loss of disc space height, diffuse disc bulge and bilateral ligamentum flavum hypertrophy associated with partial effacement of the fat at the right lateral recess and some encroachment of the nerve roots of the right lateral recess. There is is some crowding of the nerve roots within the thecal sac at this level. The nerve roots appear to be adherent to the walls of the thecal sac. There is mild to moderate spinal stenosis present at this level. Although the neural foraminal are narrowed, within the neural foramina the exiting nerve roots are surrounded by fat.At L3-4 there is loss of disc space height, diffuse disc bulge and disc desiccation associated with some ligamentum flavum hypertrophy. There is a disc extrusion present at this level with disc material extrudes behind the L4 vertebral body. This contributes significantly to the spinal stenosis at this level. There is narrowing of the spinal canal at this level and there is crowding of the nerve roots within the thecal sac. This suggests moderate spinal stenosis. Although the neural foraminal are narrowed, within the neural foramina the exiting nerve roots are surrounded by fat.At L2-3 there is no significant compromise to spinal canal or neural foramina. At L1-2 there is no significant compromise to spinal canal or neural foramina.",  # input
            ""  # output - leave this blank for generation!
        )
    ],
    return_tensors="pt"
).to("cuda")

# Generate the outputs
outputs = model.generate(**inputs, max_new_tokens=100, use_cache=True)

# Decode the outputs
generated_texts = tokenizer.batch_decode(outputs)

# Print the generated text
for text in generated_texts:
    print(text)

<|begin_of_text|>

### Input:
55 years Male (DOB:4/28/1960)Reason: lower back pain and LLE radiculopathy History: abovePROVIDER/ATTENDING NAME: MICHAEL J LEE MICHAEL J LEE

### Input2:
Five lumbar type vertebral bodies are presumed to be present which are appropriate in overall alignment and height. The conus medullaris on sagittal imaging is grossly intact.At L5-S1 there is no significant compromise to spinal canal or neural foramina. There is loss of disc space height, disc desiccation and diffuse disc bulge at this level associated with some ligamentum flavum hypertrophy and some mild facet hypertrophy. Although the neural foraminal are narrowed, within the neural foramina the exiting nerve roots are surrounded by fat.At L4-5 there is loss of disc space height, diffuse disc bulge and bilateral ligamentum flavum hypertrophy associated with partial effacement of the fat at the right lateral recess and some encroachment of the nerve roots of the right lateral recess. There is is some c

In [None]:
test_dataset

Dataset({
    features: ['findings', 'clinical_information', 'impression', '__index_level_0__', 'text'],
    num_rows: 2749
})

In [None]:
test_dataset[1]

{'findings': 'CHEST:LUNGS AND PLEURA: Mild sub-segmental atelectasis again identified.MEDIASTINUM AND HILA: Pneumomediastinum is little changed. Stable, nonenlarged precarinal lymph nodes are present.CHEST WALL: Gas is seen within the soft tissues. A tracheostomy tube traverses an apparent soft tissue mass in the region of the thyroid gland. Tube is seen along the posterior wall of the trachea well above the carina.ABDOMEN:LIVER, BILIARY TRACT: No significant abnormality noted.SPLEEN: No significant abnormality noted.PANCREAS: No significant abnormality noted.ADRENAL GLANDS: No significant abnormality noted.KIDNEYS, URETERS: The kidneys appear unremarkable. Particular, there is no evidence for renal mass despite multiphasic exam.RETROPERITONEUM, LYMPH NODES: No significant abnormality noted.BOWEL, MESENTERY: No significant abnormality noted.BONES, SOFT TISSUES: No significant abnormality noted.OTHER: No significant abnormality noted.PELVIS:UTERUS, ADNEXAE: The expected location of the 

# Generate impression using clinical information and findings from test dataset using fine-tuned model

In [None]:
%%time
import json
import torch
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM

# Assuming you have already loaded your model and tokenizer
#tokenizer = tokenizersaved
#model = modelsaved


# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Prepare the prompt template
alpaca_prompt = "\n\n### Input:\n{}\n\n### Input2:\n{}\n\n### Response:\n"

# List to store the results
results = []

# Iterate over the test_dataset with tqdm for progress bar
for entry in tqdm(test_dataset, desc="Processing Entries", unit="entry"):
    clinical_info = entry['clinical_information']
    findings = entry['findings']
    actual_impression = entry['impression']

    # Prepare the inputs for the model
    inputs = tokenizer(
        [
            alpaca_prompt.format(
                clinical_info,
                findings,
                ""  # output - leave this blank for generation!
            )
        ],
        return_tensors="pt"
    ).to("cuda")

    # Generate the outputs
    outputs = model.generate(**inputs, max_new_tokens=100, use_cache=True)

    # Decode the outputs
    generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    # Extract the predicted impression
    predicted_impression = generated_texts[0].split("### Response:\n")[-1].strip()

    # Store the results
    results.append({
        'clinical_information': clinical_info,
        'findings': findings,
        'predicted_impression': predicted_impression,
        'actual_impression': actual_impression
    })

# Save the results to a JSON file
with open('slothllama_3b.json', 'w') as f:
    json.dump(results, f, indent=4)

print("Results saved to slothllama_3b.json")


Processing Entries: 100%|██████████| 2749/2749 [2:57:23<00:00,  3.87s/entry]

Results saved to slothllama_3b.json
CPU times: user 2h 55min 44s, sys: 23.3 s, total: 2h 56min 8s
Wall time: 2h 57min 23s





In [None]:
%%time
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
import json
import gc
import os
from transformers import AutoTokenizer, AutoModelForCausalLM

# Set environment variable to avoid fragmentation
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Define the Alpaca-style prompt template
alpaca_prompt = """
### Input:
{}
### Input2:
{}
### Response:
{}"""

# Function to extract the response part from the generated impression
def extract_response(generated_text):
    response_start = generated_text.find("### Response:")
    if response_start != -1:
        return generated_text[response_start + len("### Response:"):].strip()
    return generated_text.strip()

# max_new_tokens: Maximum number of new tokens to generate
def generate_predictions(model, test_dataset, batch_size: int = 12, max_new_tokens: int = 100):
    """
    Generate predictions efficiently using batching and GPU acceleration
    """
    # Setup device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

    # Set the pad token if it's not already set
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Move the model to the selected device (GPU or CPU)
    model.to(device)
    model.eval()

    # Create a DataLoader for the test_dataset to load data in batches
    dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

    # Initialize a list to store each prediction as a dictionary
    all_predictions = []

    print("\nGenerating predictions...")

    try:
        # Iterate through each batch in the DataLoader and display progress with tqdm
        for batch in tqdm(dataloader):
            # Extract findings, clinical information, and actual impressions
            batch_findings = batch['findings']
            batch_clinical_information = batch['clinical_information']
            batch_actual_impressions = batch['impression']

            # Tokenize inputs
            inputs = tokenizer(
                [alpaca_prompt.format(clinical_info, finding, "") for clinical_info, finding in zip(batch_clinical_information, batch_findings)],
                padding=True,
                truncation=True,
                max_length=500,  # Adjust based on the findings' length
                return_tensors="pt",
                return_attention_mask=True  # Explicitly request attention mask
            )
            inputs = {k: v.to(device) for k, v in inputs.items()}

            # Generate predictions using greedy search
            with torch.no_grad():
                outputs = model.generate(
                    input_ids=inputs['input_ids'],
                    attention_mask=inputs['attention_mask'],  # Pass attention mask
                    max_new_tokens=max_new_tokens,
                    num_return_sequences=1,
                    pad_token_id=tokenizer.pad_token_id,
                    do_sample=False,  # Use greedy search
                    temperature=None,
                    num_beams=1,  # Greedy search is equivalent to beam search with num_beams=1
                    early_stopping=True,
                    no_repeat_ngram_size=3,
                    length_penalty=1.0
                )

            # Decode the generated token sequences back into readable text
            generated_impressions = tokenizer.batch_decode(outputs, skip_special_tokens=True)

            # Extract the response part from the generated impressions
            generated_impressions = [extract_response(imp) for imp in generated_impressions]

            # Append each prediction as a dictionary to the list
            for clinical_info, finding, pred_imp, actual_imp in zip(batch_clinical_information, batch_findings, generated_impressions, batch_actual_impressions):
                all_predictions.append({
                    'clinical_information': clinical_info,
                    'findings': finding,
                    'predicted_impression': pred_imp,
                    'actual_impression': actual_imp
                })

            # Clear GPU cache with torch.cuda.empty_cache() and use garbage collection (gc.collect()) to manage memory efficiently
            del inputs, outputs, batch_clinical_information, batch_findings, batch_actual_impressions
            torch.cuda.empty_cache()
            gc.collect()

    except Exception as e:
        print(f"An error occurred during processing: {str(e)}")
        raise e

    finally:
        # Clean up GPU memory
        torch.cuda.empty_cache()
        gc.collect()

    return all_predictions

# Set random seed for reproducibility
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

# Generate predictions
try:
    predictions = generate_predictions(
        model=model,
        test_dataset=test_dataset,
        batch_size=12,
        max_new_tokens=150
    )

    # Print a sample prediction
    print("\nSample Prediction:")
    sample_prediction = predictions[0]  # Assuming there is at least one prediction
    print("Clinical Information:", sample_prediction['clinical_information'])
    print("Findings:", sample_prediction['findings'])
    print("Predicted Impression:", sample_prediction['predicted_impression'])
    print("Actual Impression:", sample_prediction['actual_impression'])

    # Save predictions to file
    with open('mri_Llama3b_predictionsfullDATA_60stepsclinc.json', 'w') as f:
        json.dump(predictions, f, indent=2)

    print("\nPredictions saved to 'mri_Llama3b_predictionsfullDATA_60stepsclinc.json'")

except Exception as e:
    print(f"An error occurred: {str(e)}")


Using device: cuda

Generating predictions...


100%|██████████| 230/230 [1:15:54<00:00, 19.80s/it]



Sample Prediction:
Clinical Information: Clinical question: Prior MRI with questionable demyelinating disease. Please assess for new lesions. Signs and symptoms: Cognitive decline/depression.
Findings: Pre-and post-enhanced MRI:No diffusion weighted abnormalities. Examination redemonstrates stable few subcortical and periventricular foci of flair hyperintensity in bilateral cerebral hemispheres. There is no evidence of any new lesions or any detectable abnormal enhancement.Previously noted lesions within bilateral middle cerebellar peduncles are significantly less conspicuous on current the study and there is no evidence of any new or enhancing lesions in the posterior fossa.Unremarkable cerebral cortex, cortical sulci, ventricular system and the CSF spaces for patient's stated age. No detectable abnormal parenchymal or leptomeningeal enhancement.Unremarkable images through the orbits and including axial fat sat post enhanced series.Unremarkable calvarium, soft tissues of the scalp, p