<a href="https://colab.research.google.com/github/reshalfahsi/medbot-instruct-conversational/blob/master/MedBot_Medical_Chatbot_Instruction_Fine_Tuning_Conversational_Memory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MedBot: Medical Chatbot with Instruction Fine-Tuning and Conversational Memory**

## **Important Libraries**

### **Install**

In [1]:
!curl -LsSf https://astral.sh/uv/install.sh | sh

downloading uv 0.5.21 x86_64-unknown-linux-gnu
no checksums to verify
installing to /root/.local/bin
  uv
  uvx
everything's installed!

To add $HOME/.local/bin to your PATH, either restart your shell or run:

    source $HOME/.local/bin/env (sh, bash, zsh)
    source $HOME/.local/bin/env.fish (fish)


In [2]:
!uv pip install -q --no-cache-dir --system trl peft accelerate bitsandbytes
!uv pip install -q --no-cache-dir --system transformers evaluate datasets
!uv pip install -q --no-cache-dir --system pytelegrambotapi rouge_score
!uv pip install -q --no-cache-dir --system tableprint langchain langgraph
!uv pip install -q --no-cache-dir --system --no-deps xformers
!uv pip install -q --no-cache-dir --system "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

### **Import**

In [1]:
import torch

from unsloth import FastLanguageModel

from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from evaluate import load as load_evaluate

from tqdm.auto import tqdm

from typing import Any, Dict, List, Optional

from langchain_core.callbacks.manager import CallbackManagerForLLMRun
from langchain_core.language_models.llms import LLM
from langchain_core.prompts import PromptTemplate

from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import START, MessagesState, StateGraph

import telebot
from google.colab import userdata

import warnings
import logging
import random
import json
import os

import numpy as np
import tableprint as tp

warnings.filterwarnings("ignore")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## **Configuration**

In [2]:
os.makedirs("experiment", exist_ok=True)
os.makedirs("experiment/model", exist_ok=True)
EXPERIMENT_DIR = "experiment/"

In [3]:
BASE_MODEL_NAME = "unsloth/llama-3-8b-Instruct-bnb-4bit"
FINE_TUNED_MODEL_NAME = os.path.join(f"{EXPERIMENT_DIR}", "model")
DATASET_NAME = "Shekswess/medical_llama3_instruct_dataset_short"

In [4]:
SEED = int(np.random.randint(2147483647))

In [5]:
MAX_LENGTH = 2048
MAX_TOKEN = 256
SAMPLE_TEST_SIZE = 128

In [6]:
BATCH_PER_DEVICE = 2
GRADIENT_ACCUMULATION_STEP = 4
LEARNING_RATE = 2e-4
WARMUP_STEP = 5
MAX_EPOCH = 1
OPTIMIZER = "adamw_8bit"
LR_SCHEDULER = "linear"
WEIGHT_DECAY = 1e-2
LOGGING_STEP = 25
SAVE_STEP = 25
REPORT_TO = "none"

In [7]:
LORA_DROPOUT = 0.0
LORA_RANK = 16
LORA_ALPHA = 16
TARGET_MODULE = [
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
    "gate_proj",
    "up_proj",
    "down_proj",
]

In [8]:
DEVICE = "cpu" if not torch.cuda.is_available() else "cuda"
DATASET_NUM_PROC = 2

In [9]:
METRIC_NAME = "rouge"
TASK_TYPE = "text-generation"

In [10]:
os.environ["WANDB_DISABLED"] = "true"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

## **Dataset**

In [11]:
train_dataset = load_dataset(DATASET_NAME, split="train")

## **Training**

### **Model**

In [14]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=BASE_MODEL_NAME,
    max_seq_length=MAX_LENGTH,
    dtype=torch.float16,
    load_in_4bit=True,
)

==((====))==  Unsloth 2025.1.5: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.1k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

In [15]:
model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_RANK,
    target_modules=TARGET_MODULE,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    use_gradient_checkpointing=True,
    random_state=SEED,
    use_rslora=False,
    use_dora=False,
    loftq_config=None,
)

Unsloth 2025.1.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### **Main Event**

In [16]:
training_args = SFTConfig(
    per_device_train_batch_size=BATCH_PER_DEVICE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEP,
    warmup_steps=WARMUP_STEP,
    num_train_epochs=MAX_EPOCH,
    learning_rate=LEARNING_RATE,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=LOGGING_STEP,
    save_steps=SAVE_STEP,
    optim=OPTIMIZER,
    weight_decay=WEIGHT_DECAY,
    lr_scheduler_type=LR_SCHEDULER,
    seed=SEED,
    output_dir=EXPERIMENT_DIR,
    dataset_text_field ="prompt",
    max_seq_length = MAX_LENGTH,
    # The number of processes to use for multiprocessing
    dataset_num_proc=DATASET_NUM_PROC,
    # ``packing`` -> where multiple short examples are packed in the same input
    # sequence to increase training efficiency.
    packing=False,
    report_to=REPORT_TO,
)

In [17]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    args=training_args,
)

Map (num_proc=2):   0%|          | 0/2000 [00:00<?, ? examples/s]

In [18]:
print("================[ Memory Statistics Before Training ]================\n")
gpu_statistics = torch.cuda.get_device_properties(0)
reserved_memory = round(torch.cuda.max_memory_reserved() / 1024**3, 2)
max_memory = round(gpu_statistics.total_memory / 1024**3, 2)
print(f"Reserved Memory: {reserved_memory}GB")
print(f"Max Memory: {max_memory}GB")


Reserved Memory: 5.61GB
Max Memory: 14.75GB


In [19]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 2,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 250
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
25,1.8552
50,1.4076
75,1.2654
100,1.2594
125,1.2224
150,1.2757
175,1.2687
200,1.2531
225,1.2679
250,1.2226


In [20]:
print("================[ Memory Statistics After Training ]================\n")
used_memory = round(torch.cuda.max_memory_allocated() / 1024**3, 2)
used_memory_lora = round(used_memory - reserved_memory, 2)
used_memory_percentage = round((used_memory / max_memory) * 100, 2)
used_memory_lora_percentage = round((used_memory_lora / max_memory) * 100, 2)
print(f"Used Memory: {used_memory}GB ({used_memory_percentage}%)")
print(
    f"Used Memory for training(fine-tuning) LoRA: {used_memory_lora}GB "
    f"({used_memory_lora_percentage}%)"
)


Used Memory: 7.49GB (50.78%)
Used Memory for training(fine-tuning) LoRA: 1.88GB (12.75%)


In [21]:
with open(
    os.path.join(FINE_TUNED_MODEL_NAME, "trainer_stats.json"),
    "w",
) as f:
    json.dump(trainer_stats, f, indent=4)

In [22]:
model.save_pretrained(FINE_TUNED_MODEL_NAME)

## **Testing**

In [12]:
""" Please restart session! """

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=FINE_TUNED_MODEL_NAME,
    max_seq_length=MAX_LENGTH,
    dtype=torch.float16,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

print("\n=========================[ Model Ready! ]=========================\n")

==((====))==  Unsloth 2025.1.5: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.1.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.






In [13]:
evaluation_metric = load_evaluate(METRIC_NAME)
predictions = list()
references = list()

selected_index = list(
    np.random.randint(len(train_dataset), size=SAMPLE_TEST_SIZE)
)


for index, txt in tqdm(enumerate(train_dataset)):

    if index not in selected_index:
        continue

    user_input = txt['input']
    system_output = txt['output']

    prompt = (
        "<|start_header_id|>system<|end_header_id|> Answer the question "
        "truthfully, you are a medical professional.<|eot_id|>"
        "<|start_header_id|>user<|end_header_id|> "
        f"{user_input}"
        "<|eot_id|>"
    )

    inputs = tokenizer([prompt], return_tensors = "pt").to(DEVICE)
    outputs = model.generate(
        **inputs,
        max_new_tokens=MAX_TOKEN,
        use_cache=True,
    )

    answer = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    answer = answer[answer.find("assistant"):].replace("assistant ", '')

    predictions.append(answer)
    references.append(system_output)

results = evaluation_metric.compute(
    predictions=predictions,
    references=references,
)

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

0it [00:00, ?it/s]

In [14]:
table_data = list()
test_metrics = results.keys()
test_scores = results.values()

for metric, score in zip(test_metrics, test_scores):
    table_data.append([metric, score])

tp.table(table_data, ['Test Metric', 'Score'])

╭─────────────┬─────────────╮
│ Test Metric │       Score │
├─────────────┼─────────────┤
│      rouge1 │      0.3598 │
│      rouge2 │      0.1967 │
│      rougeL │     0.27685 │
│   rougeLsum │     0.28422 │
╰─────────────┴─────────────╯


## **Inference**

In [11]:
""" Please restart session! """

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=FINE_TUNED_MODEL_NAME,
    max_seq_length=MAX_LENGTH,
    dtype=torch.float16,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

print("\n=========================[ Model Ready! ]=========================\n")

==((====))==  Unsloth 2025.1.5: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.1.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.






### **LangChain**

In [118]:
class MedBotInstruct(LLM):
    def _call(
        self,
        prompt: str,
        stop: Optional[List[str]] = None,
        run_manager: Optional[CallbackManagerForLLMRun] = None,
        **kwargs: Any,
    ) -> str:
        """Run the LLM on the given input.

        Override this method to implement the LLM logic.

        Args:
            prompt: The prompt to generate from.
            stop: Stop words to use when generating. Model output is cut off at
                the first occurrence of any of the stop substrings.
                If stop tokens are not supported consider raising
                NotImplementedError.
            run_manager: Callback manager for the run.
            **kwargs: Arbitrary additional keyword arguments. These are usually
                passed to the model provider API call.

        Returns:
            The model output as a string. Actual completions SHOULD NOT include
            the prompt.
        """
        inputs = tokenizer([prompt], return_tensors = "pt").to(DEVICE)
        outputs = model.generate(
            **inputs,
            max_new_tokens=MAX_TOKEN,
            use_cache=True,
        )

        response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
        response = response[response.find("assistant"):].replace(
            "assistant ", ''
        )
        return response

    @property
    def _identifying_params(self) -> Dict[str, Any]:
        """Return a dictionary of identifying parameters."""
        return {
            # The model name allows users to specify custom token counting
            # rules in LLM monitoring applications (e.g., in LangSmith users
            # can provide per token pricing for their model and monitor
            # costs for the given LLM.)
            "model_name": "MedBotInstruct",
        }

    @property
    def _llm_type(self) -> str:
        """Get the type of language model used by this chat model.
        Used for logging purposes only."""
        return "custom"

In [119]:
# Define a new graph
workflow = StateGraph(state_schema=MessagesState)

llm = MedBotInstruct()

# Define the function that calls the model
def call_model(state: MessagesState):
    template = (
        """<|start_header_id|>system<|end_header_id|> Answer the question """
        """truthfully, you are a medical professional. Your responses are """
        """based on the user question and/or context.<|eot_id|>"""
        """<|start_header_id|>context<|end_header_id|> {context}<|eot_id|>"""
        """<|start_header_id|>user<|end_header_id|> """
        """{user_input}<|eot_id|>"""
    )
    prompt = PromptTemplate.from_template(template)
    chain = prompt | llm

    messages = list()
    for msg in state['messages']:
        messages.append(msg.content)

    user_input = messages[-1]

    context = ""
    messages = messages[-5:-1]
    for idx, msg in enumerate(messages):
        if idx < len(messages) - 1:
            context += msg + "\n"
        else:
            context += msg

    response = chain.invoke(
        {
            'user_input': user_input,
            'context': context,
        }
    )
    return {"messages": response}


# Define the (single) node in the graph
workflow.add_edge(START, "model")
workflow.add_node("model", call_model)

# Add memory
memory = MemorySaver()
app = workflow.compile(checkpointer=memory)

### **Telegram**

In [120]:
""" Please provide your Telegram bot's API TOKEN in Colab's secret! """

TOKEN = userdata.get("TOKEN")
bot = telebot.TeleBot(TOKEN)

In [121]:
@bot.message_handler(commands=['start'])
def start(message):
    """Send a message when the command /start is issued."""

    bot.reply_to(
        message,
        "MedBot: Medical Chatbot. Ask me anything about medical.",
    )

In [122]:
@bot.message_handler(commands=['help'])
def help(message):
    """Send a message when the command /help is issued."""

    bot.reply_to(
        message,
        "Just type and send texts, it will reply.",
    )

In [123]:
@bot.message_handler(func=lambda m: True)
def reply_text(message):
    """Reply text input from the user message."""

    config = {"configurable": {"thread_id": message.chat.id}}

    user_input = message.text

    output = app.invoke({"messages": user_input}, config)
    response = output["messages"][-1].content

    bot.reply_to(message, response)

In [124]:
bot.polling()