In [2]:
from datasets import load_dataset

## Step 2: Load the CounselChat dataset

In [3]:
dataset = load_dataset("nbertagnolli/counsel-chat")

Repo card metadata block was not found. Setting CardData to empty.


## Step 3: Format for Instruction Tuning

In [4]:
def format_example(example):
    return {
        "text": f"### Instruction:\n{example['questionText']}\n\n### Response:\n{example['answerText']}"
    }

formatted_dataset = dataset['train'].map(format_example)

In [5]:
formatted_dataset[0]['text']  # Display the formatted text for the first example

'### Instruction:\nI have so many issues to address. I have a history of sexual abuse, I’m a breast cancer survivor and I am a lifetime insomniac.    I have a long history of depression and I’m beginning to have anxiety. I have low self esteem but I’ve been happily married for almost 35 years.\n   I’ve never had counseling about any of this. Do I have too many issues to address in counseling?\n\n### Response:\nIt is very common for\xa0people to have multiple issues that they want to (and need to) address in counseling.\xa0 I have had clients ask that same question and through more exploration, there is often an underlying fear that they\xa0 "can\'t be helped" or that they will "be too much for their therapist." I don\'t know if any of this rings true for you. But, most people have more than one problem in their lives and more often than not,\xa0 people have numerous significant stressors in their lives.\xa0 Let\'s face it, life can be complicated! Therapists are completely ready and eq

## Step 4: Tokenize the dataset

In [6]:
from transformers import AutoTokenizer

base_model = "Qwen/Qwen2.5-1.5B"
tokenizer = AutoTokenizer.from_pretrained(base_model)

def tokenize(batch):
    tokens =  tokenizer(
        batch["text"], 
        truncation=True, 
        padding="max_length", 
        max_length=256
    )
    tokens["labels"] = tokens["input_ids"].copy()  # <-- Add labels for causal LM
    return tokens

tokenized_dataset = formatted_dataset.map(tokenize, batched=True, remove_columns=["text"])

Map: 100%|██████████| 2775/2775 [00:00<00:00, 3733.52 examples/s]


## Step 5: Load Qwen2.5 in 4‑bit QLoRA Mode

In [7]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",   # <-- Fix: use nf4 for CPU
    bnb_4bit_use_double_quant=True,
    device_map="auto",
    trust_remote_code=True
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.


## Step 6: Add LoRA Adapters

In [8]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)


## Step 7: Fine‑Tune

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./qwen-mentalhealth-lora",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    max_steps=100,
    learning_rate=2e-4,
    fp16=False,
    bf16=False,
    logging_steps=10,
    save_steps=100,
    save_total_limit=2
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

trainer.train()


No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
