<a href="https://colab.research.google.com/github/RiverGumSecurity/AILabs/blob/main/022_LargeLanguageModels/LAB_SFT_Counselor.ipynb" target="_new"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## An AI-LLM to provide psychological counseling!

<span style='color: cyan;'>
We all go through times in life where things are challenging.  Often times we turn to professional help
to get through these times in the form of a therapist, counselor, or perhaps priest or rabbi.
Maybe we could get ourselves a little bit of help with our friend the AI-LLM counselor!
</span></p>

<span style='color: red'>
This is obviously intended as "toungue in cheek".  If you are really having mental health challenges, please
seek professional help as soon as you can.
</span></p>

<span style='color: cyan;'>
Via HuggingFace.co, the AI community publishes not only a wonderful selection of trained neural network models,
but also a diverse array of datasets.  One of these datasets I discovered during the course authorship period is
a fully anonymized collection from online counseling and therapy sessions.
</span>


### Dataset: "Amod/mental_health_counseling_conversations"<span style='color: cyan'>

<span style='color: cyan;'>
This dataset is a collection of questions and answers sourced from two online counseling and therapy platforms. The questions cover a wide range of mental health topics, and the answers are provided by qualified psychologists. The dataset is intended to be used for fine-tuning language models to improve their ability to provide mental health advice
</span>.

### Your Mission!

* continue using the base training model of "unsloth/Meta-Llama-3.1-8B".
* re-configure this notebook to load the "Amod/mental_health_counseling_conversations" dataset
    * consider experimenting with the "split" parameter to change the default 80:20 split to perhaps 70:30.
* modify the **format_prompts()** Python function to work with the data fields in the dataset. (hint: find and view the dataset on huggingface)
* train your model and observe the training loss function. Approx 80 training steps should be sufficient.
* If your training loss does not fall below 1.0 during training, try modifying some training hyper-parameters and train again. Caution: Work with one parameter at a time and observe the impact on training memory consumption and performance.  You can lower the **max_steps** to terminate the training early in order to look at memory consumption on your GPU.
    * PEFT parameters
      * r: 8, 16 or 32 (rank of the low rank decomposition. higher retains more info, lower performs better / less info)
      * lora_alpha: 8, 16, 32, 64, 128
    * SFT Hyper-parameters
      * weight_decay: anywhere from 0.1 up to 0.5
      * per_device_batch_size: anywhere from 2 up to 8. (too high and you will see memory constraints)
      * gradient_accumulation: 4, 6, or 8
      * learning_rate: 2e-4 (can try increasing to perhaps 5e-4 - can be very sensitive)
* Test your resulting model by asking it a few questions
* Optionally: Use the "push_to_hub()" function with your own Hugging Face API key to save your work!

In [1]:
import torch
import warnings
import os
import sys
import pathlib
try:
    import unsloth
except ImportError:
        !pip install unsloth

# suppresses some noisey warnings which are just annoying
warnings.filterwarnings('ignore')
max_seq_length = 4096

# Setup Hugging Face Credentials.
HF_APIKEY = ''
if 'google.colab' in sys.modules:
    from google.colab import userdata
    HF_APIKEY = userdata.get('HF_APIKEY')
else:
    with open(pathlib.Path.home() / '.hfkey') as hf:
        HF_APIKEY = hf.read().strip()
if not HF_APIKEY:
    print('[-] ERROR: Cannot continue without Hugging Face API Key')
    sys.exit(0)
os.environ['HF_TOKEN'] = HF_APIKEY

model, tokenizer = unsloth.FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = None, load_in_4bit = True
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

In [2]:
# This code fixes a word wrapping issue
# in Google Co-Lab on Text Streamer output.
from warnings import warn
from IPython import get_ipython
from IPython.display import HTML, display
from transformers import TextStreamer

def enable_word_wrap(*args, **kwargs):
    display(HTML('''
    <style>
        pre {
            white-space: pre-wrap;
        }
    </style>
    '''))

def register_ipython_callback_once(event_name, cb):
    ev = get_ipython().events
    cb_unregs = [cb_old for cb_old in ev.callbacks[event_name] if cb_old.__name__ == cb.__name__]
    if len(cb_unregs) == 1 and cb.__code__ == cb_unregs[0].__code__:
        return
    for cb_old in cb_unregs:
        warn(f'Removing unexpected callback {cb_old}.')
        ev.unregister(event_name, cb_old)
    ev.register(event_name, cb)

register_ipython_callback_once('pre_run_cell', enable_word_wrap)

In [3]:
# PEFT MODEL
model = unsloth.FastLanguageModel.get_peft_model(
    model,
    r = 16,               ## 8 or 16 is reasonable
    lora_alpha = 32,      ## 32 on a GPU with min 16GB Ram
    lora_dropout = 0.07,  ## between 0.05 and 1.0

    ## best to leave below params alone
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"],
    bias = "none",
    use_gradient_checkpointing = True,
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.07.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.3.19 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


In [4]:
prompt = """
Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.

### Instruction:
{}

### Input:

### Response:
{}"""

def format_prompts(p):
    ## You will have to adjust these to
    ## the dataset you are training with...
    instructions = p["Context"]
    outputs      = p["Response"]
    texts = []
    for ins, outp in zip(instructions, outputs):
        text = prompt.format(ins, outp) + tokenizer.eos_token
        texts.append(text)
    return { "text" : texts }


In [5]:
import re
import json
import datasets
import requests
from bs4 import BeautifulSoup

from datasets import load_dataset
ds = load_dataset("Amod/mental_health_counseling_conversations", split = "train")
ds = ds.map(format_prompts, batched = True)

README.md:   0%|          | 0.00/2.82k [00:00<?, ?B/s]

combined_dataset.json:   0%|          | 0.00/4.79M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3512 [00:00<?, ? examples/s]

Map:   0%|          | 0/3512 [00:00<?, ? examples/s]

In [6]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

newmodel = model
trainer = SFTTrainer(
    model = model,
    processing_class = tokenizer,
    train_dataset = ds,
    args = TrainingArguments(
        ## hyper-params we usually can tune
        learning_rate = 2e-4,             ## between 1e-4 and 5e-4
        per_device_train_batch_size = 4,  ## 2, 4, or 8
        gradient_accumulation_steps = 8,  ## 4 up to 8
        num_train_epochs = 1,             ## max_steps overrides this
        max_steps = 10,                   ## single Epoch / 50 steps or more
        weight_decay = 0.01,              ## 0.01 is conservative and good for not overfitting

        ## params below are typically not modified
        warmup_steps = 5,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=12):   0%|          | 0/3512 [00:00<?, ? examples/s]

In [7]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 3,512 | Num Epochs = 1 | Total steps = 10
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 8 x 1) = 32
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.5116
2,2.4795
3,2.4786
4,2.4764
5,2.3989
6,2.2553
7,2.0304
8,2.1641
9,2.0787
10,2.0381


In [8]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory/max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)

print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.557 GB.
7.725 GB of memory reserved.
80.3052 seconds used for training.
1.34 minutes used for training.
Peak reserved memory = 7.725 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 19.529 %.
Peak reserved memory for training % of max memory = 0.0 %.


In [None]:
## Save your model?
# model.push_to_hub('l3-counselor', private=True)
# tokenizer.push_to_hub('l3-counselor', private=True)

In [11]:
from transformers import TextStreamer

unsloth.FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt.format('''
I am not really going to do well today. Lots of change at work. What should I do?
''', "", "")
], return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 4096)

<|begin_of_text|>
Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.

### Instruction:

I am not really going to do well today. Lots of change at work. What should I do?


### Input:

### Response:
I am sorry to hear about the stress you are experiencing at work.  It is understandable that you would be feeling down.  One of the things that helps me is to find a hobby that I enjoy.  It gives me something to look forward to and helps me feel like I have a life outside of work.  I hope that you find something that helps you feel better.<|end_of_text|>
