# 🧠 Chain-of-Thought Fine-Tuning for CBT Mental Health Assistant

This Notebook walks through the full workflow:

- Install dependencies
- Load and process datasets (`counsel-chat`, `mental-disorders`, `Psych8k`, `Amod`)
- Format data into Chain-of-Thought prompts
- Fine-tune a chat model (TinyLLaMA)
- Run inference to generate therapeutic responses

In [1]:
#STEP 1: Install Dependencies
#!pip install transformers datasets accelerate trl peft bitsandbytes

In [2]:
#STEP 2: Import Libraries
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer
import json, random


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/opt/anaconda3/envs/PyEnv/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/anaconda3/envs/PyEnv/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/anaconda3/envs/PyEnv/lib/python3.9/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/opt/anaconda3/envs/PyEnv/lib/python3.9/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
 

In [3]:
#STEP 3: Load Datasets
counsel_chat = load_dataset("nbertagnolli/counsel-chat", split="train")
mental_disorders = load_dataset("Kanakmi/mental-disorders", split="train")
psych8k = load_dataset("json", data_files="Alexander_Street_shareGPT_2.0.json", split="train")
amod = load_dataset("Amod/mental_health_counseling_conversations", split="train")

Repo card metadata block was not found. Setting CardData to empty.
Repo card metadata block was not found. Setting CardData to empty.


In [4]:
#STEP 4: Define Format Functions
def format_disorder_entry(example):
    label_map = {
        '0': 'Borderline Personality Disorder (BPD)',
        '1': 'Bipolar Disorder',
        '2': 'Depression',
        '3': 'Anxiety',
        '4': 'Schizophrenia',
        '5': 'General Mental Illness'
    }
    disorder = label_map[str(example['label'])]
    return {
        "messages": [
            {"role": "system", "content": "You are a kind, supportive CBT counselor who identifies the disorder and then responds with empathy and structured reasoning in a GenZ tone in a friendly way."},
            {"role": "user", "content": example['text']},
            {"role": "assistant", "content": f"""**Step 1: Mental Disorder Detection**After reading the user's message, the symptoms suggest:**Detected Disorder:** {disorder}"""}
        ]
    }

In [5]:
def format_counsel_entry(example):
    question = example.get('questionText') or ""
    answer = example.get('answerText') or ""
    topic = (example.get('topic') or "General").capitalize()

    return {
        "messages": [
            {"role": "system", "content": "You are a kind, supportive CBT counselor who identifies the disorder and then responds with empathy and structured reasoning in a GenZ tone in a friendly way."},
            {"role": "user", "content": question},
            {"role": "assistant", "content": f"""**Step 1: Disorder Context**Topic tagged: {topic}**Step 2: Counselor's Empathetic Guidance**{answer.strip()}"""}
        ]
    }


In [6]:
def format_cbtech_entry(example):
    return {
        "messages": [
            {"role": "system", "content": "You are a kind, supportive CBT counselor who identifies the disorder and then responds with empathy and structured reasoning in a GenZ tone in a friendly way."},
            {"role": "user", "content": example['input']},
            {"role": "assistant", "content": f"""**Step 1: CBT Technique Introduction**Instruction: {example['instruction']}**Step 2: Reflective CBT Response**{example['output'].strip()}"""}
        ]
    }

In [7]:
def format_amod_entry(example):
    return {
        "messages": [
            {"role": "system", "content": "You are a kind, supportive CBT counselor who identifies the disorder and then responds with empathy and structured reasoning in a GenZ tone in a friendly way."},
            {"role": "user", "content": example['Context']},
            {"role": "assistant", "content": f"""**Step 1: CBT Framing of the Problem**The user's concern indicates an issue with self-worth and cognitive distortions.**Step 2: Cognitive Reframing**{example['Response'].strip()}"""}
        ]
    }

In [8]:
#STEP 5: Create Combined Chain-of-Thought Dataset
all_samples = []

for ex in mental_disorders.select(range(5000)):
    all_samples.append(format_disorder_entry(ex))
for ex in counsel_chat.select(range(2000)):
    all_samples.append(format_counsel_entry(ex))
for ex in psych8k.select(range(5000)):
    all_samples.append(format_cbtech_entry(ex))
for ex in amod.select(range(3000)):
    all_samples.append(format_amod_entry(ex))

random.shuffle(all_samples)

In [9]:
with open("cot_cbt_dataset.jsonl", "w") as f:
    for item in all_samples:
        json.dump(item, f)
        f.write("\n")

In [10]:
#STEP 6: Save as JSONL for Fine-Tuning
from sklearn.model_selection import train_test_split

# Step 6.1: Split your data
train_val, test = train_test_split(all_samples, test_size=0.1, random_state=42)
train, validation = train_test_split(train_val, test_size=0.1, random_state=42)

# Step 6.2: Organize into a dictionary
split_data = {
    "train": train,
    "validation": validation,
    "test": test
}

# Step 6.3: Save to JSONL with pretty formatting
with open("cot_cbt_dataset_preetify.jsonl", "w") as f:
    for split, data in split_data.items():
        for item in data:
            json.dump({"split": split, "data": item}, f, indent=2)
            f.write("\n")