🧱 1️⃣ Install Dependencies

In [2]:
!uv add transformers datasets accelerate bitsandbytes peft sentencepiece

[2K[2mResolved [1m198 packages[0m [2min 150ms[0m[0m                                       [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/3)                                                   
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/3)--------------[0m[0m     0 B/57.31 MiB           [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/3)--------------[0m[0m     0 B/57.31 MiB           [1A
[2mpeft                [0m [32m[2m------------------------------[0m[0m     0 B/493.06 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/3)--------------[0m[0m     0 B/57.31 MiB           [2A
[2mpeft                [0m [32m[2m------------------------------[0m[0m     0 B/493.06 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/3)--------------[0m[0m     0 B/57.31 MiB           [2A
[2mpeft                [0m [32m[2m------------------------------[0m[0m     0 B/493.06 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/3)--------------

🧠 2️⃣ Import Libraries

In [3]:
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
)
import torch

🧩 3️⃣ Inspect and Load the Data

In [None]:
! mkdir -p data && gsutil -m cp gs://tusharwagh.appspot.com/data/combined_dataset.json data

In [4]:
from datasets import load_dataset

# Load your uploaded JSON file directly
dataset = load_dataset("json", data_files="data/combined_dataset.json")

# View sample
print(dataset)
print(dataset["train"][0])

DatasetDict({
    train: Dataset({
        features: ['Context', 'Response'],
        num_rows: 3512
    })
})
{'Context': "I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here.\n   I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it.\n   How can I change my feeling of being worthless to everyone?", 'Response': "If everyone thinks you're worthless, then maybe you need to find new people to hang out with.Seriously, the social context in which a person lives is a big influence in self-esteem.Otherwise, you can go round and round trying to understand why you're not worthless, then go back to the same crowd and be knocked down again.There are many inspirational messages you can find in social media. \xa0Maybe read some of the ones which state that no person is worthless, and that everyone has a good purpose to their life.Also, since our

🧩 Prepare for Language Modeling

Combine context and response into one conversational string so the model learns counselor-style replies.

In [5]:
def format_conversation(example):
    return {
        "text": f"Client: {example['Context'].strip()}\nCounselor: {example['Response'].strip()}"
    }

dataset = dataset["train"].map(format_conversation)


🦙 6️⃣ Choose Base Llama Model

You can choose any open-source variant:

    Model	                         Parameter	   Notes<br>
    "meta-llama/Meta-Llama-3-8B"	  8B	      Best balance of quality/performance
    "meta-llama/Llama-2-7b-hf"	      7B	      Older but lightweight
    "meta-llama/Meta-Llama-3-70B"	  70B	      For multi-GPU clusters

For Colab, stick with the 8B or 7B versions.

⚙️ 8️⃣ Load Model

In [6]:
model_name = "NousResearch/Llama-2-7b-chat-hf" # Changed to a publicly available model

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # Llama has no pad_token by default

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16, # Changed to float16 for less memory usage
)

`torch_dtype` is deprecated! Use `dtype` instead!
Fetching 2 files: 100%|██████████| 2/2 [09:37<00:00, 288.73s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.08s/it]
Some parameters are on the meta device because they were offloaded to the disk and cpu.


🧾 7️⃣ Tokenize Data

In [7]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=1024)

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])



Map: 100%|██████████| 3512/3512 [00:00<00:00, 8843.15 examples/s]


⚙️ Data Collator

In [8]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

🚀 🔟 🧮 Training Configuration

In [None]:
training_args = TrainingArguments(
    output_dir="./llama3_counseling_domain",
    per_device_train_batch_size=1,  # Reduced batch size
    gradient_accumulation_steps=32, # Further increased gradient accumulation steps
    learning_rate=1e-5,
    num_train_epochs=2,
    fp16=True,
    save_strategy="epoch",
    logging_steps=50,
    report_to="none",
    gradient_checkpointing=True, # Enabled gradient checkpointing
    optim="adamw_torch", # Specify AdamW optimizer
)

NameError: name 'TrainingArguments' is not defined

🚀 7️⃣ Start Training

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

trainer.train()


💾 8️⃣ Save the Fine-Tuned Model

In [None]:
trainer.save_model("./llama3_counseling_domain")
tokenizer.save_pretrained("./llama3_counseling_domain")


🧪 9️⃣ Test Generation

In [None]:
from transformers import pipeline

pipe = pipeline("text-generation", model="./llama3_counseling_domain", tokenizer=tokenizer)

prompt = "Client: I feel anxious and worthless lately. What should I do?\nCounselor:"
print(pipe(prompt, max_new_tokens=100)[0]["generated_text"])
