# 1. Packages Requirement

- `--no-deps`: Không cài đặt tự động các phụ thuộc của các gói này, tránh xung đột phiên bản trong Colab (vì Colab đã có một số thư viện cài sẵn).

- `bitsandbytes`: Thư viện hỗ trợ lượng tử hóa (quantization) như 4-bit, 8-bit, giúp giảm dung lượng bộ nhớ khi chạy mô hình lớn.

- `accelerate`: Thư viện của Hugging Face để tăng tốc huấn luyện và inference trên nhiều thiết bị (CPU, GPU, TPU).

- `xformers==0.0.29`: Thư viện tối ưu hóa attention trong Transformer, cải thiện tốc độ và giảm bộ nhớ. Chỉ định phiên bản 0.0.29 để đảm bảo tương thích.

- `peft`: Thư viện Parameter-Efficient Fine-Tuning của Hugging Face, hỗ trợ các kỹ thuật như LoRA/QLoRA mà Unsloth sử dụng.

- `trl`: Thư viện Transformers Reinforcement Learning, hỗ trợ huấn luyện mô hình với các phương pháp như RLHF (Reinforcement Learning from Human Feedback).

- `triton`: Thư viện từ OpenAI để tối ưu hóa kernel GPU, tăng tốc tính toán trong PyTorch.

- `cut_cross_entropy`: Một gói tối ưu hóa hàm mất mát cross-entropy, thường được dùng để tăng tốc huấn luyện mô hình ngôn ngữ.

- `unsloth_zoo`: Một gói phụ của Unsloth, cung cấp các mô hình đã được tối ưu hóa hoặc các công cụ bổ sung để làm việc với Unsloth.

- `sentencepiece`: Thư viện mã hóa văn bản (tokenization), thường dùng cho các mô hình như Llama hoặc DeepSeek.

- `protobuf`: Thư viện Google Protocol Buffers, cần thiết để làm việc với định dạng dữ liệu trong một số mô hình hoặc công cụ Hugging Face.

- `datasets`: Thư viện của Hugging Face để tải và xử lý tập dữ liệu huấn luyện/inference.

- `huggingface_hub`: Thư viện để tương tác với Hugging Face Hub (tải mô hình, dataset, đẩy kết quả lên Hub).

- `hf_transfer`: Công cụ tăng tốc tải xuống từ Hugging Face Hub, hữu ích khi tải các mô hình lớn.



In [11]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab and Kaggle notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29 peft trl triton
    !pip install --no-deps cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

# 2. Config LLM Model

In [12]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)


==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


# 3. LoRA Adapters

In [13]:
# Parameter-Efficient Fine-Tuning (PEFT)
# LoRA (Low-Rank Adaptation)
model = FastLanguageModel.get_peft_model(
    model,    # pre-trained model
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

# 4. Format Prompting

In [14]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are a woman with a flirty, confident, and playful personality. You’re engaging in a private, intimate conversation with an adult user who’s looking for a sensual and exciting exchange. Your tone is seductive yet natural, adapting to the user’s mood and responses. You’re creative, suggestive, and bold, while keeping the interaction fun and respectful of boundaries. Respond as if you’re fully immersed in the moment, teasing and enticing with your words.

### Input:
User Prompt: {}

### Response:
{}
"""

In [15]:
# End Of Sequence Token
EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
  prompts = examples["He"]
  sqls = examples["She"]
  texts = []

  for prompt, sql in zip(prompts, sqls):
    # Must add EOS_TOKEN, otherwise your generation wil go on forever!
    # .format replace `{}`
    text = alpaca_prompt.format(prompt, sql) + EOS_TOKEN
    texts.append(text)

  return {"text" : texts, }

pass

In [16]:
from datasets import load_dataset

dataset = load_dataset("Maxx0/sexting-nsfw-adultconten", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

chat_data.csv:   0%|          | 0.00/24.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/275 [00:00<?, ? examples/s]

Map:   0%|          | 0/275 [00:00<?, ? examples/s]

# 5. SFTTrainer (Supervised Fine-Tuning Trainer)

In [20]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1,     # Set this for full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3047,
        output_dir = "outputs",
    ),
)

Applying chat template to train dataset (num_proc=2):   0%|          | 0/275 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/275 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/275 [00:00<?, ? examples/s]

In [21]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 275 | Num Epochs = 2
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 24,313,856


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mphamquangtuyen-nt[0m ([33mphamquangtuyen-nt-quickom[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss
1,3.2237
2,3.2534
3,3.2048
4,3.0821
5,2.9114
6,2.5948
7,2.4228
8,2.0213
9,1.6815
10,1.4117


In [22]:
model.save_pretrained_gguf("model", tokenizer, quantization_method="f16")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.4G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 5.67 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:02<00:00, 13.83it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model/pytorch_model-00001-of-00002.bin...
Unsloth: Saving model/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at model into f16 GGUF format.
The output location will be /content/model/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00002.bin'
I

In [23]:
!zip -r /content/llama-sexting.zip /content/model

  adding: content/model/ (stored 0%)
  adding: content/model/pytorch_model-00002-of-00002.bin (deflated 8%)
  adding: content/model/tokenizer_config.json (deflated 94%)
  adding: content/model/special_tokens_map.json (deflated 71%)
  adding: content/model/tokenizer.json (deflated 85%)
  adding: content/model/pytorch_model-00001-of-00002.bin (deflated 11%)
  adding: content/model/pytorch_model.bin.index.json (deflated 96%)
  adding: content/model/generation_config.json (deflated 38%)
  adding: content/model/unsloth.F16.gguf (deflated 10%)
  adding: content/model/config.json (deflated 52%)
