In [2]:
!pip uninstall -y numpy
!pip install "numpy>=1.26.0,<2.0"

Found existing installation: numpy 2.2.6
Uninstalling numpy-2.2.6:
  Successfully uninstalled numpy-2.2.6
Defaulting to user installation because normal site-packages is not writeable
Collecting numpy<2.0,>=1.26.0
  Using cached numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Using cached numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
Installing collected packages: numpy
Successfully installed numpy-1.26.4


In [1]:
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments
import torch

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm
2025-11-09 10:29:43.612323: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  from pkg_resources import parse_version  # type: ignore


ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!


In [2]:
print("Loading model...")
max_seq_length = 2048
dtype = None  
load_in_4bit = False

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    device_map="auto"
)
print("âœ“ Model loaded")

Loading model...
==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.495 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 4/4 [00:02<00:00,  1.44it/s]


âœ“ Model loaded


In [3]:
print("\nConfiguring LoRA adapters...")
model = FastLanguageModel.get_peft_model(
    model,
    r = 12, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 12,
    lora_dropout = 0.1,
    bias = "none",
    use_gradient_checkpointing = True,
    random_state = 42,
)
print("âœ“ LoRA configured")

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.1.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.



Configuring LoRA adapters...


Unsloth 2025.11.2 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


âœ“ LoRA configured


In [4]:
print("\nLoading dataset...")
dataset = load_dataset("json", data_files="mem0_finetune_5k.jsonl", split="train")
print(f"âœ“ Loaded {len(dataset)} samples")

train_test_split = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = train_test_split["train"]
eval_dataset = train_test_split["test"]
print(f"âœ“ Dataset loaded: {len(dataset)} examples")


Loading dataset...
âœ“ Loaded 2526 samples
âœ“ Dataset loaded: 2526 examples


In [5]:
print("\nSetting up trainer...")
TOTAL_SAMPLES = len(train_dataset)
GLOBAL_BATCH_SIZE = 2 * 4
TRAIN_STEPS = TOTAL_SAMPLES // GLOBAL_BATCH_SIZE
print(f"Training configuration:")
print(f"  - Total samples: {TOTAL_SAMPLES}")
print(f"  - Global batch size: {GLOBAL_BATCH_SIZE}")
print(f"  - Training steps: {TRAIN_STEPS}")
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=SFTConfig(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=250,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="outputs",
        eval_strategy="steps",
        eval_steps=20,
        save_strategy="steps",
        save_steps=20,
        per_device_eval_batch_size=4,
        load_best_model_at_end=True,
        save_total_limit=2,
        dataloader_num_workers=0
    ),
)
print("âœ“ Trainer configured")


Setting up trainer...
Training configuration:
  - Total samples: 2273
  - Global batch size: 8
  - Training steps: 284
âœ“ Trainer configured


In [6]:
print("\nStarting model training...")
trainer.train()
print("âœ“ Training finished!")

print("\nMerging LoRA adapters into base model...")
model = model.merge_and_unload()
print("âœ“ LoRA adapters merged")

The model is already on multiple devices. Skipping the move to device specified in `args`.



Starting model training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,273 | Num Epochs = 2 | Total steps = 250
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 31,457,280 of 8,061,718,528 (0.39% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
20,1.3874,1.229739
40,1.0965,1.078542
60,1.0412,1.040216
80,1.0168,1.019795
100,1.0016,1.003936
120,0.9536,0.991784
140,0.9575,0.984945
160,0.9491,0.979676
180,0.9096,0.972146
200,0.8945,0.967218


Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


âœ“ Training finished!

Merging LoRA adapters into base model...
âœ“ LoRA adapters merged


In [7]:
model.save_pretrained("llama-3.1-8b-finetuned")
tokenizer.save_pretrained("llama-3.1-8b-finetuned")

('llama-3.1-8b-finetuned/tokenizer_config.json',
 'llama-3.1-8b-finetuned/special_tokens_map.json',
 'llama-3.1-8b-finetuned/chat_template.jinja',
 'llama-3.1-8b-finetuned/tokenizer.json')

In [9]:
model.save_pretrained_gguf(
    "llama-3.1-8b-finetuned",
    tokenizer,
    quantization_method="q4_k_m",
)

Unsloth: Merging model weights to 16-bit format...




Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF bf16 might take 3 minutes.
\        /    [2] Converting GGUF bf16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Cloning llama.cpp repository
Unsloth: Install GGUF and other packages
Unsloth: Successfully installed llama.cpp!
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into bf16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['Meta-Llama-3.1-8B-Instruct.BF16.gguf']
Unsloth: [2] Converting GGUF bf16 into q4_k_m. This might take

{'save_directory': 'llama-3.1-8b-finetuned',
 'gguf_files': ['Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf'],
 'modelfile_location': '/home/ubuntu/Modelfile',
 'want_full_precision': False,
 'is_vlm': False,
 'fix_bos_token': False}