In [None]:
from unsloth import FastLanguageModel
import torch
from trl import SFTConfig, SFTTrainer

from poi.dataset.llm import load_llm_dataset
from poi.llm import LLMConfig, train_llm
from poi.settings import DATASETS_DIR

config = LLMConfig(run_name="llama3-nyc-unsloth")

# Create dataset
train_dataset = load_llm_dataset(
    DATASETS_DIR / "NYC" / "train_codebook.json", config
)
eval_dataset = load_llm_dataset(
    DATASETS_DIR / "NYC" / "test_codebook.json", config
)

max_length = 2048  # Supports automatic RoPE Scaling, so choose any number

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="mlabonne/Meta-Llama-3-8B",
    max_seq_length=config.max_length,
    dtype=None,  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    load_in_4bit=True,  # Use 4bit quantization to reduce memory usage. Can be False
    attn_implementation="flash_attention_2",
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0.1,  # Dropout = 0 is currently optimized
    bias="none",  # Bias = "none" is currently optimized
    use_gradient_checkpointing=True,
    random_state=3407,
)


trainer = SFTTrainer(
    model=model,
    args=config.training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)
trainer.train()


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


Switching to PyTorch attention since your Xformers is broken.

Requires Flash-Attention version >=2.7.1,<=2.8.2 but got 2.8.3.
🦥 Unsloth Zoo will now patch everything to make training faster!


Map: 100%|██████████| 2848/2848 [00:00<00:00, 60303.08 examples/s]
Map: 100%|██████████| 876/876 [00:00<00:00, 59792.84 examples/s]


==((====))==  Unsloth 2025.10.6: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    NVIDIA GeForce RTX 4090. Num GPUs = 1. Max memory: 23.508 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards: 100%|██████████| 4/4 [00:06<00:00,  1.58s/it]


mlabonne/Meta-Llama-3-8B does not have a padding token! Will use pad_token = <|reserved_special_token_250|>.


Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.1.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.10.6 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.
Unsloth: Tokenizing ["text"] (num_proc=36): 100%|██████████| 2848/2848 [00:04<00:00, 633.22 examples/s]
Unsloth: Tokenizing ["text"] (num_proc=36): 100%|██████████| 876/876 [00:04<00:00, 189.05 examples/s]
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,848 | Num Epochs = 8 | Total steps = 1,424
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Epoch,Training Loss,Validation Loss
