# CodeGemma 7B Fine-tuning with KodCode (Unsloth)

**Goal:** Fine-tune CodeGemma-7B for pytest test generation using KodCode dataset.

**Requirements:**
- GPU: A100 (40GB) recommended, L4 (24GB) works, T4 (15GB) works with smaller batch
- Runtime: ~2-4 hours for 1 epoch on 30k samples

**What this notebook does:**
1. Loads `unsloth/codegemma-7b-bnb-4bit` (4-bit quantized)
2. Adds LoRA adapters for efficient fine-tuning
3. Loads KodCode-V1-SFT-R1 dataset (verified test cases)
4. Trains with SFTTrainer (1 epoch)
5. Saves LoRA adapter + exports to GGUF for Ollama

## 1. Installation

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Colab-specific installation
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

## 2. Configuration

**Adjust these based on your GPU:**
- A100 (40GB): `batch_size=4`, `max_seq_length=4096`
- L4 (24GB): `batch_size=2`, `max_seq_length=2048`
- T4 (15GB): `batch_size=1`, `max_seq_length=2048`

In [None]:
# ============ CONFIGURATION ============
# Optimized for H100 (80GB VRAM)

# Model
MODEL_NAME = "unsloth/codegemma-7b-bnb-4bit"
MAX_SEQ_LENGTH = 4096    # ← Changed from 2048
LOAD_IN_4BIT = True

# LoRA
LORA_R = 16
LORA_ALPHA = 16
LORA_DROPOUT = 0

# Dataset
DATA_SOURCE = "kodcode"
MAX_SAMPLES = 30_000
FILTER_BY_LENGTH = True

# Training - H100 optimal
BATCH_SIZE = 8           # ← Changed from 2
GRAD_ACCUM_STEPS = 2     # ← Changed from 4
NUM_EPOCHS = 1
LEARNING_RATE = 2e-4
WARMUP_RATIO = 0.03
USE_PACKING = True

# Output
OUTPUT_DIR = "./codegemma-kodcode-lora"
HF_REPO = None
SAVE_GGUF = True

## 3. Load Model + LoRA

In [None]:
from unsloth import FastLanguageModel
import torch

# Auto-detect dtype
dtype = None  # Float16 for T4/V100, Bfloat16 for Ampere+

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=dtype,
    load_in_4bit=LOAD_IN_4BIT,
    # token="hf_...",  # Uncomment if using gated models
)

print(f"Loaded {MODEL_NAME}")

In [None]:
# Add LoRA adapters (only 1-10% of parameters are trainable)
model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_R,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    use_gradient_checkpointing="unsloth",  # 30% less VRAM
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

model.print_trainable_parameters()

## 4. Prepare Dataset

In [None]:
from unsloth.chat_templates import get_chat_template
from datasets import load_dataset, Dataset
from tqdm import tqdm

# Apply ChatML template (CodeGemma uses this format)
tokenizer = get_chat_template(
    tokenizer,
    chat_template="chatml",
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
    map_eos_token=True,
)

# Prompt template for test generation
PROMPT_TEMPLATE = """### Task: Write pytest tests

### Code Under Test:
{code}

### Constraints:
pytest only, no hypothesis, no unittest. Edge cases, parametrize, no explanation."""

def formatting_prompts_func(examples):
    """Convert conversations to ChatML text format."""
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    return {"text": texts}

# Load dataset
print(f"Loading {DATA_SOURCE} dataset...")

if DATA_SOURCE == "guanaco":
    # Demo dataset (general conversation)
    dataset = load_dataset("philschmid/guanaco-sharegpt-style", split="train")
    if MAX_SAMPLES < len(dataset):
        dataset = dataset.shuffle(seed=42).select(range(MAX_SAMPLES))
    dataset = dataset.map(formatting_prompts_func, batched=True)
else:
    # KodCode or TestGenEval
    if DATA_SOURCE == "kodcode":
        ds = load_dataset("KodCode/KodCode-V1-SFT-R1", split="train")
        print(f"KodCode columns: {ds.column_names}")
        
        # Flexible column mapping
        code_col = next((c for c in ("solution", "r1_solution", "code") if c in ds.column_names), None)
        test_col = next((c for c in ("test", "tests") if c in ds.column_names), None)
        
        if not code_col or not test_col:
            raise KeyError(f"Required columns not found. Need code+test. Got: {ds.column_names}")
        print(f"Using columns: code={code_col}, test={test_col}")
        
        # Filter by correctness if available
        if "r1_correctness" in ds.column_names:
            before = len(ds)
            ds = ds.filter(lambda x: x.get("r1_correctness") in (True, "True", "true", 1))
            print(f"Filtered by r1_correctness: {before} -> {len(ds)}")
    else:
        # TestGenEval
        ds = load_dataset("kjain14/testgeneval", split="train")
        code_col, test_col = "code_src", "test_src"
    
    # Convert to conversation format with length filtering
    rows = []
    skipped_long = 0
    skipped_empty = 0
    
    for row in tqdm(ds, desc="Processing"):
        code = str(row.get(code_col, "") or "").strip()
        test = str(row.get(test_col, "") or "").strip()
        
        if not code or not test:
            skipped_empty += 1
            continue
        
        prompt = PROMPT_TEMPLATE.format(code=code)
        full_text = prompt + "\n" + test
        
        # Skip if too long
        if FILTER_BY_LENGTH:
            n_tokens = len(tokenizer.encode(full_text, add_special_tokens=False))
            if n_tokens > MAX_SEQ_LENGTH - 50:  # Leave room for special tokens
                skipped_long += 1
                continue
        
        rows.append({
            "conversations": [
                {"from": "human", "value": prompt},
                {"from": "gpt", "value": test}
            ]
        })
        
        if len(rows) >= MAX_SAMPLES:
            break
    
    print(f"\nDataset stats:")
    print(f"  Loaded: {len(rows)} samples")
    print(f"  Skipped (empty): {skipped_empty}")
    print(f"  Skipped (too long): {skipped_long}")
    
    dataset = Dataset.from_list(rows)
    dataset = dataset.map(formatting_prompts_func, batched=True)

print(f"\nFinal dataset size: {len(dataset)} samples")

In [None]:
# Preview a sample
print("=" * 60)
print("Sample conversation:")
print("=" * 60)
print(dataset[0]["text"][:2000])
print("..." if len(dataset[0]["text"]) > 2000 else "")

## 5. Train

In [None]:
from trl import SFTTrainer, SFTConfig

# Calculate total steps
total_samples = len(dataset)
effective_batch = BATCH_SIZE * GRAD_ACCUM_STEPS
steps_per_epoch = total_samples // effective_batch
total_steps = int(steps_per_epoch * NUM_EPOCHS)

print(f"Training config:")
print(f"  Samples: {total_samples}")
print(f"  Effective batch size: {effective_batch}")
print(f"  Steps per epoch: {steps_per_epoch}")
print(f"  Total steps: {total_steps}")
print(f"  Warmup steps: {int(total_steps * WARMUP_RATIO)}")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    packing=USE_PACKING,
    args=SFTConfig(
        output_dir=OUTPUT_DIR,
        per_device_train_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=GRAD_ACCUM_STEPS,
        num_train_epochs=NUM_EPOCHS,
        learning_rate=LEARNING_RATE,
        warmup_ratio=WARMUP_RATIO,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        seed=3407,
        max_seq_length=MAX_SEQ_LENGTH,
        dataset_text_field="text",
        report_to="none",  # Set to "wandb" if you want logging
        logging_steps=50,
        save_steps=500,
        save_total_limit=2,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
    ),
)

In [None]:
# Show memory before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024**3, 3)
max_memory = round(gpu_stats.total_memory / 1024**3, 3)
print(f"GPU: {gpu_stats.name}")
print(f"Max memory: {max_memory} GB")
print(f"Reserved before training: {start_gpu_memory} GB")

In [None]:
# Train!
print("Starting training...")
trainer_stats = trainer.train()

In [None]:
# Show final stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024**3, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)

print(f"\nTraining complete!")
print(f"  Time: {trainer_stats.metrics['train_runtime']:.1f} seconds ({trainer_stats.metrics['train_runtime']/60:.1f} min)")
print(f"  Final loss: {trainer_stats.metrics.get('train_loss', 'N/A')}")
print(f"  Peak memory: {used_memory} GB ({used_percentage}% of {max_memory} GB)")
print(f"  Memory for training: {used_memory_for_lora} GB")

## 6. Test the Model

In [None]:
from unsloth.chat_templates import get_chat_template
from transformers import TextStreamer

# Re-apply chat template for inference
tokenizer = get_chat_template(
    tokenizer,
    chat_template="chatml",
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
    map_eos_token=True,
)

# Enable faster inference
FastLanguageModel.for_inference(model)

# Test prompt
test_code = '''
def fibonacci(n: int) -> int:
    """Return the nth Fibonacci number."""
    if n <= 1:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)
'''

test_prompt = PROMPT_TEMPLATE.format(code=test_code)

messages = [{"from": "human", "value": test_prompt}]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

print("Generating tests for fibonacci function...")
print("=" * 60)

text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
    input_ids=inputs,
    streamer=text_streamer,
    max_new_tokens=512,
    use_cache=True,
    temperature=0.1,
    top_p=0.9,
)

## 7. Save Model

In [None]:
# Save LoRA adapter locally
print(f"Saving LoRA adapter to {OUTPUT_DIR}...")
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print("Done!")

# Push to HuggingFace (optional)
if HF_REPO:
    print(f"Pushing to HuggingFace: {HF_REPO}...")
    model.push_to_hub(HF_REPO, token=os.environ.get("HF_TOKEN"))
    tokenizer.push_to_hub(HF_REPO, token=os.environ.get("HF_TOKEN"))
    print("Done!")

In [None]:
# Save merged 16-bit model (for vLLM or direct inference)
if False:  # Set to True to save merged model
    print("Saving merged 16-bit model...")
    model.save_pretrained_merged(f"{OUTPUT_DIR}-merged", tokenizer, save_method="merged_16bit")
    print("Done!")

In [None]:
# Save GGUF for Ollama
if SAVE_GGUF:
    print("Saving GGUF (q4_k_m) for Ollama...")
    model.save_pretrained_gguf(
        f"{OUTPUT_DIR}-gguf",
        tokenizer,
        quantization_method="q4_k_m"  # Good balance of size/quality
    )
    print(f"Done! GGUF saved to {OUTPUT_DIR}-gguf/")
    print("\nTo use in Ollama:")
    print(f"  1. Copy the .gguf file to your local machine")
    print(f"  2. Create a Modelfile with: FROM ./{OUTPUT_DIR}-gguf/unsloth.Q4_K_M.gguf")
    print(f"  3. Run: ollama create codegemma-kodcode -f Modelfile")

## 8. Download Files

Run this cell to create a zip file you can download from Colab.

In [None]:
import shutil

# Create zip of LoRA adapter
print("Creating zip file...")
shutil.make_archive(f"{OUTPUT_DIR}", 'zip', OUTPUT_DIR)
print(f"Created {OUTPUT_DIR}.zip")

# In Colab, you can download with:
try:
    from google.colab import files
    files.download(f"{OUTPUT_DIR}.zip")
except:
    print(f"Download {OUTPUT_DIR}.zip manually from the file browser.")

---

## GPU Recommendations

| GPU | VRAM | Batch Size | Seq Length | Est. Time (30k samples) |
|-----|------|------------|------------|-------------------------|
| **A100 (40GB)** | 40GB | 4 | 4096 | ~1.5 hours |
| **A100 (80GB)** | 80GB | 8 | 4096 | ~1 hour |
| **L4** | 24GB | 2 | 2048 | ~2.5 hours |
| **T4** | 15GB | 1 | 2048 | ~4 hours |

### Cloud Options:

1. **Google Colab Pro** ($10/month): A100 access, good for experimentation
2. **Google Colab Pro+** ($50/month): Longer runtimes, priority A100
3. **RunPod**: A100 ~$1.99/hr, good for long training
4. **Lambda Labs**: A100 ~$1.29/hr (often waitlisted)
5. **Vast.ai**: Cheapest A100s ~$0.80-1.50/hr (variable quality)

**Recommendation:** Start with Colab Pro (A100) for prototyping, use RunPod for final training.