# Phase 3: Python System Automation

Training Phase 2 output on Python automation and system scripting datasets

**Input:**
- Base model: `/kaggle/input/qwen3-08b-coder-reasoning` (1.6GB)
- Phase 1 LoRA: `/kaggle/input/qwen3-phase1-lora-adapter` (42MB - CodeAlpaca)
- Phase 2 LoRA: `/kaggle/input/qwen3-phase2-linux-lora-adapter` (27MB - Linux commands, 1 epoch)

**PERFORMANCE FIX:**
- **Problem:** Phase 2 was 6x slower due to merge → save → reload → quantize cycle
- **Solution:** Keep merged model in memory, apply quantization once, no disk I/O
- **Expected:** ~2 hours (down from 10+ hours)

**Workflow:**
1. Load base model with 4-bit quantization (ONCE)
2. Load Phase 1 LoRA adapter
3. Merge Phase 1 (in memory, no save)
4. Load Phase 2 LoRA adapter  
5. Merge Phase 2 (in memory, no save)
6. Apply Phase 3 LoRA training directly

**Critical:** All phases use **pad_token_id=151645** (EOS token) for consistency

**Expected Time:** 2 hours on T4 GPU

In [None]:
# Install dependencies
!pip install -q transformers datasets accelerate peft bitsandbytes trl pandas

In [None]:
import torch
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset, concatenate_datasets, Dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel
from trl import SFTTrainer, SFTConfig

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")

In [None]:
# Configuration
BASE_MODEL_PATH = "/kaggle/input/qwen3-08b-coder-reasoning"
PHASE1_LORA_PATH = "/kaggle/input/qwen3-phase1-lora-adapter"
PHASE2_LORA_PATH = "/kaggle/input/qwen3-phase2-linux-lora-adapter"
OUTPUT_DIR = "/kaggle/working/qwen3-08b-phase3-python"

# Training hyperparameters (optimized for memory + time)
BATCH_SIZE = 2
GRADIENT_ACCUMULATION = 8  # Effective batch = 16
LEARNING_RATE = 2e-4
NUM_EPOCHS = 1  # Reduced to 1 epoch (3h time constraint)
MAX_SEQ_LENGTH = 2048

# CRITICAL: Padding token (consistent across all phases)
PAD_TOKEN_ID = 151645  # EOS token used in Phase 1 & 2

## Step 1: Load Base Model with 4-bit Quantization (PERFORMANCE FIX)

**Key Change:** Load base model with quantization ONCE at the start. This eliminates the costly reload cycle from Phase 2.

**Previous (slow):** Base → merge P1 → save → reload+quantize → train P2
**Now (fast):** Base+quantize → merge P1 → merge P2 → train P3 (all in memory)

In [None]:
print("="*60)
print("STEP 1: LOAD BASE MODEL WITH 4-BIT QUANTIZATION")
print("="*60)

# Configure 4-bit quantization ONCE
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

print("\n🔄 Loading base model with 4-bit quantization...")
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_PATH,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

print("✅ Base model loaded and quantized")
print(f"   Model device: {base_model.device}")
print(f"   Model dtype: {base_model.dtype}")

## Step 2: Load Tokenizer with Consistent Padding

**CRITICAL:** Must use pad_token_id=151645 (EOS token) to match Phase 1 & 2 training.

In [None]:
print("\n" + "="*60)
print("STEP 2: LOAD TOKENIZER")
print("="*60)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)

# CRITICAL: Force padding token to match Phase 1 & 2
# Phase 1 & 2 used pad_token_id = 151645 (eos_token)
print(f"\n⚠️  CRITICAL: Setting pad_token_id={PAD_TOKEN_ID} (EOS token)")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = PAD_TOKEN_ID
tokenizer.padding_side = "right"

print(f"\n🔍 VERIFICATION:")
print(f"   pad_token_id: {tokenizer.pad_token_id}")
print(f"   eos_token_id: {tokenizer.eos_token_id}")

if tokenizer.pad_token_id == PAD_TOKEN_ID:
    print(f"   ✅ CORRECT: Matches Phase 1 & 2 (ID: {PAD_TOKEN_ID})")
else:
    raise ValueError(f"❌ PADDING TOKEN MISMATCH! Expected {PAD_TOKEN_ID}, got {tokenizer.pad_token_id}")

print("="*60)

## Step 3: Sequential Merging (In-Memory, No Disk I/O)

**Performance optimization:** Merge Phase 1 and Phase 2 LoRAs in memory without saving to disk.

In [None]:
print("\n" + "="*60)
print("STEP 3: MERGE PHASE 1 & 2 LORAS (IN MEMORY)")
print("="*60)

# Load and merge Phase 1 LoRA
print("\n🔄 Loading Phase 1 LoRA adapter...")
model_with_phase1 = PeftModel.from_pretrained(base_model, PHASE1_LORA_PATH)

print("🔄 Merging Phase 1 LoRA (CodeAlpaca knowledge)...")
model = model_with_phase1.merge_and_unload()
del model_with_phase1
torch.cuda.empty_cache()
print("✅ Phase 1 merged (in memory)")

# Load and merge Phase 2 LoRA
print("\n🔄 Loading Phase 2 LoRA adapter...")
model_with_phase2 = PeftModel.from_pretrained(model, PHASE2_LORA_PATH)

print("🔄 Merging Phase 2 LoRA (Linux commands, 1 epoch)...")
model = model_with_phase2.merge_and_unload()
del model_with_phase2
torch.cuda.empty_cache()
print("✅ Phase 2 merged (in memory)")

print("\n✅ Model now includes:")
print("   • Base Qwen2.5-0.5B knowledge")
print("   • Phase 1: CodeAlpaca training")
print("   • Phase 2: Linux command training (1 epoch)")
print("\n🚀 Ready for Phase 3 training!")
print("="*60)

## Step 4: Prepare Model for LoRA Training

Enable gradient checkpointing and prepare for k-bit training.

In [None]:
print("\n" + "="*60)
print("STEP 4: PREPARE MODEL FOR LORA TRAINING")
print("="*60)

# Enable gradient checkpointing for memory efficiency
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

print("✅ Gradient checkpointing enabled")
print("✅ Model prepared for k-bit training")

## Step 5: Configure Phase 3 LoRA

New LoRA adapter for Python automation training.

In [None]:
print("\n" + "="*60)
print("STEP 5: CONFIGURE PHASE 3 LORA")
print("="*60)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

print("\n✅ Phase 3 LoRA adapter configured")
print("="*60)

## Step 6: Load Python Automation Datasets

Focus on Python system automation, scripting, and DevOps tasks.

In [None]:
print("\n" + "="*60)
print("STEP 6: LOAD PYTHON AUTOMATION DATASETS")
print("="*60)

# Python automation datasets
print("\n🔄 Loading Python automation datasets...")

# 1. CodeAlpaca (filter for Python)
print("\nLoading CodeAlpaca (Python only)...")
codealpha_ds = load_dataset("sahil2801/CodeAlpaca-20k", split="train")
python_alpaca = codealpha_ds.filter(
    lambda x: 'python' in x['instruction'].lower() or 'script' in x['instruction'].lower()
)
print(f"✓ CodeAlpaca Python: {len(python_alpaca)} examples")

# 2. CodeFeedback (filter for Python)
print("\nLoading CodeFeedback (filtering for Python)...")
codefeedback_ds = load_dataset("m-a-p/CodeFeedback-Filtered-Instruction", split="train")
python_cf = codefeedback_ds.filter(
    lambda x: x.get('lang') == 'python' or 'python' in x.get('query', '').lower()
)
print(f"✓ CodeFeedback Python: {len(python_cf)} examples")

# Combine and cap at 25K for memory
print("\n🔄 Combining datasets...")
def normalize_python(example):
    return {
        'instruction': example.get('instruction') or example.get('query', ''),
        'output': example.get('output') or example.get('answer', '')
    }

python_alpaca_norm = python_alpaca.map(normalize_python)
python_cf_norm = python_cf.map(normalize_python)

combined = concatenate_datasets([python_alpaca_norm, python_cf_norm])
combined = combined.shuffle(seed=42).select(range(min(25000, len(combined))))

print(f"\n✅ Total training examples: {len(combined):,}")
print("="*60)

## Step 7: Format Dataset

Convert to instruction-response format for training.

In [None]:
print("\n" + "="*60)
print("STEP 7: FORMAT DATASET")
print("="*60)

def format_instruction(example):
    return {
        "text": f"Instruction: {example['instruction']}\n\nResponse: {example['output']}"
    }

train_dataset = combined.map(format_instruction)
print(f"✅ Dataset formatted: {len(train_dataset):,} examples")
print("="*60)

## Step 8: Configure Training

Set up training parameters with memory optimization.

In [None]:
print("\n" + "="*60)
print("STEP 8: CONFIGURE TRAINING")
print("="*60)

training_args = SFTConfig(
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION,
    learning_rate=LEARNING_RATE,
    fp16=True,
    save_strategy="epoch",
    logging_steps=50,
    warmup_steps=100,
    lr_scheduler_type="cosine",
    optim="paged_adamw_8bit",
    report_to="none",
    max_grad_norm=0.3,
    # SFT-specific parameters
    max_length=MAX_SEQ_LENGTH,
    dataset_text_field="text",
    packing=False,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    processing_class=tokenizer,
    args=training_args,
)

print(f"✅ Trainer configured")
print(f"   Effective batch size: {BATCH_SIZE * GRADIENT_ACCUMULATION}")
print(f"   Total steps: {len(train_dataset) // (BATCH_SIZE * GRADIENT_ACCUMULATION) * NUM_EPOCHS}")
print(f"   Expected time: ~2 hours (with performance fix)")
print("="*60)

## Step 9: Train Phase 3

Start training with performance monitoring.

In [None]:
import time

print("\n" + "="*80)
print("STEP 9: TRAINING PHASE 3")
print("="*80)

print("\n🚀 Starting training...")
print(f"   Dataset: {len(train_dataset):,} examples")
print(f"   Epochs: {NUM_EPOCHS}")
print(f"   Batch size: {BATCH_SIZE} (effective: {BATCH_SIZE * GRADIENT_ACCUMULATION})")
print("\n⏱️  PERFORMANCE MONITORING:")
print(f"   Phase 2 baseline: 23.48 sec/step (6x slow)")
print(f"   Phase 3 target: <5 sec/step (with fix)")
print("\n" + "="*80)

start_time = time.time()
trainer.train()
end_time = time.time()

training_hours = (end_time - start_time) / 3600
print(f"\n✅ Training complete!")
print(f"   Total time: {training_hours:.2f} hours")
print(f"   Avg sec/step: {(end_time - start_time) / trainer.state.global_step:.2f}")
print("="*80)

## Step 10: Save Phase 3 LoRA Adapter

Save only the LoRA adapter weights (~27-40MB).

In [None]:
print("\n" + "="*60)
print("STEP 10: SAVE PHASE 3 LORA ADAPTER")
print("="*60)

print("\n💾 Saving Phase 3 LoRA adapter...")
trainer.model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

import os
adapter_size = os.path.getsize(f"{OUTPUT_DIR}/adapter_model.safetensors") / (1024 * 1024)
print(f"\n✅ Phase 3 adapter saved!")
print(f"   Location: {OUTPUT_DIR}")
print(f"   Size: {adapter_size:.1f} MB")
print(f"   Ready for Phase 4 training")
print("="*60)

## Training Complete! 🎉

**Phase 3 Results:**
- Model has learned: Base + CodeAlpaca + Linux Commands + Python Automation
- Adapter saved for Phase 4 (Advanced Troubleshooting)

**Next Steps:**
1. Download this Phase 3 adapter (~27-40 MB)
2. Upload to Kaggle as dataset
3. Create Phase 4 notebook
4. Final training phase!

**Performance Check:**
- If this phase completed in ~2 hours: ✅ Fix worked!
- If this phase took 10+ hours: ⚠️ Need further debugging