# Phase 3: Python System Automation

Training Phase 2 output on Python system automation datasets

**Input:**
- Phase 2 Merged Model: `/kaggle/input/qwen3-phase2-linux-merged` (Base + CodeAlpaca + Linux Commands)

**Workflow:**
1. Load Phase 2 merged model
2. Apply LoRA and train on Python automation datasets
3. Save Phase 3 LoRA adapter
4. Merge LoRA with base model for Phase 4 input

**Datasets:**
- HuggingFace: `RazinAleks/SO-Python_QA-System_Administration_and_DevOps_class` ⭐ Perfect fit!
- HuggingFace: `infinite-dataset-hub/ShellScriptDataset`
- HuggingFace: `flytech/python-codes-25k` (FILTERED for system-relevant code only)

**Filtering Strategy:**
- ✅ Keep: File I/O, process management, system calls, automation, subprocess, os module
- ❌ Remove: Web scraping, data science, algorithms, LeetCode, matplotlib/pandas

In [None]:
# Install dependencies
!pip install -q transformers datasets accelerate peft bitsandbytes trl pandas

In [None]:
import torch
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from datasets import load_dataset, concatenate_datasets, Dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# Configuration
BASE_MODEL_PATH = "/kaggle/input/qwen3-08b-coder-reasoning"  # Original base model for final merge
PHASE2_MERGED_PATH = "/kaggle/input/qwen3-phase2-linux-merged"  # Phase 2 merged model
OUTPUT_DIR = "/kaggle/working/qwen3-08b-phase3-python"

# Training hyperparameters (OPTIMIZED FOR MEMORY)
BATCH_SIZE = 2  # Reduced from 4 to save memory
GRADIENT_ACCUMULATION = 8  # Increased to keep effective batch size = 16
LEARNING_RATE = 2e-4
NUM_EPOCHS = 3
MAX_SEQ_LENGTH = 2048

In [None]:
# Load Phase 2 merged model
print("="*60)
print("STEP 1: LOADING PHASE 2 MERGED MODEL")
print("="*60)

print("\nLoading Phase 2 merged model (Base + CodeAlpaca + Linux Commands)...")
merged_phase2_model = AutoModelForCausalLM.from_pretrained(
    PHASE2_MERGED_PATH,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

print("✓ Phase 2 merged model loaded")
print("✓ Model includes: Base + CodeAlpaca + Linux Commands")

# This merged model will be used for Phase 3 training
print("\n" + "="*60)

## Step 1: Load Phase 2 Merged Model

We start with the merged model from Phase 2, which already contains knowledge from:
- Base Qwen3-0.8B model
- Phase 1: CodeAlpaca training
- Phase 2: Linux Commands training

In [None]:
# Load tokenizer with padding token verification
print("="*60)
print("STEP 2: LOADING TOKENIZER")
print("="*60)
print("\nLoading tokenizer from base model...")
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)

# CRITICAL: Force padding token to match Phase 1 & 2 exactly
# ALL PHASES use pad_token_id = 151645 (eos_token)
print("⚠️  Forcing pad_token to eos_token (ID: 151645) to match Phase 1 & 2")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = 151645
tokenizer.padding_side = "right"

print(f"\n🔍 VERIFICATION:")
print(f"   pad_token_id: {tokenizer.pad_token_id}")
print(f"   eos_token_id: {tokenizer.eos_token_id}")

# Phase 1 & 2 used pad_token_id = 151645
if tokenizer.pad_token_id == 151645:
    print(f"   ✅ CORRECT: Matches Phase 1 & 2 configuration (ID: 151645)")
else:
    print(f"   ⚠️  WARNING: Different from Phase 1 & 2 (expected 151645, got {tokenizer.pad_token_id})")
    print(f"   This may cause training inconsistencies!")

print("="*60)

In [None]:
# Prepare merged model for training with 4-bit quantization
print("\n" + "="*60)
print("STEP 3: PREPARING MERGED MODEL FOR TRAINING")
print("="*60)

print("\nApplying 4-bit quantization to merged model...")
# Note: We already have the merged model from previous cell
# Now we need to prepare it for k-bit training
model = prepare_model_for_kbit_training(merged_phase2_model, use_gradient_checkpointing=True)
print(f"✓ Model prepared for training on device: {model.device}")
print(f"✓ Gradient checkpointing enabled (saves memory)")
print("="*60)

# Clean up to save memory
del merged_phase2_model
torch.cuda.empty_cache()
print("✓ Original models cleaned from memory")

In [None]:
# Configure LoRA for Phase 3 training
print("\n" + "="*60)
print("STEP 4: CONFIGURING PHASE 3 LORA")
print("="*60)
print("\nSetting up LoRA for Python automation training...")
lora_config = LoraConfig(
    r=16,  # LoRA rank
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
print("="*60)

## Step 2: Load and Filter Datasets

We'll load three datasets focused on Python system automation:
1. **StackOverflow Python QA** - System Administration & DevOps
2. **Shell Script Dataset** - Shell scripting patterns
3. **Python Codes** - Filtered for system-relevant code only

In [None]:
# Load HuggingFace datasets (requires internet enabled in notebook settings)
print("Loading HuggingFace datasets...")

# Dataset 1: StackOverflow Python QA - System Administration
try:
    print("Loading RazinAleks/SO-Python_QA-System_Administration_and_DevOps_class...")
    hf_so = load_dataset("RazinAleks/SO-Python_QA-System_Administration_and_DevOps_class", split="train")
    # Sample if too large to speed up processing
    if len(hf_so) > 10000:
        hf_so = hf_so.shuffle(seed=42).select(range(10000))
        print(f"✓ StackOverflow Python QA: {len(hf_so)} examples (sampled for efficiency)")
    else:
        print(f"✓ StackOverflow Python QA: {len(hf_so)} examples")
except Exception as e:
    print(f"! Could not load StackOverflow dataset: {e}")
    hf_so = None

# Dataset 2: Shell Script Dataset
try:
    print("Loading infinite-dataset-hub/ShellScriptDataset...")
    hf_shell = load_dataset("infinite-dataset-hub/ShellScriptDataset", split="train")
    # Sample if too large
    if len(hf_shell) > 5000:
        hf_shell = hf_shell.shuffle(seed=42).select(range(5000))
        print(f"✓ Shell Script Dataset: {len(hf_shell)} examples (sampled for efficiency)")
    else:
        print(f"✓ Shell Script Dataset: {len(hf_shell)} examples")
except Exception as e:
    print(f"! Could not load Shell Script dataset: {e}")
    hf_shell = None

In [None]:
# Dataset 3: Python Codes (will filter for system-relevant)
try:
    print("Loading flytech/python-codes-25k...")
    hf_python_raw = load_dataset("flytech/python-codes-25k", split="train")
    print(f"✓ Python Codes: {len(hf_python_raw)} examples (will filter)")
    
    # Filter for system-relevant keywords with negative filtering
    print("Filtering for system-relevant Python code...")
    
    # POSITIVE keywords - MUST have at least one
    system_keywords = [
        'os.', 'sys.', 'subprocess', 'shutil', 'pathlib', 'glob',
        'file', 'directory', 'process', 'path', 'chmod', 'chown',
        'environ', 'execute', 'script', 'automation', 'system',
        'open(', 'read', 'write', 'json', 'yaml', 'config',
        'argparse', 'logging', 'socket', 'threading', 'multiprocessing'
    ]
    
    # NEGATIVE keywords - exclude if ANY are present
    exclude_keywords = [
        'hexagonal', 'tile', 'palindrome', 'fibonacci', 'leetcode',
        'algorithm', 'dataframe', 'matplotlib', 'pandas', 'seaborn',
        'pyplot', 'plot', 'graph', 'visualization', 'sklearn',
        'tensorflow', 'keras', 'pytorch', 'neural', 'machine learning',
        'deep learning', 'scraping', 'beautifulsoup', 'selenium',
        'theorem', 'proof', 'equation', 'matrix multiplication',
        'binary tree', 'linked list', 'stack', 'queue', 'heap'
    ]
    

    def is_system_relevant(example):    hf_python = None

        """Check if code contains system keywords and excludes non-system topics"""    print(f"! Could not load/filter Python Codes dataset: {e}")

        code = str(example.get('code', '') or except Exception as e:

                  example.get('text', '') or     print(f"✓ Filtered to {len(hf_python)} system-relevant examples (from {len(hf_python_raw)})")

                  example.get('instruction', '') or    hf_python = hf_python_raw.filter(is_system_relevant)

                  example.get('output', '') or '').lower()    

                return has_system and not has_exclude

        # Must have at least one positive keyword        

        has_system = any(keyword.lower() in code for keyword in system_keywords)        has_exclude = any(keyword.lower() in code for keyword in exclude_keywords)

                # Must NOT have any negative keywords

In [None]:
# Normalize all datasets to common format
print("\nNormalizing datasets to common format...")

def normalize_so_python(dataset):
    """Normalize StackOverflow Python QA dataset"""
    if dataset is None:
        return None
    
    def format_so(example):
        question = example.get('question', example.get('instruction', example.get('input', '')))
        answer = example.get('answer', example.get('output', example.get('response', '')))
        return {"text": f"Instruction: {question}\n\nResponse: {answer}"}
    
    return dataset.map(format_so, remove_columns=dataset.column_names)

def normalize_shell(dataset):
    """Normalize Shell Script dataset"""
    if dataset is None:
        return None
    
    def format_shell(example):
        instruction = example.get('instruction', example.get('question', example.get('input', '')))
        script = example.get('script', example.get('output', example.get('code', '')))
        return {"text": f"Instruction: {instruction}\n\nResponse: {script}"}
    
    return dataset.map(format_shell, remove_columns=dataset.column_names)

def normalize_python(dataset):
    """Normalize Python Codes dataset"""
    if dataset is None:
        return None
    
    def format_python(example):
        instruction = example.get('instruction', example.get('question', example.get('input', '')))
        code = example.get('code', example.get('output', example.get('text', '')))
        return {"text": f"Instruction: {instruction}\n\nResponse: {code}"}
    
    return dataset.map(format_python, remove_columns=dataset.column_names)

# Normalize all datasets
datasets_normalized = []

if hf_so is not None:
    ds1 = normalize_so_python(hf_so)
    if ds1:
        datasets_normalized.append(ds1)
        print(f"✓ Normalized StackOverflow: {len(ds1)}")

if hf_shell is not None:
    ds2 = normalize_shell(hf_shell)
    if ds2:
        datasets_normalized.append(ds2)
        print(f"✓ Normalized Shell Scripts: {len(ds2)}")

if hf_python is not None:
    ds3 = normalize_python(hf_python)
    if ds3:
        datasets_normalized.append(ds3)
        print(f"✓ Normalized Python Codes: {len(ds3)}")

In [None]:
# Combine all normalized datasets
if not datasets_normalized:
    raise ValueError("No datasets loaded successfully! Check data sources and internet connection.")

print(f"\nCombining {len(datasets_normalized)} datasets...")
dataset = concatenate_datasets(datasets_normalized)
print(f"✓ Combined dataset size: {len(dataset)}")

# CRITICAL: Cap dataset size to avoid OOM errors
# Phase 1 used 20K, Phase 2 capped at 25K from 76K
# Phase 3: Target 20K to avoid capping and ensure high quality
MAX_DATASET_SIZE = 20000
if len(dataset) > MAX_DATASET_SIZE:
    print(f"⚠️  Dataset too large ({len(dataset)} examples)")
    print(f"⚠️  Sampling {MAX_DATASET_SIZE} examples to fit in GPU memory...")
    dataset = dataset.shuffle(seed=42).select(range(MAX_DATASET_SIZE))
    print(f"✓ Reduced to {len(dataset)} examples")
else:
    # Shuffle for better training
    dataset = dataset.shuffle(seed=42)
    print(f"✓ Dataset shuffled")

print(f"\n✓ Final training size: {len(dataset)} examples")

# Show sample
print("\nSample training example:")
print(dataset[0]['text'][:500])

In [None]:
# Training arguments (using SFTConfig for SFT-specific parameters)
training_args = SFTConfig(
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION,
    learning_rate=LEARNING_RATE,
    fp16=True,
    save_strategy="epoch",
    logging_steps=50,
    warmup_steps=100,
    lr_scheduler_type="cosine",
    optim="paged_adamw_8bit",
    report_to="none",
    max_grad_norm=0.3,
    # SFT-specific parameters
    max_length=MAX_SEQ_LENGTH,
    dataset_text_field="text",
    packing=False,
)

In [None]:
# Initialize trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    processing_class=tokenizer,
    args=training_args,
)

print("Trainer initialized. Starting training...")
print(f"Starting training on {len(dataset)} examples...")

In [None]:
# Train!
trainer.train()

In [None]:
# Save the fine-tuned model
print("Saving model...")
trainer.model.save_pretrained(OUTPUT_DIR + "/final")
tokenizer.save_pretrained(OUTPUT_DIR + "/final")
print(f"✓ Model saved to {OUTPUT_DIR}/final")

In [None]:
# Merge LoRA adapter into base model for next phase
print("\n" + "="*60)
print("MERGING LORA ADAPTER INTO BASE MODEL")
print("="*60)

# Reload the trained model in full precision for merging
print("\nLoading trained model for merging...")
from peft import PeftModel

# Load base model (full precision)
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_PATH,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

# Load the LoRA adapter we just trained
lora_model = PeftModel.from_pretrained(base_model, OUTPUT_DIR + "/final")

# Merge LoRA weights into base model
print("Merging LoRA adapter into base model...")
merged_model = lora_model.merge_and_unload()

# Save the merged model
MERGED_OUTPUT = OUTPUT_DIR + "/merged"
print(f"Saving merged model to {MERGED_OUTPUT}...")
merged_model.save_pretrained(MERGED_OUTPUT)
tokenizer.save_pretrained(MERGED_OUTPUT)

print(f"\n✓ Merged model saved to {MERGED_OUTPUT}")
print("✓ This merged model should be used as input for Phase 4")
print(f"✓ Pad token ID: {tokenizer.pad_token_id} (preserved for next phase)")

# Clean up to free memory
del base_model, lora_model, merged_model
torch.cuda.empty_cache()
print("✓ Memory cleared")

## Merge LoRA Adapter into Base Model

For sequential training (Phase 3 → Phase 4), we merge the LoRA adapter into the base model so Phase 4 can build on all accumulated knowledge.

In [None]:
# Test the model
print("\nTesting the fine-tuned model...")
test_prompts = [
    "Instruction: Write a Python script to backup all files in a directory\n\nResponse:",
    "Instruction: Create a Python script to monitor CPU usage\n\nResponse:",
    "Instruction: Write Python code to automate file organization by extension\n\nResponse:",
]

for prompt in test_prompts:
    print(f"\n{'='*60}")
    print(f"PROMPT: {prompt}")
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"OUTPUT: {result}")

In [None]:
# Archive BOTH outputs for download
print("\nArchiving outputs...")
!zip -r qwen3-08b-phase3-python-lora.zip {OUTPUT_DIR}/final
!zip -r qwen3-08b-phase3-python-merged.zip {OUTPUT_DIR}/merged

print("\n" + "="*60)
print("PHASE 3 COMPLETE!")
print("="*60)
print("\n✓ LoRA adapter archived: qwen3-08b-phase3-python-lora.zip")
print("✓ Merged model archived: qwen3-08b-phase3-python-merged.zip")
print("\n⚠️  IMPORTANT: Upload the MERGED model as input for Phase 4")
print(f"⚠️  Pad token ID {tokenizer.pad_token_id} is preserved in merged model")