# üöÄ ROBUST Phi-3 Mini Fine-tuning for Kantra Rules Generation

## üéØ **New Robust Pipeline Features:**

### **‚úÖ Key Improvements:**
1. **Explicit Chat Template Formatting** - Ensures consistent user/assistant structure
2. **System Prompt Integration** - Forces YAML-only output during training
3. **Automatic Model Merging** - Creates standalone model (no PEFT complexity)
4. **Simple Inference** - Load and use like any regular model
5. **Better Error Handling** - Robust validation and fallbacks

### **üîÑ Pipeline Overview:**
```
Training Data ‚Üí Explicit Formatting ‚Üí LoRA Fine-tuning ‚Üí Model Merging ‚Üí Standalone Model
```

### **üìã Expected Results:**
- ‚úÖ **Consistent YAML Output** (no conversational text)
- ‚úÖ **Simple Inference** (no adapter loading)
- ‚úÖ **Better Reliability** (merged model is more stable)
- ‚úÖ **Easy Deployment** (single model directory)

---


# üöÄ Phi-3 Mini Fine-tuning for Kantra Rules Generation

This notebook fine-tunes Microsoft's Phi-3-mini-4k-instruct model to generate Kantra migration rules using LoRA (Low-Rank Adaptation) for parameter-efficient training.

## üìã What this notebook does:
- Fine-tunes Phi-3-mini (3.8B params) using QLoRA
- Uses your custom Kantra rules dataset
- Optimized for Google Colab (free GPU)
- Creates a specialized model for migration rule generation

## ‚ö° Expected Performance:
- **Colab Free (T4)**: ~15-30 minutes training time
- **Colab Pro (A100)**: ~5-15 minutes training time
- **Local Mac**: ~2.5-4.5 hours (not recommended)

---


## ‚ö†Ô∏è Step 1: GPU Check

**IMPORTANT**: Make sure you're using a GPU runtime before proceeding!

Go to: **Runtime** ‚Üí **Change runtime type** ‚Üí **Hardware accelerator** ‚Üí **T4 GPU**


In [None]:
# Verify GPU is available
import torch

try:
    assert torch.cuda.is_available() is True
    print("‚úÖ GPU is available!")
    print(f"üöÄ Using GPU: {torch.cuda.get_device_name(0)}")
    print(f"üíæ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
except AssertionError:
    print("‚ùå GPU is not available!")
    print("‚ö†Ô∏è Please set up a GPU before using this notebook:")
    print("   1. Go to Runtime ‚Üí Change runtime type")
    print("   2. Select 'T4 GPU' under Hardware accelerator")
    print("   3. Click Save and restart the runtime")
    print("   4. Re-run this cell")
    raise RuntimeError("GPU required for efficient fine-tuning. Please enable GPU and restart.")


: 

## üì• Step 2: Clone Repository

Clone the Kantra fine-tuning repository and install dependencies.


In [None]:
# Clone the repository and set up environment
%cd /content/
%rm -rf kantra-finetune
!git clone --depth 1 https://github.com/sshaaf/kantra-finetune.git
%cd kantra-finetune
%ls

# Install PyTorch with CUDA support first
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install bitsandbytes separately (common Colab issue)
!pip install bitsandbytes

# Install core fine-tuning packages
!pip install transformers datasets accelerate peft trl

# Install flash attention (optional, may fail but that's OK)
!pip install flash-attn --no-build-isolation || echo "‚ö†Ô∏è Flash attention install failed - will use eager attention"

# Install any additional dependencies from the repository (if setup.py exists)
!pip install -e . || echo "‚ÑπÔ∏è No setup.py found - using manual package installation"

print("‚úÖ Installation completed!")


## üîß Alternative Installation (Run if above fails)

If the installation above fails, run this cell for a more robust installation.


In [None]:
# Alternative installation method
import subprocess
import sys

def install_package(package):
    """Install package with error handling"""
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        print(f"‚úÖ {package} installed successfully")
        return True
    except subprocess.CalledProcessError:
        print(f"‚ùå Failed to install {package}")
        return False

# Essential packages
packages = [
    "torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121",
    "transformers",
    "datasets", 
    "accelerate",
    "peft",
    "trl",
    "psutil"  # For memory checking
]

print("üîß Installing packages individually...")
for package in packages:
    install_package(package)

# Try bitsandbytes with different approaches
print("\nüîß Installing bitsandbytes...")
bitsandbytes_installed = False

# Method 1: Standard pip install
if install_package("bitsandbytes"):
    bitsandbytes_installed = True
else:
    # Method 2: Try with --no-deps
    print("üîÑ Trying bitsandbytes with --no-deps...")
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "bitsandbytes", "--no-deps"])
        bitsandbytes_installed = True
        print("‚úÖ bitsandbytes installed with --no-deps")
    except:
        print("‚ùå bitsandbytes installation failed")

# Optional packages
print("\nüîß Installing optional packages...")
optional_packages = ["flash-attn"]
for package in optional_packages:
    if not install_package(f"{package} --no-build-isolation"):
        print(f"‚ö†Ô∏è {package} failed - will use fallback")

print(f"\n‚úÖ Installation complete!")
print(f"üì¶ Quantization available: {bitsandbytes_installed}")

if not bitsandbytes_installed:
    print("‚ö†Ô∏è Warning: bitsandbytes not installed - will run without quantization")
    print("üí° This will use more memory but should still work")


## üîß Step 3: Environment Setup

Verify the installation and check hardware configuration.


In [None]:
# Verify installation and setup
import torch
import sys
import os

print(f"üìç Current directory: {os.getcwd()}")
print(f"üêç Python version: {sys.version}")
print(f"üî• PyTorch version: {torch.__version__}")
print(f"‚ö° CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"üöÄ GPU: {torch.cuda.get_device_name(0)}")
    print(f"üíæ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("‚ùå No GPU available - this will be very slow!")

# Check if bitsandbytes is available
try:
    import bitsandbytes
    bitsandbytes_available = True
    print(f"üì¶ BitsAndBytes version: {bitsandbytes.__version__}")
except ImportError:
    bitsandbytes_available = False
    print("‚ö†Ô∏è BitsAndBytes not available - will run without quantization")

# Set device configuration
device = "cuda" if torch.cuda.is_available() else "cpu"
use_quantization = torch.cuda.is_available() and bitsandbytes_available

print(f"\n‚úÖ Configuration:")
print(f"   üéØ Device: {device}")
print(f"   üì¶ Quantization: {use_quantization}")
print(f"   üìÅ Working directory: {os.getcwd()}")

if not use_quantization and torch.cuda.is_available():
    print("   ‚ö†Ô∏è GPU available but quantization disabled (no bitsandbytes)")
    print("   üí° Will use more GPU memory but should still work")

# Verify key files exist
if os.path.exists("train_dataset.jsonl"):
    print(f"   üìä Dataset found: train_dataset.jsonl ({os.path.getsize('train_dataset.jsonl')/1024:.1f} KB)")
else:
    print("   ‚ö†Ô∏è Dataset not found - will need to upload train_dataset.jsonl")


## üìÅ Step 4: Dataset Check

Check for existing dataset or upload your `train_dataset.jsonl` file.


In [None]:
# Check for existing dataset or upload
import os

dataset_file = "train_dataset.jsonl"

# Check if dataset already exists
if os.path.exists(dataset_file):
    print(f"‚úÖ Found existing dataset: {dataset_file}")
    print(f"üìä File size: {os.path.getsize(dataset_file) / 1024:.1f} KB")
else:
    print("üìÅ Dataset not found locally. Please upload your train_dataset.jsonl file:")
    try:
        from google.colab import files
        uploaded = files.upload()
        
        if dataset_file in uploaded:
            print(f"‚úÖ Dataset uploaded successfully: {dataset_file}")
            print(f"üìä File size: {os.path.getsize(dataset_file) / 1024:.1f} KB")
        else:
            print("‚ùå Dataset file not found. Please upload train_dataset.jsonl")
            raise FileNotFoundError("Dataset file is required to proceed")
    except ImportError:
        # Not in Colab environment
        print("‚ÑπÔ∏è Not in Colab environment. Please ensure train_dataset.jsonl is in the current directory.")
        if not os.path.exists(dataset_file):
            raise FileNotFoundError(f"Dataset file '{dataset_file}' not found in current directory")


## ‚öôÔ∏è Step 5: Model Configuration

Configure the base model, quantization settings, and training parameters.


In [None]:
# Import required libraries
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from peft import LoraConfig
from trl import SFTTrainer

# Model and training configuration
model_id = "microsoft/Phi-3-mini-4k-instruct"
new_model_name = "phi-3-mini-kantra-rules-generator"

print(f"ü§ñ Base model: {model_id}")
print(f"üéØ Output model: {new_model_name}")


In [None]:
# Configure quantization for memory efficiency (QLoRA)
bnb_config = None
if use_quantization:
    try:
        # BitsAndBytesConfig should already be imported above, but let's be safe
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
        )
        print("‚ö° 4-bit quantization enabled for memory efficiency")
    except Exception as e:
        print(f"‚ùå Quantization setup failed: {e}")
        print("üîÑ Disabling quantization and continuing...")
        use_quantization = False
        bnb_config = None
        
if not use_quantization:
    print("‚ö†Ô∏è Running without quantization (will use more memory)")
    print("üí° This is normal if bitsandbytes installation failed")


## üì• Step 6: Load Model and Tokenizer


In [None]:
# Load tokenizer and model
print("üìù Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print("ü§ñ Loading Phi-3 model...")
model_kwargs = {
    "trust_remote_code": True,
    "attn_implementation": "flash_attention_2" if use_quantization else "eager",
}

if bnb_config is not None:
    model_kwargs["quantization_config"] = bnb_config
    model_kwargs["device_map"] = "auto"
    model_kwargs["dtype"] = torch.bfloat16
else:
    model_kwargs["dtype"] = torch.float32

model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)
model.config.use_cache = False
model.config.pretraining_tp = 1

print("‚úÖ Model and tokenizer loaded successfully!")


## üéõÔ∏è Step 7: Configure LoRA and Load Dataset


In [None]:
# Configure LoRA for parameter-efficient fine-tuning
lora_config = LoraConfig(
    r=16,                # Rank of the update matrices
    lora_alpha=32,       # Alpha parameter for scaling
    lora_dropout=0.05,   # Dropout probability for LoRA layers
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)

# Load the dataset
print("üìä Loading training dataset...")
dataset = load_dataset("json", data_files=dataset_file, split="train")

print(f"‚úÖ Dataset loaded: {len(dataset):,} examples")
print(f"üéØ LoRA will train only ~1% of model parameters!")

# Show a sample
print("\nüìù Sample training example:")
print("=" * 50)
sample = dataset[0]
for key, value in sample.items():
    if isinstance(value, str) and len(value) > 200:
        print(f"{key}: {value[:200]}...")
    else:
        print(f"{key}: {value}")
print("=" * 50)


## üèãÔ∏è Step 8: Training Configuration and Start Training


## üîß Fix Training Arguments (Run this if you get ValueError)

If you get a ValueError about `load_best_model_at_end`, run this cell to fix the training arguments.


In [None]:
# Fixed training arguments (without load_best_model_at_end)
training_args = TrainingArguments(
    output_dir=f"./{new_model_name}",
    per_device_train_batch_size=2 if use_quantization else 1,
    gradient_accumulation_steps=2 if use_quantization else 4,
    learning_rate=2e-4,
    logging_steps=10,
    num_train_epochs=3,
    save_strategy="epoch",
    optim="paged_adamw_32bit" if use_quantization else "adamw_torch",
    fp16=False,
    bf16=use_quantization,
    dataloader_pin_memory=True,
    remove_unused_columns=False,
    report_to=None,  # Disable wandb
    save_total_limit=2,  # Keep only the last 2 checkpoints to save space
    # Removed load_best_model_at_end since we don't have validation data
    # This parameter requires eval_strategy to be set, which we don't need for this training
)

print("‚úÖ Training arguments fixed!")
print("üí° Removed load_best_model_at_end to avoid validation requirement")


## üîß Fix Flash Attention Error (Run this if training fails)

If you get a "FlashAttention only supports Ampere GPUs or newer" error, run this cell.


In [None]:
# Fix Flash Attention and Wandb issues
import os

# Disable wandb completely
os.environ["WANDB_DISABLED"] = "true"
os.environ["WANDB_MODE"] = "disabled"

# Recreate training arguments with eager attention and disabled wandb
training_args = TrainingArguments(
    output_dir=f"./{new_model_name}",
    per_device_train_batch_size=2 if use_quantization else 1,
    gradient_accumulation_steps=2 if use_quantization else 4,
    learning_rate=2e-4,
    logging_steps=10,
    num_train_epochs=3,
    save_strategy="epoch",
    optim="paged_adamw_32bit" if use_quantization else "adamw_torch",
    fp16=False,
    bf16=use_quantization,
    dataloader_pin_memory=True,
    remove_unused_columns=False,
    report_to=[],  # Empty list to disable all reporting
    save_total_limit=2,
)

# Reload model with eager attention (no flash attention)
print("üîÑ Reloading model with eager attention...")

# Clear memory first
if 'model' in locals():
    del model
if torch.cuda.is_available():
    torch.cuda.empty_cache()

# Reload model with eager attention
model_kwargs = {
    "trust_remote_code": True,
    "attn_implementation": "eager",  # Force eager attention for T4 compatibility
}

if bnb_config is not None:
    model_kwargs["quantization_config"] = bnb_config
    model_kwargs["device_map"] = "auto"
    model_kwargs["dtype"] = torch.bfloat16
else:
    model_kwargs["dtype"] = torch.float32

model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)
model.config.use_cache = False
model.config.pretraining_tp = 1

print("‚úÖ Model reloaded with eager attention")
print("‚úÖ Wandb disabled")
print("üí° Ready to train on T4 GPU!")


In [None]:
# Recreate trainer with fixed configuration
print("üöÄ Creating trainer with T4-compatible settings...")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    peft_config=lora_config,
)

print("‚úÖ Trainer created successfully!")
print("üéØ Configuration:")
print(f"   - Attention: eager (T4 compatible)")
print(f"   - Wandb: disabled")
print(f"   - Quantization: {use_quantization}")
print(f"   - Device: {model.device}")

# Show memory usage
if torch.cuda.is_available():
    memory_used = torch.cuda.memory_allocated() / 1e9
    memory_total = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"   - GPU Memory: {memory_used:.1f}GB / {memory_total:.1f}GB ({memory_used/memory_total*100:.1f}%)")

print("\nüèÉ Ready to start training!")


In [None]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir=f"./{new_model_name}",
    per_device_train_batch_size=2 if use_quantization else 1,
    gradient_accumulation_steps=2 if use_quantization else 4,
    learning_rate=2e-4,
    logging_steps=10,
    num_train_epochs=3,
    save_strategy="epoch",
    optim="paged_adamw_32bit" if use_quantization else "adamw_torch",
    fp16=False,
    bf16=use_quantization,
    dataloader_pin_memory=True,
    remove_unused_columns=False,
    report_to=None,  # Disable wandb
    load_best_model_at_end=True,
)

# Calculate estimates
total_examples = len(dataset)
batch_size = training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps
steps_per_epoch = total_examples // batch_size
estimated_time = "15-30 minutes (GPU)" if use_quantization else "2.5-4.5 hours (CPU)"

print("üèãÔ∏è Training Configuration:")
print(f"   üìä Examples: {total_examples:,}")
print(f"   üî¢ Effective batch size: {batch_size}")
print(f"   üìà Steps per epoch: {steps_per_epoch:,}")
print(f"   üîÑ Epochs: {training_args.num_train_epochs}")
print(f"   ‚è±Ô∏è Estimated time: {estimated_time}")


In [None]:
# Create trainer and start training
import time

print("üöÄ Initializing trainer...")
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    peft_config=lora_config,
)

print("üèÉ Starting fine-tuning process...")
print(f"‚è∞ Started at: {time.strftime('%H:%M:%S')}")
print("\n" + "="*60)
print("üî• TRAINING IN PROGRESS")
print("="*60)

start_time = time.time()
trainer.train()
end_time = time.time()

training_time = end_time - start_time
print("\n" + "="*60)
print("üéâ TRAINING COMPLETED!")
print("="*60)
print(f"‚è±Ô∏è Total training time: {training_time/60:.1f} minutes")
print(f"‚è∞ Finished at: {time.strftime('%H:%M:%S')}")


## üíæ Step 9: Save Model and Test


In [None]:
# Save the fine-tuned model
final_model_path = f"./{new_model_name}-final"
print(f"üíæ Saving model to: {final_model_path}")
trainer.save_model(final_model_path)
print("‚úÖ Model saved successfully!")

# Create downloadable archive (for Colab)
try:
    import shutil
    archive_name = f"{new_model_name}-final"
    shutil.make_archive(archive_name, 'zip', final_model_path)
    archive_size = os.path.getsize(f"{archive_name}.zip") / 1024 / 1024
    print(f"üì¶ Archive created: {archive_name}.zip ({archive_size:.1f} MB)")
    print("üì• Download from Colab file browser!")
except:
    print("‚ÑπÔ∏è Archive creation skipped (not in Colab or error occurred)")

print(f"\nüéä Fine-tuning complete! Your model is ready to use.")


In [None]:
# Test the fine-tuned model
print("üß™ Testing the fine-tuned model...")

test_prompt = "Generate a Kantra rule to detect when a Java file imports `sun.misc.Unsafe`, which is a non-portable and risky API."

messages = [{"role": "user", "content": test_prompt}]
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

print(f"üéØ Test prompt: {test_prompt}")
print("\nü§ñ Generating response...")

with torch.no_grad():
    generated_ids = model.generate(
        model_inputs,
        max_new_tokens=500,
        do_sample=True,
        temperature=0.1,
        pad_token_id=tokenizer.eos_token_id
    )

decoded_output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

if "<|assistant|>" in decoded_output:
    response = decoded_output.split("<|assistant|>")[1].strip()
else:
    input_text = tokenizer.batch_decode(model_inputs, skip_special_tokens=True)[0]
    response = decoded_output.replace(input_text, "").strip()

print("\n" + "="*60)
print("üéâ MODEL RESPONSE")
print("="*60)
print(response)
print("="*60)

# Validate YAML
try:
    import yaml
    yaml.safe_load(response)
    print("‚úÖ Success! The output appears to be valid YAML.")
except:
    print("‚ÑπÔ∏è YAML validation skipped or output may not be valid YAML")


## üîÑ Step 10: Merge LoRA Adapter (Optional)

**‚ö†Ô∏è Memory Requirements:**
- **Phi-3-mini (3.8B)**: ~8GB RAM needed for merging
- **Colab Free**: 12GB RAM available ‚úÖ **Should work**
- **Colab Pro**: 25GB+ RAM available ‚úÖ **Will work**

**Note**: Larger models (7B+) need 18GB+ RAM and won't work on Colab Free.


In [None]:
# Check available memory and decide whether to merge
import psutil
import gc

def get_available_memory_gb():
    """Get available RAM in GB"""
    return psutil.virtual_memory().available / (1024**3)

def get_total_memory_gb():
    """Get total RAM in GB"""
    return psutil.virtual_memory().total / (1024**3)

available_memory = get_available_memory_gb()
total_memory = get_total_memory_gb()

print(f"üíæ System Memory:")
print(f"   Total RAM: {total_memory:.1f} GB")
print(f"   Available RAM: {available_memory:.1f} GB")

# Memory requirements for different models
memory_requirements = {
    "phi-3-mini": 8,  # 3.8B parameters
    "phi-3-small": 12, # 7B parameters  
    "phi-3-medium": 18, # 14B parameters
}

model_size = "phi-3-mini"  # We're using Phi-3-mini
required_memory = memory_requirements[model_size]

print(f"üìä Model: {model_size}")
print(f"üîß Required RAM for merging: ~{required_memory} GB")

can_merge = available_memory >= required_memory
print(f"‚úÖ Can merge LoRA: {can_merge}")

if not can_merge:
    print(f"‚ö†Ô∏è Warning: Only {available_memory:.1f}GB available, need {required_memory}GB")
    print("üí° Options:")
    print("   1. Skip merging (use LoRA adapter as-is)")
    print("   2. Upgrade to Colab Pro for more RAM")
    print("   3. Use a local machine with more RAM")
else:
    print("üéâ Sufficient memory available for merging!")


In [None]:
# Merge LoRA adapter (if sufficient memory)
merge_lora = input("ü§î Do you want to merge the LoRA adapter? (y/n): ").lower().strip() == 'y'

if merge_lora and can_merge:
    print("üîÑ Starting LoRA merge process...")
    print("‚ö†Ô∏è This may take several minutes and use significant memory...")
    
    try:
        # Clear memory first
        if 'trainer' in locals():
            del trainer
        torch.cuda.empty_cache() if torch.cuda.is_available() else None
        gc.collect()
        
        print("üì• Loading base model for merging...")
        # Load base model in float16 to save memory
        base_model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.float16,
            device_map="auto" if torch.cuda.is_available() else None,
            trust_remote_code=True,
        )
        
        print("üîó Loading and merging LoRA adapter...")
        from peft import PeftModel
        
        # Load the LoRA model
        lora_model = PeftModel.from_pretrained(base_model, final_model_path)
        
        # Merge the adapter
        merged_model = lora_model.merge_and_unload()
        
        # Save the merged model
        merged_model_path = f"./{new_model_name}-merged"
        print(f"üíæ Saving merged model to: {merged_model_path}")
        
        merged_model.save_pretrained(merged_model_path)
        tokenizer.save_pretrained(merged_model_path)
        
        print("‚úÖ LoRA merge completed successfully!")
        print(f"üìÅ Merged model saved to: {merged_model_path}")
        
        # Create archive for merged model
        try:
            import shutil
            merged_archive_name = f"{new_model_name}-merged"
            shutil.make_archive(merged_archive_name, 'zip', merged_model_path)
            archive_size = os.path.getsize(f"{merged_archive_name}.zip") / 1024 / 1024
            print(f"üì¶ Merged model archive: {merged_archive_name}.zip ({archive_size:.1f} MB)")
        except:
            print("‚ÑπÔ∏è Could not create merged model archive")
        
        # Clean up memory
        del base_model, lora_model, merged_model
        torch.cuda.empty_cache() if torch.cuda.is_available() else None
        gc.collect()
        
    except Exception as e:
        print(f"‚ùå Error during LoRA merge: {e}")
        print("üí° The LoRA adapter is still available for use without merging")
        
elif merge_lora and not can_merge:
    print("‚ùå Cannot merge: Insufficient memory")
    print(f"üí° Need {required_memory}GB RAM, but only {available_memory:.1f}GB available")
    print("üîß The LoRA adapter works fine without merging!")
    
else:
    print("‚ÑπÔ∏è Skipping LoRA merge - using adapter format")
    print("üí° You can still use the model with the LoRA adapter!")


## üöÄ Step 11: Upload to Hugging Face Hub (Optional)

Upload your fine-tuned model to Hugging Face Hub for easy sharing and deployment.


In [None]:
# Upload to Hugging Face Hub
upload_to_hub = input("ü§î Do you want to upload the model to Hugging Face Hub? (y/n): ").lower().strip() == 'y'

if upload_to_hub:
    try:
        from huggingface_hub import HfApi, login
        
        print("üîë Please log in to Hugging Face Hub...")
        print("üí° You'll need a Hugging Face account and access token")
        print("üìù Get your token from: https://huggingface.co/settings/tokens")
        
        # Login to Hugging Face
        login()
        
        # Get repository name
        repo_name = input("üìù Enter repository name (e.g., 'your-username/phi3-kantra-rules'): ").strip()
        
        if not repo_name:
            repo_name = f"phi3-kantra-rules-{int(time.time())}"
            print(f"üè∑Ô∏è Using default name: {repo_name}")
        
        # Determine which model to upload
        if 'merged_model_path' in locals() and os.path.exists(merged_model_path):
            upload_path = merged_model_path
            model_type = "merged"
            print(f"üì§ Uploading merged model from: {upload_path}")
        else:
            upload_path = final_model_path
            model_type = "LoRA adapter"
            print(f"üì§ Uploading LoRA adapter from: {upload_path}")
        
        # Create repository and upload
        api = HfApi()
        
        print(f"üèóÔ∏è Creating repository: {repo_name}")
        api.create_repo(repo_id=repo_name, exist_ok=True)
        
        print(f"üì§ Uploading {model_type}...")
        api.upload_folder(
            folder_path=upload_path,
            repo_id=repo_name,
            commit_message=f"Upload fine-tuned Phi-3 mini for Kantra rules generation ({model_type})"
        )
        
        print("‚úÖ Upload completed successfully!")
        print(f"üîó Your model is available at: https://huggingface.co/{repo_name}")
        print(f"üí° You can now use it with: AutoModelForCausalLM.from_pretrained('{repo_name}')\")\n")
        
        # Create model card
        model_card_content = f\"\"\"---\nlicense: mit\nbase_model: {model_id}\ntags:\n- phi3\n- kantra\n- code-migration\n- fine-tuned\nlibrary_name: transformers\n---\n\n# Phi-3 Mini Fine-tuned for Kantra Rules Generation\n\nThis model is a fine-tuned version of [{model_id}](https://huggingface.co/{model_id}) for generating Kantra migration rules.\n\n## Model Details\n- **Base Model**: {model_id}\n- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)\n- **Task**: Code migration rule generation\n- **Model Type**: {model_type}\n\n## Usage\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n{'from peft import PeftModel' if model_type == 'LoRA adapter' else ''}\n\ntokenizer = AutoTokenizer.from_pretrained(\"{repo_name}\")\n{'base_model = AutoModelForCausalLM.from_pretrained(\"' + model_id + '\")' if model_type == 'LoRA adapter' else ''}\n{'model = PeftModel.from_pretrained(base_model, \"' + repo_name + '\")' if model_type == 'LoRA adapter' else 'model = AutoModelForCausalLM.from_pretrained(\"' + repo_name + '\")'}\n\n# Generate Kantra rules\nprompt = \"Generate a Kantra rule to detect deprecated Java APIs\"\nmessages = [{{\"role\": \"user\", \"content\": prompt}}]\ninputs = tokenizer.apply_chat_template(messages, return_tensors=\"pt\")\noutputs = model.generate(inputs, max_new_tokens=500)\nresponse = tokenizer.decode(outputs[0], skip_special_tokens=True)\nprint(response)\n```\n\n## Training Details\n- Fine-tuned using QLoRA for parameter efficiency\n- Optimized for generating YAML-formatted Kantra migration rules\n- Trained on custom Kantra rules dataset\n\"\"\"\n        \n        # Upload model card\n        with open(\"README.md\", \"w\") as f:\n            f.write(model_card_content)\n        \n        api.upload_file(\n            path_or_fileobj=\"README.md\",\n            path_in_repo=\"README.md\",\n            repo_id=repo_name,\n            commit_message=\"Add model card\"\n        )\n        \n        os.remove(\"README.md\")  # Clean up\n        \n        print(\"üìÑ Model card created and uploaded!\")\n        \n    except ImportError:\n        print(\"‚ùå huggingface_hub not installed. Install with: pip install huggingface_hub\")\n    except Exception as e:\n        print(f\"‚ùå Upload failed: {e}\")\n        print(\"üí° You can manually upload the model files to Hugging Face Hub\")\nelse:\n    print(\"‚ÑπÔ∏è Skipping Hugging Face Hub upload\")\n    print(\"üí° You can manually upload later if needed\")"


## üéä Summary & Next Steps

Congratulations! You've successfully fine-tuned Phi-3-mini for Kantra rules generation.

### ‚úÖ What you accomplished:
- ‚úÖ Fine-tuned a 3.8B parameter model using LoRA
- ‚úÖ Used parameter-efficient training (~1% of parameters)
- ‚úÖ Created a specialized model for migration rule generation
- ‚úÖ Handled memory constraints intelligently
- ‚úÖ Generated downloadable model files

### üìÅ Your files:
- **LoRA Adapter**: `phi-3-mini-kantra-rules-generator-final/` (always created)
- **Merged Model**: `phi-3-mini-kantra-rules-generator-merged/` (if merged)
- **Archives**: `.zip` files for easy download

### üîÑ Model Formats Explained:

#### **LoRA Adapter** (Always Available):
- ‚úÖ **Small size**: Only adapter weights (~few MB)
- ‚úÖ **Memory efficient**: Works on any system
- ‚úÖ **Flexible**: Can be applied to different base models
- ‚ö†Ô∏è **Requires base model**: Need to load Phi-3-mini + adapter

#### **Merged Model** (If you have enough RAM):
- ‚úÖ **Standalone**: Complete model, no base model needed
- ‚úÖ **Faster loading**: Single model file
- ‚úÖ **Easy deployment**: Standard transformer model
- ‚ö†Ô∏è **Larger size**: Full model weights (~7GB)

### üöÄ Next steps:

#### **1. Local Testing**:
```python
# For LoRA adapter:
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = PeftModel.from_pretrained(base_model, "./phi-3-mini-kantra-rules-generator-final")

# For merged model (if available):
model = AutoModelForCausalLM.from_pretrained("./phi-3-mini-kantra-rules-generator-merged")
```

#### **2. Production Deployment**:
- Use merged model for faster inference
- Use LoRA adapter for memory-constrained environments
- Consider uploading to Hugging Face Hub for easy access

#### **3. Further Improvements**:
- üìä Add validation dataset to monitor overfitting
- üîß Experiment with different LoRA ranks (8, 32, 64)
- üìà Try different learning rates (1e-4, 5e-4)
- üìù Add more diverse training examples

### üí° Memory Guidelines:
- **Colab Free (12GB)**: ‚úÖ Phi-3-mini merging works
- **Colab Pro (25GB+)**: ‚úÖ All models work
- **Local Mac**: Depends on unified memory
- **For 7B+ models**: Need 18GB+ RAM for merging

---
**üéâ Happy fine-tuning! Your Kantra rules generator is ready to use! üöÄ**


# üöÄ ALTERNATIVE: Run Robust Training Pipeline

## üéØ **Use This Instead of Individual Steps Above**

If you want the most robust training with automatic model merging, run this cell instead of the individual steps above. This implements all the improvements in one go.

**Key Features:**
- ‚úÖ Explicit chat template formatting
- ‚úÖ System prompt integration  
- ‚úÖ Automatic LoRA merging
- ‚úÖ Standalone model creation
- ‚úÖ Built-in testing


In [None]:
# Run the robust training pipeline
!python robust_finetune.py


# üß™ Test the Robust Model

## üéØ **Simple Inference with Merged Model**

The robust pipeline creates a merged model that's much simpler to use. No PEFT/LoRA complexity!


In [None]:
# Test the merged model with simple inference
!python simple_inference.py
