# Final Model Overview: Gemma-2-2b-it with Improved Prompts + QLoRA
**Date**: July 24, 2025

## 🤔 Analysis and Considerations

### Performance Comparison with Other Models
Despite using a significantly larger model (Gemma-2-2b-it with ~2.6B parameters), the performance improvement compared to smaller models is limited. For example, DeBERTa-v3-xsmall, which has approximately 1/100th the parameters, achieves MAP@3 scores around 0.93. This suggests that model size alone is not the primary factor limiting performance in this task.

### Performance Ceiling Observations
Examining the leaderboard, most submissions appear to plateau around 0.94 MAP@3 score, indicating a potential performance ceiling for this competition. This plateau suggests fundamental limitations that are not easily overcome by simply scaling model size or improving training techniques.

**Resource Constraints and Scalability**: It's important to note that MAP@3 scores don't necessarily improve linearly with model parameter count. If the goal were to improve scores simply by scaling up model size, it would require significantly larger computational resources and associated costs, which are not feasible in my current environment.

### Potential Underlying Issues

#### Misconception Category Problems
The current labeling scheme may be working against optimal performance. The Category:Misconception format creates several challenges:

1. **Severe Class Imbalance**: Many misconception categories have only one or very few samples, making it nearly impossible for models to learn meaningful patterns for these rare classes.

2. **Non-generalizable Labels**: Some misconception categories are highly specific and may not represent patterns that generalize well to unseen data.

3. **Complex Multi-dimensional Classification**: The current format combines multiple classification dimensions (True/False for correctness + Correct/Misconception/Neither for understanding type) into a single label, potentially making the learning task unnecessarily complex.

#### Proposed Alternative Approaches

**Simplified Classification Approach:**
Focus on the core understanding classification (Correct/Misconception/Neither) while excluding the True/False correctness dimension and specific misconception categories. This would:
- Reduce class imbalance issues
- Focus on the most generalizable aspects of student understanding
- Allow better utilization of larger model capabilities

**Two-Stage Pipeline Approach:**
An alternative strategy could involve:
1. Using a high-performance pre-trained LLM to identify unclear points, errors, or misconceptions in student explanations
2. Applying a secondary classification model to categorize these identified issues

This approach could potentially leverage the superior reasoning capabilities of large language models while avoiding the current labeling complexity issues.

### Hypothesis
The fundamental issue may not be model capacity but rather the labeling framework itself. Since our Gemma-2-2b model and other participants' larger models should theoretically have superior baseline LLM performance, addressing the labeling complexity could unlock significantly better prediction accuracy.

**Note**: These observations are based on empirical results and leaderboard analysis, and represent hypotheses that would require further experimentation to validate.

---

## 🚀 Model Summary
**Model Name**: `gemma-2-2b-improved-prompts-qlora`
**Base Model**: google/gemma-2-2b-it (~2.6B parameters)
**Training Method**: QLoRA (Quantized Low-Rank Adaptation)
**Task**: 65-label classification for mathematical misconception detection

## 📊 Performance Results
- **Final MAP@3 Score**: **0.9411** (94.11%)
- **Final Accuracy**: 0.8894 (88.94%)
- **Evaluation Loss**: 0.3691
- **Training Epochs**: 3.0
- **Evaluation Runtime**: 473.05 seconds
- **Evaluation Speed**: 15.52 samples/second
- **Evaluation Steps per Second**: 1.94
- **Improvement from Baseline**: +1.1% (0.93 → 0.9411)

## 🔧 Technical Specifications

### QLoRA Configuration
- **Quantization**: 4-bit (nf4) with double quantization
- **LoRA Rank (r)**: 16
- **LoRA Alpha**: 32
- **Target Modules**: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- **Memory Reduction**: ~75% compared to full fine-tuning

### Training Parameters
- **Epochs**: 3
- **Batch Size**: 8 (per device)
- **Gradient Accumulation**: 4 steps (effective batch size: 32)
- **Learning Rate**: 1e-4
- **Max Token Length**: 1024
- **Training Time**: ~8 hours on A100 GPU

### Training Environment
- **Platform**: Google Colab Pro
- **GPU**: NVIDIA A100 (40GB VRAM)
- **Compute Units Consumed**: 50 units (~$5.00 USD cost)
- **Code Backup**: Training code preserved in input section for reference

## 📝 Prompt Engineering Improvements

### Enhanced Prompt Structure
Based on `final_compact_prompt.py` with the following optimizations:

1. **Early Guidelines Placement**: Classification guidelines positioned at the beginning
2. **Complete Label Coverage**: All 65 labels including False_Correct:NA
3. **Clear Task Definition**: Explicit instruction for exact label selection
4. **Structured Format**: Organized question-answer-explanation flow

### Sample Prompt Template
```
You are an expert math educator analyzing student responses for mathematical misconceptions.

Question: [Question Text]
Correct Answer: [MC_Answer]
Student's Explanation: [Student Explanation]

CLASSIFICATION GUIDELINES:
• True_Correct:NA = Student demonstrates correct understanding
• False_Correct:NA = Student gives correct answer but for wrong reasons
• True_Neither:NA = Correct answer but unclear/incomplete reasoning
• False_Neither:NA = Incorrect answer but no specific misconception identified
• True_Misconception:[Type] = Correct answer but demonstrates specific misconception
• False_Misconception:[Type] = Incorrect answer with identifiable misconception

TASK: Classify this student's response using EXACTLY ONE of these 65 labels:
[Complete label list...]

Classification:
```

## 🎯 Key Improvements Over Baseline

### 1. Prompt Optimization
- **Token Efficiency**: ~741 tokens average (optimal for Gemma-2B)
- **Label Completeness**: Full 65-label taxonomy support
- **Context Structure**: Enhanced problem-solution-explanation flow

### 2. QLoRA Benefits
- **Memory Efficiency**: 75% reduction in GPU memory usage
- **Training Stability**: Improved convergence with 4-bit quantization
- **Parameter Efficiency**: Only ~1% of parameters trained (adapters)

### 3. Architecture Enhancements
- **Gradient Checkpointing**: Memory optimization for long sequences
- **Group Batching**: Efficient processing of variable-length inputs
- **Mixed Precision**: bf16 for QLoRA stability

## 🛠️ Development Environment

### AI-Assisted Development
- **Primary Tool**: GitHub Copilot for code generation and documentation
- **Language Support**: English documentation and code comments (**author is non-native English speaker**)
- **Code Coverage**: Majority of implementation assisted by Copilot, including model training, data processing, and submission notebooks

### Training Infrastructure
- **Cloud Platform**: Google Colab Pro with premium GPU access
- **Hardware**: NVIDIA A100 GPU (40GB VRAM) for efficient QLoRA training
- **Cost Management**: 50 compute units consumed (~$5.00 USD total cost)
- **Code Preservation**: Complete training codebase backed up in input section for reproducibility

---
**Model Created**: July 23, 2025
**Framework**: Transformers + PEFT + QLoRA
**Competition**: MAP - Charting Student Math Misunderstandings


## 📦 Install Required Libraries

In [1]:
# Install required libraries for improved QLoRA model
import subprocess
import sys

def install_package(package):
    """Install package using pip"""
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Required packages for the improved QLoRA model (inference only)
required_packages = [
    "transformers>=4.35.0",  # For model and tokenizer
    "peft>=0.8.0",          # Required for QLoRA adapter loading
    "torch>=2.0.0"          # PyTorch for tensor operations
    # Note: bitsandbytes, accelerate, sentencepiece not needed for inference
]

print("🔧 Installing required libraries for improved QLoRA model...")
for package in required_packages:
    try:
        # Test import first
        if "transformers" in package:
            import transformers
            print(f"✅ transformers already installed: {transformers.__version__}")
        elif "peft" in package:
            import peft
            print(f"✅ peft already installed: {peft.__version__}")
        elif "torch" in package:
            import torch
            print(f"✅ torch already installed: {torch.__version__}")
    except ImportError:
        print(f"📦 Installing {package}...")
        install_package(package)

print("🎉 Essential libraries for improved QLoRA inference are ready!")
print("📋 Note: Minimal setup optimized for inference (no training dependencies)")

🔧 Installing required libraries for improved QLoRA model...
✅ transformers already installed: 4.52.4


2025-07-24 19:39:30.397177: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1753385970.624290      13 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1753385970.693000      13 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


✅ peft already installed: 0.15.2
✅ torch already installed: 2.6.0+cu124
🎉 Essential libraries for improved QLoRA inference are ready!
📋 Note: Minimal setup optimized for inference (no training dependencies)


## 📚 Import Dependencies and Environment Setup

In [2]:
# Essential libraries for improved QLoRA model
import pandas as pd
import numpy as np
import os
import json
import time
import warnings
from pathlib import Path

# Machine learning libraries
import torch
from torch.utils.data import Dataset, DataLoader

# Transformers and PEFT libraries for improved QLoRA
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoConfig,
    DataCollatorWithPadding,
)
from peft import PeftModel  # Required for QLoRA adapters

# Suppress warnings
warnings.filterwarnings("ignore")

# Environment configuration
print("🔧 Environment Information for Improved QLoRA Model:")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU device: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    device = torch.device("cuda")
else:
    print("Running on CPU")
    device = torch.device("cpu")

# Model configuration (matching training setup)
MODEL_NAME = "google/gemma-2-2b-it"
NUM_LABELS = 65  # Complete 65-label coverage
MAX_LENGTH = 1024  # Optimized token length
IMPROVED_MODEL_NAME = "gemma-2-2b-improved-prompts-qlora"

print(f"\n🚀 Model Configuration:")
print(f"Base model: {MODEL_NAME}")
print(f"Number of labels: {NUM_LABELS}")
print(f"Max token length: {MAX_LENGTH}")
print(f"Improved model name: {IMPROVED_MODEL_NAME}")

# Kaggle path configuration
KAGGLE_INPUT_PATH = "/kaggle/input"
if os.path.exists(KAGGLE_INPUT_PATH):
    print(f"\n📁 Kaggle environment detected: {KAGGLE_INPUT_PATH}")
    # Competition data path
    COMP_DATA_PATH = "/kaggle/input/map-charting-student-math-misunderstandings"
    # Improved QLoRA model path (update this to your dataset name)
    MODEL_DATA_PATH = "/kaggle/input/gemma-2-2b-improved-prompts-qloraaaa/transformers/default/1/kaggle-ready-improved-qlora"
    print(f"🎯 Competition data path: {COMP_DATA_PATH}")
    print(f"🤖 Improved QLoRA model path: {MODEL_DATA_PATH}")
else:
    print(f"\n📁 Local environment detected")
    COMP_DATA_PATH = r"C:\Users\mouse\Desktop\NotDelete\GitHub\kaggleCompe_MAP-math\map_data"
    # Local improved QLoRA model path
    MODEL_DATA_PATH = r"C:\Users\mouse\Desktop\NotDelete\GitHub\kaggleCompe_MAP-math\colab\colabで訓練して保存\kaggle-ready-improved-qlora"

print("✅ All dependencies imported and environment configured for improved QLoRA model!")

🔧 Environment Information for Improved QLoRA Model:
PyTorch version: 2.6.0+cu124
CUDA available: False
Running on CPU

🚀 Model Configuration:
Base model: google/gemma-2-2b-it
Number of labels: 65
Max token length: 1024
Improved model name: gemma-2-2b-improved-prompts-qlora

📁 Kaggle environment detected: /kaggle/input
🎯 Competition data path: /kaggle/input/map-charting-student-math-misunderstandings
🤖 Improved QLoRA model path: /kaggle/input/gemma-2-2b-improved-prompts-qloraaaa/transformers/default/1/kaggle-ready-improved-qlora
✅ All dependencies imported and environment configured for improved QLoRA model!


## 📝 Define Improved Prompt Functions

In [3]:
def get_improved_compact_prompt(question, answer, explanation, all_labels):
    """
    Enhanced prompt function based on final_compact_prompt.py
    Optimized for 65-label classification with early guidelines placement
    """
    labels_text = "\n".join([f"- {label}" for label in all_labels])

    prompt = f"""You are an expert math educator analyzing student responses for mathematical misconceptions.

Question: {question}
Correct Answer: {answer}
Student's Explanation: {explanation}

CLASSIFICATION GUIDELINES:
• True_Correct:NA = Student demonstrates correct understanding
• False_Correct:NA = Student gives correct answer but for wrong reasons
• True_Neither:NA = Correct answer but unclear/incomplete reasoning
• False_Neither:NA = Incorrect answer but no specific misconception identified
• True_Misconception:[Type] = Correct answer but demonstrates specific misconception
• False_Misconception:[Type] = Incorrect answer with identifiable misconception

TASK: Classify this student's response using EXACTLY ONE of these {len(all_labels)} labels:

{labels_text}

Classification:"""

    return prompt

def create_enhanced_text_with_improved_prompt(row, all_labels):
    """
    Create enhanced text features using the improved prompt structure
    Matches the training format for optimal model performance
    """
    question = str(row["QuestionText"]) if pd.notna(row["QuestionText"]) else ""
    mc_answer = str(row["MC_Answer"]) if pd.notna(row["MC_Answer"]) else ""
    explanation = str(row["StudentExplanation"]) if pd.notna(row["StudentExplanation"]) else ""

    # Use the improved prompt format from training
    enhanced_text = get_improved_compact_prompt(question, mc_answer, explanation, all_labels)
    return enhanced_text

def get_default_labels():
    """
    Get the complete 65-label set used during training
    This ensures consistency between training and inference
    """
    # These are the 65 labels that the improved model was trained on
    default_labels = [
        "False_Correct:NA", "False_Misconception:Algebra Of Functional Expressions",
        "False_Misconception:Algebra Vs Calculus", "False_Misconception:Arithmetic Of Algebraic Expressions",
        "False_Misconception:Confused About Infinity Or Undefined", "False_Misconception:Confused By Notation",
        "False_Misconception:Does Not Use Appropriate Formulas Or Procedures",
        "False_Misconception:Graphical", "False_Misconception:Incomplete",
        "False_Misconception:Incorrect Definition", "False_Misconception:Linear Extrapolation",
        "False_Misconception:Logarithms", "False_Misconception:Numerical Error",
        "False_Misconception:Operations", "False_Misconception:Other",
        "False_Misconception:Overconstraining", "False_Misconception:Probability",
        "False_Misconception:Properties Of Functions", "False_Misconception:Reasoning About Functions",
        "False_Misconception:Reasoning About Graphs", "False_Misconception:Signed Numbers",
        "False_Misconception:Slope", "False_Misconception:Symbol String Manipulation",
        "False_Misconception:Trigonometry", "False_Misconception:Units", "False_Neither:NA",
        "True_Correct:NA", "True_Misconception:Algebra Of Functional Expressions",
        "True_Misconception:Algebra Vs Calculus", "True_Misconception:Arithmetic Of Algebraic Expressions",
        "True_Misconception:Confused About Infinity Or Undefined", "True_Misconception:Confused By Notation",
        "True_Misconception:Does Not Use Appropriate Formulas Or Procedures",
        "True_Misconception:Graphical", "True_Misconception:Incomplete",
        "True_Misconception:Incorrect Definition", "True_Misconception:Linear Extrapolation",
        "True_Misconception:Logarithms", "True_Misconception:Numerical Error",
        "True_Misconception:Operations", "True_Misconception:Other",
        "True_Misconception:Overconstraining", "True_Misconception:Probability",
        "True_Misconception:Properties Of Functions", "True_Misconception:Reasoning About Functions",
        "True_Misconception:Reasoning About Graphs", "True_Misconception:Signed Numbers",
        "True_Misconception:Slope", "True_Misconception:Symbol String Manipulation",
        "True_Misconception:Trigonometry", "True_Misconception:Units", "True_Neither:NA"
    ]
    return sorted(default_labels)  # Sort for consistency

print("✅ Improved prompt functions implemented successfully!")
print("📝 Features: Enhanced structure from final_compact_prompt.py")
print("🎯 Support: Complete 65-label classification")
print("📋 Guidelines: Early classification guidelines placement")
print("⚡ Optimization: Token efficiency and context structure")

# Display sample labels
sample_labels = get_default_labels()[:10]
print(f"\n📊 Sample labels (first 10 of {len(get_default_labels())}):")
for i, label in enumerate(sample_labels, 1):
    print(f"  {i}. {label}")
print("  ...")

✅ Improved prompt functions implemented successfully!
📝 Features: Enhanced structure from final_compact_prompt.py
🎯 Support: Complete 65-label classification
📋 Guidelines: Early classification guidelines placement
⚡ Optimization: Token efficiency and context structure

📊 Sample labels (first 10 of 52):
  1. False_Correct:NA
  2. False_Misconception:Algebra Of Functional Expressions
  3. False_Misconception:Algebra Vs Calculus
  4. False_Misconception:Arithmetic Of Algebraic Expressions
  5. False_Misconception:Confused About Infinity Or Undefined
  6. False_Misconception:Confused By Notation
  7. False_Misconception:Does Not Use Appropriate Formulas Or Procedures
  8. False_Misconception:Graphical
  9. False_Misconception:Incomplete
  10. False_Misconception:Incorrect Definition
  ...


## 🔧 Define Dataset Class for Inference

In [4]:
class ImprovedMathMisconceptionDataset(Dataset):
    """
    Enhanced Math Misconception Dataset for PyTorch inference
    Optimized for the improved prompt structure and 65-label classification
    """
    
    def __init__(self, texts, tokenizer, max_length=1024):
        """
        Args:
            texts (list): Enhanced text data with improved prompts
            tokenizer: Gemma tokenizer (compatible with QLoRA model)
            max_length (int): Maximum token length (optimized for improved prompts)
        """
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])

        # Tokenize with improved prompt structure support
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt",
        )

        return {
            "input_ids": encoding["input_ids"].flatten(),
            "attention_mask": encoding["attention_mask"].flatten(),
        }

def compute_map3_metrics_inference(predictions, top_k=3):
    """
    Compute MAP@3 style predictions for inference
    Returns top-k predictions with confidence scores
    """
    # Apply softmax to get probabilities
    probs = torch.softmax(torch.tensor(predictions), dim=-1).numpy()
    
    # Get top-k predictions for each sample
    top_k_results = []
    for prob in probs:
        # Get indices of top-k predictions
        top_k_indices = np.argsort(prob)[::-1][:top_k]
        top_k_probs = prob[top_k_indices]
        top_k_results.append((top_k_indices, top_k_probs))
    
    return top_k_results

print("✅ Improved Math Misconception Dataset class defined successfully!")
print("🎯 Features: Optimized for enhanced prompt structure")
print("📏 Max length: 1024 tokens (matching training configuration)")
print("🔧 Support: Complete 65-label classification")
print("📊 MAP@3: Inference-ready prediction format")

✅ Improved Math Misconception Dataset class defined successfully!
🎯 Features: Optimized for enhanced prompt structure
📏 Max length: 1024 tokens (matching training configuration)
🔧 Support: Complete 65-label classification
📊 MAP@3: Inference-ready prediction format


## 🤖 Load Pre-trained Improved QLoRA Model

In [5]:
def load_improved_qlora_model():
    """
    Load the pre-trained improved QLoRA model with enhanced performance
    Supports both Kaggle and local environments
    """
    print("=" * 60)
    print("🤖 Loading Improved QLoRA Gemma Model (MAP@3: 0.9411)")
    print("=" * 60)
    
    try:
        # Verify model path
        print(f"📁 Model path: {MODEL_DATA_PATH}")
        
        if not os.path.exists(MODEL_DATA_PATH):
            print(f"❌ Model path not found: {MODEL_DATA_PATH}")
            print("💡 In Kaggle environment, ensure the improved model dataset is properly uploaded")
            return None, None, None
        
        # Load label mapping from improved model
        label_file = os.path.join(MODEL_DATA_PATH, "label_mapping.json")
        print(f"📋 Loading label mapping: {label_file}")
        
        # Try to load from model directory, fallback to default labels
        try:
            with open(label_file, "r", encoding="utf-8") as f:
                label_mapping = json.load(f)
            print("✅ Label mapping loaded from model directory")
        except FileNotFoundError:
            print("⚠️ Label mapping not found in model directory, using default 65 labels")
            default_labels = get_default_labels()
            label_mapping = {str(i): label for i, label in enumerate(default_labels)}
        
        print(f"📊 Total labels: {len(label_mapping)}")
        
        # Display sample labels
        print("🎯 Sample labels:")
        for idx, label in list(label_mapping.items())[:5]:
            print(f"   {idx}: {label}")
        print("   ...")
        
        # Load tokenizer
        print(f"\n📝 Loading Gemma tokenizer...")
        try:
            tokenizer = AutoTokenizer.from_pretrained(
                MODEL_DATA_PATH,
                local_files_only=True  # Kaggle offline support
            )
        except:
            # Fallback to base model tokenizer
            print("⚠️ Loading tokenizer from base model (fallback)")
            tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
            
        # Ensure padding token is set
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
            
        print(f"✅ Tokenizer loaded successfully")
        print(f"🔖 Padding token: {tokenizer.pad_token}")
        print(f"📏 Vocabulary size: {tokenizer.vocab_size:,}")
        
        # Check if this is a merged model or PEFT model
        config_path = os.path.join(MODEL_DATA_PATH, "config.json")
        adapter_config_path = os.path.join(MODEL_DATA_PATH, "adapter_config.json")
        
        if os.path.exists(config_path) and not os.path.exists(adapter_config_path):
            # This is a merged/unified model (kaggle-ready format)
            print(f"\n🧠 Loading merged improved model...")
            model = AutoModelForSequenceClassification.from_pretrained(
                MODEL_DATA_PATH,
                num_labels=len(label_mapping),
                torch_dtype=torch.float16 if device.type == "cuda" else torch.float32,
                device_map="auto" if device.type == "cuda" else None,
                trust_remote_code=True,
                local_files_only=True  # Kaggle offline support
            )
            print("✅ Merged improved model loaded successfully!")
            
        else:
            # This is a PEFT model with adapters
            print(f"\n🧠 Loading base model and PEFT adapters...")
            
            # Load base model first
            base_model = AutoModelForSequenceClassification.from_pretrained(
                MODEL_NAME,
                num_labels=len(label_mapping),
                torch_dtype=torch.float16 if device.type == "cuda" else torch.float32,
                device_map="auto" if device.type == "cuda" else None,
                trust_remote_code=True,
            )
            
            # Load PEFT adapters
            model = PeftModel.from_pretrained(
                base_model, 
                MODEL_DATA_PATH,
                torch_dtype=torch.float16 if device.type == "cuda" else torch.float32,
            )
            print("✅ PEFT improved model loaded successfully!")
        
        # Move to device if needed
        if device.type == "cpu":
            model = model.to(device)
        
        # Set to evaluation mode
        model.eval()
        
        # Display model information
        total_params = sum(p.numel() for p in model.parameters())
        print(f"\n📊 Model Information:")
        print(f"  🏷️ Classification labels: {model.config.num_labels}")
        print(f"  📈 Total parameters: {total_params:,}")
        print(f"  💡 Model type: Improved QLoRA Gemma-2-2b-it")
        print(f"  🎯 Training MAP@3: 0.9411")
        print(f"  ⚡ Device: {device}")
        
        return model, tokenizer, label_mapping
    
    except Exception as e:
        print(f"❌ Model loading error: {e}")
        print("\n🔧 Troubleshooting for Kaggle environment:")
        print("1. Ensure improved QLoRA model is uploaded as Kaggle dataset")
        print("2. Verify MODEL_DATA_PATH matches your dataset name") 
        print("3. Check if model files (config.json, model files) exist")
        print("4. Verify adapter files for PEFT models")
        import traceback
        traceback.print_exc()
        raise e

# Load improved model
print("🚀 Starting improved QLoRA model loading...")
model, tokenizer, label_mapping = load_improved_qlora_model()

if model is not None:
    print("\n🎉 Improved QLoRA model ready for inference!")
    print("📈 Expected performance: MAP@3 ≈ 0.9411 (training result)")
    print("🔥 Enhanced with optimized prompts and 65-label support")
else:
    print("❌ Model loading failed. Please check the setup.")

🚀 Starting improved QLoRA model loading...
🤖 Loading Improved QLoRA Gemma Model (MAP@3: 0.9411)
📁 Model path: /kaggle/input/gemma-2-2b-improved-prompts-qloraaaa/transformers/default/1/kaggle-ready-improved-qlora
📋 Loading label mapping: /kaggle/input/gemma-2-2b-improved-prompts-qloraaaa/transformers/default/1/kaggle-ready-improved-qlora/label_mapping.json
✅ Label mapping loaded from model directory
📊 Total labels: 65
🎯 Sample labels:
   0: False_Correct:NA
   1: False_Misconception:Adding_across
   2: False_Misconception:Adding_terms
   3: False_Misconception:Additive
   4: False_Misconception:Base_rate
   ...

📝 Loading Gemma tokenizer...
✅ Tokenizer loaded successfully
🔖 Padding token: <pad>
📏 Vocabulary size: 256,000

🧠 Loading merged improved model...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

✅ Merged improved model loaded successfully!

📊 Model Information:
  🏷️ Classification labels: 65
  📈 Total parameters: 2,614,491,648
  💡 Model type: Improved QLoRA Gemma-2-2b-it
  🎯 Training MAP@3: 0.9411
  ⚡ Device: cpu

🎉 Improved QLoRA model ready for inference!
📈 Expected performance: MAP@3 ≈ 0.9411 (training result)
🔥 Enhanced with optimized prompts and 65-label support


## 📊 Load and Prepare Test Data

In [6]:
def load_and_prepare_test_data_with_improved_prompts():
    """
    Load test data and apply improved prompt structure
    Matches the training format for optimal model performance
    """
    print("=" * 60)
    print("📊 Loading Test Data with Improved Prompts")
    print("=" * 60)
    
    try:
        # Load test data
        test_path = os.path.join(COMP_DATA_PATH, "test.csv")
        print(f"📁 Test data path: {test_path}")
        
        if not os.path.exists(test_path):
            print(f"❌ Test data not found: {test_path}")
            return None
        
        test_df = pd.read_csv(test_path)
        print(f"✅ Test data loaded successfully!")
        print(f"📈 Test data shape: {test_df.shape}")
        
        # Local environment testing: use smaller sample
        is_local = not os.path.exists("/kaggle/input")
        if is_local:
            print("🔧 Local environment detected - using first 50 samples for testing")
            test_df = test_df.head(50).copy()
            print(f"📊 Sample data shape: {test_df.shape}")
        
        # Display data information
        print(f"\n📋 Test data columns:")
        print(test_df.columns.tolist())
        
        print(f"\n📋 Data sample:")
        print(test_df.head(3))
        
        # Get all labels for improved prompt generation
        all_labels = get_default_labels()
        print(f"\n🎯 Using {len(all_labels)} labels for improved prompts")
        
        # Create enhanced text features using improved prompts
        print(f"\n🔧 Creating enhanced text features with improved prompts...")
        print("   This process uses the same prompt structure as training")
        
        def create_improved_enhanced_text(row):
            """Create enhanced text using the exact improved prompt structure from training"""
            return create_enhanced_text_with_improved_prompt(row, all_labels)
        
        # Apply improved prompt formatting
        start_time = time.time()
        test_df["enhanced_text"] = test_df.apply(create_improved_enhanced_text, axis=1)
        end_time = time.time()
        
        print(f"✅ Enhanced text creation completed in {end_time - start_time:.2f} seconds")
        
        # Analyze text length statistics
        text_lengths = test_df["enhanced_text"].str.len()
        print(f"\n📊 Improved Prompt Text Statistics:")
        print(f"   Average length: {text_lengths.mean():.0f} characters")
        print(f"   Minimum length: {text_lengths.min()} characters")
        print(f"   Maximum length: {text_lengths.max()} characters")
        print(f"   Median length: {text_lengths.median():.0f} characters")
        print(f"   Estimated tokens (avg): {text_lengths.mean() / 4:.0f} tokens")
        
        # Check for long prompts
        long_prompts = (text_lengths > 3000).sum()
        print(f"   Prompts >3000 chars: {long_prompts} ({long_prompts/len(test_df)*100:.1f}%)")
        
        # Display sample improved prompt
        print(f"\n📝 Sample Enhanced Text with Improved Prompt:")
        sample_text = test_df["enhanced_text"].iloc[0]
        print(f"Length: {len(sample_text)} characters")
        print(f"Sample (first 500 chars):")
        print(f"{sample_text[:500]}...")
        
        # Validate improved prompt structure
        print(f"\n✅ Improved Prompt Validation:")
        sample_prompts = test_df["enhanced_text"].head(3)
        for i, prompt in enumerate(sample_prompts):
            has_guidelines = "CLASSIFICATION GUIDELINES:" in prompt
            has_task = "TASK: Classify this student's response" in prompt
            has_labels = len([label for label in all_labels if label in prompt]) > 50
            print(f"   Sample {i+1}: Guidelines={has_guidelines}, Task={has_task}, Labels={has_labels}")
        
        return test_df
    
    except Exception as e:
        print(f"❌ Test data loading error: {e}")
        return None

# Load and prepare test data
print("📊 Loading test data with improved prompt formatting...")
test_df = load_and_prepare_test_data_with_improved_prompts()

if test_df is not None:
    print(f"\n🎉 Test data preparation completed!")
    print(f"📈 Ready for inference: {len(test_df):,} samples")
    print(f"🔧 Enhanced with improved prompt structure")
    print(f"🎯 Optimized for 65-label classification")
else:
    print("❌ Test data preparation failed. Please check the setup.")

📊 Loading test data with improved prompt formatting...
📊 Loading Test Data with Improved Prompts
📁 Test data path: /kaggle/input/map-charting-student-math-misunderstandings/test.csv
✅ Test data loaded successfully!
📈 Test data shape: (3, 5)

📋 Test data columns:
['row_id', 'QuestionId', 'QuestionText', 'MC_Answer', 'StudentExplanation']

📋 Data sample:
   row_id  QuestionId                                       QuestionText  \
0   36696       31772  What fraction of the shape is not shaded? Give...   
1   36697       31772  What fraction of the shape is not shaded? Give...   
2   36698       32835                      Which number is the greatest?   

           MC_Answer                                 StudentExplanation  
0  \( \frac{1}{3} \)  I think that 1/3 is the answer, as it's the si...  
1  \( \frac{3}{6} \)  i think this answer is because 3 triangles are...  
2          \( 6.2 \)     because the 2 makes it higher than the others.  

🎯 Using 52 labels for improved prompts

🔧 C

## 🔮 Generate Predictions with Improved Prompts

In [7]:
def generate_improved_predictions(model, tokenizer, test_df, label_mapping, batch_size=4):
    """
    Generate predictions using the improved QLoRA model
    Optimized for MAP@3 format with enhanced performance
    """
    print("=" * 60)
    print("🔮 Generating Predictions with Improved QLoRA Model")
    print("=" * 60)
    
    if test_df is None or len(test_df) == 0:
        print("❌ Test data not available")
        return None
    
    print(f"📊 Test samples: {len(test_df):,}")
    print(f"🔧 Batch size: {batch_size}")
    print(f"🧠 Model: Improved QLoRA Gemma-2-2b-it")
    print(f"🎯 Expected performance: MAP@3 ≈ 0.9411")
    
    # Prepare test texts with improved prompts
    test_texts = test_df["enhanced_text"].tolist()
    
    # Create dataset with optimized parameters
    print(f"\n🔧 Creating inference dataset...")
    test_dataset = ImprovedMathMisconceptionDataset(
        test_texts, tokenizer, max_length=MAX_LENGTH
    )
    
    # Create dataloader with memory optimization
    test_dataloader = DataLoader(
        test_dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=0,  # Kaggle stability
        collate_fn=DataCollatorWithPadding(tokenizer=tokenizer),
        pin_memory=False  # Memory optimization
    )
    
    print(f"✅ Test dataloader created: {len(test_dataloader)} batches")
    
    # Start prediction
    print(f"\n🔮 Starting inference with improved model...")
    all_predictions = []
    total_samples = 0
    
    try:
        model.eval()
        with torch.no_grad():
            start_time = time.time()
            
            for batch_idx, batch in enumerate(test_dataloader):
                # Move batch to device
                batch = {k: v.to(device) for k, v in batch.items()}
                
                # Forward pass
                outputs = model(**batch)
                predictions = outputs.logits
                
                # Move to CPU and collect
                batch_predictions = predictions.cpu().numpy()
                all_predictions.append(batch_predictions)
                
                total_samples += len(batch_predictions)
                
                # Progress reporting
                if (batch_idx + 1) % 5 == 0 or (batch_idx + 1) == len(test_dataloader):
                    elapsed = time.time() - start_time
                    samples_per_sec = total_samples / elapsed if elapsed > 0 else 0
                    print(f"   Progress: {total_samples:,}/{len(test_df):,} "
                          f"({total_samples/len(test_df)*100:.1f}%) "
                          f"- {samples_per_sec:.1f} samples/sec")
        
        total_time = time.time() - start_time
        print(f"✅ Inference completed in {total_time:.2f} seconds")
        print(f"⚡ Average speed: {len(test_df)/total_time:.1f} samples/second")
        
        # Combine all predictions
        all_predictions = np.vstack(all_predictions)
        print(f"📊 Prediction tensor shape: {all_predictions.shape}")
        
        # Convert to probabilities
        probs = torch.softmax(torch.tensor(all_predictions), dim=-1).numpy()
        
        # Generate TOP-3 predictions for MAP@3
        print(f"\n🎯 Extracting TOP-3 predictions for MAP@3...")
        submission_predictions = []
        
        # Create index to label mapping
        idx_to_label = {int(k): v for k, v in label_mapping.items()}
        
        # Confidence statistics
        confidence_scores = []
        
        for i, prob in enumerate(probs):
            # Get top-3 most confident predictions
            top3_indices = np.argsort(prob)[::-1][:3]
            top3_probs = prob[top3_indices]
            
            # Convert indices to labels
            top3_labels = [idx_to_label[idx] for idx in top3_indices]
            
            # Store confidence for analysis
            confidence_scores.append(top3_probs[0])  # Top prediction confidence
            
            # Format as space-separated string (Kaggle format)
            prediction_string = " ".join(top3_labels)
            submission_predictions.append(prediction_string)
            
            # Show sample predictions
            if i < 5:
                print(f"  Sample {i+1}:")
                print(f"    Predictions: {prediction_string}")
                print(f"    Confidences: {[f'{p:.3f}' for p in top3_probs]}")
        
        print(f"✅ TOP-3 prediction extraction completed: {len(submission_predictions)} samples")
        
        # Analyze prediction statistics
        print(f"\n📈 Prediction Analysis:")
        
        # Confidence statistics
        avg_confidence = np.mean(confidence_scores)
        min_confidence = np.min(confidence_scores)
        max_confidence = np.max(confidence_scores)
        print(f"   Average top prediction confidence: {avg_confidence:.3f}")
        print(f"   Min confidence: {min_confidence:.3f}")
        print(f"   Max confidence: {max_confidence:.3f}")
        
        # Label distribution
        all_pred_labels = " ".join(submission_predictions).split()
        from collections import Counter
        pred_counts = Counter(all_pred_labels)
        
        print(f"   Unique labels predicted: {len(pred_counts)}/{len(label_mapping)}")
        print(f"   Most frequent predictions:")
        for label, count in pred_counts.most_common(5):
            percentage = count / (len(submission_predictions) * 3) * 100
            print(f"     {label}: {count} times ({percentage:.1f}%)")
        
        # Memory cleanup
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            print(f"🧹 GPU memory cleaned up")
        
        return submission_predictions
    
    except Exception as e:
        print(f"❌ Prediction error: {e}")
        if torch.cuda.is_available():
            print(f"🖥️ Current GPU memory: {torch.cuda.memory_allocated() / 1e6:.1f} MB")
        raise e

# Generate predictions
if model is not None and tokenizer is not None and test_df is not None:
    print("🔮 Starting prediction generation with improved QLoRA model...")
    
    # Adjust batch size based on environment
    if device.type == "cpu":
        batch_size = 2  # Conservative for CPU
    elif torch.cuda.is_available():
        gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
        batch_size = 4 if gpu_memory < 16 else 8  # Adaptive batch size
    else:
        batch_size = 4
    
    print(f"🔧 Using batch size: {batch_size} (optimized for {device})")
    
    test_predictions = generate_improved_predictions(
        model, tokenizer, test_df, label_mapping, batch_size
    )
    
    if test_predictions is not None:
        print(f"\n🎉 Prediction generation completed successfully!")
        print(f"📊 Generated {len(test_predictions):,} TOP-3 predictions")
        print(f"🏆 Ready for Kaggle submission!")
    else:
        print("❌ Prediction generation failed")
else:
    print("❌ Required components not ready. Please check previous cells.")
    test_predictions = None

🔮 Starting prediction generation with improved QLoRA model...
🔧 Using batch size: 2 (optimized for cpu)
🔮 Generating Predictions with Improved QLoRA Model
📊 Test samples: 3
🔧 Batch size: 2
🧠 Model: Improved QLoRA Gemma-2-2b-it
🎯 Expected performance: MAP@3 ≈ 0.9411

🔧 Creating inference dataset...
✅ Test dataloader created: 2 batches

🔮 Starting inference with improved model...
   Progress: 3/3 (100.0%) - 0.0 samples/sec
✅ Inference completed in 213.06 seconds
⚡ Average speed: 0.0 samples/second
📊 Prediction tensor shape: (3, 65)

🎯 Extracting TOP-3 predictions for MAP@3...
  Sample 1:
    Predictions: True_Correct:NA True_Neither:NA True_Misconception:Incomplete
    Confidences: ['0.989', '0.008', '0.001']
  Sample 2:
    Predictions: False_Misconception:WNB False_Neither:NA False_Misconception:Incomplete
    Confidences: ['0.993', '0.004', '0.001']
  Sample 3:
    Predictions: True_Neither:NA True_Correct:NA True_Misconception:Shorter_is_bigger
    Confidences: ['0.842', '0.146', '0.

## 📤 Create Kaggle Submission File

In [8]:
def create_improved_submission_file(test_df, predictions, output_path="submission.csv"):
    """
    Create Kaggle submission file with improved QLoRA predictions
    Properly formatted for MAP competition
    """
    print("=" * 60)
    print("📤 Creating Kaggle Submission File (Improved QLoRA)")
    print("=" * 60)
    
    if test_df is None or predictions is None:
        print("❌ Test data or predictions not available")
        return None
    
    print(f"📊 Submission data: {len(predictions):,} samples")
    print(f"🤖 Model: Improved QLoRA Gemma-2-2b-it (MAP@3: 0.9411)")
    
    # Get row IDs from test data
    if 'row_id' in test_df.columns:
        row_ids = test_df['row_id'].tolist()
        print(f"✅ Using row_id from test data")
    else:
        # Estimate starting row_id for submission (typical competition format)
        print("⚠️ row_id not found in test data, using estimated values")
        start_id = 36696  # Common starting ID for MAP competition
        row_ids = list(range(start_id, start_id + len(test_df)))
    
    print(f"📋 Row ID range: {min(row_ids)} to {max(row_ids)}")
    
    # Use predictions as-is (already in correct format from model)
    print(f"\n🔧 Using model predictions directly...")
    
    # Show sample predictions
    print(f"📝 Sample predictions:")
    for i, pred in enumerate(predictions[:5]):
        print(f"  Sample {i+1}: {pred}")
    
    # Create submission DataFrame
    submission_df = pd.DataFrame({
        'row_id': row_ids,
        'Category:Misconception': predictions
    })
    
    print(f"\n✅ Submission DataFrame created: {submission_df.shape}")
    print(f"📝 Columns: {list(submission_df.columns)}")
    
    # Display sample of final submission
    print(f"\n📋 Submission Sample:")
    print(submission_df.head(10).to_string(index=False))
    
    # Save to file
    try:
        submission_df.to_csv(output_path, index=False)
        print(f"\n💾 Submission file saved: {output_path}")
        
        # File validation
        file_size = os.path.getsize(output_path)
        print(f"📏 File size: {file_size:,} bytes ({file_size/1024:.1f} KB)")
        
        # Load and verify format
        check_df = pd.read_csv(output_path)
        print(f"✅ File verification: {check_df.shape}")
        
        # Check required columns
        required_cols = ['row_id', 'Category:Misconception']
        cols_present = all(col in check_df.columns for col in required_cols)
        print(f"   Required columns present: {cols_present}")
        
        # Validate prediction format
        sample_predictions = check_df['Category:Misconception'].head(5).tolist()
        print(f"\n🔍 Prediction Format Validation:")
        for i, pred in enumerate(sample_predictions):
            pred_parts = pred.split()
            has_three_parts = len(pred_parts) == 3
            print(f"   Sample {i+1}: {len(pred_parts)} parts - {pred}")
        
        # Final summary
        print(f"\n📊 Final Submission Summary:")
        print(f"   ✅ File: {output_path}")
        print(f"   ✅ Format: Kaggle MAP competition format")
        print(f"   ✅ Samples: {len(check_df):,}")
        print(f"   ✅ Columns: {list(check_df.columns)}")
        print(f"   ✅ Model: Improved QLoRA (Training MAP@3: 0.9411)")
        
        # Analyze prediction distribution
        all_prediction_labels = " ".join(check_df['Category:Misconception']).split()
        unique_labels = set(all_prediction_labels)
        print(f"   ✅ Unique labels in submission: {len(unique_labels)}")
        
        return submission_df
    
    except Exception as e:
        print(f"❌ File save error: {e}")
        return None

def display_submission_summary(submission_df):
    """Display comprehensive submission summary"""
    if submission_df is None:
        return
    
    print("\n" + "=" * 60)
    print("🏆 IMPROVED QLORA SUBMISSION READY!")
    print("=" * 60)
    
    print(f"🤖 Model: Improved QLoRA Gemma-2-2b-it")
    print(f"📈 Training Performance: MAP@3 = 0.9411, Accuracy = 0.8894")
    print(f"🎯 Enhancement: Optimized prompts + 65-label support")
    print(f"📊 Submission: {len(submission_df):,} predictions")
    print(f"📁 File: submission.csv")
    
    print(f"\n🔧 Technical Details:")
    print(f"   • QLoRA: 4-bit quantization, LoRA rank 16")
    print(f"   • Prompt: Enhanced with early guidelines placement")
    print(f"   • Labels: Complete 65-label coverage")
    print(f"   • Format: Kaggle MAP@3 competition standard")
    
    print(f"\n📋 Next Steps:")
    print(f"   1. Download 'submission.csv'")
    print(f"   2. Submit to MAP competition on Kaggle")
    print(f"   3. Monitor leaderboard for performance")
    
    print(f"\n🎯 Expected Performance:")
    print(f"   Based on training results (MAP@3: 0.9411)")
    print(f"   This should achieve competitive performance!")
    
    print("\n🏆 Good luck with your improved submission!")

# Create submission file
if test_predictions is not None and test_df is not None:
    print("📤 Creating Kaggle submission file with improved predictions...")
    
    submission_df = create_improved_submission_file(
        test_df, 
        test_predictions, 
        "submission.csv"
    )
    
    if submission_df is not None:
        display_submission_summary(submission_df)
    else:
        print("❌ Submission file creation failed")
else:
    print("❌ Required data not available. Please run previous cells first.")

📤 Creating Kaggle submission file with improved predictions...
📤 Creating Kaggle Submission File (Improved QLoRA)
📊 Submission data: 3 samples
🤖 Model: Improved QLoRA Gemma-2-2b-it (MAP@3: 0.9411)
✅ Using row_id from test data
📋 Row ID range: 36696 to 36698

🔧 Using model predictions directly...
📝 Sample predictions:
  Sample 1: True_Correct:NA True_Neither:NA True_Misconception:Incomplete
  Sample 2: False_Misconception:WNB False_Neither:NA False_Misconception:Incomplete
  Sample 3: True_Neither:NA True_Correct:NA True_Misconception:Shorter_is_bigger

✅ Submission DataFrame created: (3, 2)
📝 Columns: ['row_id', 'Category:Misconception']

📋 Submission Sample:
 row_id                                                  Category:Misconception
  36696           True_Correct:NA True_Neither:NA True_Misconception:Incomplete
  36697 False_Misconception:WNB False_Neither:NA False_Misconception:Incomplete
  36698    True_Neither:NA True_Correct:NA True_Misconception:Shorter_is_bigger

💾 Submissio