# AURORA V2 - Meta-Learning Neural Oracle Training

This notebook trains the **Neural Oracle** (Layer 2) using curriculum-based meta-learning.

## What This Does
1. Collects 40+ diverse datasets from OpenML
2. Implements curriculum learning (numeric → categorical → text)
3. Generates ground truth labels by measuring actual ML performance
4. Trains XGBoost + LightGBM ensemble
5. Exports model for drop-in deployment

## Time Estimate
- Total: 30-45 minutes
- Dataset collection: ~5 min
- Meta-learning: ~25 min
- Training: ~5 min

## Output
- `neural_oracle_meta_v3_TIMESTAMP.pkl` - Trained model (~5MB)
- Copy to `models/` directory for deployment

## Cell 1: Install Dependencies

In [None]:
# Install required packages
!pip install -q xgboost lightgbm scikit-learn pandas numpy openml tqdm

# Verify installations
import xgboost as xgb
import lightgbm as lgb
import sklearn
import openml

print(f"✅ XGBoost: {xgb.__version__}")
print(f"✅ LightGBM: {lgb.__version__}")
print(f"✅ scikit-learn: {sklearn.__version__}")
print(f"✅ OpenML: {openml.__version__}")

## Cell 2: Clone AURORA Repository

In [None]:
# Configuration
REPO_URL = "https://github.com/shobith-s/AURORA-V2.git"
BRANCH = "main"  # Change if using a different branch

# Clone repository
!git clone {REPO_URL} 2>/dev/null || (cd AURORA-V2 && git pull)
%cd AURORA-V2

# Checkout branch
!git checkout {BRANCH}

# Verify
!ls -la src/features/

## Cell 3: Import Modules

In [None]:
import os
import sys
import json
import pickle
import warnings
from pathlib import Path
from typing import Dict, List, Tuple, Any, Optional
from datetime import datetime
from dataclasses import dataclass, field
from collections import Counter

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
import openml

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Suppress warnings
warnings.filterwarnings('ignore')

# Suppress OpenML warnings about parquet download failures (ARFF fallback works fine)
import logging
logging.getLogger('openml').setLevel(logging.ERROR)

# Add project to path
sys.path.insert(0, os.getcwd())

# Import AURORA feature extractor
try:
    from src.features.enhanced_extractor import MetaLearningFeatureExtractor, MetaLearningFeatures
    print("✅ MetaLearningFeatureExtractor loaded")
    USE_META_FEATURES = True
except ImportError:
    from src.features.minimal_extractor import MinimalFeatureExtractor as MetaLearningFeatureExtractor
    print("⚠️ Using MinimalFeatureExtractor (fallback)")
    USE_META_FEATURES = False

print(f"\n📁 Working directory: {os.getcwd()}")

## Cell 4: Configuration

In [None]:
@dataclass
class TrainingConfig:
    """Configuration for meta-learning training."""
    # Dataset collection
    n_datasets: int = 40              # Number of OpenML datasets
    max_samples_per_dataset: int = 5000  # Max rows per dataset
    min_samples_for_cv: int = 50      # Minimum for cross-validation
    
    # Cross-validation
    cv_folds: int = 3                 # Number of CV folds
    
    # Training
    test_size: float = 0.2            # Test set size
    random_state: int = 42            # Random seed
    min_confidence: float = 0.5       # Filter low-confidence samples
    
    # Actions to try for each column type
    numeric_actions: List[str] = field(default_factory=lambda: [
        'keep_as_is',
        'standard_scale',
        'minmax_scale',
        'robust_scale',
        'log_transform',
        'log1p_transform',
        'sqrt_transform',
        'clip_outliers',
    ])
    
    categorical_actions: List[str] = field(default_factory=lambda: [
        'keep_as_is',
        'onehot_encode',
        'label_encode',
        'ordinal_encode',
        'frequency_encode',
        'drop_column',
    ])
    
    text_actions: List[str] = field(default_factory=lambda: [
        'keep_as_is',
        'drop_column',
        'label_encode',
    ])

# Initialize configuration
config = TrainingConfig()

print("📋 Configuration:")
print(f"  Datasets to collect: {config.n_datasets}")
print(f"  CV folds: {config.cv_folds}")
print(f"  Random seed: {config.random_state}")
print(f"  Numeric actions: {len(config.numeric_actions)}")
print(f"  Categorical actions: {len(config.categorical_actions)}")

## Cell 5: Dataset Collection

Collects 40+ diverse datasets from OpenML for training.

In [None]:
class DatasetCollector:
    """Collects datasets from OpenML for meta-learning."""
    
    # Curated list of diverse OpenML dataset IDs
    OPENML_DATASETS = [
        # Classification datasets
        (31, "credit-g"),           # German Credit
        (37, "diabetes"),           # Pima Indians Diabetes
        (44, "spambase"),           # Spam Detection
        (151, "electricity"),       # Electricity
        (1461, "bank-marketing"),   # Bank Marketing
        (1464, "blood-transfusion"), # Blood Transfusion
        (1480, "ilpd"),             # Indian Liver Patient
        (1494, "qsar-biodeg"),      # QSAR Biodegradation
        (40536, "SpeedDating"),     # Speed Dating
        (40945, "titanic"),         # Titanic
        (41027, "Australian"),      # Australian Credit
        (4134, "Bioresponse"),      # Bioresponse
        (1590, "adult"),            # Adult Income
        (23, "cmc"),                # Contraceptive Method
        (40966, "MiceProtein"),     # Mice Protein
        (1063, "kc2"),              # KC2
        (1068, "pc1"),              # PC1
        (4538, "GesturePhaseSegmentationProcessed"),
        (40981, "wilt"),            # Wilt
        (40982, "steel-plates-fault"),
        (40983, "wdbc"),            # Wisconsin Diagnostic Breast Cancer
        (40984, "segment"),         # Image Segmentation
        (40994, "climate-model-simulation"),
        (1111, "KDDCup09_appetency"),
        (1169, "airlines"),         # Airlines
        (4135, "Amazon_employee_access"),
        (1486, "nomao"),            # Nomao
        (1489, "phoneme"),          # Phoneme
        (1501, "semeion"),          # Semeion
        (23381, "dresses-sales"),   # Dresses
        (42570, "compass"),         # COMPAS
        (6, "letter"),              # Letter Recognition
        (12, "mfeat-factors"),      # MFeat Factors
        (14, "mfeat-fourier"),      # MFeat Fourier
        (16, "mfeat-karhunen"),     # MFeat Karhunen
        (18, "mfeat-morphological"),# MFeat Morphological
        (20, "mfeat-pixel"),        # MFeat Pixel
        (22, "mfeat-zernike"),      # MFeat Zernike
        (32, "pendigits"),          # Pen Digits
        (40499, "texture"),         # Texture
    ]
    
    def __init__(self, config: TrainingConfig):
        self.config = config
        self.datasets = []
        
    def collect(self) -> List[Tuple[str, pd.DataFrame, str]]:
        """Collect datasets from OpenML."""
        collected = []
        
        print(f"📦 Collecting up to {self.config.n_datasets} datasets from OpenML...\n")
        
        for dataset_id, name in tqdm(self.OPENML_DATASETS[:self.config.n_datasets], desc="Downloading datasets"):
            max_retries = 3
            for attempt in range(max_retries):
                try:
                    # Fetch from OpenML
                    dataset = openml.datasets.get_dataset(dataset_id)
                    X, y, categorical_indicator, attribute_names = dataset.get_data(
                        target=dataset.default_target_attribute,
                        dataset_format='dataframe'
                    )
                    
                    if X is None or y is None:
                        continue
                    
                    # Combine into DataFrame
                    df = X.copy()
                    df['target'] = y
                    
                    # Reset index to avoid IndexError after sampling
                    df = df.reset_index(drop=True)
                    
                    # Sample if too large
                    if len(df) > self.config.max_samples_per_dataset:
                        df = df.sample(n=self.config.max_samples_per_dataset, 
                                       random_state=self.config.random_state).reset_index(drop=True)
                    
                    collected.append((name, df, 'classification'))
                    print(f"  ✓ Downloaded {name}: {len(df)} rows, {len(df.columns)} columns")
                    
                except Exception as e:
                    if attempt < max_retries - 1:
                        continue
                    print(f"  ⚠️ Skipped {name}: {str(e)[:50]}")
                    break
                else:
                    # Success - break out of retry loop
                    break
        
        print(f"\n✅ Collected {len(collected)} datasets")
        self.datasets = collected
        return collected

# Collect datasets
collector = DatasetCollector(config)
datasets = collector.collect()

# Summary
total_rows = sum(len(df) for _, df, _ in datasets)
total_cols = sum(len(df.columns) for _, df, _ in datasets)
print(f"\n📊 Dataset Summary:")
print(f"  Total datasets: {len(datasets)}")
print(f"  Total rows: {total_rows:,}")
print(f"  Total columns: {total_cols:,}")

## Cell 6: Curriculum Meta-Learner

Core meta-learning logic with staged curriculum.

In [None]:
@dataclass
class TrainingSample:
    """A single training sample."""
    features: np.ndarray
    label: str
    confidence: float
    column_type: str
    column_name: str
    dataset_name: str
    performance_score: float


class PreprocessingExecutor:
    """Executes preprocessing actions on columns."""
    
    @staticmethod
    def apply(column: pd.Series, action: str) -> Optional[np.ndarray]:
        """Apply a preprocessing action to a column."""
        non_null = column.dropna()
        if len(non_null) == 0:
            return None
        
        try:
            if action == 'keep_as_is':
                if pd.api.types.is_numeric_dtype(column):
                    return column.fillna(0).values.reshape(-1, 1)
                else:
                    le = LabelEncoder()
                    return le.fit_transform(column.fillna('__NA__').astype(str)).reshape(-1, 1)
            
            elif action == 'standard_scale':
                if not pd.api.types.is_numeric_dtype(column):
                    return None
                scaler = StandardScaler()
                values = column.fillna(column.mean()).values.reshape(-1, 1)
                return scaler.fit_transform(values)
            
            elif action == 'minmax_scale':
                if not pd.api.types.is_numeric_dtype(column):
                    return None
                scaler = MinMaxScaler()
                values = column.fillna(column.mean()).values.reshape(-1, 1)
                return scaler.fit_transform(values)
            
            elif action == 'robust_scale':
                if not pd.api.types.is_numeric_dtype(column):
                    return None
                from sklearn.preprocessing import RobustScaler
                scaler = RobustScaler()
                values = column.fillna(column.median()).values.reshape(-1, 1)
                return scaler.fit_transform(values)
            
            elif action == 'log_transform':
                if not pd.api.types.is_numeric_dtype(column):
                    return None
                values = column.fillna(0)
                if (values <= 0).any():
                    return None
                return np.log(values).values.reshape(-1, 1)
            
            elif action == 'log1p_transform':
                if not pd.api.types.is_numeric_dtype(column):
                    return None
                values = column.fillna(0)
                if (values < 0).any():
                    return None
                return np.log1p(values).values.reshape(-1, 1)
            
            
            elif action == 'sqrt_transform':
                if not pd.api.types.is_numeric_dtype(column):
                    return None
                values = column.fillna(0)
                if (values < 0).any():
                    return None
                return np.sqrt(values).values.reshape(-1, 1)
            elif action == 'clip_outliers':
                if not pd.api.types.is_numeric_dtype(column):
                    return None
                values = column.fillna(column.median())
                q1, q3 = values.quantile([0.25, 0.75])
                iqr = q3 - q1
                clipped = values.clip(q1 - 1.5*iqr, q3 + 1.5*iqr)
                return clipped.values.reshape(-1, 1)
            
            elif action == 'onehot_encode':
                values = column.fillna('__NA__').astype(str)
                dummies = pd.get_dummies(values)
                return dummies.values
            
            elif action == 'label_encode':
                values = column.fillna('__NA__').astype(str)
                le = LabelEncoder()
                return le.fit_transform(values).reshape(-1, 1)
            
            elif action == 'ordinal_encode':
                from sklearn.preprocessing import OrdinalEncoder
                values = column.fillna('__NA__').astype(str).values.reshape(-1, 1)
                enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
                return enc.fit_transform(values)
            
            elif action == 'frequency_encode':
                values = column.fillna('__NA__').astype(str)
                freq = values.value_counts(normalize=True)
                return values.map(freq).values.reshape(-1, 1)
            
            elif action == 'drop_column':
                return None
            
            else:
                return None
                
        except Exception:
            return None


class CurriculumMetaLearner:
    """Curriculum-based meta-learner for neural oracle training."""
    
    def __init__(self, config: TrainingConfig):
        self.config = config
        self.extractor = MetaLearningFeatureExtractor()
        self.executor = PreprocessingExecutor()
        self.samples: List[TrainingSample] = []
        
    def classify_column(self, column: pd.Series, name: str) -> str:
        """Classify column as numeric, categorical, text, or deterministic."""
        if column.isna().all():
            return 'deterministic:all_null'
        if column.nunique() <= 1:
            return 'deterministic:constant'
        if column.nunique() == len(column.dropna()) and 'id' in name.lower():
            return 'deterministic:id'
        if pd.api.types.is_datetime64_any_dtype(column):
            return 'deterministic:datetime'
        if pd.api.types.is_bool_dtype(column):
            return 'deterministic:boolean'
        
        # Check for boolean-like strings
        if pd.api.types.is_object_dtype(column):
            unique_lower = set(column.dropna().astype(str).str.lower().unique())
            if unique_lower.issubset({'true', 'false', 'yes', 'no', '0', '1', 't', 'f', 'y', 'n'}):
                if len(unique_lower) <= 3:
                    return 'deterministic:boolean'
        
        if pd.api.types.is_numeric_dtype(column):
            return 'numeric'
        
        # Check for text vs categorical
        non_null = column.dropna()
        if len(non_null) > 0:
            avg_len = non_null.astype(str).str.len().mean()
            unique_ratio = column.nunique() / len(non_null)
            if unique_ratio > 0.5 and avg_len > 30:
                return 'text'
        
        return 'categorical'
    
    def measure_performance(self, X: np.ndarray, y: np.ndarray) -> Optional[float]:
        """Measure cross-validation performance."""
        if len(y) < self.config.min_samples_for_cv:
            return None
        
        # Encode target if needed
        if not pd.api.types.is_numeric_dtype(pd.Series(y)):
            le = LabelEncoder()
            y = le.fit_transform(y.astype(str))
        
        # Check for minimum class samples
        unique, counts = np.unique(y, return_counts=True)
        if len(unique) < 2 or min(counts) < self.config.cv_folds:
            return None
        
        try:
            cv = StratifiedKFold(
                n_splits=self.config.cv_folds,
                shuffle=True,
                random_state=self.config.random_state
            )
            
            model = LogisticRegression(
                max_iter=500,
                random_state=self.config.random_state,
                n_jobs=-1
            )
            
            scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy', n_jobs=-1)
            return float(scores.mean())
        except Exception:
            return None
    
    def find_best_action(
        self,
        df: pd.DataFrame,
        col_name: str,
        target: pd.Series,
        actions: List[str],
        other_cols: Optional[pd.DataFrame] = None
    ) -> Tuple[str, float, Dict[str, float]]:
        """Find best action by measuring performance."""
        scores = {}
        column = df[col_name]
        
        # Prepare target
        mask = ~target.isna()
        y = target[mask].values
        
        for action in actions:
            transformed = self.executor.apply(column, action)
            
            if action == 'drop_column' or transformed is None:
                if other_cols is not None and len(other_cols.columns) > 0:
                    X = other_cols[mask].values
                else:
                    continue
            else:
                X_col = transformed[mask]
                if other_cols is not None and len(other_cols.columns) > 0:
                    X = np.hstack([other_cols[mask].values, X_col])
                else:
                    X = X_col
            
            score = self.measure_performance(X, y)
            if score is not None:
                scores[action] = score
        
        if not scores:
            return 'keep_as_is', 0.0, {}
        
        best = max(scores, key=scores.get)
        return best, scores[best], scores
    
    def extract_features(self, column: pd.Series, name: str) -> np.ndarray:
        """Extract features from a column."""
        if USE_META_FEATURES and hasattr(self.extractor, 'extract_meta_features'):
            features = self.extractor.extract_meta_features(column, name)
        else:
            features = self.extractor.extract(column, name)
        return features.to_array()
    
    def process_dataset(self, name: str, df: pd.DataFrame) -> List[TrainingSample]:
        """Process a single dataset through curriculum stages."""
        samples = []
        
        if 'target' not in df.columns:
            return samples
        
        target = df['target']
        feature_cols = [c for c in df.columns if c != 'target']
        
        # Classify columns
        col_types = {c: self.classify_column(df[c], c) for c in feature_cols}
        
        # Stage 1: Deterministic
        det_rules = {
            'all_null': 'drop_column',
            'constant': 'drop_column',
            'id': 'drop_column',
            'datetime': 'parse_datetime',
            'boolean': 'parse_boolean',
        }
        
        for col, ctype in col_types.items():
            if ctype.startswith('deterministic:'):
                rule = ctype.split(':')[1]
                action = det_rules.get(rule, 'keep_as_is')
                features = self.extract_features(df[col], col)
                samples.append(TrainingSample(
                    features=features,
                    label=action,
                    confidence=1.0,
                    column_type='deterministic',
                    column_name=col,
                    dataset_name=name,
                    performance_score=1.0
                ))
        
        # Prepare processed numeric columns for context
        numeric_cols = [c for c, t in col_types.items() if t == 'numeric']
        processed_numeric = pd.DataFrame()
        
        # Stage 2: Numeric
        for col in numeric_cols:
            best, score, all_scores = self.find_best_action(
                df, col, target,
                self.config.numeric_actions,
                processed_numeric if len(processed_numeric.columns) > 0 else None
            )
            
            if all_scores:
                # Calculate confidence from score gap
                sorted_scores = sorted(all_scores.values(), reverse=True)
                gap = sorted_scores[0] - sorted_scores[1] if len(sorted_scores) > 1 else 0.3
                confidence = min(1.0, 0.5 + gap * 5)
                
                features = self.extract_features(df[col], col)
                samples.append(TrainingSample(
                    features=features,
                    label=best,
                    confidence=confidence,
                    column_type='numeric',
                    column_name=col,
                    dataset_name=name,
                    performance_score=score
                ))
                
                # Add to context
                scaled = self.executor.apply(df[col], 'standard_scale')
                if scaled is not None:
                    processed_numeric[col] = scaled.flatten()
        
        # Stage 3: Categorical
        cat_cols = [c for c, t in col_types.items() if t == 'categorical']
        for col in cat_cols:
            best, score, all_scores = self.find_best_action(
                df, col, target,
                self.config.categorical_actions,
                processed_numeric if len(processed_numeric.columns) > 0 else None
            )
            
            if all_scores:
                sorted_scores = sorted(all_scores.values(), reverse=True)
                gap = sorted_scores[0] - sorted_scores[1] if len(sorted_scores) > 1 else 0.3
                confidence = min(1.0, 0.5 + gap * 5)
                
                features = self.extract_features(df[col], col)
                samples.append(TrainingSample(
                    features=features,
                    label=best,
                    confidence=confidence,
                    column_type='categorical',
                    column_name=col,
                    dataset_name=name,
                    performance_score=score
                ))
        
        # Stage 4: Text
        text_cols = [c for c, t in col_types.items() if t == 'text']
        for col in text_cols:
            best, score, all_scores = self.find_best_action(
                df, col, target,
                self.config.text_actions,
                processed_numeric if len(processed_numeric.columns) > 0 else None
            )
            
            if not all_scores:
                best = 'drop_column'
                score = 0.0
                all_scores = {'drop_column': 0.0}
            
            confidence = 0.7  # Text is often dropped
            features = self.extract_features(df[col], col)
            samples.append(TrainingSample(
                features=features,
                label=best,
                confidence=confidence,
                column_type='text',
                column_name=col,
                dataset_name=name,
                performance_score=score
            ))
        
        return samples
    
    def run(self, datasets: List[Tuple[str, pd.DataFrame, str]]) -> List[TrainingSample]:
        """Run curriculum meta-learning on all datasets."""
        print(f"\n🎓 Starting Curriculum Meta-Learning...\n")
        
        for name, df, _ in tqdm(datasets, desc="Processing datasets"):
            samples = self.process_dataset(name, df)
            self.samples.extend(samples)
        
        # Filter by minimum confidence
        self.samples = [
            s for s in self.samples 
            if s.confidence >= self.config.min_confidence
        ]
        
        print(f"\n✅ Generated {len(self.samples)} training samples")
        return self.samples

print("✅ CurriculumMetaLearner defined")

## Cell 7-9: Run Curriculum Stages

### What This Does
This cell runs the **curriculum-based meta-learning** process:

1. **Collects 40+ datasets** from OpenML with diverse characteristics
2. **For each column** in each dataset:
   - Classifies column type (numeric, categorical, text, or deterministic)
   - **Measures actual ML performance** by trying different preprocessing actions
   - Uses cross-validation with LogisticRegression to evaluate each action
   - Selects the best action based on CV accuracy
   - Calculates confidence from the performance gap
3. **Generates training samples** with:
   - 62 comprehensive meta-features
   - Performance-based labels (best action)
   - Confidence scores
4. **Filters samples** by minimum confidence threshold (0.5)

### Why This Takes Time (30-60 minutes)
We're running **cross-validation** on every column to find which preprocessing action actually improves ML performance. This is much better than heuristics, but requires computational time.

**Progress bar** will show dataset processing status below.

### Expected Output
- 1000-3000 training samples
- Diverse action distribution (15+ different actions)
- High-confidence labels based on actual performance


In [None]:
print("⚠️  WARNING: This training will take 30-60 minutes")
print("   We're running cross-validation on 40+ datasets to measure ACTUAL performance.")
print("   This is necessary for learning which preprocessing actions work best.")
print("   Progress will be shown below...\n")

# Initialize and run meta-learner
learner = CurriculumMetaLearner(config)
samples = learner.run(datasets)

# Statistics
labels = [s.label for s in samples]
types = [s.column_type for s in samples]
confidences = [s.confidence for s in samples]

print("\n📊 Training Data Statistics:")
print(f"  Total samples: {len(samples)}")
print(f"  Average confidence: {np.mean(confidences):.3f}")
print(f"  Unique labels: {len(set(labels))}")
print(f"  Unique datasets: {len(set(s.dataset_name for s in samples))}")

print("\n📋 Label Distribution:")
for label, count in sorted(Counter(labels).items(), key=lambda x: -x[1]):
    pct = count / len(labels) * 100
    print(f"  {label:25s}: {count:4d} ({pct:5.1f}%)")

print("\n📋 Type Distribution:")
for ctype, count in sorted(Counter(types).items(), key=lambda x: -x[1]):
    pct = count / len(types) * 100
    print(f"  {ctype:15s}: {count:4d} ({pct:5.1f}%)")

## Cell 10: Prepare Training Data

In [None]:
# Prepare feature matrix and labels
X = np.vstack([s.features for s in samples])
y = np.array([s.label for s in samples])
sample_weights = np.array([s.confidence for s in samples])

# Get feature names
if USE_META_FEATURES:
    feature_names = MetaLearningFeatures.get_feature_names()
else:
    feature_names = [f'feature_{i}' for i in range(X.shape[1])]

print(f"📊 Training Data Shape:")
print(f"  Features (X): {X.shape}")
print(f"  Labels (y): {y.shape}")
print(f"  Feature count: {len(feature_names)}")
print(f"  Unique labels: {len(np.unique(y))}")

# Split into train/test
X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X, y, sample_weights,
    test_size=config.test_size,
    random_state=config.random_state,
    stratify=y
)

print(f"\n📊 Train/Test Split:")
print(f"  Training: {len(X_train)} samples")
print(f"  Test: {len(X_test)} samples")

## Cell 11: Train XGBoost + LightGBM Ensemble

In [None]:
print("🚀 Training XGBoost + LightGBM Ensemble...\n")

# XGBoost classifier
xgb_model = XGBClassifier(
    n_estimators=300,
    max_depth=8,
    learning_rate=0.03,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=config.random_state,
    n_jobs=-1,
    use_label_encoder=False,
    eval_metric='mlogloss'
)

# LightGBM classifier
lgb_model = LGBMClassifier(
    n_estimators=300,
    max_depth=8,
    learning_rate=0.03,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=config.random_state,
    n_jobs=-1,
    verbose=-1
)

# Voting ensemble
ensemble = VotingClassifier(
    estimators=[
        ('xgb', xgb_model),
        ('lgb', lgb_model)
    ],
    voting='soft',
    weights=[0.6, 0.4]
)

# Train
print("Training ensemble (this may take a few minutes)...")
ensemble.fit(X_train, y_train, sample_weight=w_train)

# Evaluate
train_pred = ensemble.predict(X_train)
test_pred = ensemble.predict(X_test)

train_acc = accuracy_score(y_train, train_pred)
test_acc = accuracy_score(y_test, test_pred)

print(f"\n✅ Training Complete!")
print(f"  Training Accuracy: {train_acc:.1%}")
print(f"  Test Accuracy: {test_acc:.1%}")

## Cell 12: Evaluate on Test Set

In [None]:
print("📊 Detailed Evaluation:\n")

# Classification report
print("Classification Report:")
print(classification_report(y_test, test_pred))

# Per-class accuracy
print("\nPer-Class Accuracy:")
classes = np.unique(y_test)
for cls in sorted(classes):
    mask = y_test == cls
    cls_acc = accuracy_score(y_test[mask], test_pred[mask])
    count = mask.sum()
    print(f"  {cls:25s}: {cls_acc:5.1%} ({count:3d} samples)")

# Confusion matrix summary
print("\n📊 Model Quality Check:")
print(f"  ✅ Test accuracy: {test_acc:.1%}")
print(f"  {'✅' if test_acc >= 0.9 else '⚠️'} Target: 90%+")
print(f"  ✅ Unique classes: {len(classes)}")

## Cell 13: Save Model + Metadata

In [None]:
# Generate filename with timestamp
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
model_filename = f'neural_oracle_meta_v3_{timestamp}.pkl'

# Prepare metadata
metadata = {
    'version': 'meta_v3',
    'accuracy': float(test_acc),
    'training_samples': len(samples),
    'datasets': len(datasets),
    'feature_count': X.shape[1],
    'unique_labels': len(np.unique(y)),
    'timestamp': timestamp,
    'config': {
        'n_datasets': config.n_datasets,
        'cv_folds': config.cv_folds,
        'random_state': config.random_state,
    }
}

# Save model directly (NeuralOracle compatible format)
with open(model_filename, 'wb') as f:
    pickle.dump(ensemble, f)

# Save metadata separately
metadata_filename = f'model_metadata_{timestamp}.json'
with open(metadata_filename, 'w') as f:
    json.dump(metadata, f, indent=2)

# Get file size
import os
model_size = os.path.getsize(model_filename) / (1024 * 1024)

print(f"✅ Model saved: {model_filename}")
print(f"✅ Metadata saved: {metadata_filename}")
print(f"\n📊 Model Info:")
print(f"  Size: {model_size:.1f} MB")
print(f"  Test Accuracy: {test_acc:.1%}")
print(f"  Training Samples: {len(samples)}")
print(f"  Datasets: {len(datasets)}")
print(f"  Features: {X.shape[1]}")

## Cell 14: Download Instructions

In [None]:
# Download files (Colab only)
try:
    from google.colab import files
    
    print("📥 Downloading model...")
    files.download(model_filename)
    
    print("\n📥 Downloading metadata...")
    files.download(metadata_filename)
    
    print("\n✅ Downloads started!")
    
except ImportError:
    print("📁 Not running in Colab. Files saved locally:")
    print(f"  - {model_filename}")
    print(f"  - {metadata_filename}")

print("\n" + "="*60)
print("DEPLOYMENT INSTRUCTIONS")
print("="*60)
print(f"""
1. Copy the model file to your AURORA installation:
   
   cp {model_filename} /path/to/AURORA-V2/models/

2. The NeuralOracle will automatically load the latest model,
   or you can specify the path explicitly:
   
   from src.neural.oracle import NeuralOracle
   oracle = NeuralOracle(model_path='models/{model_filename}')

3. Verify deployment:
   
   python scripts/model_integration_utils.py

Expected Results:
  - Model loads without errors
  - Bestsellers.csv accuracy: 90%+
  - Inference time: <5ms per column
""")

## Summary

### What Was Created
- **Trained Model**: XGBoost + LightGBM ensemble for preprocessing action prediction
- **Training Data**: Generated from 40+ OpenML datasets using **performance-based ground truth** (actual CV scores)
- **Features**: 62 comprehensive meta-features covering statistics, semantics, patterns, and distributions
- **Actions**: 15+ preprocessing actions learned through curriculum meta-learning

### Key Improvements in This Version
✓ **All 15+ preprocessing actions** are now being learned (not just 3!)
✓ **Performance-based action selection** using cross-validation (not heuristics)
✓ **30+ comprehensive meta-features** extracted per column
✓ **IndexError bug fixed** with `reset_index(drop=True)`
✓ **Retry logic** for failed dataset downloads
✓ **OpenML warning suppression** for cleaner output
✓ **Training time: 30-60 minutes** (necessary for actual performance measurement)

### Expected Action Distribution
After training, you should see a diverse distribution like:
```
📊 Action Distribution:
   - standard_scale: ~30-40%
   - keep_as_is: ~10-15%
   - log1p_transform: ~8-12%
   - robust_scale: ~8-10%
   - onehot_encode: ~8-10%
   - label_encode: ~6-8%
   - clip_outliers: ~5-8%
   - frequency_encode: ~3-5%
   - minmax_scale: ~3-5%
   - drop_column: ~2-4%
   - ordinal_encode: ~2-4%
   - sqrt_transform: ~2-3%
   - log_transform: ~1-3%
```

If you see only 3 actions (standard_scale, onehot_encode, label_encode), the training didn't work correctly.

### Model Details
- **Architecture**: VotingClassifier (XGBoost 60%, LightGBM 40%)
- **All Actions**: keep_as_is, standard_scale, minmax_scale, robust_scale, log_transform, log1p_transform, sqrt_transform, clip_outliers, onehot_encode, label_encode, ordinal_encode, frequency_encode, drop_column
- **Expected accuracy**: 85-92% on test data
- **Training samples**: 1000-3000 column examples

### Next Steps
1. Download the model file (`neural_oracle_meta_v3_TIMESTAMP.pkl`)
2. Copy to `models/` directory in your AURORA installation
3. Run validation tests: `python tests/test_complete_system.py`
4. Deploy to production

### Troubleshooting
- **If accuracy is low (<80%)**: Try increasing `n_datasets` in config or reducing `min_confidence`
- **If training is slow**: This is expected! We're running CV on 40+ datasets. Reduce `n_datasets` if needed.
- **If only 3 actions learned**: Check that `find_best_action()` is being called, not heuristics
- **If IndexError occurs**: Make sure `reset_index(drop=True)` is called after sampling
- Check `docs/META_LEARNING_GUIDE.md` for detailed documentation
