# **SMU Course Bidding Prediction Using CatBoost V4**

<div style="background-color:#DFFFD6; padding:12px; border-radius:5px; border: 1px solid #228B22;">
   <h2 style="color:#006400;">✅ Looking to Implement This? ✅</h2>
   <p>🚀 **Get started quickly by using** <strong><a href="example_prediction.ipynb">example_prediction.ipynb</a></strong>.</p> 
   <ul> 
      <li>📌 **Three pre-trained CatBoost models (`.cbm`) available for instant predictions.**</li>
      <li>🔧 Includes **step-by-step instructions** for making predictions with uncertainty quantification.</li>
      <li>⚡ Works **out-of-the-box**—just load the models and start predicting!</li>
   </ul>
   <h3>🔗 📌 Next Steps:</h3>
   <p>👉 <a href="example_prediction.ipynb"><strong>Go to Example Prediction Notebook</strong></a></p>
</div> 
<h2><span style="color:red">NOTE: use at your own discretion.</span></h2>

### **Changes in V4**
- **Three-model architecture**: Added a classification model to predict whether a course will receive bids, complementing the existing median and min bid regression models
- **Advanced uncertainty quantification**: Implemented entropy-based confidence scoring for classification and bootstrap-based confidence intervals for regression models
- **Enhanced feature engineering**: Incorporated day-of-week boolean flags (`has_mon`, `has_tue`, etc.) for better temporal pattern recognition
- **Asymmetric loss function**: Custom loss that penalizes under-predictions more heavily than over-predictions, crucial for bidding strategy
- **Comprehensive evaluation suite**: Added confidence interval coverage analysis, residual analysis with emphasis on under-predictions, and cross-model feature importance comparison

### **Objective**
This notebook predicts bidding outcomes for courses in the SMU bidding system using **three specialized CatBoost models**. Building on insights from **V1, V2, and V3**, this version introduces a comprehensive **multi-model approach** with advanced uncertainty quantification:

1. **Classification Model**: Predicts whether a course will receive bids (optimized for high recall)
2. **Median Bid Regression Model**: Predicts the median bid price with confidence intervals
3. **Min Bid Regression Model**: Predicts the minimum bid price with confidence intervals

### **Key Enhancements in V4**

**Learning from V3:**
   - V3 focused on two regression models for median and min bid prediction
   - V4 adds a **classification component** to identify courses that will receive bidding activity
   - Enhanced with **probabilistic predictions** and **confidence scoring**

**New V4 Features:**
   - **Entropy-based confidence scoring** for classification predictions with five confidence levels (Very Low to Very High)
   - **Bootstrap sampling** (100 iterations) for robust confidence interval estimation
   - **Asymmetric loss function** (α=2.0) that heavily penalizes dangerous under-predictions
   - **Comprehensive uncertainty analysis** including interval width and coverage metrics

### **Three-Model Architecture**

| **Model Type** | **Purpose** | **Output** | **Uncertainty Measure** |
|----------------|-------------|------------|-------------------------|
| **Classification** | Predict bid courses | Probability + Confidence Level | Entropy-based confidence score |
| **Median Bid Regression** | Predict median bid price | Price + 95% CI | Bootstrap confidence intervals |
| **Min Bid Regression** | Predict minimum bid price | Price + 95% CI | Bootstrap confidence intervals |

### **Updated Dataset Features**

| **Feature Name** | **Type** | **Description** |
|------------------|----------|-----------------|
| **`subject_area`** | Categorical | Subject area (IS, ECON, etc.) |
| **`catalogue_no`** | Categorical | Course number |
| **`round`** | Categorical | Bidding round (1, 1A, 1B, 1C, 2, 2A) |
| **`window`** | Numerical | Bidding window (1-5) |
| **`before_process_vacancy`** | Numerical | Available spots before bidding |
| **`acad_year_start`** | Numerical | Academic year start |
| **`term`** | Categorical | Academic term (1, 2, 3A, 3B) |
| **`start_time`** | Categorical | Class start time |
| **`course_name`** | Categorical | Course name/description |
| **`section`** | Categorical | Course section |
| **`instructor`** | Categorical | Instructor name |
| **`has_mon`** - **`has_sun`** | Boolean | Day-of-week indicators |
| **🎯 Target Variables 🎯** | | **Model outputs** |
| **`bids`** | Binary | Whether course receives bids |
| **`target_median_bid`** | Numerical | Median bid price |
| **`target_min_bid`** | Numerical | Minimum bid price |

### **Advanced Uncertainty Quantification**

**Classification Confidence:**
- **Entropy-based scoring**: Measures prediction certainty using information entropy
- **Five confidence levels**: Very Low, Low, Medium, High, Very High
- **Probability outputs**: Separate probabilities for bid/non-bid outcomes

**Regression Confidence Intervals:**
- **Bootstrap sampling**: 100 model iterations for robust uncertainty estimation
- **95% confidence intervals**: Upper and lower bounds for each prediction
- **Interval width analysis**: Wider intervals indicate higher uncertainty

### **Methodology**
The notebook follows this enhanced structure:

1. **Data Preparation**:
   - Loading separate datasets for classification and regression tasks
   - Feature standardization and categorical encoding
   - Train-test splitting with consistent random seeds

2. **Three-Model Training**:
   - **Classification**: CatBoost with recall optimization for bid opportunity detection
   - **Median Regression**: CatBoost with bootstrap uncertainty quantification
   - **Min Regression**: CatBoost with asymmetric loss for under-prediction penalties

3. **Advanced Evaluation**:
   - **Classification**: Recall (maximizing true positives for bid detection), confusion matrix, entropy-based confidence analysis
   - **Regression**: MSE, MAE, R², asymmetric MSE, confidence interval coverage
   - **Cross-model feature importance comparison**

4. **Comprehensive Visualization**:
   - Confidence distribution plots and uncertainty analysis
   - Residual analysis with under-prediction emphasis
   - Feature importance rankings across all three models

5. **Model Persistence and Reporting**:
   - All models saved as `.cbm` files for deployment
   - Detailed results exported to CSV format
   - Comprehensive summary report generation

### **Key Metrics and Performance**

**Classification Model:**
- **Primary metric**: Recall (optimized for capturing all bidding opportunities - maximizing true positives)
- **Confidence analysis**: Distribution of entropy-based confidence scores  
- **Output**: Probabilities for bid/no-bid outcomes, confidence levels, and entropy values

**Regression Models:**
- **Standard metrics**: MSE, MAE, R² for model accuracy
- **Asymmetric MSE**: Custom metric penalizing under-predictions (α=2.0)
- **Uncertainty metrics**: Mean confidence interval width and coverage percentage
- **Safety analysis**: Percentage of dangerous under-predictions

### **Classification Strategy - Maximizing Bidding Opportunities**

**Recall-Optimized Approach:**
- **Target**: Predict courses that will receive bids (positive class = 1)
- **Primary Goal**: Maximize recall to capture all potential bidding opportunities
- **Business Logic**: Missing a course that will receive bids (False Negative) is more costly than incorrectly predicting a course will receive bids (False Positive)
- **Optimization**: Model trained to minimize missed bidding opportunities while maintaining reasonable precision

### **Implementation Notes**
To run this V4 notebook:
- Install required packages: `pip install catboost pandas numpy matplotlib seaborn scikit-learn scipy`
- Ensure you have the three required datasets:
  - Classification training/test data
  - Median bid regression training/test data
  - Min bid regression training/test data
- Models automatically save to `script_output_model_training/mode/` directory

### **V4 Advantages**
- **Comprehensive coverage**: Handles both bid opportunity detection and price prediction
- **Risk-aware predictions**: Asymmetric loss prevents dangerous under-bidding
- **Confidence-calibrated**: Provides uncertainty measures for better decision-making
- **Feature-rich analysis**: Cross-model feature importance for strategic insights
- **Production-ready**: All models saved with consistent interfaces for deployment

## **1. Setup**

In [1]:
import pandas as pd
import numpy as np
from catboost import CatBoostRegressor, CatBoostClassifier
from pathlib import Path
import warnings
import os
import psycopg2
from dotenv import load_dotenv
import uuid
from datetime import datetime
from collections import defaultdict
import pickle
import json
import hashlib
from typing import List
import re
warnings.filterwarnings('ignore')

# Add database configuration
load_dotenv()
db_config = {
    'host': os.getenv('DB_HOST'),
    'database': os.getenv('DB_NAME'),
    'user': os.getenv('DB_USER'),
    'password': os.getenv('DB_PASSWORD'),
    'port': int(os.getenv('DB_PORT', 5432)),
    'gssencmode': 'disable'
}

# Create output directories
output_dir = Path('script_output/predictions')
output_dir.mkdir(parents=True, exist_ok=True)
cache_dir = Path('db_cache')
cache_dir.mkdir(parents=True, exist_ok=True)

## **2. SMUBiddingTransformer**

In [2]:
class SMUBiddingTransformer:
    """
    A reusable transformer class for processing SMU course bidding data
    optimized for CatBoost model.
    
    Uses categorical encoding for instructors and one-hot encoding for multi-valued days.
    
    Expected input columns:
    - course_code: str (e.g. 'MGMT715', 'COR-COMM175')
    - course_name: str
    - acad_year_start: int
    - term: str ('1', '2', '3A', '3B')
    - start_time: str (e.g. '19:30', 'TBA') - preserved as categorical
    - day_of_week: str (can be multivalued, e.g. 'Mon,Thu')
    - before_process_vacancy: int
    - bidding_window: str (e.g. 'Round 1 Window 1', 'Incoming Freshmen Rnd 1 Win 4')
    - instructor: str (can be multivalued, e.g. 'JOHN DOE, JANE SMITH')
    """
    
    def __init__(self):
        """
        Initialize the transformer for CatBoost optimization.
        
        Uses categorical encoding for instructors and one-hot encoding for days.
        """
        # Fitted flags
        self.is_fitted = False
        
        # Lists to track feature types for CatBoost
        self.categorical_features = []
        self.numeric_features = []
        
    def fit(self, df: pd.DataFrame) -> 'SMUBiddingTransformer':
        """
        Fit the transformer on training data.
        
        Parameters:
        -----------
        df : pandas.DataFrame
            Training dataframe with all required columns
        """
        # Validate required columns
        required_cols = [
            'course_code', 'course_name', 'acad_year_start', 'term',
            'start_time', 'day_of_week', 'before_process_vacancy',
            'bidding_window', 'instructor', 'section'
        ]
        missing_cols = [col for col in required_cols if col not in df.columns]
        if missing_cols:
            raise ValueError(f"Missing required columns: {missing_cols}")
        
        print(f"Fitting transformer on {len(df)} rows...")
        
        self.is_fitted = True
        return self
    
    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Transform the dataframe to CatBoost-ready format.
        """
        # Try to load existing model if not fitted
        if not self.is_fitted:
            raise ValueError("Transformer must be fitted before transform. Call fit() first.")
        
        # Create a copy to avoid modifying original
        df_transformed = df.copy()
        
        # Reset feature tracking
        self.categorical_features = []
        self.numeric_features = []
        
        # 1. Extract course components (categorical + numeric)
        course_features = self._extract_course_features(df_transformed)
        
        # 2. Process bidding window (categorical + numeric)
        round_window_features = self._extract_round_window(df_transformed)
        
        # 3. Basic features (preserve categorical nature) + instructor as categorical
        basic_features = self._process_basic_features(df_transformed)
        
        # 4. Create day one-hot encoding
        day_features = self._create_day_one_hot_encoding(df_transformed)
        
        # Combine all features - FIXED: Ensure proper concatenation
        feature_dfs = [course_features, round_window_features, basic_features, day_features]
        
        # Filter out any empty DataFrames
        feature_dfs = [df for df in feature_dfs if not df.empty]
        
        if not feature_dfs:
            raise ValueError("No features were extracted")
        
        # Concatenate all features
        final_df = pd.concat(feature_dfs, axis=1)
        
        # Verify all expected features are present
        expected_features = self.categorical_features + self.numeric_features
        missing_features = [f for f in expected_features if f not in final_df.columns]
        
        if missing_features:
            print(f"Warning: Missing features in final dataframe: {missing_features}")
            print(f"Available columns: {list(final_df.columns)}")
        
        # Debug: Print feature summary
        print(f"Transformed data shape: {final_df.shape}")
        print(f"Features included: {list(final_df.columns)[:10]}...")  # Show first 10
        
        return final_df
        
    def fit_transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """Fit the transformer and transform the data in one step."""
        self.fit(df)
        return self.transform(df)
    
    def get_categorical_features(self) -> List[str]:
        """Get list of categorical feature names for CatBoost."""
        return self.categorical_features.copy()
    
    def get_numeric_features(self) -> List[str]:
        """Get list of numeric feature names."""
        return self.numeric_features.copy()
    
    def _extract_course_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Extract subject area and catalogue number from course code."""
        features = pd.DataFrame(index=df.index)
        
        def split_course_code(code):
            """Split course code into subject area and catalogue number."""
            if pd.isna(code):
                return None, None
            
            code = str(code).strip().upper()
            
            # Handle hyphenated codes like 'COR-COMM175'
            if '-' in code:
                parts = code.split('-')
                if len(parts) >= 2:
                    subject = '-'.join(parts[:-1])
                    # Extract number from last part
                    num_match = re.search(r'(\d+)', parts[-1])
                    if num_match:
                        return subject, int(num_match.group(1))
                    else:
                        # Try extracting from full last part
                        num_match = re.search(r'(\d+)', code)
                        if num_match:
                            return subject, int(num_match.group(1))
            
            # Standard format like 'MGMT715'
            match = re.match(r'([A-Z\-]+)(\d+)', code)
            if match:
                return match.group(1), int(match.group(2))
            
            return code, 0
        
        # Extract components
        splits = df['course_code'].apply(split_course_code)
        features['subject_area'] = splits.apply(lambda x: x[0] if x else None)
        features['catalogue_no'] = splits.apply(lambda x: x[1] if x else 0)

        # Debug: Verify extraction
        print(f"Extracted course features: {features.shape}")
        print(f"Sample subject_area values: {features['subject_area'].head()}")
        print(f"Sample catalogue_no values: {features['catalogue_no'].head()}")

        # subject_area and catalogue_no are categorical for CatBoost
        self.categorical_features.extend(['subject_area', 'catalogue_no'])

        return features
    
    def _extract_round_window(self, df: pd.DataFrame) -> pd.DataFrame:
        """Extract round and window from bidding_window string."""
        features = pd.DataFrame(index=df.index)
        
        def parse_bidding_window(window_str):
            """Parse bidding window string into round and window number."""
            if pd.isna(window_str):
                return None, None
            
            window_str = str(window_str).strip()
            # Check for Incoming Freshmen FIRST (before other patterns)
            if 'Incoming Freshmen' in window_str:
                match = re.search(r'Rnd\s+(\d)\s+Win\s+(\d)', window_str, re.IGNORECASE)
                if match:
                    # Add F suffix to distinguish from regular rounds
                    return f"{match.group(1)}F", int(match.group(2))     
            
            # Pattern 1: Standard format
            match = re.search(r'Round\s+(\d[A-C]?)\s+Window\s+(\d)', window_str, re.IGNORECASE)
            if match:
                return match.group(1), int(match.group(2))
            
            # Pattern 2: Abbreviated format
            match = re.search(r'Rnd\s+(\d[A-C]?)\s+Win\s+(\d)', window_str, re.IGNORECASE)
            if match:
                return match.group(1), int(match.group(2))
            
            # Pattern 3: Incoming Exchange format (keeps original round)
            match = re.search(r'Incoming\s+Exchange\s+Rnd\s+(\w+)\s+Win\s+(\d+)', window_str, re.IGNORECASE)
            if match:
                return match.group(1), int(match.group(2))
            
            # Pattern 4: Incoming Freshmen format (adds F suffix)
            match = re.search(r'Incoming\s+Freshmen\s+Rnd\s+(\w+)\s+Win\s+(\d+)', window_str, re.IGNORECASE)
            if match:
                original_round = match.group(1)
                window_num = int(match.group(2))
                # Map Incoming Freshmen Round 1 to Round 1F
                if original_round == "1":
                    round_str = "1F"
                else:
                    round_str = f"{original_round}F"
                return round_str, window_num
            
            # Fallback patterns...
            match = re.search(r'(\d[A-C]?)', window_str)
            if match:
                win_match = re.search(r'Window\s+(\d)|Win\s+(\d)', window_str, re.IGNORECASE)
                if win_match:
                    window_num = int(win_match.group(1) or win_match.group(2))
                    return match.group(1), window_num
                return match.group(1), 1
            
            return '1', 1
        
        # Extract round and window
        parsed = df['bidding_window'].apply(parse_bidding_window)
        features['round'] = parsed.apply(lambda x: x[0] if x else '1')
        features['window'] = parsed.apply(lambda x: x[1] if x else 1)
        
        # Round as categorical (preserves ordering like 1, 1A, 1B, 2, 2A)
        self.categorical_features.append('round')
        
        # Window as numeric
        self.numeric_features.append('window')
        
        return features
    
    def _process_basic_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Process basic features, preserving categorical nature where beneficial."""
        features = pd.DataFrame(index=df.index)
        
        # Numeric features
        features['before_process_vacancy'] = pd.to_numeric(
            df['before_process_vacancy'], errors='coerce'
        ).fillna(0)
        features['acad_year_start'] = pd.to_numeric(
            df['acad_year_start'], errors='coerce'
        ).fillna(2025)
        
        self.numeric_features.extend(['before_process_vacancy', 'acad_year_start'])
        
        # Categorical features
        features['term'] = df['term'].astype(str)
        features['start_time'] = df['start_time'].astype(str)
        features['course_name'] = df['course_name'].astype(str)
        features['section'] = df['section'].astype(str)
        
        # Process instructor names (remove duplicates, handle comma-separated format)
        features['instructor'] = df['instructor'].apply(self._process_instructor_names)

        # Replace empty strings with None for proper CatBoost handling
        features.loc[features['start_time'].isin(['', 'nan']), 'start_time'] = None
        features.loc[features['course_name'].isin(['', 'nan']), 'course_name'] = None
        features.loc[features['section'].isin(['', 'nan']), 'section'] = None
        
        self.categorical_features.extend(['term', 'start_time', 'course_name', 'section', 'instructor'])
        
        return features

    def _create_day_one_hot_encoding(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create one-hot encoding for days of the week."""
        features = pd.DataFrame(index=df.index)
        
        # Initialize all day columns as 0
        day_columns = ['has_mon', 'has_tue', 'has_wed', 'has_thu', 'has_fri', 'has_sat', 'has_sun']
        for col in day_columns:
            features[col] = 0
        
        # Day mapping
        day_abbrev = {
            'MONDAY': 'MON', 'TUESDAY': 'TUE', 'WEDNESDAY': 'WED',
            'THURSDAY': 'THU', 'FRIDAY': 'FRI', 'SATURDAY': 'SAT', 'SUNDAY': 'SUN',
            'MON': 'MON', 'TUE': 'TUE', 'WED': 'WED', 'THU': 'THU',
            'FRI': 'FRI', 'SAT': 'SAT', 'SUN': 'SUN'
        }
        
        day_to_column = {
            'MON': 'has_mon', 'TUE': 'has_tue', 'WED': 'has_wed', 'THU': 'has_thu',
            'FRI': 'has_fri', 'SAT': 'has_sat', 'SUN': 'has_sun'
        }
        
        # Process each row's day_of_week
        for idx, days_value in enumerate(df['day_of_week']):
            if pd.isna(days_value) or str(days_value).strip() == '':
                continue  # Leave all days as 0
            
            days_str = str(days_value).strip()
            
            # Handle JSON array format
            if days_str.startswith('[') and days_str.endswith(']'):
                try:
                    import json
                    days_list = json.loads(days_str)
                    if isinstance(days_list, list):
                        for day in days_list:
                            day_upper = str(day).strip().upper()
                            standardized_day = day_abbrev.get(day_upper, day_upper)
                            
                            if standardized_day in day_to_column:
                                features.loc[df.index[idx], day_to_column[standardized_day]] = 1
                except json.JSONDecodeError:
                    # If JSON parsing fails, try comma-separated format as fallback
                    pass
            else:
                # Handle comma-separated format (legacy support)
                for day in days_str.split(','):
                    day_upper = day.strip().upper()
                    standardized_day = day_abbrev.get(day_upper, day_upper)
                    
                    if standardized_day in day_to_column:
                        features.loc[df.index[idx], day_to_column[standardized_day]] = 1
        
        # These are numeric binary features (0/1)
        self.numeric_features.extend(day_columns)
        
        return features

    def get_feature_names(self) -> List[str]:
        """Get all feature names after transformation."""
        if not self.is_fitted:
            raise ValueError("Transformer must be fitted to get feature names.")
        
        return self.categorical_features + self.numeric_features
    
    def _process_instructor_names(self, instructor_input):
        """Process instructor names to ensure consistent JSON array format as categorical string."""
        # Handle list/array input
        if isinstance(instructor_input, (list, np.ndarray)):
            if len(instructor_input) == 0:
                return None
            # Convert list to JSON string format
            unique_instructors = []
            seen = set()
            for instructor in instructor_input:
                if pd.notna(instructor) and str(instructor).strip() and str(instructor).upper() != 'TBA':
                    instructor_clean = str(instructor).strip()
                    if instructor_clean not in seen:
                        seen.add(instructor_clean)
                        unique_instructors.append(instructor_clean)
            
            if unique_instructors:
                unique_instructors.sort()
                import json
                return json.dumps(unique_instructors)
            else:
                return None
        
        # Handle string input (original logic)
        if pd.isna(instructor_input) or str(instructor_input).strip() == '' or str(instructor_input).upper() == 'TBA':
            return None
        
        instructor_str = str(instructor_input).strip()
        
        # Check if it's already a JSON array
        if instructor_str.startswith('[') and instructor_str.endswith(']'):
            try:
                import json
                # Parse JSON array
                instructors = json.loads(instructor_str)
                if isinstance(instructors, list) and instructors:
                    # Remove duplicates and sort alphabetically
                    unique_instructors = []
                    seen = set()
                    for instructor in instructors:
                        instructor_clean = str(instructor).strip()
                        if instructor_clean and instructor_clean not in seen:
                            seen.add(instructor_clean)
                            unique_instructors.append(instructor_clean)
                    
                    unique_instructors.sort()  # Alphabetical order
                    
                    # Return as JSON string for categorical feature
                    return json.dumps(unique_instructors)
                else:
                    return None
            except json.JSONDecodeError:
                # If JSON parsing fails, treat as regular string
                pass
        
        # If not JSON format, treat as comma-separated string and convert to JSON
        import re
        parts = re.split(r', (?=[A-Z]+(?:\s|,|$))', instructor_str)
        
        seen = set()
        unique_parts = []
        for part in parts:
            part = part.strip()
            if part and part not in seen:
                seen.add(part)
                unique_parts.append(part)
        
        if unique_parts:
            unique_parts.sort()  # Alphabetical order
            import json
            return json.dumps(unique_parts)
        else:
            return None


## **3. Database Helper**

In [3]:
def connect_database():
    """Connect to PostgreSQL database"""
    load_dotenv()
    db_config = {
        'host': os.getenv('DB_HOST'),
        'database': os.getenv('DB_NAME'),
        'user': os.getenv('DB_USER'),
        'password': os.getenv('DB_PASSWORD'),
        'port': int(os.getenv('DB_PORT', 5432)),
        'gssencmode': 'disable'
    }
    
    try:
        connection = psycopg2.connect(**db_config)
        print("✅ Database connection established")
        return connection
    except Exception as e:
        print(f"❌ Database connection failed: {e}")
        return None

def load_or_cache_data(connection, cache_dir):
    """Load data from cache or database"""
    cache_files = {
        'courses': cache_dir / 'courses_cache.pkl',
        'classes': cache_dir / 'classes_cache.pkl',
        'acad_terms': cache_dir / 'acad_terms_cache.pkl',
        'professors': cache_dir / 'professors_cache.pkl'
    }
    
    data_cache = {}
    
    # Try loading from cache first
    if all(f.exists() for f in cache_files.values()):
        print("✅ Loading from cache...")
        for key, file in cache_files.items():
            data_cache[key] = pd.read_pickle(file)
    else:
        print("📥 Downloading from database...")
        queries = {
            'courses': "SELECT * FROM courses",
            'classes': "SELECT * FROM classes",
            'acad_terms': "SELECT * FROM acad_term",
            'professors': "SELECT * FROM professors"
        }
        
        for key, query in queries.items():
            df = pd.read_sql_query(query, connection)
            df.to_pickle(cache_files[key])
            data_cache[key] = df
    
    return data_cache

In [4]:
# Cell 4: Data Preparation Functions
def prepare_prediction_data(raw_data_path='script_input/raw_data.xlsx'):
    """Prepare data for prediction from raw_data.xlsx"""
    print("📂 Loading raw data...")
    
    # Load sheets
    standalone_df = pd.read_excel(raw_data_path, sheet_name='standalone')
    multiple_df = pd.read_excel(raw_data_path, sheet_name='multiple')
    
    # Filter for classes with bidding data
    bidding_data = standalone_df[
        standalone_df['bidding_window'].notna() & 
        standalone_df['total'].notna()
    ].copy()
    
    # Calculate before_process_vacancy
    bidding_data['before_process_vacancy'] = bidding_data['total'] - bidding_data['current_enrolled']
    
    # Extract round and window from bidding_window
    def parse_bidding_window(window_str):
        if pd.isna(window_str):
            return None, None
        
        import re
        # Handle various formats
        patterns = [
            (r'Round\s+(\w+)\s+Window\s+(\d+)', lambda m: (m.group(1), int(m.group(2)))),
            (r'Rnd\s+(\w+)\s+Win\s+(\d+)', lambda m: (m.group(1), int(m.group(2)))),
            (r'Incoming\s+Freshmen\s+Rnd\s+(\w+)\s+Win\s+(\d+)', lambda m: (f"{m.group(1)}F", int(m.group(2))))
        ]
        
        for pattern, extractor in patterns:
            match = re.search(pattern, str(window_str), re.IGNORECASE)
            if match:
                return extractor(match)
        return '1', 1
    
    bidding_data[['round', 'window']] = bidding_data['bidding_window'].apply(
        lambda x: pd.Series(parse_bidding_window(x))
    )
    
    # Get instructor information from multiple sheet
    instructor_map = {}
    for record_key, group in multiple_df.groupby('record_key'):
        professors = group['professor_name'].dropna().unique()
        if len(professors) > 0:
            instructor_map[record_key] = professors.tolist()
    
    # Map instructors to bidding data
    bidding_data['instructor'] = bidding_data['record_key'].map(
        lambda x: instructor_map.get(x, [])
    )
    
    # Get day of week information
    day_map = {}
    for record_key, group in multiple_df[multiple_df['type'] == 'CLASS'].groupby('record_key'):
        days = group['day_of_week'].dropna().unique()
        if len(days) > 0:
            day_map[record_key] = ', '.join(days)
    
    bidding_data['day_of_week'] = bidding_data['record_key'].map(
        lambda x: day_map.get(x, '')
    )
    
    # Get start time
    time_map = {}
    for record_key, group in multiple_df[multiple_df['type'] == 'CLASS'].groupby('record_key'):
        times = group['start_time'].dropna()
        if len(times) > 0:
            time_map[record_key] = times.iloc[0]
    
    bidding_data['start_time'] = bidding_data['record_key'].map(
        lambda x: time_map.get(x, '')
    )
    
    return bidding_data, standalone_df, multiple_df

def map_classes_to_predictions(bidding_data, data_cache):
    """Map predictions to class IDs - checks both database cache and new_classes.csv"""
    courses_df = data_cache['courses']
    classes_df = data_cache['classes']
    
    # Create course code to ID mapping from both sources
    course_id_map = dict(zip(courses_df['code'], courses_df['id']))
    
    # Also check new_courses.csv for courses not in database yet
    new_courses_path = Path('script_output/new_courses.csv')
    if new_courses_path.exists():
        try:
            new_courses_df = pd.read_csv(new_courses_path)
            for _, row in new_courses_df.iterrows():
                if row['code'] not in course_id_map:
                    course_id_map[row['code']] = row['id']
            print(f"📚 Added {len(new_courses_df)} courses from new_courses.csv")
        except Exception as e:
            print(f"⚠️ Could not load new_courses.csv: {e}")
    
    # Also check verify folder for new courses
    verify_courses_path = Path('script_output/verify/new_courses.csv')
    if verify_courses_path.exists():
        try:
            verify_courses_df = pd.read_csv(verify_courses_path)
            for _, row in verify_courses_df.iterrows():
                if row['code'] not in course_id_map:
                    course_id_map[row['code']] = row['id']
            print(f"📚 Added {len(verify_courses_df)} courses from verify/new_courses.csv")
        except Exception as e:
            print(f"⚠️ Could not load verify/new_courses.csv: {e}")
    
    # Load new_classes.csv to find classes not in database yet
    new_classes_df = None
    new_classes_path = Path('script_output/new_classes.csv')
    if new_classes_path.exists():
        try:
            new_classes_df = pd.read_csv(new_classes_path)
            print(f"📚 Loaded {len(new_classes_df)} classes from new_classes.csv")
        except Exception as e:
            print(f"⚠️ Could not load new_classes.csv: {e}")
    
    # Map each row to class IDs
    class_mappings = []
    unmapped_courses = set()
    
    for idx, row in bidding_data.iterrows():
        course_code = row['course_code']
        section = str(row['section'])
        acad_term_id = row['acad_term_id']
        record_key = row.get('record_key', '')
        
        # Get course ID
        course_id = course_id_map.get(course_code)
        if not course_id:
            unmapped_courses.add(course_code)
            continue
        
        found_in_db = False
        found_in_new = False
        
        # First, try to find in database cache
        matching_classes = classes_df[
            (classes_df['course_id'] == course_id) & 
            (classes_df['section'] == section) & 
            (classes_df['acad_term_id'] == acad_term_id)
        ]
        
        if not matching_classes.empty:
            found_in_db = True
            for _, class_row in matching_classes.iterrows():
                mapping = {
                    'prediction_idx': idx,
                    'class_id': class_row['id'],
                    'professor_id': class_row.get('professor_id'),
                    'course_code': course_code,
                    'section': section,
                    'acad_term_id': acad_term_id,
                    'record_key': record_key,
                    'source': 'database'
                }
                class_mappings.append(mapping)
        
        # If not found in database, check new_classes.csv
        if not found_in_db and new_classes_df is not None:
            # Try matching by course_id, section, and acad_term_id
            new_matching = new_classes_df[
                (new_classes_df['course_id'] == course_id) & 
                (new_classes_df['section'] == section) & 
                (new_classes_df['acad_term_id'] == acad_term_id)
            ]
            
            if not new_matching.empty:
                found_in_new = True
                for _, class_row in new_matching.iterrows():
                    mapping = {
                        'prediction_idx': idx,
                        'class_id': class_row['id'],
                        'professor_id': class_row.get('professor_id'),
                        'course_code': course_code,
                        'section': section,
                        'acad_term_id': acad_term_id,
                        'record_key': record_key,
                        'source': 'new_classes'
                    }
                    class_mappings.append(mapping)
        
        # If still not found anywhere
        if not found_in_db and not found_in_new:
            # Create a placeholder mapping
            mapping = {
                'prediction_idx': idx,
                'class_id': f"PENDING_{course_code}_{section}_{acad_term_id}",
                'professor_id': None,
                'course_code': course_code,
                'section': section,
                'acad_term_id': acad_term_id,
                'record_key': record_key,
                'source': 'not_found'
            }
            class_mappings.append(mapping)
    
    # Create summary
    mappings_df = pd.DataFrame(class_mappings)
    
    if not mappings_df.empty:
        print(f"\n📊 Mapping Summary:")
        print(f"   Total mappings: {len(mappings_df)}")
        print(f"   Unique predictions mapped: {mappings_df['prediction_idx'].nunique()}")
        source_counts = mappings_df['source'].value_counts()
        for source, count in source_counts.items():
            print(f"   From {source}: {count}")
        
        if unmapped_courses:
            print(f"\n⚠️ Courses without IDs: {len(unmapped_courses)}")
            print(f"   Sample: {list(unmapped_courses)[:5]}")
    
    return mappings_df

In [5]:
# Cell 5: Load and Prepare Data
print("="*60)
print("DATA LOADING AND PREPARATION")
print("="*60)

# Connect to database
connection = connect_database()
if not connection:
    raise Exception("Failed to connect to database")

# Load cache data
data_cache = load_or_cache_data(connection, cache_dir)

# Prepare prediction data
bidding_data, standalone_df, multiple_df = prepare_prediction_data()
print(f"📊 Prepared {len(bidding_data)} records for prediction")

# Display sample data
print("\nSample bidding data:")
print(bidding_data.head())

DATA LOADING AND PREPARATION
✅ Database connection established
✅ Loading from cache...
📂 Loading raw data...
📊 Prepared 1165 records for prediction

Sample bidding data:
                                          record_key  \
0  SelectedAcadTerm=2510&SelectedClassNumber=1002...   
1  SelectedAcadTerm=2510&SelectedClassNumber=1003...   
2  SelectedAcadTerm=2510&SelectedClassNumber=1004...   
3  SelectedAcadTerm=2510&SelectedClassNumber=1005...   
4  SelectedAcadTerm=2510&SelectedClassNumber=1006...   

                                            filepath course_code section  \
0  script_input/classTimingsFull\2025-26_T1\Selec...     THES720      G1   
1  script_input/classTimingsFull\2025-26_T1\Selec...     OBHR701      G1   
2  script_input/classTimingsFull\2025-26_T1\Selec...     FNCE710      G1   
3  script_input/classTimingsFull\2025-26_T1\Selec...  LAW103_603      G1   
4  script_input/classTimingsFull\2025-26_T1\Selec...  LAW103_603     G61   

            course_name             

In [6]:
# Cell 6: Transform Data (FIXED)
def transform_bidding_data(bidding_data, output_dir):
    print("="*60)
    print("FEATURE TRANSFORMATION")
    print("="*60)
    
    # Initialize transformer
    transformer = SMUBiddingTransformer()
    transformer.fit(bidding_data)
    
    # Transform data
    X_transformed = transformer.transform(bidding_data)
    print(f"✅ Transformed data shape: {X_transformed.shape}")
    print(f"📋 Features: {list(X_transformed.columns)[:10]}...")
    
    # Ensure all categorical features have __NA__ for null values
    categorical_features = transformer.get_categorical_features()
    for col in categorical_features:
        if col in X_transformed.columns:
            # Convert to string type first
            X_transformed[col] = X_transformed[col].astype(str)
            # Replace 'nan' strings with a consistent placeholder
            X_transformed[col] = X_transformed[col].replace('nan', '__NA__')
            # Also handle any remaining NaN values
            X_transformed[col] = X_transformed[col].fillna('__NA__')
            # Handle empty strings
            X_transformed[col] = X_transformed[col].replace('', '__NA__')
    
    # Save transformed data to CSV
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    transformed_output_path = output_dir / f'transformed_features_{timestamp}.csv'
    
    # Add the original identifiers to help with tracking
    X_transformed_with_ids = X_transformed.copy()
    X_transformed_with_ids['course_code'] = bidding_data['course_code'].values
    X_transformed_with_ids['section'] = bidding_data['section'].values
    X_transformed_with_ids['acad_term_id'] = bidding_data['acad_term_id'].values
    X_transformed_with_ids['record_key'] = bidding_data['record_key'].values
    
    # Reorder columns to put identifiers first
    id_cols = ['record_key', 'course_code', 'section', 'acad_term_id']
    feature_cols = [col for col in X_transformed.columns if col not in id_cols]
    X_transformed_with_ids = X_transformed_with_ids[id_cols + feature_cols]
    
    # Save to CSV
    X_transformed_with_ids.to_csv(transformed_output_path, index=False)
    print(f"\n💾 Transformed features saved to: {transformed_output_path}")
    print(f"   Total columns: {len(X_transformed_with_ids.columns)}")
    print(f"   - Identifier columns: {len(id_cols)}")
    print(f"   - Feature columns: {len(feature_cols)}")
    print(f"   - Categorical features: {len(transformer.get_categorical_features())}")
    print(f"   - Numeric features: {len(transformer.get_numeric_features())}")
    
    # Also save a metadata file with feature information
    metadata = {
        'timestamp': timestamp,
        'total_rows': len(X_transformed),
        'total_features': len(feature_cols),
        'categorical_features': transformer.get_categorical_features(),
        'numeric_features': transformer.get_numeric_features(),
        'identifier_columns': id_cols,
        'transformation_date': datetime.now().isoformat()
    }
    
    metadata_path = output_dir / f'transformation_metadata_{timestamp}.json'
    with open(metadata_path, 'w') as f:
        json.dump(metadata, f, indent=2)
    print(f"📋 Metadata saved to: {metadata_path}")
    
    # Display sample of transformed data
    print("\n🔍 Sample of transformed data:")
    print(X_transformed_with_ids.head())
    
    return X_transformed, transformer

# Run Transformation
X_transformed, transformer = transform_bidding_data(bidding_data, output_dir)

FEATURE TRANSFORMATION
Fitting transformer on 1165 rows...
Extracted course features: (1165, 2)
Sample subject_area values: 0    THES
1    OBHR
2    FNCE
3     LAW
4     LAW
Name: subject_area, dtype: object
Sample catalogue_no values: 0    720
1    701
2    710
3    103
4    103
Name: catalogue_no, dtype: int64
Transformed data shape: (1165, 18)
Features included: ['subject_area', 'catalogue_no', 'round', 'window', 'before_process_vacancy', 'acad_year_start', 'term', 'start_time', 'course_name', 'section']...
✅ Transformed data shape: (1165, 18)
📋 Features: ['subject_area', 'catalogue_no', 'round', 'window', 'before_process_vacancy', 'acad_year_start', 'term', 'start_time', 'course_name', 'section']...

💾 Transformed features saved to: script_output\predictions\transformed_features_20250628_101742.csv
   Total columns: 21
   - Identifier columns: 4
   - Feature columns: 17
   - Categorical features: 8
   - Numeric features: 10
📋 Metadata saved to: script_output\predictions\transformat

In [7]:
# Cell 7: Map to Classes
print("="*60)
print("CLASS MAPPING")
print("="*60)

# Map to classes
class_mappings = map_classes_to_predictions(bidding_data, data_cache)
print(f"🔗 Mapped to {len(class_mappings)} class instances")

# Display sample mappings
print("\nSample class mappings:")
print(class_mappings.head())

CLASS MAPPING
📚 Added 19 courses from verify/new_courses.csv
📚 Loaded 1205 classes from new_classes.csv

📊 Mapping Summary:
   Total mappings: 1205
   Unique predictions mapped: 1165
   From new_classes: 1205
🔗 Mapped to 1205 class instances

Sample class mappings:
   prediction_idx                              class_id  \
0               0  570e6d3c-111e-4f4b-a624-ec3b154af0b8   
1               1  72d2235b-48ec-4c20-abca-e6465d12fe64   
2               2  2b9599aa-27ab-44b9-bb7c-301e258316ac   
3               3  07841c5b-5d9b-4ea6-858e-c6026b052fe5   
4               4  8c31aac8-00de-4663-a025-84894b676116   

                           professor_id course_code section acad_term_id  \
0                                   NaN     THES720      G1   AY202526T1   
1  80e3f253-ef1f-46e2-93bf-356749da74bc     OBHR701      G1   AY202526T1   
2  2344a6fb-c450-4362-ae49-89ddad3fe6ee     FNCE710      G1   AY202526T1   
3  c680436e-1fa6-49d5-a325-da3deafa8dcb  LAW103_603      G1   AY202526T1   

In [None]:
# Cell 8: Load Models and Generate Predictions (FIXED)
def load_models_and_predict(X_transformed, bidding_data):
    print("="*60)
    print("MODEL PREDICTIONS")
    print("="*60)
    
    # Load models
    models = {
        'classification': CatBoostClassifier(),
        'median': CatBoostRegressor(),
        'min': CatBoostRegressor()
    }
    
    model_paths = {
        'classification': 'script_output/models/classification/production_classification_model.cbm',
        'median': 'script_output/models/regression_median/production_regression_median_model.cbm',
        'min': 'script_output/models/regression_min/production_regression_min_model.cbm'
    }
    
    # Load each model
    for name, model in models.items():
        try:
            model.load_model(model_paths[name])
            print(f"✅ Loaded {name} model")
        except Exception as e:
            print(f"❌ Error loading {name} model: {e}")
            return None
    
    # Verify data format matches model expectations
    expected_features = [
        'subject_area', 'catalogue_no', 'round', 'window', 'before_process_vacancy',
        'acad_year_start', 'term', 'start_time', 'course_name', 'section', 'instructor',
        'has_mon', 'has_tue', 'has_wed', 'has_thu', 'has_fri', 'has_sat', 'has_sun'
    ]
    
    # Create prediction dataset with only the features expected by models
    prediction_data = X_transformed.copy()
    
    # Ensure all expected features are present
    missing_features = set(expected_features) - set(prediction_data.columns)
    if missing_features:
        print(f"⚠️ Warning: Missing features: {missing_features}")
    
    # Select only the features that exist and are expected
    available_features = [col for col in expected_features if col in prediction_data.columns]
    prediction_data = prediction_data[available_features]
    
    print(f"📊 Using {len(available_features)} features for prediction")
    print(f"🔮 Generating predictions for {len(prediction_data)} records...")
    
    try:
        # Classification predictions
        clf_pred = models['classification'].predict(prediction_data)
        clf_proba = models['classification'].predict_proba(prediction_data)
        
        # Regression predictions  
        median_pred = models['median'].predict(prediction_data)
        min_pred = models['min'].predict(prediction_data)
        
        print(f"✅ Generated predictions for {len(prediction_data)} records")
        
        # Create results dataframe
        results = {
            'classification_prediction': clf_pred,
            'classification_probabilities': clf_proba,
            'median_prediction': median_pred,
            'min_prediction': min_pred
        }
        
        return results, models
        
    except Exception as e:
        print(f"❌ Error during prediction: {e}")
        print(f"   Data shape: {prediction_data.shape}")
        print(f"   Data types: {prediction_data.dtypes}")
        return None, models
    
# Load Models and Generate Predictions
prediction_results, loaded_models = load_models_and_predict(X_transformed, bidding_data)
if prediction_results:
    clf_pred = prediction_results['classification_prediction']
    clf_proba = prediction_results['classification_probabilities'] 
    median_pred = prediction_results['median_prediction']
    min_pred = prediction_results['min_prediction']
    
    print(f"\n📈 Prediction Summary:")
    print(f"   Classification predictions: {len(clf_pred)}")
    print(f"   Median predictions range: {min(median_pred):.2f} - {max(median_pred):.2f}")
    print(f"   Min predictions range: {min(min_pred):.2f} - {max(min_pred):.2f}")

MODEL PREDICTIONS
✅ Loaded classification model
✅ Loaded median model
✅ Loaded min model
📊 Using 18 features for prediction
🔮 Generating predictions for 1165 records...
✅ Generated predictions for 1165 records

📈 Prediction Summary:
   Classification predictions: 1165
   Median predictions range: 3.93 - 99.52
   Min predictions range: -2.32 - 66.04


In [10]:
# Cell 9: Calculate Uncertainties and Confidence
print("="*60)
print("UNCERTAINTY QUANTIFICATION")
print("="*60)

def calculate_entropy_confidence(probabilities):
    """Calculate entropy-based confidence scores"""
    epsilon = 1e-10
    entropy = -np.sum(probabilities * np.log(probabilities + epsilon), axis=1)
    max_entropy = -np.log(1/probabilities.shape[1])
    confidence_score = 1 - (entropy / max_entropy)
    
    confidence_levels = np.where(
        confidence_score >= 0.9, 'Very High',
        np.where(confidence_score >= 0.7, 'High',
                np.where(confidence_score >= 0.5, 'Medium',
                        np.where(confidence_score >= 0.3, 'Low', 'Very Low')))
    )
    return confidence_score, confidence_levels

# Calculate classification confidence
confidence_scores, confidence_levels = calculate_entropy_confidence(clf_proba)

# Calculate regression uncertainties using virtual ensembles
uncertainties = {}
for model_name in ['median', 'min']:
    model = loaded_models[model_name]
    n_trees = model.tree_count_
    n_subsets = 10
    trees_per_subset = max(1, n_trees // n_subsets)
    
    subset_predictions = []
    for i in range(n_subsets):
        tree_start = i * trees_per_subset
        tree_end = min((i + 1) * trees_per_subset, n_trees)
        if tree_start < n_trees:
            partial_pred = model.predict(X_transformed, 
                                       ntree_start=tree_start, 
                                       ntree_end=tree_end)
            subset_predictions.append(partial_pred)
    
    uncertainties[model_name] = np.std(subset_predictions, axis=0)

print("✅ Calculated prediction uncertainties")

UNCERTAINTY QUANTIFICATION
✅ Calculated prediction uncertainties


In [11]:
# Cell 10: Apply Safety Factors
print("="*60)
print("SAFETY FACTOR APPLICATION")
print("="*60)

# Load safety factor tables
median_sf_df = pd.read_csv('script_output/models/regression_median/median_bid_safety_factor_analysis.csv')
min_sf_df = pd.read_csv('script_output/models/regression_min/min_bid_safety_factor_analysis.csv')

# Find optimal safety factors (example: SF with TPR > 0.9)
median_optimal_idx = median_sf_df[median_sf_df['tpr'] > 0.9]['safety_factor'].idxmin()
min_optimal_idx = min_sf_df[min_sf_df['tpr'] > 0.9]['safety_factor'].idxmin()

median_optimal_sf = median_sf_df.iloc[median_optimal_idx]['safety_factor']
min_optimal_sf = min_sf_df.iloc[min_optimal_idx]['safety_factor']

print(f"📊 Optimal safety factors:")
print(f"   Median: {median_optimal_sf:.2f}")
print(f"   Min: {min_optimal_sf:.2f}")

# Apply safety factors
median_recommended = median_pred * (1 + median_optimal_sf)
min_recommended = min_pred * (1 + min_optimal_sf)

SAFETY FACTOR APPLICATION
📊 Optimal safety factors:
   Median: 0.70
   Min: 0.70


In [12]:
# Cell 11: Create Output DataFrames
print("="*60)
print("CREATING OUTPUT TABLES")
print("="*60)

# Create PredictionResult entries
prediction_results = []

for idx in range(len(bidding_data)):
    # Create input hash for deduplication
    input_data = bidding_data.iloc[idx]
    input_str = f"{input_data['course_code']}_{input_data['term']}_{input_data['round']}_{input_data['window']}"
    input_hash = hashlib.sha256(input_str.encode()).hexdigest()
    
    prediction_result = {
        'id': str(uuid.uuid4()),
        'input_hash': input_hash,
        'model_version': 'v4.0',
        'clf_predicted': bool(clf_pred[idx]),
        'clf_prob_no_bid': float(clf_proba[idx, 0]),
        'clf_prob_bid': float(clf_proba[idx, 1]),
        'clf_confidence_score': float(confidence_scores[idx]),
        'clf_confidence_level': confidence_levels[idx],
        'median_predicted': float(median_pred[idx]),
        'median_lower_95ci': float(median_pred[idx] - 1.96 * uncertainties['median'][idx]),
        'median_upper_95ci': float(median_pred[idx] + 1.96 * uncertainties['median'][idx]),
        'median_uncertainty': float(uncertainties['median'][idx]),
        'median_recommended': float(median_recommended[idx]),
        'min_predicted': float(min_pred[idx]),
        'min_lower_95ci': float(min_pred[idx] - 1.96 * uncertainties['min'][idx]),
        'min_upper_95ci': float(min_pred[idx] + 1.96 * uncertainties['min'][idx]),
        'min_uncertainty': float(uncertainties['min'][idx]),
        'min_recommended': float(min_recommended[idx]),
        'safety_factor_median': float(median_optimal_sf),
        'safety_factor_min': float(min_optimal_sf),
        'recommendations': json.dumps({
            'action': 'bid' if clf_pred[idx] else 'skip',
            'suggested_bid': float(median_recommended[idx]),
            'minimum_safe_bid': float(min_recommended[idx])
        }),
        'created_at': datetime.now().isoformat(),
        'updated_at': datetime.now().isoformat()
    }
    prediction_results.append(prediction_result)

prediction_results_df = pd.DataFrame(prediction_results)
print(f"✅ Created {len(prediction_results_df)} prediction results")

CREATING OUTPUT TABLES
✅ Created 1165 prediction results


In [13]:
# Cell 12: Map Predictions to Classes
print("="*60)
print("CLASS-PREDICTION MAPPING")
print("="*60)

# Create class-prediction mappings
class_predictions = []

for _, mapping in class_mappings.iterrows():
    pred_idx = mapping['prediction_idx']
    if pred_idx < len(prediction_results_df):
        pred_result = prediction_results_df.iloc[pred_idx]
        
        class_pred = {
            'class_id': mapping['class_id'],
            'prediction_result_id': pred_result['id'],
            'course_code': mapping['course_code'],
            'section': mapping['section'],
            'acad_term_id': mapping['acad_term_id'],
            'professor_id': mapping.get('professor_id')
        }
        class_predictions.append(class_pred)

class_predictions_df = pd.DataFrame(class_predictions)
print(f"✅ Created {len(class_predictions_df)} class-prediction mappings")

CLASS-PREDICTION MAPPING
✅ Created 1205 class-prediction mappings


In [14]:
# Cell 13: Create Safety Factor Table
print("="*60)
print("SAFETY FACTOR TABLE")
print("="*60)

safety_factor_entries = []

# Process median safety factors
for _, row in median_sf_df.iterrows():
    entry = {
        'id': str(uuid.uuid4()),
        'model_version': 'v4.0',
        'prediction_type': 'median',
        'safety_factor': float(row['safety_factor']),
        'tpr': float(row['tpr']),
        'mean_loss': float(row['mean_loss']),
        'under_prediction_rate': None,
        'mae': float(row['mae']),
        'mse': float(row['mse']),
        'created_at': datetime.now().isoformat()
    }
    safety_factor_entries.append(entry)

# Process min safety factors
for _, row in min_sf_df.iterrows():
    entry = {
        'id': str(uuid.uuid4()),
        'model_version': 'v4.0',
        'prediction_type': 'min',
        'safety_factor': float(row['safety_factor']),
        'tpr': float(row['tpr']),
        'mean_loss': float(row['mean_loss']),
        'under_prediction_rate': float(row.get('under_prediction_rate', 0)),
        'mae': float(row['mae']),
        'mse': float(row['mse']),
        'created_at': datetime.now().isoformat()
    }
    safety_factor_entries.append(entry)

safety_factor_df = pd.DataFrame(safety_factor_entries)
print(f"✅ Created {len(safety_factor_df)} safety factor entries")

SAFETY FACTOR TABLE
✅ Created 22 safety factor entries


In [15]:
# Cell 14: Save Results
print("="*60)
print("SAVING RESULTS")
print("="*60)

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Save all outputs
prediction_results_df.to_csv(output_dir / f'prediction_results_{timestamp}.csv', index=False)
class_predictions_df.to_csv(output_dir / f'class_predictions_{timestamp}.csv', index=False)
safety_factor_df.to_csv(output_dir / f'safety_factor_table_{timestamp}.csv', index=False)

# Create summary report
summary = {
    'timestamp': timestamp,
    'total_predictions': len(prediction_results_df),
    'total_class_mappings': len(class_predictions_df),
    'unique_courses': bidding_data['course_code'].nunique(),
    'unique_terms': bidding_data['acad_term_id'].nunique(),
    'predictions_with_bids': int(clf_pred.sum()),
    'predictions_without_bids': int(len(clf_pred) - clf_pred.sum()),
    'median_bid_range': {
        'min': float(median_pred.min()),
        'max': float(median_pred.max()),
        'mean': float(median_pred.mean())
    },
    'min_bid_range': {
        'min': float(min_pred.min()),
        'max': float(min_pred.max()),
        'mean': float(min_pred.mean())
    }
}

with open(output_dir / f'prediction_summary_{timestamp}.json', 'w') as f:
    json.dump(summary, f, indent=2)

print(f"\n✅ Batch prediction completed!")
print(f"📁 Results saved to {output_dir}")
print(f"   - Predictions: {len(prediction_results_df)} records")
print(f"   - Class mappings: {len(class_predictions_df)} records")
print(f"   - Safety factors: {len(safety_factor_df)} entries")
print(f"\n📊 Summary:")
print(f"   - Courses with bids: {summary['predictions_with_bids']}")
print(f"   - Courses without bids: {summary['predictions_without_bids']}")
print(f"   - Median bid range: {summary['median_bid_range']['min']:.0f} - {summary['median_bid_range']['max']:.0f}")

# Close database connection
if connection:
    connection.close()
    print("\n🔒 Database connection closed")

SAVING RESULTS

✅ Batch prediction completed!
📁 Results saved to script_output\predictions
   - Predictions: 1165 records
   - Class mappings: 1205 records
   - Safety factors: 22 entries

📊 Summary:
   - Courses with bids: 1057
   - Courses without bids: 108
   - Median bid range: 4 - 100

🔒 Database connection closed
