# **SMU Course Bidding Data Preprocessing**

<div style="background-color:#DFFFD6; padding:12px; border-radius:5px; border: 1px solid #228B22;">
    
  <h2 style="color:#006400;">✅ Looking to Implement This? ✅</h2>
  
  <p>🚀 Get started quickly by using <strong><a href="example_prediction.ipynb">example_prediction.ipynb</a></strong>.</p>
  
  <ul>
    <li>📌 **Pre-trained CatBoost model (`.cbm`) available for instant predictions.**</li>
    <li>🔧 Includes **step-by-step instructions** for making predictions.</li>
    <li>⚡ Works **out-of-the-box**—just load the model and start predicting!</li>
  </ul>

  <h3>🔗 📌 Next Steps:</h3>
  <p>👉 <a href="example_prediction.ipynb"><strong>Go to Example Prediction Notebook</strong></a></p>

</div>

### **Changes in V4**
- Feature selection of top variables.
- Reusable data transformer.
- Extract data from raw_data rather than html directly.
- Transform data for training by parsing in extracted data into data transformer.


### **Objective**
This notebook performs the following steps:
1. Extracts data
2. Transform data and split into training and test set based on `acad_year_start` and `term`. This is to prevent autocorrelation effects.

### **Requirements**
- Python 3.x
- Pandas, NumPy

---



## **1. Setup**

In [1]:
import pandas as pd
import os
import re
import glob
from datetime import datetime
from typing import List
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

### **2. SMU Bidding Data Feature Engineering Transformer**

#### **What This Code Does**
The `SMUBiddingTransformer` class transforms raw SMU course bidding data into machine learning-ready features optimized for CatBoost training. It automatically categorizes features into categorical (for CatBoost's native handling) and numeric types, with special one-hot encoding for multi-valued day combinations.

**Output:** A pandas DataFrame with engineered features organized into two categories:
- **Categorical Features**: `subject_area`, `catalogue_no`, `round`, `term`, `start_time`, `course_name`, `section`, `instructor`
- **Numeric Features**: `window`, `before_process_vacancy`, `acad_year_start`, `has_mon`, `has_tue`, `has_wed`, `has_thu`, `has_fri`, `has_sat`, `has_sun`

#### **What Is Required**

**Input Data:** A pandas DataFrame with these required columns:
- `course_code`, `course_name`, `section`, `acad_year_start`, `term`
- `start_time`, `day_of_week`, `before_process_vacancy`, `bidding_window`, `instructor`

**Dependencies:**
- Python packages: `pandas`, `numpy`, `pickle`, `os`, `datetime`, `re`

**Configuration:**
- All previous embedding parameters are deprecated and ignored
- CatBoost handles categorical features and missing values natively
- Multi-valued days (e.g., "Mon,Thu") are one-hot encoded into 7 binary columns

In [2]:
class SMUBiddingTransformer:
    """
    A reusable transformer class for processing SMU course bidding data
    optimized for CatBoost model.
    
    Uses categorical encoding for instructors and one-hot encoding for multi-valued days.
    
    Expected input columns:
    - course_code: str (e.g. 'MGMT715', 'COR-COMM175')
    - course_name: str
    - acad_year_start: int
    - term: str ('1', '2', '3A', '3B')
    - start_time: str (e.g. '19:30', 'TBA') - preserved as categorical
    - day_of_week: str (can be multivalued, e.g. 'Mon,Thu')
    - before_process_vacancy: int
    - bidding_window: str (e.g. 'Round 1 Window 1', 'Incoming Freshmen Rnd 1 Win 4')
    - instructor: str (can be multivalued, e.g. 'JOHN DOE, JANE SMITH')
    """
    
    def __init__(self):
        """
        Initialize the transformer for CatBoost optimization.
        
        Uses categorical encoding for instructors and one-hot encoding for days.
        """
        # Fitted flags
        self.is_fitted = False
        
        # Lists to track feature types for CatBoost
        self.categorical_features = []
        self.numeric_features = []
        
    def fit(self, df: pd.DataFrame) -> 'SMUBiddingTransformer':
        """
        Fit the transformer on training data.
        
        Parameters:
        -----------
        df : pandas.DataFrame
            Training dataframe with all required columns
        """
        # Validate required columns
        required_cols = [
            'course_code', 'course_name', 'acad_year_start', 'term',
            'start_time', 'day_of_week', 'before_process_vacancy',
            'bidding_window', 'instructor', 'section'
        ]
        missing_cols = [col for col in required_cols if col not in df.columns]
        if missing_cols:
            raise ValueError(f"Missing required columns: {missing_cols}")
        
        print(f"Fitting transformer on {len(df)} rows...")
        
        self.is_fitted = True
        return self
    
    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Transform the dataframe to CatBoost-ready format.
        """
        # Try to load existing model if not fitted
        if not self.is_fitted:
            raise ValueError("Transformer must be fitted before transform. Call fit() first.")
        
        # Create a copy to avoid modifying original
        df_transformed = df.copy()
        
        # Reset feature tracking
        self.categorical_features = []
        self.numeric_features = []
        all_features = []
        
        # 1. Extract course components (categorical + numeric)
        course_features = self._extract_course_features(df_transformed)
        all_features.append(course_features)
        
        # 2. Process bidding window (categorical + numeric)
        round_window_features = self._extract_round_window(df_transformed)
        all_features.append(round_window_features)
        
        # 3. Basic features (preserve categorical nature) + instructor as categorical
        basic_features = self._process_basic_features(df_transformed)
        all_features.append(basic_features)
        
        # 4. Create day one-hot encoding
        day_features = self._create_day_one_hot_encoding(df_transformed)
        all_features.append(day_features)
        
        # Combine all features
        final_df = pd.concat(all_features, axis=1)
        
        return final_df
    
    def fit_transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """Fit the transformer and transform the data in one step."""
        self.fit(df)
        return self.transform(df)
    
    def get_categorical_features(self) -> List[str]:
        """Get list of categorical feature names for CatBoost."""
        return self.categorical_features.copy()
    
    def get_numeric_features(self) -> List[str]:
        """Get list of numeric feature names."""
        return self.numeric_features.copy()
    
    def _extract_course_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Extract subject area and catalogue number from course code."""
        features = pd.DataFrame(index=df.index)
        
        def split_course_code(code):
            """Split course code into subject area and catalogue number."""
            if pd.isna(code):
                return None, None
            
            code = str(code).strip().upper()
            
            # Handle hyphenated codes like 'COR-COMM175'
            if '-' in code:
                parts = code.split('-')
                if len(parts) >= 2:
                    subject = '-'.join(parts[:-1])
                    # Extract number from last part
                    num_match = re.search(r'(\d+)', parts[-1])
                    if num_match:
                        return subject, int(num_match.group(1))
                    else:
                        # Try extracting from full last part
                        num_match = re.search(r'(\d+)', code)
                        if num_match:
                            return subject, int(num_match.group(1))
            
            # Standard format like 'MGMT715'
            match = re.match(r'([A-Z\-]+)(\d+)', code)
            if match:
                return match.group(1), int(match.group(2))
            
            return code, 0
        
        # Extract components
        splits = df['course_code'].apply(split_course_code)
        features['subject_area'] = splits.apply(lambda x: x[0] if x else None)
        features['catalogue_no'] = splits.apply(lambda x: x[1] if x else 0)
        
        # subject_area and catalogue_no are categorical for CatBoost
        self.categorical_features.extend(['subject_area', 'catalogue_no'])
        
        return features
    
    def _extract_round_window(self, df: pd.DataFrame) -> pd.DataFrame:
        """Extract round and window from bidding_window string."""
        features = pd.DataFrame(index=df.index)
        
        def parse_bidding_window(window_str):
            """Parse bidding window string into round and window number."""
            if pd.isna(window_str):
                return None, None
            
            window_str = str(window_str).strip()
            
            # Handle patterns from V4_01 notebook
            import re
            match = re.search(r'Round\s+(\d[A-C]?)\s+Window\s+(\d)', window_str, re.IGNORECASE)
            if match:
                return match.group(1), int(match.group(2))
            
            match = re.search(r'Rnd\s+(\d[A-C]?)\s+Win\s+(\d)', window_str, re.IGNORECASE)
            if match:
                return match.group(1), int(match.group(2))
            
            match = re.search(r'(\d[A-C]?)', window_str)
            if match:
                win_match = re.search(r'Window\s+(\d)|Win\s+(\d)', window_str, re.IGNORECASE)
                if win_match:
                    window_num = int(win_match.group(1) or win_match.group(2))
                    return match.group(1), window_num
                return match.group(1), 1
            
            return '1', 1
        
        # Extract round and window
        parsed = df['bidding_window'].apply(parse_bidding_window)
        features['round'] = parsed.apply(lambda x: x[0] if x else '1')
        features['window'] = parsed.apply(lambda x: x[1] if x else 1)
        
        # Round as categorical (preserves ordering like 1, 1A, 1B, 2, 2A)
        self.categorical_features.append('round')
        
        # Window as numeric
        self.numeric_features.append('window')
        
        return features
    
    def _process_basic_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Process basic features, preserving categorical nature where beneficial."""
        features = pd.DataFrame(index=df.index)
        
        # Numeric features
        features['before_process_vacancy'] = pd.to_numeric(
            df['before_process_vacancy'], errors='coerce'
        ).fillna(0)
        features['acad_year_start'] = pd.to_numeric(
            df['acad_year_start'], errors='coerce'
        ).fillna(2025)
        
        self.numeric_features.extend(['before_process_vacancy', 'acad_year_start'])
        
        # Categorical features
        features['term'] = df['term'].astype(str)
        features['start_time'] = df['start_time'].astype(str)
        features['course_name'] = df['course_name'].astype(str)
        features['section'] = df['section'].astype(str)
        
        # Process instructor names (remove duplicates, handle comma-separated format)
        features['instructor'] = df['instructor'].apply(self._process_instructor_names)

        # Replace empty strings with None for proper CatBoost handling
        features.loc[features['start_time'].isin(['', 'nan']), 'start_time'] = None
        features.loc[features['course_name'].isin(['', 'nan']), 'course_name'] = None
        features.loc[features['section'].isin(['', 'nan']), 'section'] = None
        
        self.categorical_features.extend(['term', 'start_time', 'course_name', 'section', 'instructor'])
        
        return features

    def _create_day_one_hot_encoding(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create one-hot encoding for days of the week."""
        features = pd.DataFrame(index=df.index)
        
        # Initialize all day columns as 0
        day_columns = ['has_mon', 'has_tue', 'has_wed', 'has_thu', 'has_fri', 'has_sat', 'has_sun']
        for col in day_columns:
            features[col] = 0
        
        # Day mapping
        day_abbrev = {
            'MONDAY': 'MON', 'TUESDAY': 'TUE', 'WEDNESDAY': 'WED',
            'THURSDAY': 'THU', 'FRIDAY': 'FRI', 'SATURDAY': 'SAT', 'SUNDAY': 'SUN',
            'MON': 'MON', 'TUE': 'TUE', 'WED': 'WED', 'THU': 'THU',
            'FRI': 'FRI', 'SAT': 'SAT', 'SUN': 'SUN'
        }
        
        day_to_column = {
            'MON': 'has_mon', 'TUE': 'has_tue', 'WED': 'has_wed', 'THU': 'has_thu',
            'FRI': 'has_fri', 'SAT': 'has_sat', 'SUN': 'has_sun'
        }
        
        # Process each row's day_of_week
        for idx, days in enumerate(df['day_of_week']):
            if pd.isna(days) or str(days).strip() == '':
                continue  # Leave all days as 0
            
            # Handle multiple days separated by comma
            for day in str(days).split(','):
                day_upper = day.strip().upper()
                standardized_day = day_abbrev.get(day_upper, day_upper)
                
                if standardized_day in day_to_column:
                    features.loc[df.index[idx], day_to_column[standardized_day]] = 1
        
        # These are numeric binary features (0/1)
        self.numeric_features.extend(day_columns)
        
        return features

    def get_feature_names(self) -> List[str]:
        """Get all feature names after transformation."""
        if not self.is_fitted:
            raise ValueError("Transformer must be fitted to get feature names.")
        
        return self.categorical_features + self.numeric_features
    
    def _process_instructor_names(self, instructor_str):
        """Process instructor names to remove duplicates and handle comma-separated format."""
        if pd.isna(instructor_str) or str(instructor_str).strip() == '' or str(instructor_str).upper() == 'TBA':
            return None
        
        # Convert to string and clean
        instructor_str = str(instructor_str).strip()
        
        # Split instructor names using pattern: ", " followed by uppercase letters (start of lastname)
        # This handles cases like "TSE, JUSTIN K, AIDAN WONG, TSE, JUSTIN K"
        import re
        parts = re.split(r', (?=[A-Z]+(?:\s|,|$))', instructor_str)
        
        # Remove duplicates while preserving order (don't use set() to avoid losing order)
        seen = set()
        unique_parts = []
        for part in parts:
            part = part.strip()
            if part and part not in seen:
                seen.add(part)
                unique_parts.append(part)
        
        # Join back into single string for categorical feature
        return ', '.join(unique_parts) if unique_parts else None



---

## **3. SMUDataMerger Class**

### **SMU Raw Data and BOSS Results Integration Pipeline**

#### **What This Code Does**
The `SMUDataMerger` class combines SMU's raw course data with BOSS bidding results to create a unified dataset for machine learning analysis. It intelligently merges timing information from multiple data sources and creates course-level records suitable for the `SMUBiddingTransformer`.

**Output:** Two timestamped CSV files saved to `script_output/model_training/`:
- **Classification Dataset**: `classification/classification_model_data_{timestamp}.csv` - All merged records for predicting bidding success
- **Regression Dataset**: `regression/regression_model_data_{timestamp}.csv` - Only records with non-zero bids for predicting bid amounts
- **Final DataFrame**: Returns the classification dataset with columns: `course_code`, `course_name`, `acad_year_start`, `term`, `section`, `start_time`, `day_of_week`, `before_process_vacancy`, `bidding_window`, `instructor`, `median_bid`, `min_bid`

#### **What Is Required**

**Input Data:**
- **Raw Data Excel File** (`script_input/raw_data.xlsx`):
  - `standalone` sheet: Core course information with `course_code`, `section`, `acad_year_start`, `term`, `record_key`
  - `multiple` sheet: Class sessions with `type`, `day_of_week`, `start_time`, `venue`, `professor_name`, `record_key`
- **BOSS Results Folder** (`script_input/overallBossResults/`):
  - Multiple Excel files with columns: `Course Code`, `Section`, `Term`, `Before Process Vacancy`, `Bidding Window`, `Instructor`, `Median Bid`, `Min Bid`

**Dependencies:**
- Python packages: `pandas`, `glob`, `os`, `re`, `datetime`, `collections.Counter`
- Excel file reading capabilities (openpyxl or xlrd)

**Configuration:**
- `raw_data_path` (default="script_input/raw_data.xlsx"): Path to raw data Excel file
- `boss_results_folder` (default="script_input/overallBossResults"): Folder containing BOSS results Excel files

#### **What the User Needs to Do Step by Step**

**Step 1: Initialize the Merger**
```python
from SMUDataMerger import SMUDataMerger

# Initialize with default paths
merger = SMUDataMerger()

# Or with custom paths
merger = SMUDataMerger(
    raw_data_path="custom_path/raw_data.xlsx",
    boss_results_folder="custom_path/boss_results"
)
```

**Step 2: Execute the Complete Merge Process**
```python
# Execute merge and get final dataset
final_dataset = merger.process_and_merge()

# Check results
print(f"Final dataset shape: {final_dataset.shape}")
print(f"Unique courses: {final_dataset['course_code'].nunique()}")
```

**Step 3: Use Output with SMUBiddingTransformer**
```python
# The output is automatically saved and ready for SMUBiddingTransformer
from SMUBiddingTransformer import SMUBiddingTransformer

transformer = SMUBiddingTransformer()
ml_features = transformer.fit_transform(final_dataset)
```

In [3]:
class SMUDataMerger:
    """
    A class to merge SMU raw data with boss results data for bidding analysis.
    """
    
    def __init__(self, raw_data_path="script_input/raw_data.xlsx", 
                 boss_results_folder="script_input/overallBossResults"):
        self.raw_data_path = raw_data_path
        self.boss_results_folder = boss_results_folder
        self.output_folder = "script_output/model_training"
        
        # Create output directory if it doesn't exist
        os.makedirs(self.output_folder, exist_ok=True)
    
    def load_raw_data(self):
        """
        Load and process the raw_data.xlsx file with standalone and multiple sheets.
        """
        print(f"Loading raw data from {self.raw_data_path}")
        
        # Load both sheets
        standalone_df = pd.read_excel(self.raw_data_path, sheet_name='standalone')
        multiple_df = pd.read_excel(self.raw_data_path, sheet_name='multiple')
        
        print(f"Standalone sheet: {standalone_df.shape[0]} rows")
        print(f"Multiple sheet: {multiple_df.shape[0]} rows")
        
        # Filter multiple_df to only include CLASS entries (ignore EXAM)
        class_df = multiple_df[multiple_df['type'] == 'CLASS'].copy()
        print(f"Class entries in multiple sheet: {class_df.shape[0]} rows")
        
        return standalone_df, class_df

    def process_class_timings(self, class_df):
        """
        Process class timings by grouping by record_key and aggregating the timing information.
        """
        if class_df.empty:
            return pd.DataFrame(columns=['record_key', 'day_of_week', 'start_time'])
        
        def aggregate_days(days):
            # Remove NaN values and convert to set to remove duplicates
            valid_days = [day for day in days if pd.notna(day)]
            if not valid_days:
                return None
            unique_days = sorted(set(valid_days))
            return ', '.join(unique_days)
        
        def get_most_common_time(times):
            # Remove NaN values
            valid_times = [time for time in times if pd.notna(time)]
            if not valid_times:
                return None
            # Get most common time, or first occurrence if tie
            time_counts = Counter(valid_times)
            return time_counts.most_common(1)[0][0]
        
        # Group by record_key and aggregate
        timing_summary = class_df.groupby('record_key').agg({
            'day_of_week': aggregate_days,
            'start_time': get_most_common_time,
            'venue': lambda x: ', '.join([str(v) for v in x if pd.notna(v)]) if any(pd.notna(v) for v in x) else None,
            'professor_name': lambda x: ', '.join([str(p) for p in x if pd.notna(p)]) if any(pd.notna(p) for p in x) else None
        }).reset_index()
        
        return timing_summary

    def combine_raw_data(self, standalone_df, class_df):
        """
        Combine standalone and multiple (class) data into one flat dataset.
        """
        print("Combining standalone and class timing data...")
        
        # Process class timings first
        timing_summary = self.process_class_timings(class_df)
        
        # Merge standalone with timing summary
        if not timing_summary.empty:
            combined_df = pd.merge(
                standalone_df,
                timing_summary,
                on='record_key',
                how='left'
            )
        else:
            combined_df = standalone_df.copy()
            combined_df['day_of_week'] = None
            combined_df['start_time'] = None
            combined_df['venue'] = None
            combined_df['professor_name'] = None
        
        # Create boss-compatible term format: "2021-22 Term 1"
        def create_boss_term_format(row):
            if pd.notna(row['acad_year_start']) and pd.notna(row['acad_year_end']) and pd.notna(row['term']):
                year_start = int(row['acad_year_start'])
                year_end = str(int(row['acad_year_end']))[-2:]  # Last 2 digits
                term = str(row['term']).strip()
                if term.startswith('T'):
                    term = term[1:]  # Remove T prefix
                return f"{year_start}-{year_end} Term {term}"
            return None
        
        # Add boss_term_format column for matching
        combined_df['boss_term_format'] = combined_df.apply(create_boss_term_format, axis=1)
        
        print(f"Combined raw data shape: {combined_df.shape}")
        print(f"Sample boss term formats created: {combined_df['boss_term_format'].value_counts().head()}")
        
        return combined_df

    def standardize_term_format(self, term_str):
        """
        Convert term formats - remove T prefix if present.
        """
        if pd.isna(term_str):
            return None
        
        term_str = str(term_str).strip().upper()
        
        # If starts with 'T', remove it
        if term_str.startswith('T'):
            return term_str[1:]
        
        return term_str
    
    def clean_text_encoding(self, text):
        """
        Clean text encoding issues from web scraping.
        """
        if pd.isna(text):
            return text
        
        text = str(text)
        
        # Common encoding fixes
        replacements = {
            'â€"': '–',  # en dash
            'â€™': "'",  # apostrophe
            'â€œ': '"',  # left quote
            'â€': '"',   # right quote
            'Ã©': 'é',   # e acute
            'Ã¨': 'è',   # e grave
            'Ã ': 'à',   # a grave
            'Ã¢': 'â',   # a circumflex
            'Ã®': 'î',   # i circumflex
            'Ã´': 'ô',   # o circumflex
            'Ã»': 'û',   # u circumflex
            'Ã§': 'ç',   # c cedilla
            'â€¦': '...',  # ellipsis
            'â€‰': ' ',   # thin space
            'Â': '',      # non-breaking space artifact
        }
        
        for old, new in replacements.items():
            text = text.replace(old, new)
        
        # Remove any remaining non-ASCII characters that might cause issues
        # But keep common accented characters
        import unicodedata
        text = unicodedata.normalize('NFKD', text)
        
        return text.strip()

    def create_course_key(self, course_code, section, acad_year_start, term):
        """
        Create a standardized key for matching courses.
        Format: COURSECODE_SECTION_YEAR_TERM
        """
        if pd.isna(course_code) or pd.isna(section):
            return None
        
        # Clean course code and section
        course_code_clean = str(course_code).strip().upper()
        section_clean = str(section).strip().upper()
        
        # Handle academic year - ensure it's an integer
        if pd.notna(acad_year_start):
            acad_year = int(float(acad_year_start))
        else:
            return None
        
        # Standardize term format (remove T prefix if present)
        if pd.notna(term):
            term_clean = self.standardize_term_format(term)
        else:
            return None
        
        key = f"{course_code_clean}_{section_clean}_{acad_year}_{term_clean}"
        return key

    def load_boss_results(self):
        """
        Load and combine all Excel files from the overallBossResults folder.
        """
        print(f"Loading boss results from {self.boss_results_folder}")
        
        # Find all Excel files in the folder
        excel_files = glob.glob(os.path.join(self.boss_results_folder, "*.xlsx"))
        
        if not excel_files:
            print("No Excel files found in the boss results folder!")
            return pd.DataFrame()
        
        print(f"Found {len(excel_files)} Excel files")
        
        all_boss_data = []
        
        for file_path in excel_files:
            try:
                # Extract academic year and term from filename
                filename = os.path.basename(file_path)
                print(f"Processing file: {filename}")
                
                # Load the Excel file
                df = pd.read_excel(file_path)
                
                # Add source filename for tracking
                df['source_file'] = filename
                
                all_boss_data.append(df)
                
            except Exception as e:
                print(f"Error processing {file_path}: {str(e)}")
                continue
        
        if not all_boss_data:
            print("No valid data found in boss results files!")
            return pd.DataFrame()
        
        # Combine all dataframes
        combined_boss_df = pd.concat(all_boss_data, ignore_index=True)
        print(f"Combined boss results: {combined_boss_df.shape[0]} rows")
        
        return combined_boss_df

    def parse_term_info(self, term_str):
        """
        Parse term string like "2021-22 Term 2" to extract academic year start and term.
        """
        if pd.isna(term_str):
            return None, None
        
        try:
            # Pattern: "YYYY-YY Term X"
            match = re.match(r'(\d{4})-\d{2}\s+Term\s+(.+)', str(term_str).strip())
            if match:
                acad_year_start = int(match.group(1))
                term = match.group(2).strip()  # This will be "1", "2", "3A", "3B"
                return acad_year_start, term
        except Exception as e:
            print(f"Error parsing term '{term_str}': {e}")
        
        return None, None

    def merge_data(self, combined_raw_df, boss_df):
        """
        Merge the combined raw data with boss results data.
        """
        print("Starting data merge process...")
        
        # Filter out rows with missing terms to avoid duplicates
        print(f"\nRows before filtering missing terms: {len(combined_raw_df)}")
        combined_raw_df = combined_raw_df[combined_raw_df['term'].notna()].copy()
        print(f"Rows after filtering missing terms: {len(combined_raw_df)}")
        
        # Clean boss results - remove unnamed columns
        boss_columns_to_keep = ['Term', 'Session', 'Bidding Window', 'Course Code', 'Description', 
                            'Section', 'Vacancy', 'Opening Vacancy', 'Before Process Vacancy', 
                            'D.I.C.E', 'After Process Vacancy', 'Enrolled Students', 
                            'Median Bid', 'Min Bid', 'Instructor', 'School/Department', 'source_file']
        
        # Filter boss_df to only keep valid columns
        boss_df_clean = boss_df[[col for col in boss_columns_to_keep if col in boss_df.columns]].copy()
        
        # Parse boss results term information to extract year and term
        boss_df_clean[['boss_acad_year_start', 'boss_term']] = boss_df_clean['Term'].apply(
            lambda x: pd.Series(self.parse_term_info(x))
        )
        
        # Create course keys for raw data
        combined_raw_df['course_key'] = combined_raw_df.apply(
            lambda row: self.create_course_key(
                row['course_code'], 
                row['section'], 
                row['acad_year_start'], 
                row['term']  # This is the original term like T1, T2
            ), axis=1
        )
        
        # Create course keys for boss data
        boss_df_clean['course_key'] = boss_df_clean.apply(
            lambda row: self.create_course_key(
                row['Course Code'], 
                row['Section'], 
                row['boss_acad_year_start'], 
                row['boss_term']  # This should already be in format like "1", "2", "3A"
            ), axis=1
        )
        
        # Debug: Show sample keys
        print(f"\nRaw data course keys (first 5):")
        print(combined_raw_df[['course_code', 'section', 'term', 'course_key']].dropna().head())
        print(f"\nBoss data course keys (first 5):")
        print(boss_df_clean[['Course Code', 'Section', 'Term', 'boss_term', 'course_key']].dropna().head())
        
        print(f"\nRaw data valid keys: {combined_raw_df['course_key'].notna().sum()} out of {len(combined_raw_df)}")
        print(f"Boss data valid keys: {boss_df_clean['course_key'].notna().sum()} out of {len(boss_df_clean)}")
        
        # Find common keys
        raw_keys = set(combined_raw_df['course_key'].dropna())
        boss_keys = set(boss_df_clean['course_key'].dropna())
        common_keys = raw_keys.intersection(boss_keys)
        print(f"\nCommon keys found: {len(common_keys)}")
        
        if len(common_keys) == 0:
            print("\nNo matching keys found. Checking for mismatches...")
            print("Sample raw keys:", list(raw_keys)[:5])
            print("Sample boss keys:", list(boss_keys)[:5])
        
        # Perform the merge
        merged_df = pd.merge(
            combined_raw_df,
            boss_df_clean,
            on='course_key',
            how='inner',
            suffixes=('_raw', '_boss')
        )
        
        print(f"\nMerged data: {merged_df.shape[0]} rows")
        
        if merged_df.empty:
            print("No matching records found between raw data and boss results.")
            return pd.DataFrame(), merged_df
        
        # Create the final dataframe with only required columns
        final_df = pd.DataFrame()
        
        # Map columns to match SMUBiddingTransformer expected input
        column_mapping = {
            'course_code': 'course_code',
            'course_name': 'course_name',
            'acad_year_start': 'acad_year_start',
            'term': 'term',  # Original term format
            'start_time': 'start_time',
            'day_of_week': 'day_of_week',
            'before_process_vacancy': 'Before Process Vacancy',
            'bidding_window': 'Bidding Window',
            'instructor': 'professor_name',  # Prefer professor_name from class data
            'median_bid': 'Median Bid',
            'min_bid': 'Min Bid',
            'section': 'section'
            # REMOVED: 'vacancy' and 'grading_basis'
        }
        
        # If professor_name is empty, use Instructor from boss results
        for new_col, source_col in column_mapping.items():
            if source_col in merged_df.columns:
                final_df[new_col] = merged_df[source_col]
            else:
                print(f"Warning: Column {source_col} not found in merged data")
                final_df[new_col] = None
        
        # Special handling for instructor - use boss Instructor if professor_name is empty
        if 'Instructor' in merged_df.columns:
            mask = final_df['instructor'].isna() | (final_df['instructor'] == '')
            final_df.loc[mask, 'instructor'] = merged_df.loc[mask, 'Instructor']
        
        # Add course description from boss results if course_name is missing
        if 'Description' in merged_df.columns:
            mask = final_df['course_name'].isna() | (final_df['course_name'] == '')
            final_df.loc[mask, 'course_name'] = merged_df.loc[mask, 'Description']
        
        # Clean course names and instructor names for encoding issues
        final_df['course_name'] = final_df['course_name'].apply(self.clean_text_encoding)
        final_df['instructor'] = final_df['instructor'].apply(self.clean_text_encoding)
        
        print(f"\nFinal dataframe columns: {list(final_df.columns)}")
        print(f"Final dataframe shape: {final_df.shape}")
        
        return final_df, merged_df
    
    def process_and_merge(self):
        """
        Main method to execute the data merging process and save results.
        """
        try:
            # Step 1: Load raw data
            standalone_df, class_df = self.load_raw_data()
            
            # Step 2: Combine standalone and class data into one flat file
            combined_raw_df = self.combine_raw_data(standalone_df, class_df)
            
            # Step 3: Load boss results
            boss_df = self.load_boss_results()
            
            if boss_df.empty:
                print("No boss results data found. Exiting.")
                return None
            
            # Step 4: Merge the combined raw data with boss results
            final_df, detailed_df = self.merge_data(combined_raw_df, boss_df)
            
            if final_df.empty:
                print("No matching records found between raw data and boss results.")
                return None
            
            # Clean text encoding issues in course_name and instructor
            print("\nCleaning text encoding issues...")
            final_df['course_name'] = final_df['course_name'].apply(self.clean_text_encoding)
            final_df['instructor'] = final_df['instructor'].apply(self.clean_text_encoding)
            
            # Step 5: Save the results with timestamp
            timestamp = datetime.now().strftime("%d%m%y%H%M%S")

            # Create folder structure
            classification_folder = os.path.join(self.output_folder, "classification")
            regression_folder = os.path.join(self.output_folder, "regression")
            os.makedirs(classification_folder, exist_ok=True)
            os.makedirs(regression_folder, exist_ok=True)

            # Save classification model data (all data)
            classification_filename = f"classification_model_data_{timestamp}.csv"
            classification_path = os.path.join(classification_folder, classification_filename)
            final_df.to_csv(classification_path, index=False)

            print(f"\nClassification model data saved to: {classification_path}")
            print(f"Classification dataset shape: {final_df.shape}")

            # Create and save regression model data (non-zero bids only)
            regression_df = final_df[(final_df['median_bid'] > 0) & (final_df['min_bid'] > 0)].copy()
            regression_filename = f"regression_model_data_{timestamp}.csv"
            regression_path = os.path.join(regression_folder, regression_filename)
            regression_df.to_csv(regression_path, index=False)

            print(f"\nRegression model data saved to: {regression_path}")
            print(f"Regression dataset shape: {regression_df.shape}")
            
            # Display summary statistics
            print(f"\nSummary Statistics (Classification Data):")
            print(f"- Total merged records: {final_df.shape[0]}")
            print(f"- Unique courses: {final_df['course_code'].nunique()}")
            print(f"- Unique sections: {final_df['section'].nunique()}")
            print(f"- Academic years covered: {final_df['acad_year_start'].min()} - {final_df['acad_year_start'].max()}")
            print(f"- Terms covered: {sorted(final_df['term'].unique())}")
            
            # Check for missing critical values
            critical_cols = ['course_code', 'section', 'before_process_vacancy', 'median_bid', 'min_bid']
            for col in critical_cols:
                if col in final_df.columns:
                    missing_count = final_df[col].isna().sum()
                    if missing_count > 0:
                        print(f"- Missing values in {col}: {missing_count}")
            
            # Return the CatBoost dataframe as the primary output
            return final_df
            
        except Exception as e:
            print(f"Error during merge process: {str(e)}")
            raise

In [4]:
# Extract data required for model
merger = SMUDataMerger()
result_df = merger.process_and_merge()

Loading raw data from script_input/raw_data.xlsx
Standalone sheet: 12973 rows
Multiple sheet: 19986 rows
Class entries in multiple sheet: 13082 rows
Combining standalone and class timing data...
Combined raw data shape: (12973, 26)
Sample boss term formats created: boss_term_format
2023-24 Term 2    1664
2023-24 Term 1    1659
2024-25 Term 1    1647
2021-22 Term 2    1631
2022-23 Term 1    1614
Name: count, dtype: int64
Loading boss results from script_input/overallBossResults
Found 14 Excel files
Processing file: 2021-22_T2.xlsx
Processing file: 2021-22_T3B.xlsx
Processing file: 2022-23_T1.xlsx
Processing file: 2022-23_T2.xlsx
Processing file: 2022-23_T3A.xlsx
Processing file: 2022-23_T3B.xlsx
Processing file: 2023-24_T1.xlsx
Processing file: 2023-24_T2.xlsx
Processing file: 2023-24_T3A.xlsx
Processing file: 2023-24_T3B.xlsx
Processing file: 2024-25_T1.xlsx
Processing file: 2024-25_T2.xlsx
Processing file: 2024-25_T3A.xlsx
Processing file: 2024-25_T3B.xlsx
Combined boss results: 12134


---

### **4. Training Data Preparation Pipeline for SMU Bidding Models**

#### **What This Code Does**
The `SMUBiddingPipeline` class creates model-ready training datasets from SMUDataMerger output using SMUBiddingTransformer. It handles the complete workflow from merged data to timestamped training/test datasets with temporal splits to prevent data leakage.

**Output:** Three types of timestamped CSV datasets saved to `script_output/model_training/`:
- **Classification Dataset**: `classification/classification_train/test_{timestamp}.csv` - Predicts bidding success with `bids` target (True when median_bid or min_bid equals zero)
- **Median Regression Dataset**: `regression/regression_median_train/test_{timestamp}.csv` - Predicts median bid amounts with `target_median_bid` column (filtered for non-zero bids only)  
- **Min Regression Dataset**: `regression/regression_min_train/test_{timestamp}.csv` - Predicts minimum bid amounts with `target_min_bid` column (filtered for non-zero bids only)
- **Temporal Split**: Training set (pre-2024 T2), Test set (2024 T2/T3A/T3B terms)

#### **What Is Required**

**Input Data:**
- **Merged CSV File** (from SMUDataMerger output) with columns: `course_code`, `course_name`, `acad_year_start`, `term`, `section`, `start_time`, `day_of_week`, `before_process_vacancy`, `bidding_window`, `instructor`, `median_bid`, `min_bid`

**Dependencies:**
- Python packages: `pandas`, `pickle`, `os`, `datetime`
- Custom modules: `SMUBiddingTransformer` class  
- Directory structure: `script_output/model_training/` (auto-created)

**Configuration:**
- All previous embedding parameters are deprecated and ignored
- CatBoost handles categorical features and missing values natively
- Multi-valued days (e.g., "Mon,Thu") are one-hot encoded into 7 binary columns

In [8]:
class SMUBiddingPipeline:
    """
    A comprehensive pipeline class that manages the entire workflow from 
    SMUDataMerger output to model-ready features using SMUBiddingTransformer.
    
    This class handles:
    1. Loading merged data from SMUDataMerger
    2. Preparing training data with SMUBiddingTransformer
    3. Transforming user inputs for predictions
    4. Saving/loading fitted transformers
    
    Uses categorical encoding for instructors and one-hot encoding for multi-valued days.
    Optimized for CatBoost models.
    """
    
    def __init__(self):
        """
        Initialize the pipeline for CatBoost-optimized feature engineering.
        
        Uses categorical encoding for instructors and one-hot encoding for days.
        """
        # Initialize transformer
        self.transformer = None
        
        # Required columns for the transformer
        self.required_columns = [
            'course_code', 'course_name', 'acad_year_start', 'term',
            'start_time', 'day_of_week', 'before_process_vacancy',
            'bidding_window', 'instructor', 'section'
        ]

        # Initialize fitted status
        self.is_fitted = False
        self.training_columns = []
    
    def load_merged_data(self, file_path: str) -> pd.DataFrame:
        """
        Load the merged data CSV from SMUDataMerger.
        
        Parameters:
        -----------
        file_path : str
            Path to the CSV file created by SMUDataMerger
            
        Returns:
        --------
        pd.DataFrame
            Loaded and validated dataframe
        """
        print(f"Loading merged data from: {file_path}")
        
        if not os.path.exists(file_path):
            raise FileNotFoundError(f"Data file not found: {file_path}")
        
        df = pd.read_csv(file_path)
        print(f"Loaded data shape: {df.shape}")
        
        # Validate required columns
        missing_cols = [col for col in self.required_columns if col not in df.columns]
        if missing_cols:
            print(f"Warning: Missing columns will be created with default values: {missing_cols}")
            
            # Create missing columns with appropriate defaults
            for col in missing_cols:
                if col == 'section':
                    df[col] = 'SEC1'  # Default section
                elif col == 'bidding_window':
                    df[col] = 'Round 1 Window 1'  # Default bidding window
                else:
                    df[col] = None
        
        # Data quality checks
        print("\nData Quality Summary:")
        print(f"- Total records: {len(df)}")
        print(f"- Unique courses: {df['course_code'].nunique()}")
        print(f"- Date range: {df['acad_year_start'].min()} - {df['acad_year_start'].max()}")
        print(f"- Terms: {sorted(df['term'].unique())}")
        
        # Check for missing values in critical columns
        for col in ['course_code', 'before_process_vacancy']:
            missing_count = df[col].isna().sum()
            if missing_count > 0:
                print(f"- Missing {col}: {missing_count} records")
        
        return df
    
    def prepare_classification_data(self, df: pd.DataFrame) -> None:
        """
        Prepare classification training data with temporal split and save as CSV files.
        
        Parameters:
        -----------
        df : pd.DataFrame
            Input dataframe with merged SMU data
        """
        print("Preparing classification training data with temporal split...")
        
        self.transformer = SMUBiddingTransformer()
        
        print("Fitting SMUBiddingTransformer on all data...")
        X_transformed = self.transformer.fit_transform(df)
        
        # Create bids target column for classification  
        # True when median_bid or min_bid equals zero, False otherwise
        X_transformed['bids'] = (df['median_bid'] > 0) | (df['min_bid'] > 0)

        # Create temporal split
        # Test: 2024-25 T2, T3A, T3B
        test_mask = (df['acad_year_start'] == 2024) & (df['term'].isin(['T2', 'T3A', 'T3B', '2', '3A', '3B']))
        train_mask = ~test_mask
        
        X_train = X_transformed[train_mask].copy()
        X_test = X_transformed[test_mask].copy()
        
        print(f"Training set: {X_train.shape[0]} samples")
        print(f"Test set: {X_test.shape[0]} samples")
        
        # Save datasets
        timestamp = datetime.now().strftime("%d%m%y%H%M%S")
        classification_folder = "script_output/model_training/classification"
        os.makedirs(classification_folder, exist_ok=True)
        
        train_path = os.path.join(classification_folder, f"classification_train_{timestamp}.csv")
        test_path = os.path.join(classification_folder, f"classification_test_{timestamp}.csv")
        
        X_train.to_csv(train_path, index=False)
        X_test.to_csv(test_path, index=False)
        
        print(f"Classification training data saved to: {train_path}")
        print(f"Classification test data saved to: {test_path}")
        
        self.is_fitted = True
        self.training_columns = list(X_transformed.columns)

    def prepare_regression_data_median(self, df: pd.DataFrame) -> None:
        """
        Prepare regression training data for median_bid prediction with temporal split.
        """
        print("Preparing regression training data for median_bid prediction...")
        
        # Filter out zero/missing median bids
        df_clean = df[(df['median_bid'] > 0) & (df['median_bid'].notna())].copy()
        print(f"Records after filtering median_bid: {len(df_clean)}")
        
        # Initialize transformer if not already fitted        
        self.transformer = SMUBiddingTransformer()
        
        print("Fitting SMUBiddingTransformer...")
        X_transformed = self.transformer.fit_transform(df_clean)
        self.is_fitted = True
        self.training_columns = list(X_transformed.columns)
        
        # Create temporal split
        test_mask = (df_clean['acad_year_start'] == 2024) & (df_clean['term'].isin(['T2', 'T3A', 'T3B', '2', '3A', '3B']))
        train_mask = ~test_mask
        
        X_train = X_transformed[train_mask].copy()
        X_test = X_transformed[test_mask].copy()
        y_train = df_clean[train_mask]['median_bid'].copy()
        y_test = df_clean[test_mask]['median_bid'].copy()
        
        # Add target column to features for saving
        X_train['target_median_bid'] = y_train.values
        X_test['target_median_bid'] = y_test.values
        
        print(f"Median regression - Training: {X_train.shape[0]}, Test: {X_test.shape[0]}")
        
        # Save datasets
        timestamp = datetime.now().strftime("%d%m%y%H%M%S")
        regression_folder = "script_output/model_training/regression"
        os.makedirs(regression_folder, exist_ok=True)
        
        train_path = os.path.join(regression_folder, f"regression_median_train_{timestamp}.csv")
        test_path = os.path.join(regression_folder, f"regression_median_test_{timestamp}.csv")
        
        X_train.to_csv(train_path, index=False)
        X_test.to_csv(test_path, index=False)
        
        print(f"Median regression training data saved to: {train_path}")
        print(f"Median regression test data saved to: {test_path}")

    def prepare_regression_data_min(self, df: pd.DataFrame) -> None:
        """
        Prepare regression training data for min_bid prediction with temporal split.
        """
        print("Preparing regression training data for min_bid prediction...")
        
        # Filter out zero/missing min bids
        df_clean = df[(df['min_bid'] > 0) & (df['min_bid'].notna())].copy()
        print(f"Records after filtering min_bid: {len(df_clean)}")
        
        # Initialize transformer if not already fitted
        if not self.is_fitted:
            
            self.transformer = SMUBiddingTransformer()
            
            print("Fitting SMUBiddingTransformer...")
            X_transformed = self.transformer.fit_transform(df_clean)
            self.is_fitted = True
            self.training_columns = list(X_transformed.columns)
        else:
            X_transformed = self.transformer.transform(df_clean)
        
        # Create temporal split
        test_mask = (df_clean['acad_year_start'] == 2024) & (df_clean['term'].isin(['T2', 'T3A', 'T3B', '2', '3A', '3B']))
        train_mask = ~test_mask
        
        X_train = X_transformed[train_mask].copy()
        X_test = X_transformed[test_mask].copy()
        y_train = df_clean[train_mask]['min_bid'].copy()
        y_test = df_clean[test_mask]['min_bid'].copy()
        
        # Add target column to features for saving
        X_train['target_min_bid'] = y_train.values
        X_test['target_min_bid'] = y_test.values
        
        print(f"Min regression - Training: {X_train.shape[0]}, Test: {X_test.shape[0]}")
        
        # Save datasets
        timestamp = datetime.now().strftime("%d%m%y%H%M%S")
        regression_folder = "script_output/model_training/regression"
        os.makedirs(regression_folder, exist_ok=True)
        
        train_path = os.path.join(regression_folder, f"regression_min_train_{timestamp}.csv")
        test_path = os.path.join(regression_folder, f"regression_min_test_{timestamp}.csv")
        
        X_train.to_csv(train_path, index=False)
        X_test.to_csv(test_path, index=False)
        
        print(f"Min regression training data saved to: {train_path}")
        print(f"Min regression test data saved to: {test_path}")

    def prepare_all_datasets(self, classification_df: pd.DataFrame, regression_df: pd.DataFrame) -> None:
        """
        OPTIMIZED: Transform data once and create all datasets from the same transformation
        """
        print("🚀 OPTIMIZED: Preparing all datasets with single transformation...")
        print("="*60)
        
        # STEP 1: Initialize and fit transformer ONCE on classification data
        print("Fitting transformer on classification data...")
        self.transformer = SMUBiddingTransformer()
        
        X_transformed_full = self.transformer.fit_transform(classification_df)
        self.is_fitted = True
        self.training_columns = list(X_transformed_full.columns)
        
        print(f"✅ Single transformation complete: {X_transformed_full.shape}")
        
        # STEP 2: Create all datasets from the same transformation
        timestamp = datetime.now().strftime("%d%m%y%H%M%S")
        
        # Classification data with temporal split
        print("\n📊 Creating classification datasets...")
        test_mask = (classification_df['acad_year_start'] == 2024) & (classification_df['term'].isin(['T2', 'T3A', 'T3B', '2', '3A', '3B']))
        
        # Add classification target
        X_transformed_full['bids'] = (classification_df['median_bid'] > 0) | (classification_df['min_bid'] > 0)
        
        X_class_train = X_transformed_full[~test_mask].copy()
        X_class_test = X_transformed_full[test_mask].copy()
        
        # Save classification data
        classification_folder = "script_output/model_training/classification"
        os.makedirs(classification_folder, exist_ok=True)
        
        X_class_train.to_csv(f"{classification_folder}/classification_train_{timestamp}.csv", index=False)
        X_class_test.to_csv(f"{classification_folder}/classification_test_{timestamp}.csv", index=False)
        
        print(f"📁 Classification: Train={X_class_train.shape[0]}, Test={X_class_test.shape[0]}")
        
        # STEP 3: Transform regression data (reuse fitted transformer)
        print("\n📈 Creating regression datasets...")
        
        # Filter for non-zero bids
        median_mask = (regression_df['median_bid'] > 0) & (regression_df['median_bid'].notna())
        min_mask = (regression_df['min_bid'] > 0) & (regression_df['min_bid'].notna())
        
        # Transform regression data using fitted transformer
        X_reg_median = self.transformer.transform(regression_df[median_mask])
        X_reg_min = self.transformer.transform(regression_df[min_mask])
        
        # Add targets
        X_reg_median['target_median_bid'] = regression_df[median_mask]['median_bid'].values
        X_reg_min['target_min_bid'] = regression_df[min_mask]['min_bid'].values
        
        # Temporal splits for regression
        reg_median_test_mask = (regression_df[median_mask]['acad_year_start'] == 2024) & (regression_df[median_mask]['term'].isin(['T2', 'T3A', 'T3B', '2', '3A', '3B']))
        reg_min_test_mask = (regression_df[min_mask]['acad_year_start'] == 2024) & (regression_df[min_mask]['term'].isin(['T2', 'T3A', 'T3B', '2', '3A', '3B']))
        
        # Split and save regression datasets
        regression_folder = "script_output/model_training/regression"
        os.makedirs(regression_folder, exist_ok=True)
        
        # Median regression
        X_reg_median[~reg_median_test_mask].to_csv(f"{regression_folder}/regression_median_train_{timestamp}.csv", index=False)
        X_reg_median[reg_median_test_mask].to_csv(f"{regression_folder}/regression_median_test_{timestamp}.csv", index=False)
        
        # Min regression  
        X_reg_min[~reg_min_test_mask].to_csv(f"{regression_folder}/regression_min_train_{timestamp}.csv", index=False)
        X_reg_min[reg_min_test_mask].to_csv(f"{regression_folder}/regression_min_test_{timestamp}.csv", index=False)
        
        print(f"📁 Median Regression: Train={(~reg_median_test_mask).sum()}, Test={reg_median_test_mask.sum()}")
        print(f"📁 Min Regression: Train={(~reg_min_test_mask).sum()}, Test={reg_min_test_mask.sum()}")        
        print(f"\n🎯 OPTIMIZATION COMPLETE!")
        print(f"⚡ Single transformation instead of 3 separate ones")
        print("="*60)

In [9]:
# File paths
classification_data_path = "script_output/model_training/classification/classification_model_data_160625133429.csv"
regression_data_path = "script_output/model_training/regression/regression_model_data_160625133429.csv"

# Initialize the pipeline
pipeline = SMUBiddingPipeline()

# Load both datasets
df_classification = pipeline.load_merged_data(classification_data_path)
df_regression = pipeline.load_merged_data(regression_data_path)

# Prepare all datasets with both dataframes
pipeline.prepare_all_datasets(df_classification, df_regression)

print("Pipeline setup complete!")
print(f"Transformer fitted: {pipeline.is_fitted}")
print(f"Feature columns: {len(pipeline.training_columns)}")

Loading merged data from: script_output/model_training/classification/classification_model_data_160625133429.csv
Loaded data shape: (121172, 12)

Data Quality Summary:
- Total records: 121172
- Unique courses: 701
- Date range: 2021 - 2024
- Terms: ['T1', 'T2', 'T3A', 'T3B']
Loading merged data from: script_output/model_training/regression/regression_model_data_160625133429.csv
Loaded data shape: (36667, 12)

Data Quality Summary:
- Total records: 36667
- Unique courses: 626
- Date range: 2021 - 2024
- Terms: ['T1', 'T2', 'T3A', 'T3B']
🚀 OPTIMIZED: Preparing all datasets with single transformation...
Fitting transformer on classification data...
Fitting transformer on 121172 rows...
✅ Single transformation complete: (121172, 18)

📊 Creating classification datasets...
📁 Classification: Train=114854, Test=6318

📈 Creating regression datasets...
📁 Median Regression: Train=33351, Test=3316
📁 Min Regression: Train=33351, Test=3316

🎯 OPTIMIZATION COMPLETE!
⚡ Single transformation instead of

In [10]:
# Create a sample user input (what a student might want to bid on)
user_input = pd.DataFrame({
    'course_code': ['MGMT715'],
    'course_name': ['Strategic Management'],
    'acad_year_start': [2025],
    'term': ['1'],
    'start_time': ['19:30'],
    'day_of_week': ['Mon,Thu'],
    'before_process_vacancy': [15],
    'bidding_window': ['Round 1 Window 1'],
    'instructor': ['JOHN DOE'],
    'section': ['G1']
})

print("🔍 Testing user input transformation...")
print(f"Input shape: {user_input.shape}")
print("\nUser input:")
print(user_input.to_string())

try:
    # Use the existing fitted transformer
    if 'pipeline' in locals() and pipeline.transformer is not None and pipeline.is_fitted:
        transformed_features = pipeline.transformer.transform(user_input)
        print(f"\n✅ SUCCESS! Transformed shape: {transformed_features.shape}")
        print(f"📊 Feature columns: {list(transformed_features.columns)}")
        print(f"🔢 Sample values:\n{transformed_features.iloc[0].head(10)}")
        
        # Show feature types
        print(f"\n📋 Feature breakdown:")
        print(f"   Categorical: {len(pipeline.transformer.get_categorical_features())}")
        print(f"   Numeric: {len(pipeline.transformer.get_numeric_features())}")
        print(f"   Total features: {len(pipeline.transformer.get_feature_names())}")
        
    else:
        print("❌ No fitted transformer found. Run the pipeline first!")
        
except Exception as e:
    print(f"❌ Error: {e}")
    print("💡 Make sure you've run the pipeline training first!")

🔍 Testing user input transformation...
Input shape: (1, 10)

User input:
  course_code           course_name  acad_year_start term start_time day_of_week  before_process_vacancy    bidding_window instructor section
0     MGMT715  Strategic Management             2025    1      19:30     Mon,Thu                      15  Round 1 Window 1   JOHN DOE      G1

✅ SUCCESS! Transformed shape: (1, 18)
📊 Feature columns: ['subject_area', 'catalogue_no', 'round', 'window', 'before_process_vacancy', 'acad_year_start', 'term', 'start_time', 'course_name', 'section', 'instructor', 'has_mon', 'has_tue', 'has_wed', 'has_thu', 'has_fri', 'has_sat', 'has_sun']
🔢 Sample values:
subject_area                              MGMT
catalogue_no                               715
round                                        1
window                                       1
before_process_vacancy                      15
acad_year_start                           2025
term                                         1
sta