# **SMU Course Scraping Using Selenium for new classes**

<div style="background-color:#FFD700; padding:15px; border-radius:5px; border: 2px solid #FF4500;">
    
  <h1 style="color:#8B0000;">⚠️🚨 SCRAPE THIS DATA AT YOUR OWN RISK 🚨⚠️</h1>
  
  <p><strong>📌 If you need the data, please contact me directly.</strong> Only available for **existing students**.</p>

  <h3>🔗 📩 How to Get the Data?</h3>
  <p>📨 <strong>Reach out to me for access</strong> instead of scraping manually.</p>
  <p>Visit <a href='https://www.afterclass.io/'>AfterClass</a> to use the data for planning</p>

</div>

<br>

### **Objective**
This script is designed to scrape SMU course details from the BOSS system using Selenium. The process involves:
1. Logging into the system manually to bypass authentication.
2. Iteratively scraping class details for specified academic years and terms.
3. Writing the scraped data to structured CSV files.

The data is then ingested into [AfterClass.io](https://www.afterclass.io/) to serve students.

### **Script Structure**
1. **Setup**: Import libraries and initialize Selenium WebDriver.
2. **Login**: Wait for manual login and authentication.
3. **Scraping Logic**:
    - `scrape_class_details`: Scrapes course details for a specific class number, academic year, and term.
    - `main`: Manages the scraping process for multiple academic years and terms.
4. **Execution**: Log in and start scraping.


---

## **1. Setup**

In [None]:
import os
os.environ['PGGSSENCMODE'] = 'disable'

import re
import csv
import time
import pandas as pd
import random
import glob
import win32com.client as win32
from collections import defaultdict
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from webdriver_manager.chrome import ChromeDriverManager
from pathlib import Path
from thefuzz import fuzz
import uuid
import logging
import psycopg2
from typing import List, Optional, Tuple
from collections import Counter, defaultdict
from dotenv import load_dotenv
from datetime import datetime, timedelta
import webbrowser
import json
import google.generativeai as genai

In [None]:
# Define the academic term range you want to scrape or process.
# For a single term, set both START and END to the same value.
START_AY_TERM = '2025-26_T1'
END_AY_TERM = '2025-26_T1'
ACAD_TERM_ID = 'AY202526T1'

# Define the specific bidding round and window you want to target.
# Set to None to let the script auto-detect the current phase based on the schedule.
TARGET_ROUND = None   # e.g., '1A', '2', etc.
TARGET_WINDOW = None  # e.g., 1, 2, 3, etc.

# Central bidding schedule for each academic term.
# The script will use this to determine the correct folder names and bidding phases.
# Format: (results_datetime, "Full Bidding Window Name", "Folder_Suffix")
BIDDING_SCHEDULES = {
    '2025-26_T1': [
        (datetime(2025, 7, 9, 14, 0), "Round 1 Window 1", "R1W1"),
        (datetime(2025, 7, 11, 14, 0), "Round 1A Window 1", "R1AW1"),
        (datetime(2025, 7, 14, 14, 0), "Round 1A Window 2", "R1AW2"),
        (datetime(2025, 7, 16, 14, 0), "Round 1A Window 3", "R1AW3"),
        (datetime(2025, 7, 18, 14, 0), "Round 1B Window 1", "R1BW1"),
        (datetime(2025, 7, 21, 14, 0), "Round 1B Window 2", "R1BW2"),
        (datetime(2025, 7, 30, 14, 0), "Incoming Exchange Rnd 1C Win 1", "R1CW1"),
        (datetime(2025, 7, 31, 14, 0), "Incoming Exchange Rnd 1C Win 2", "R1CW2"),
        (datetime(2025, 8, 1, 14, 0), "Incoming Exchange Rnd 1C Win 3", "R1CW3"),
        (datetime(2025, 8, 11, 14, 0), "Incoming Freshmen Rnd 1 Win 1", "R1FW1"),
        (datetime(2025, 8, 12, 14, 0), "Incoming Freshmen Rnd 1 Win 2", "R1FW2"),
        (datetime(2025, 8, 13, 14, 0), "Incoming Freshmen Rnd 1 Win 3", "R1FW3"),
        (datetime(2025, 8, 14, 14, 0), "Incoming Freshmen Rnd 1 Win 4", "R1FW4"),
        (datetime(2025, 8, 20, 14, 0), "Round 2 Window 1", "R2W1"),
        (datetime(2025, 8, 22, 14, 0), "Round 2 Window 2", "R2W2"),
        (datetime(2025, 8, 25, 14, 0), "Round 2 Window 3", "R2W3"),
        (datetime(2025, 8, 27, 14, 0), "Round 2A Window 1", "R2AW1"),
        (datetime(2025, 8, 29, 14, 0), "Round 2A Window 2", "R2AW2"),
        (datetime(2025, 9, 1, 14, 0), "Round 2A Window 3", "R2AW3"),
    ]
    # You can add schedules for other terms here, e.g., '2025-26_T2': [...]
}

## **2. Scrape all BOSS data**

### **BOSS Class Scraper Summary**

#### **What This Code Does**
The `BOSSClassScraper` class automates the extraction of class timing data from SMU's BOSS system. It systematically scrapes class details across multiple academic terms and saves them as HTML files for further processing.

**Key Features:**
- **Automated Web Scraping**: Navigates through BOSS class detail pages using Selenium WebDriver
- **Flexible Term Range**: Dynamically derives academic years from input parameters (e.g., '2025-26_T1' to '2028-29_T2') rather than hardcoded lists
- **Smart Pagination**: Scans class numbers from 1000-5000 with intelligent termination after 300 consecutive empty records
- **Data Organization**: Saves HTML files in structured directories by academic term (`script_input/classTimingsFull/`)
- **Incremental CSV Updates**: Appends only new valid files to the existing CSV index, avoiding duplicates

#### **What Is Required**

**Technical Dependencies:**
- Python packages: `selenium`, `webdriver-manager`, standard libraries (`os`, `time`, `csv`, `re`)
- Chrome browser and ChromeDriver (auto-managed)
- Network access to SMU's BOSS system

**User Requirements:**
- **Manual Authentication**: User must manually log in and complete Microsoft Authenticator process when prompted
- **SMU Credentials**: Valid access to BOSS system
- **Directory Structure**: Code creates `script_input/classTimingsFull/` for HTML files and `script_input/scraped_filepaths.csv` for the file index

**Usage in Jupyter Notebook:**
```python
# The scraper now uses the global configuration variables defined in cell 1.1
scraper = BOSSClassScraper()

# The start and end terms are read from START_AY_TERM and END_AY_TERM
success = scraper.run_full_scraping_process(START_AY_TERM, END_AY_TERM)

In [None]:
import os
import time
import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from webdriver_manager.chrome import ChromeDriverManager
from datetime import datetime
import csv
from pathlib import Path

class BOSSClassScraper:
    """
    A class to scrape class details from BOSS (SMU's online class registration system).
    It performs a full scan for the first bidding window of a term and targeted
    re-scrapes for subsequent windows based on previously found classes.
    """
    
    def __init__(self):
        """
        Initialize the BOSS Class Scraper with configuration parameters.
        """
        self.term_code_map = {'T1': '10', 'T2': '20', 'T3A': '31', 'T3B': '32'}
        self.all_terms = ['T1', 'T2', 'T3A', 'T3B']
        self.driver = None
        self.min_class_number = 1000
        self.max_class_number = 5000
        self.consecutive_empty_threshold = 300
        
        # Use the global bidding schedule
        self.bidding_schedule = BIDDING_SCHEDULES

    def _get_bidding_round_info_for_term(self, ay_term, now):
        """
        Determines the bidding round folder name for a given academic term based on the current time.
        """
        # Get the schedule for the specific academic term
        schedule = self.bidding_schedule.get(ay_term)
        if not schedule:
            return None

        # Find the correct window from the schedule
        for results_date, _, folder_suffix in schedule:
            if now < results_date:
                return f"{ay_term}_{folder_suffix}"
        return None

    def wait_for_manual_login(self):
        """Wait for manual login and Microsoft Authenticator process completion."""
        print("Please log in manually and complete the Microsoft Authenticator process.")
        print("Waiting for BOSS dashboard to load...")
        wait = WebDriverWait(self.driver, 120)
        try:
            wait.until(EC.presence_of_element_located((By.ID, "Label_UserName")))
            username = self.driver.find_element(By.ID, "Label_UserName").text
            print(f"Login successful! Logged in as {username}")
        except TimeoutException:
            raise Exception("Login failed or timed out.")
        time.sleep(1)

    def scrape_and_save_html(self, start_ay_term=START_AY_TERM, end_ay_term=END_AY_TERM, base_dir='script_input/classTimingsFull'):
        """
        Scrapes class details, always performing a full scan from 1000-5000.
        """
        now = datetime.now()
        start_year = int(start_ay_term[:4])
        end_year = int(end_ay_term[:4])
        all_academic_years = [f"{year}-{(year + 1) % 100:02d}" for year in range(start_year, end_year + 1)]
        all_ay_terms = [f"{ay}_{term}" for ay in all_academic_years for term in self.all_terms]
        
        try:
            start_idx = all_ay_terms.index(start_ay_term)
            end_idx = all_ay_terms.index(end_ay_term)
        except ValueError:
            print("Invalid start or end term provided.")
            return
            
        ay_terms_to_scrape = all_ay_terms[start_idx:end_idx+1]
        
        for ay_term in ay_terms_to_scrape:
            print(f"\nProcessing Academic Term: {ay_term}")
            
            round_window_folder_name = self._get_bidding_round_info_for_term(ay_term, now)
            if not round_window_folder_name:
                print(f"Not in a bidding window for {ay_term} at this time. Skipping.")
                continue

            current_round_path = os.path.join(base_dir, ay_term, round_window_folder_name)
            os.makedirs(current_round_path, exist_ok=True)
            
            ay, term = ay_term.split('_')
            ay_short, term_code = ay[2:4], self.term_code_map.get(term, '10')

            # Always perform full scan regardless of previous rounds
            print(f"Performing full scan for {ay_term}.")
            consecutive_empty = 0
            for class_num in range(self.min_class_number, self.max_class_number + 1):
                was_scraped = self._scrape_single_class(current_round_path, ay_short, term_code, class_num)
                if was_scraped is None: # Error occurred, stop this scan
                    break
                if not was_scraped: # Page had no record
                    consecutive_empty += 1
                    if consecutive_empty >= self.consecutive_empty_threshold:
                        print(f"Stopping scan after {consecutive_empty} consecutive empty records.")
                        break
                else: # Successful scrape
                    consecutive_empty = 0
        print("\nScraping process completed.")
    
    def _scrape_single_class(self, target_path, ay_short, term_code, class_num):
        """
        Scrapes a single class number and saves the HTML, always overwriting existing files.
        Returns True if data was found, False if "No record found", None on error.
        """
        filename = f"SelectedAcadTerm={ay_short}{term_code}&SelectedClassNumber={class_num:04}.html"
        filepath = os.path.join(target_path, filename)

        # Remove the existing file check - always scrape
        url = f"https://boss.intranet.smu.edu.sg/ClassDetails.aspx?SelectedClassNumber={class_num:04}&SelectedAcadTerm={ay_short}{term_code}&SelectedAcadCareer=UGRD"
        
        try:
            self.driver.get(url)
            # Robust wait for either content or an error message
            WebDriverWait(self.driver, 15).until(EC.any_of(
                EC.visibility_of_element_located((By.ID, "RadGrid_MeetingInfo_ctl00")),
                EC.presence_of_element_located((By.ID, "lblErrorDetails"))
            ))
            
            page_source = self.driver.page_source
            if "No record found" in page_source:
                print(f"No record for class {class_num}")
                return False
            
            with open(filepath, 'w', encoding='utf-8') as f:
                f.write(page_source)
            print(f"Saved {filepath}")
            time.sleep(1) # Small delay to be polite
            return True
            
        except Exception as e:
            print(f"Error processing {url}: {str(e)}")
            time.sleep(5)
            return None # Indicate an error occurred

    def generate_scraped_filepaths_csv(self, base_dir='script_input/classTimingsFull', output_csv='script_input/scraped_filepaths.csv'):
        """Generates/appends to a CSV file with paths to all valid HTML files."""
        existing_filepaths = set()
        if os.path.exists(output_csv):
            try:
                with open(output_csv, 'r', encoding='utf-8', newline='') as csvfile:
                    reader = csv.reader(csvfile)
                    next(reader)
                    for row in reader:
                        if row: existing_filepaths.add(row[0])
            except (IOError, StopIteration) as e:
                print(f"Could not read existing CSV, will overwrite: {e}")

        new_filepaths = []
        for root, _, files in os.walk(base_dir):
            for file in files:
                if file.endswith('.html'):
                    filepath = os.path.join(root, file)
                    if filepath not in existing_filepaths:
                        new_filepaths.append(filepath)
        
        if not new_filepaths:
            print("No new valid HTML files found to add to the CSV.")
            return

        mode = 'a' if existing_filepaths else 'w'
        with open(output_csv, mode, newline='', encoding='utf-8') as csvfile:
            writer = csv.writer(csvfile)
            if mode == 'w':
                writer.writerow(['Filepath'])
            for path in new_filepaths:
                writer.writerow([path])
        
        print(f"CSV updated. Total valid files now: {len(existing_filepaths) + len(new_filepaths)}")

    def run_full_scraping_process(self, start_ay_term=START_AY_TERM, end_ay_term=END_AY_TERM):
        """Run the complete scraping process for a specified term range."""
        try:
            options = webdriver.ChromeOptions()
            options.add_argument('--no-sandbox')
            options.add_argument('--disable-dev-shm-usage')
            service = Service(ChromeDriverManager().install())
            self.driver = webdriver.Chrome(service=service, options=options)
            
            self.driver.get("https://boss.intranet.smu.edu.sg/")
            self.wait_for_manual_login()
            
            self.scrape_and_save_html(start_ay_term, end_ay_term)
            self.generate_scraped_filepaths_csv()
            
            return True
        except Exception as e:
            print(f"Error during scraping process: {str(e)}")
            return False
        finally:
            if self.driver:
                self.driver.quit()
            print("Process completed!")

In [None]:
# Run the scraper
scraper = BOSSClassScraper()
success = scraper.run_full_scraping_process(START_AY_TERM, END_AY_TERM)

## **Scrape Overall Bidding Results**

### **ScrapeOverallResults Summary**

#### **What This Code Does**
The `ScrapeOverallResults` class is designed to scrape the main "Overall Results" page from the BOSS system. This page provides a summary of bidding data for all courses in a specific round, including median/min bids and vacancy information. It is the primary source for historical and current bidding statistics.

The scraper operates in two main modes, controlled by the global configuration variables:

- **Automatic Phase Detection**: If `TARGET_ROUND` and `TARGET_WINDOW` are set to `None`, the scraper uses the current system time to check against the `BIDDING_SCHEDULES`. It automatically determines the most recently concluded bidding phase and scrapes its results. This is the default and recommended mode for running during the bidding period.

- **Manual Targeting**: If `TARGET_ROUND` and `TARGET_WINDOW` are set to specific values (e.g., `'1A'`, `2`), the scraper will ignore the current time and target that exact round and window for the academic term defined in `START_AY_TERM`.

**Key Features:**
- **Automated or Manual Mode**: Can either auto-detect the correct bidding phase or be manually aimed at a specific round/window.
- **Robust Form Interaction**: Reliably navigates the BOSS interface, selecting the correct term, round, and window from dropdown menus.
- **Full Data Extraction**: Scrapes all pages of the results table, setting the page size to 50 for efficiency.
- **Structured Output**: Saves the final, cleaned data into a single Excel file named after the academic term (e.g., `2025-26 T1.xlsx`) in the `script_input/overallBossResults/` directory.

#### **What Is Required**

**Technical Dependencies:**
- All packages from the main setup cell.
- Network access to SMU's BOSS system.

**User Requirements:**
- **Manual Authentication**: User must manually log in and complete the Microsoft Authenticator process.
- **Global Configuration**: Relies on `START_AY_TERM`, `TARGET_ROUND`, `TARGET_WINDOW`, and `BIDDING_SCHEDULES` defined in the configuration cell.

**Usage in Jupyter Notebook:**
```python
# The scraper reads its targets from the global configuration variables.
# This example will use the values set in START_AY_TERM, TARGET_ROUND, and TARGET_WINDOW.
scraper = ScrapeOverallResults(headless=False, delay=5)
scraper.run(
    term=START_AY_TERM.replace('_', ' '), # Converts '2025-26_T1' to '2025-26 T1'
    bid_round=TARGET_ROUND,
    bid_window=TARGET_WINDOW
)

In [None]:
class ScrapeOverallResults:
    """
    BOSS Overall Results Scraper using Selenium
    
    This class scrapes course bidding results from the BOSS system with proper
    authentication, form interaction, and data extraction capabilities.
    """
    
    def __init__(self, headless=False, delay=5):
        """
        Initialize the scraper with configuration parameters.
        
        Args:
            headless (bool): Run browser in headless mode
            delay (int): Delay between requests in seconds
        """
        self.driver = None
        self.delay = delay
        self.headless = headless
        self.base_url = "https://boss.intranet.smu.edu.sg/OverallResults.aspx"
        
        # Column mapping and ordering as specified
        self.desired_columns = [
            'Term', 'Session', 'Bidding Window', 'Course Code', 'Description',
            'Section', 'Vacancy', 'Opening Vacancy', 'Before Process Vacancy',
            'D.I.C.E', 'After Process Vacancy', 'Enrolled Students',
            'Median Bid', 'Min Bid', 'Instructor', 'School/Department'
        ]
        
        # Use the global bidding schedule, extracting only the date and name
        # Assuming we are targeting the start term for this scraper.
        self.boss_schedule = []
        if START_AY_TERM in BIDDING_SCHEDULES:
            schedule_for_term = BIDDING_SCHEDULES[START_AY_TERM]
            # Keep only the datetime and full name for this class's logic
            self.boss_schedule = [(dt, name) for dt, name, suffix in schedule_for_term]
        
        # Setup logging
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s'
        )
        self.logger = logging.getLogger(__name__)
        
    def _determine_current_bidding_phase(self):
        """
        Determine the current bidding phase based on current time
        
        Returns:
            tuple: (round, window) or (None, None) if no active phase
        """
        current_time = datetime.now()
        self.logger.info(f"Current time: {current_time}")
        
        # Find the most recent bidding phase that has started
        active_phase = None
        for schedule_time, phase_name in self.boss_schedule:
            if current_time >= schedule_time:
                active_phase = (schedule_time, phase_name)
            else:
                break
        
        if active_phase is None:
            self.logger.warning("No active bidding phase found - before first scheduled phase")
            return None, None
        
        schedule_time, phase_name = active_phase
        self.logger.info(f"Current bidding phase: {phase_name} (started at {schedule_time})")
        
        # Parse the phase name to extract round and window
        # Handle different formats:
        # "Round 1 Window 1" -> ("1", "1")
        # "Round 1A Window 2" -> ("1A", "2")
        # "Round 2A Window 3" -> ("2A", "3")
        # "Incoming Exchange Rnd 1C Win 1" -> ("1C", "1")
        # "Incoming Freshmen Rnd 1 Win 1" -> ("1", "1")
        
        try:
            # Remove prefixes and normalize
            normalized = phase_name.replace("Incoming Exchange ", "").replace("Incoming Freshmen ", "")
            normalized = normalized.replace("Rnd ", "Round ").replace("Win ", "Window ")
            
            # Extract round and window using regex
            import re
            match = re.search(r'Round\s+(\d+[A-Z]*)\s+Window\s+(\d+)', normalized)
            
            if match:
                round_value = match.group(1)
                window_value = match.group(2)
                
                self.logger.info(f"Parsed phase: Round {round_value}, Window {window_value}")
                return round_value, window_value
            else:
                self.logger.warning(f"Could not parse phase name: {phase_name}")
                return None, None
                
        except Exception as e:
            self.logger.error(f"Error parsing phase name '{phase_name}': {str(e)}")
            return None, None
    
    def _setup_driver(self):
        """Setup Chrome WebDriver with appropriate options"""
        chrome_options = Options()
        
        if self.headless:
            chrome_options.add_argument("--headless")
        
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")
        
        try:
            self.driver = webdriver.Chrome(options=chrome_options)
            self.logger.info("Chrome WebDriver initialized successfully")
        except Exception as e:
            self.logger.error(f"Failed to initialize WebDriver: {str(e)}")
            raise
    
    def wait_for_manual_login(self):
        """
        Wait for manual login and Microsoft Authenticator process completion.
        """
        print("Please log in manually and complete the Microsoft Authenticator process.")
        print("Waiting for BOSS dashboard to load...")
        
        wait = WebDriverWait(self.driver, 120)
        
        try:
            # Wait for login success indicators
            wait.until(EC.presence_of_element_located((By.ID, "Label_UserName")))
            wait.until(EC.presence_of_element_located((By.XPATH, "//a[contains(text(),'Sign out')]")))
            
            username = self.driver.find_element(By.ID, "Label_UserName").text
            print(f"Login successful! Logged in as {username}")
            
        except TimeoutException:
            print("Login failed or timed out. Could not detect login elements.")
            raise Exception("Login failed")
        
        time.sleep(2)
    
    def _navigate_to_overall_results(self):
        """Navigate to the Overall Results page"""
        try:
            self.driver.get(self.base_url)
            
            # Wait for page to load
            wait = WebDriverWait(self.driver, 30)
            wait.until(EC.presence_of_element_located((By.ID, "rcboCourseCareer")))
            
            self.logger.info("Successfully navigated to Overall Results page")
            time.sleep(2)
            
        except Exception as e:
            self.logger.error(f"Failed to navigate to Overall Results page: {str(e)}")
            raise
    
    def _select_course_career(self, career="Undergraduate"):
        """Select course career (default: Undergraduate)"""
        try:
            # The dropdown is already set to Undergraduate by default
            career_input = self.driver.find_element(By.ID, "rcboCourseCareer_Input")
            current_value = career_input.get_attribute("value")
            
            if current_value != career:
                # If we need to change it, click the dropdown arrow
                dropdown_arrow = self.driver.find_element(By.ID, "rcboCourseCareer_Arrow")
                dropdown_arrow.click()
                time.sleep(1)
                
                # Select the desired option
                option = self.driver.find_element(By.XPATH, f"//li[@class='rcbItem' and text()='{career}']")
                option.click()
                time.sleep(1)
            
            self.logger.info(f"Course career set to: {career}")
            
        except Exception as e:
            self.logger.error(f"Failed to select course career: {str(e)}")
            raise
    
    def _select_term(self, term):
        """
        Selects a term ONLY if it's not already the selected term.
        
        Args:
            term (str): The full-text term to select (e.g., '2025-26 Term 1').
        """
        try:
            wait = WebDriverWait(self.driver, 10)
            
            # 1. First, check the currently displayed value in the term input box.
            current_term_input = self.driver.find_element(By.ID, "rcboTerm_Input")
            current_term_value = current_term_input.get_attribute("value").strip()
            
            # 2. Compare with the desired term. If they match, do nothing.
            if current_term_value == term:
                self.logger.info(f"Term '{term}' is already selected. Skipping interaction.")
                return

            # 3. If the term needs to be changed, proceed with the selection logic.
            self.logger.info(f"Current term is '{current_term_value}', changing to '{term}'.")
            term_arrow = wait.until(EC.element_to_be_clickable((By.ID, "rcboTerm_Arrow")))
            self.driver.execute_script("arguments[0].click();", term_arrow)
            
            dropdown_div = wait.until(EC.visibility_of_element_located((By.ID, "rcboTerm_DropDown")))
            
            selected_checkboxes = dropdown_div.find_elements(By.XPATH, ".//input[@type='checkbox' and @checked='checked']")
            for checkbox in selected_checkboxes:
                self.driver.execute_script("arguments[0].checked = false;", checkbox)
            
            term_checkbox_xpath = f"//div[@id='rcboTerm_DropDown']//label[contains(., '{term}')]/input[@type='checkbox']"
            term_checkbox = wait.until(EC.presence_of_element_located((By.XPATH, term_checkbox_xpath)))

            self.driver.execute_script("arguments[0].click();", term_checkbox)
            
            self.driver.find_element(By.TAG_NAME, "body").click()
            time.sleep(1)
            
            self.logger.info(f"Term selected: {term}")
            
        except (NoSuchElementException, TimeoutException) as e:
            self.logger.error(f"Failed to select term '{term}'. The element could not be found or timed out.")
            # Make sure to import these exceptions at the top of your script:
            # from selenium.common.exceptions import NoSuchElementException, TimeoutException
            with open("error_page_source.html", "w", encoding="utf-8") as f:
                f.write(self.driver.page_source)
            self.logger.info("Page HTML at the time of error saved to 'error_page_source.html'.")
            raise e
    
    def _select_bid_round(self, round_value=None):
        """
        Select bid round
        
        Args:
            round_value (str): Round to select (e.g., '1', '1A', '2', '1F')
                            If None, leave as default (empty)
                            Note: '1F' is automatically mapped to '1' for dropdown selection
        """
        try:
            if round_value is None:
                self.logger.info("Bid round left as default (empty)")
                return
            
            # Map round values that don't exist in dropdown to valid options
            original_round_value = round_value
            if round_value == '1F':  # Freshmen Round 1 maps to Round 1
                round_value = '1'
                self.logger.info(f"Mapped round '{original_round_value}' to '{round_value}' for dropdown selection")
            
            # Click the round dropdown arrow
            round_arrow = self.driver.find_element(By.ID, "rcboBidRound_Arrow")
            round_arrow.click()
            time.sleep(1)
            
            # Select the round option
            round_option = self.driver.find_element(
                By.XPATH, 
                f"//div[@id='rcboBidRound_DropDown']//li[@class='rcbItem' and text()='{round_value}']"
            )
            round_option.click()
            time.sleep(1)
            
            self.logger.info(f"Bid round selected: {round_value} (original: {original_round_value})")
            
        except Exception as e:
            self.logger.error(f"Failed to select bid round '{original_round_value}' (mapped to '{round_value}'): {str(e)}")
            raise

    def _select_bid_window(self, window_value=None):
        """
        Select bid window
        
        Args:
            window_value (str): Window to select (e.g., '1', '2', '3')
                               If None, leave as default (empty)
        """
        try:
            if window_value is None:
                self.logger.info("Bid window left as default (empty)")
                return
            
            # Click the window dropdown arrow
            window_arrow = self.driver.find_element(By.ID, "rcboBidWindow_Arrow")
            window_arrow.click()
            time.sleep(1)
            
            # Select the window option
            window_option = self.driver.find_element(
                By.XPATH, 
                f"//div[@id='rcboBidWindow_DropDown']//li[@class='rcbItem' and text()='{window_value}']"
            )
            window_option.click()
            time.sleep(1)
            
            self.logger.info(f"Bid window selected: {window_value}")
            
        except Exception as e:
            self.logger.error(f"Failed to select bid window '{window_value}': {str(e)}")
            raise
    
    def _click_search(self):
        """Click the search button to submit the form"""
        try:
            search_button = self.driver.find_element(By.ID, "RadButton_Search_input")
            search_button.click()
            
            # Wait for results to load
            wait = WebDriverWait(self.driver, 30)
            wait.until(EC.presence_of_element_located((By.ID, "RadGrid_OverallResults_ctl00")))
            
            self.logger.info("Search completed successfully")
            time.sleep(3)  # Give extra time for data to load
            
        except Exception as e:
            self.logger.error(f"Failed to click search or load results: {str(e)}")
            raise
    
    def _set_page_size_to_50(self):
        """Set the page size dropdown to 50 records per page"""
        try:
            # Click the page size dropdown arrow
            page_size_arrow = self.driver.find_element(
                By.ID, "RadGrid_OverallResults_ctl00_ctl03_ctl01_PageSizeComboBox_Arrow"
            )
            page_size_arrow.click()
            time.sleep(1)
            
            # Select 50 from the dropdown
            option_50 = self.driver.find_element(
                By.XPATH, 
                "//div[@id='RadGrid_OverallResults_ctl00_ctl03_ctl01_PageSizeComboBox_DropDown']//li[text()='50']"
            )
            option_50.click()
            
            # Wait for page to reload with new page size
            time.sleep(5)  # Extended wait for page reload
            
            self.logger.info("Page size set to 50 records per page")
            
        except Exception as e:
            self.logger.error(f"Failed to set page size to 50: {str(e)}")
            # Continue anyway, might work with default page size

    def _sort_by_bidding_window(self):
        """Sort the results by bidding window to get Incoming Freshmen first"""
        try:
            # Click the Bidding Window header to sort
            sort_link = self.driver.find_element(
                By.XPATH, 
                "//a[contains(@onclick, 'EVENT_TYPE_DESCRIPTION') and contains(text(), 'Bidding Window')]"
            )
            sort_link.click()
            
            # Wait for table to reload after sorting
            time.sleep(3)
            wait = WebDriverWait(self.driver, 15)
            wait.until(EC.presence_of_element_located((By.ID, "RadGrid_OverallResults_ctl00")))
            
            self.logger.info("Successfully sorted by Bidding Window")
            
        except Exception as e:
            self.logger.error(f"Failed to sort by bidding window: {str(e)}")
            # Continue without sorting rather than failing completely

    def _extract_table_data(self, stop_on_bidding_window_change=False, last_bidding_window=None):
        """
        Extract data from the current page table with improved robustness
        
        Args:
            stop_on_bidding_window_change (bool): Whether to stop when bidding window changes
            last_bidding_window (str): The last bidding window seen (for change detection)
        
        Returns:
            tuple: (page_data, should_stop, current_bidding_window)
        """
        try:
            # Wait for table to be fully loaded
            wait = WebDriverWait(self.driver, 15)
            wait.until(EC.presence_of_element_located((By.ID, "RadGrid_OverallResults_ctl00")))
            
            # Additional wait for data to populate
            time.sleep(2)
            
            # Find the main table
            table = self.driver.find_element(By.ID, "RadGrid_OverallResults_ctl00")
            
            # Get all rows in the table
            all_rows = table.find_elements(By.TAG_NAME, "tr")
            self.logger.info(f"Found {len(all_rows)} total rows in table")
            
            # Filter for data rows only (rgRow and rgAltRow classes)
            data_rows = []
            for row in all_rows:
                try:
                    row_class = row.get_attribute("class") or ""
                    if "rgRow" in row_class or "rgAltRow" in row_class:
                        data_rows.append(row)
                except StaleElementReferenceException:
                    continue
            
            self.logger.info(f"Found {len(data_rows)} data rows")
            
            if len(data_rows) == 0:
                self.logger.warning("No data rows found - checking table structure")
                self._debug_table_content()
                return []
                        
            page_data = []
            current_bidding_window = None
            should_stop = False

            for i, row in enumerate(data_rows):
                try:
                    # Get all cells in the row
                    cells = row.find_elements(By.TAG_NAME, "td")
                    
                    if len(cells) < 16:
                        self.logger.warning(f"Row {i} has only {len(cells)} cells, expected 16")
                        continue
                    
                    # Extract current row's bidding window for change detection
                    row_bidding_window = cells[2].text.strip()
                    if current_bidding_window is None:
                        current_bidding_window = row_bidding_window
                    
                    # Check for bidding window change if requested
                    if stop_on_bidding_window_change and last_bidding_window:
                        if (last_bidding_window.startswith("Incoming Freshmen") and 
                            not row_bidding_window.startswith("Incoming Freshmen")):
                            self.logger.info(f"Bidding window changed from '{last_bidding_window}' to '{row_bidding_window}' - stopping extraction")
                            should_stop = True
                            break
                    
                    # Extract section text from cell 5 (which contains a link)
                    section_cell = cells[5]
                    section_text = ""
                    try:
                        # Try to get link text first
                        link = section_cell.find_element(By.TAG_NAME, "a")
                        section_text = link.get_attribute("title") or link.text.strip()
                    except NoSuchElementException:
                        # If no link, get cell text directly
                        section_text = section_cell.text.strip()
                    
                    # Create record with proper column mapping
                    record = {
                        'Term': cells[0].text.strip(),
                        'Session': cells[1].text.strip(), 
                        'Bidding Window': cells[2].text.strip(),
                        'Course Code': cells[3].text.strip(),
                        'Description': cells[4].text.strip(),
                        'Section': section_text,
                        'Median Bid': cells[6].text.strip(),
                        'Min Bid': cells[7].text.strip(),
                        'Vacancy': cells[8].text.strip(),
                        'Opening Vacancy': cells[9].text.strip(),
                        'Before Process Vacancy': cells[10].text.strip(),
                        'After Process Vacancy': cells[11].text.strip(),
                        'D.I.C.E': cells[12].text.strip(),
                        'Enrolled Students': cells[13].text.strip(),
                        'Instructor': cells[14].text.strip(),
                        'School/Department': cells[15].text.strip()
                    }
                    
                    # Clean up data
                    cleaned_record = {}
                    for key, value in record.items():
                        # Remove extra whitespace and handle special characters
                        cleaned_value = re.sub(r'\s+', ' ', str(value)).strip()
                        cleaned_value = cleaned_value.replace('\u00a0', ' ')  # Remove &nbsp;
                        cleaned_value = cleaned_value.replace('&nbsp;', ' ')
                        
                        # Handle empty values
                        if cleaned_value == '' or cleaned_value == ' ':
                            cleaned_value = '-'
                        
                        # Convert '-' to '0' for Median Bid and Min Bid fields
                        if key in ['Median Bid', 'Min Bid'] and cleaned_value == '-':
                            cleaned_value = '0'
                        
                        cleaned_record[key] = cleaned_value
                    
                    # Only add record if it has a valid course code
                    if cleaned_record['Course Code'] and cleaned_record['Course Code'] != '-':
                        page_data.append(cleaned_record)
                        
                        # Log first record for verification
                        if len(page_data) == 1:
                            self.logger.info(f"Sample record: {cleaned_record['Course Code']} - {cleaned_record['Description'][:30]}...")
                    
                except Exception as e:
                    self.logger.warning(f"Error processing row {i}: {str(e)}")
                    continue
            
            self.logger.info(f"Successfully extracted {len(page_data)} valid records")
            return page_data, should_stop, current_bidding_window
            
        except Exception as e:
            self.logger.error(f"Failed to extract table data: {str(e)}")
            self._debug_table_content()
            return []
    
    def _debug_table_content(self):
        """Debug method to inspect table content when extraction fails"""
        try:
            self.logger.info("=== TABLE DEBUG INFO ===")
            
            # Check if main table exists
            try:
                table = self.driver.find_element(By.ID, "RadGrid_OverallResults_ctl00")
                self.logger.info("✓ Main table found")
            except NoSuchElementException:
                self.logger.error("✗ Main table NOT found")
                return
            
            # Check all rows
            all_rows = table.find_elements(By.TAG_NAME, "tr")
            self.logger.info(f"Total rows in table: {len(all_rows)}")
            
            # Analyze first few rows
            for i, row in enumerate(all_rows[:5]):
                try:
                    row_class = row.get_attribute("class") or "no-class"
                    cells = row.find_elements(By.TAG_NAME, "td") + row.find_elements(By.TAG_NAME, "th")
                    cell_count = len(cells)
                    
                    first_cell_text = ""
                    if cells:
                        first_cell_text = cells[0].text.strip()[:50]
                    
                    is_data_row = "rgRow" in row_class or "rgAltRow" in row_class
                    
                    self.logger.info(f"Row {i}: class='{row_class}', cells={cell_count}, data_row={is_data_row}")
                    self.logger.info(f"  First cell: '{first_cell_text}'")
                    
                except Exception as e:
                    self.logger.warning(f"Error analyzing row {i}: {str(e)}")
            
            # Check for error messages in the page
            error_elements = self.driver.find_elements(By.CLASS_NAME, "error")
            if error_elements:
                for error in error_elements:
                    if error.text.strip():
                        self.logger.warning(f"Error message found: {error.text}")
            
            # Check if there's a "no data" message
            no_data_elements = self.driver.find_elements(By.XPATH, "//*[contains(text(), 'No record') or contains(text(), 'no data') or contains(text(), 'No data')]")
            if no_data_elements:
                for element in no_data_elements:
                    self.logger.warning(f"No data message: {element.text}")
            
            self.logger.info("=== END TABLE DEBUG ===")
            
        except Exception as e:
            self.logger.error(f"Debug method failed: {str(e)}")
    
    def _has_next_page(self):
        """Check if there is a next page available using flexible selectors"""
        try:
            # Method 1: Look for Next Page button by title attribute (most reliable)
            next_buttons = self.driver.find_elements(
                By.XPATH, "//input[@title='Next Page' and contains(@name, 'RadGrid_OverallResults')]"
            )
            
            if next_buttons:
                next_button = next_buttons[0]
                is_enabled = next_button.is_enabled()
                button_class = next_button.get_attribute("class") or ""
                is_not_disabled = "disabled" not in button_class.lower()
                
                self.logger.debug(f"Next button found: enabled={is_enabled}, class='{button_class}'")
                return is_enabled and is_not_disabled
            
            # Method 2: Look for Next Page button by class
            next_buttons_by_class = self.driver.find_elements(By.CLASS_NAME, "rgPageNext")
            if next_buttons_by_class:
                next_button = next_buttons_by_class[0]
                is_enabled = next_button.is_enabled()
                button_class = next_button.get_attribute("class") or ""
                is_not_disabled = "disabled" not in button_class.lower()
                
                self.logger.debug(f"Next button (by class) found: enabled={is_enabled}, class='{button_class}'")
                return is_enabled and is_not_disabled
            
            # Method 3: Check pagination info to see if we're on the last page
            try:
                # Look for pagination info like "1129 items in 23 pages"
                info_elements = self.driver.find_elements(By.CLASS_NAME, "rgInfoPart")
                if info_elements:
                    info_text = info_elements[0].text
                    self.logger.debug(f"Pagination info: {info_text}")
                    
                    # Extract current page and total pages
                    import re
                    match = re.search(r'(\d+)\s+items\s+in\s+(\d+)\s+pages', info_text)
                    if match:
                        total_items = int(match.group(1))
                        total_pages = int(match.group(2))
                        
                        # Find current page by looking for rgCurrentPage
                        current_page_elements = self.driver.find_elements(By.CLASS_NAME, "rgCurrentPage")
                        if current_page_elements:
                            current_page_text = current_page_elements[0].text.strip()
                            if current_page_text.isdigit():
                                current_page = int(current_page_text)
                                has_next = current_page < total_pages
                                
                                self.logger.info(f"Pagination: Page {current_page} of {total_pages} (has_next: {has_next})")
                                return has_next
            except Exception as e:
                self.logger.debug(f"Error checking pagination info: {str(e)}")
            
            self.logger.warning("No Next Page button found using any method")
            return False
            
        except Exception as e:
            self.logger.error(f"Error checking for next page: {str(e)}")
            return False
    
    def _click_next_page(self):
        """Click the next page button using flexible selectors"""
        try:
            next_button = None
            
            # Method 1: Find by title attribute
            next_buttons = self.driver.find_elements(
                By.XPATH, "//input[@title='Next Page' and contains(@name, 'RadGrid_OverallResults')]"
            )
            
            if next_buttons:
                next_button = next_buttons[0]
                self.logger.debug("Found Next button by title attribute")
            else:
                # Method 2: Find by class
                next_buttons_by_class = self.driver.find_elements(By.CLASS_NAME, "rgPageNext")
                if next_buttons_by_class:
                    next_button = next_buttons_by_class[0]
                    self.logger.debug("Found Next button by class")
            
            if next_button and next_button.is_enabled():
                # Log button details for debugging
                button_name = next_button.get_attribute("name")
                self.logger.debug(f"Clicking Next button: {button_name}")
                
                next_button.click()
                
                # Wait for page to load
                time.sleep(self.delay)
                
                # Wait for table to be updated with longer timeout
                wait = WebDriverWait(self.driver, 15)
                wait.until(EC.presence_of_element_located((By.ID, "RadGrid_OverallResults_ctl00")))
                
                # Additional wait for content to fully load
                time.sleep(2)
                
                self.logger.info("Successfully navigated to next page")
                return True
            else:
                if next_button:
                    self.logger.info("Next button found but is disabled")
                else:
                    self.logger.info("No Next button found")
                return False
                
        except Exception as e:
            self.logger.error(f"Failed to click next page: {str(e)}")
            return False
    
    def _generate_filename(self, term):
        """
        Generate filename based on term
        
        Args:
            term (str): Term like '2025-26 Term 1'
            
        Returns:
            str: Filename like '2025-26_T1.xlsx'
        """
        # Convert term format
        term_map = {
            'Term 1': 'T1',
            'Term 2': 'T2', 
            'Term 3A': 'T3A',
            'Term 3B': 'T3B'
        }
        
        filename = term
        for full_term, short_term in term_map.items():
            if full_term in term:
                filename = term.replace(full_term, short_term)
                break
        
        return filename + '.xlsx'
    
    def _save_to_excel(self, data, filename):
        """
        Save data to Excel file, concatenating if file exists
        
        Args:
            data (list): List of dictionaries containing the data
            filename (str): Excel filename to save to
        """
        try:
            if not data:
                self.logger.warning("No data to save")
                return
            
            # Create DataFrame with desired column order
            new_df = pd.DataFrame(data)
            new_df = new_df[self.desired_columns]
            
            # Check if file exists
            if os.path.exists(filename):
                self.logger.info(f"File {filename} exists, concatenating data...")
                
                try:
                    # Read existing data
                    existing_df = pd.read_excel(filename, engine='openpyxl')
                    
                    # Concatenate and remove duplicates
                    combined_df = pd.concat([existing_df, new_df], ignore_index=True)
                    combined_df = combined_df.drop_duplicates().reset_index(drop=True)
                    
                    self.logger.info(f"Combined {len(existing_df)} existing + {len(new_df)} new = {len(combined_df)} total records")
                    
                except Exception as e:
                    self.logger.error(f"Error reading existing file, creating new: {str(e)}")
                    combined_df = new_df
            else:
                combined_df = new_df
            
            # Save to Excel
            combined_df.to_excel(filename, index=False, engine='openpyxl')
            
            self.logger.info(f"Data saved to {filename}")
            self.logger.info(f"Total records in file: {len(combined_df)}")
            
            return len(combined_df)
            
        except Exception as e:
            self.logger.error(f"Failed to save data to Excel: {str(e)}")
            raise
    
    def _get_current_page_info(self):
        """Get current page number and total pages"""
        try:
            # Method 1: From pagination info text
            info_elements = self.driver.find_elements(By.CLASS_NAME, "rgInfoPart")
            if info_elements:
                info_text = info_elements[0].text
                # Extract total pages from text like "1129 items in 23 pages"
                import re
                match = re.search(r'(\d+)\s+items\s+in\s+(\d+)\s+pages', info_text)
                if match:
                    total_items = int(match.group(1))
                    total_pages = int(match.group(2))
                    
                    # Get current page from rgCurrentPage element
                    current_page_elements = self.driver.find_elements(By.CLASS_NAME, "rgCurrentPage")
                    if current_page_elements:
                        current_page_text = current_page_elements[0].text.strip()
                        if current_page_text.isdigit():
                            current_page = int(current_page_text)
                            return current_page, total_pages, total_items
            
            return None, None, None
            
        except Exception as e:
            self.logger.debug(f"Error getting page info: {str(e)}")
            return None, None, None
    
    def scrape_term_data(self, term, bid_round=None, bid_window=None, output_dir="./"):
        """
        Scrape data for a specific term with improved pagination handling
        
        Args:
            term (str): Term to scrape (e.g., '2025-26 Term 1')
            bid_round (str): Specific bid round to filter by
            bid_window (str): Specific bid window to filter by
            output_dir (str): Directory to save the Excel file
        """
        try:
            # Setup driver if not already done
            if self.driver is None:
                self._setup_driver()
            
            # Navigate to page
            self._navigate_to_overall_results()
            
            # Fill form
            self._select_course_career("Undergraduate")
            self._select_term(term)
            self._select_bid_round(bid_round)
            self._select_bid_window(bid_window)
            
            # Submit search
            self._click_search()
            
            # Set page size to 50
            self._set_page_size_to_50()

            # Sort by bidding window to get Incoming Freshmen first
            self._sort_by_bidding_window()
            
            # Get initial page information
            current_page, total_pages, total_items = self._get_current_page_info()
            if total_pages:
                self.logger.info(f"Starting scrape: {total_items} total items across {total_pages} pages")
            
            # Collect all data from all pages
            all_data = []
            page_num = 1
            max_pages = 200  # Increased safety limit
            last_bidding_window = None
            should_stop_scraping = False

            while page_num <= max_pages and not should_stop_scraping:
                # Get current page info for verification
                current_page, total_pages, total_items = self._get_current_page_info()
                
                if current_page and total_pages:
                    self.logger.info(f"Scraping page {current_page} of {total_pages} (iteration {page_num})...")
                else:
                    self.logger.info(f"Scraping page {page_num}...")
                
                # Extract data from current page with early termination support
                page_data, should_stop, current_bidding_window = self._extract_table_data(
                    stop_on_bidding_window_change=True, 
                    last_bidding_window=last_bidding_window
                )
                
                # Update tracking variables
                if current_bidding_window:
                    last_bidding_window = current_bidding_window
                if should_stop:
                    should_stop_scraping = True
                
                if page_data:
                    all_data.extend(page_data)
                    self.logger.info(f"Page {page_num}: Found {len(page_data)} records")
                    if should_stop_scraping:
                        self.logger.info("Early termination triggered due to bidding window change")
                        break
                else:
                    self.logger.warning(f"Page {page_num}: No data found")
                    
                    # If first page has no data, something is wrong
                    if page_num == 1:
                        self.logger.error("No data on first page - check search criteria or page structure")
                        break
                
                # Check if we've reached the last page using multiple methods
                if current_page and total_pages and current_page >= total_pages:
                    self.logger.info(f"Reached last page ({current_page}/{total_pages})")
                    break
                
                # Check if there's a next page using our improved method
                if self._has_next_page():
                    # Store current page for verification
                    old_page = current_page
                    
                    if self._click_next_page():
                        # Verify we actually moved to next page
                        time.sleep(1)  # Brief wait
                        new_current_page, _, _ = self._get_current_page_info()
                        
                        if new_current_page and old_page and new_current_page <= old_page:
                            self.logger.warning(f"Page didn't advance (was {old_page}, now {new_current_page})")
                            break
                        
                        page_num += 1
                        time.sleep(self.delay)  # Rate limiting
                    else:
                        self.logger.info("Failed to navigate to next page, stopping")
                        break
                else:
                    self.logger.info("No more pages available")
                    break
            
            # Generate filename and save data
            if all_data:
                filename = self._generate_filename(term)
                filepath = os.path.join(output_dir, filename)
                
                total_records = self._save_to_excel(all_data, filepath)
                
                self.logger.info(f"Scraping completed for {term}")
                self.logger.info(f"Records collected this session: {len(all_data)}")
                self.logger.info(f"Total records in file: {total_records}")
                
                # Final verification
                if current_page and total_pages:
                    expected_total = total_items if total_items else "unknown"
                    self.logger.info(f"Expected ~{expected_total} total records from {total_pages} pages")
            else:
                self.logger.error("No data collected for any page")
            
            return all_data
            
        except Exception as e:
            self.logger.error(f"Failed to scrape term data: {str(e)}")
            raise
    
    def scrape_multiple_terms(self, terms_config, output_dir="./"):
        """
        Scrape data for multiple terms
        
        Args:
            terms_config (list): List of dictionaries with term configurations
                                Example: [
                                    {'term': '2025-26 Term 1', 'round': '1', 'window': '1'},
                                    {'term': '2024-25 Term 2', 'round': None, 'window': None}
                                ]
            output_dir (str): Directory to save Excel files
        """
        try:
            # Create output directory
            os.makedirs(output_dir, exist_ok=True)
            
            # Setup driver
            self._setup_driver()
            
            # Navigate to login page and wait for manual login
            self.driver.get("https://boss.intranet.smu.edu.sg/")
            self.wait_for_manual_login()
            
            # Process each term configuration
            for i, config in enumerate(terms_config):
                try:
                    self.logger.info(f"Processing term {i+1}/{len(terms_config)}: {config['term']}")
                    
                    data = self.scrape_term_data(
                        term=config['term'],
                        bid_round=config.get('round'),
                        bid_window=config.get('window'),
                        output_dir=output_dir
                    )
                    
                    self.logger.info(f"Completed {config['term']}: {len(data)} records")
                    
                    # Delay between terms
                    if i < len(terms_config) - 1:
                        self.logger.info(f"Waiting {self.delay} seconds before next term...")
                        time.sleep(self.delay)
                    
                except Exception as e:
                    self.logger.error(f"Failed to process term {config['term']}: {str(e)}")
                    continue
            
            self.logger.info("All terms processing completed")
            
        except Exception as e:
            self.logger.error(f"Failed to scrape multiple terms: {str(e)}")
            raise
        finally:
            self.close()

    def _transform_term_format(self, short_term):
        """
        Converts a short-form term into the website's full-text format.
        Example: '2025-26_T1' -> '2025-26 Term 1'
        
        Args:
            short_term (str): The term in short format (e.g., 'YYYY-YY_TX').
            
        Returns:
            str: The term in the website's format.
        """
        # Mapping from short-form to the website's text.
        term_map = {
            'T1': 'Term 1',
            'T2': 'Term 2',
            'T3A': 'Term 3A',
            'T3B': 'Term 3B'
        }
        
        try:
            # Split the string into the year part and the term part (e.g., '2025-26' and 'T1')
            year_part, term_part = short_term.split('_')
            
            # Look up the full term name from our map.
            full_term_name = term_map.get(term_part)
            
            if full_term_name:
                # Combine them back into the final format.
                return f"{year_part} {full_term_name}"
            else:
                # If the term part is not in our map, raise an error.
                raise ValueError(f"Unknown term suffix: '{term_part}'")
                
        except (ValueError, IndexError) as e:
            self.logger.error(f"Invalid term format: '{short_term}'. Expected format like '2025-26_T1'.")
            raise e
    
    def run(self, term, bid_round=None, bid_window=None, output_dir="./script_input/overallBossResults", auto_detect_phase=True):
        """
        Runs the scraper for a single term, handling term format transformation internally.
        
        Args:
            term (str): Term to scrape in short format (e.g., '2025-26_T1').
            ... (other args are the same) ...
        """
        try:
            # First, transform the short-form term into the website-friendly format.
            website_term = self._transform_term_format(term)
            
            # Auto-detect current bidding phase if enabled and no explicit round/window provided
            if auto_detect_phase and (bid_round is None or bid_window is None):
                detected_round, detected_window = self._determine_current_bidding_phase()
                if detected_round and detected_window:
                    if bid_round is None: bid_round = detected_round
                    if bid_window is None: bid_window = detected_window
                    self.logger.info(f"Auto-detected bidding phase: Round {bid_round}, Window {bid_window}")
                else:
                    self.logger.info("Could not auto-detect a current bidding phase. Scraping with default filters.")
            
            round_str = f"Round {bid_round}" if bid_round else "All Rounds"
            window_str = f"Window {bid_window}" if bid_window else "All Windows"
            self.logger.info(f"Scraping {website_term} - {round_str}, {window_str}")
            
            self._setup_driver()
            self.driver.get("https://boss.intranet.smu.edu.sg/")
            self.wait_for_manual_login()
            
            # Scrape the term data using the correctly formatted term.
            data = self.scrape_term_data(
                term=website_term,
                bid_round=bid_round,
                bid_window=bid_window,
                output_dir=output_dir
            )
            
            print(f"Scraping completed! Collected {len(data)} records for {website_term}")
            return data
            
        except Exception as e:
            print(f"An error occurred during the scraping process: {str(e)}")
            self.logger.error(f"Error during scraping: {str(e)}")
            raise
        finally:
            self.close()

    def close(self):
        """Close the WebDriver"""
        if self.driver:
            self.driver.quit()
            self.logger.info("WebDriver closed")

In [None]:
# Option 1: Use the run() method with the new configuration variables
scraper = ScrapeOverallResults(headless=False, delay=5)
scraper.run(
    term=START_AY_TERM,
    bid_round=TARGET_ROUND,
    bid_window=TARGET_WINDOW,
    auto_detect_phase=True  # Ensure auto-detection is enabled.
)

# # Option 2: Use the run() method with automatic detection disabled
# scraper = ScrapeOverallResults(headless=False, delay=5)
# scraper.run(auto_detect_phase=False)  # Will scrape all rounds and windows

# # Option 3: Use the run() method with custom parameters (overrides auto-detection)
# scraper = ScrapeOverallResults(headless=False, delay=5)
# scraper.run(
#     term='2024-25 Term 2',
#     bid_round='2', 
#     bid_window='2',
#     output_dir="./my_output_folder",
#     auto_detect_phase=False  # Disable auto-detection since we're specifying manually
# )

# # Option 4: Mix auto-detection with manual override
# scraper = ScrapeOverallResults(headless=False, delay=5)
# scraper.run(
#     term='2025-26 Term 1',
#     bid_round='1A',  # Override auto-detected round
#     # bid_window will be auto-detected
#     auto_detect_phase=True
# )

# # Option 5: Use the scrape_multiple_terms method for multiple terms
# scraper = ScrapeOverallResults(headless=False, delay=5)
# terms_config = [
#     {'term': '2025-26 Term 1', 'round': '1', 'window': '1'},
#     {'term': '2024-25 Term 2', 'round': '2', 'window': '2'},
#     {'term': '2024-25 Term 1', 'round': None, 'window': None}  # None means use default/empty
# ]
# scraper.scrape_multiple_terms(terms_config, output_dir="./multiple_terms_output")


---

## **3. Extract Data from HTML Files**

### **HTML Data Extractor Summary**

#### **What This Code Does**
The `HTMLDataExtractor` class processes previously scraped HTML files from SMU's BOSS system and extracts structured data into Excel format. It systematically parses course information, class timings, academic terms, and exam schedules from local HTML files without requiring network access or authentication.

**Key Features:**
- **Local File Processing**: Uses Selenium WebDriver to parse local HTML files without network connectivity requirements
- **Comprehensive Data Extraction**: Extracts course details, academic terms, class timings, exam schedules, grading information, and professor names
- **Test-First Approach**: Includes `run_test()` function to validate extraction logic on a small sample before processing all files
- **Structured Output**: Organizes extracted data into two Excel sheets - standalone records (one per HTML file) and multiple records (class/exam timings)
- **Error Tracking**: Captures and logs parsing errors in a separate sheet for debugging and quality assurance
- **Flexible Data Parsing**: Handles multiple academic term naming conventions and date formats used across different years
- **Record Linking**: Uses record keys to maintain relationships between standalone and multiple data records

#### **What Is Required**

**Technical Dependencies:**
- Python packages: `selenium`, `webdriver-manager`, `pandas`, `openpyxl`, standard libraries (`os`, `re`, `datetime`, `pathlib`)
- Chrome browser and ChromeDriver (auto-managed)
- No network access required (processes local files only)

**Input Requirements:**
- **Scraped HTML Files**: Previously downloaded HTML files from BOSS system stored locally
- **File Path CSV**: `script_input/scraped_filepaths.csv` containing paths to valid HTML files
- **Directory Structure**: HTML files organized in the expected folder structure (typically `script_input/classTimingsFull/`)

**Output Structure:**
- **Excel File**: `script_input/raw_data.xlsx` (or custom path) with multiple sheets:
  - `standalone`: One record per HTML file with course and class information
  - `multiple`: Multiple records for class timings and exam schedules
  - `errors`: Parsing errors and problematic files for debugging

**Data Extraction Capabilities:**
- **Course Information**: Course codes, names, descriptions, credit units, course areas, enrollment requirements
- **Academic Terms**: Term IDs, academic years, start/end dates, BOSS IDs
- **Class Details**: Sections, grading basis, course outline URLs, professor names
- **Timing Data**: Class schedules, exam dates, venues, day-of-week information
- **Cross-References**: Maintains linking keys between related records across sheets

**Usage in Jupyter Notebook:**
```python
# Initialize extractor
extractor = HTMLDataExtractor()

# Test with sample files first (recommended)
test_success = extractor.run_test(test_count=10)

if test_success:
    # Run full extraction
    extractor.run()
    
# Or run directly without testing
extractor.run(
    scraped_filepaths_csv='script_input/scraped_filepaths.csv',
    output_path='script_input/raw_data.xlsx'
)
```

The class provides a crucial intermediate step between raw HTML scraping and database insertion, creating clean, structured data that can be further processed for database integration or analysis.

In [None]:
class HTMLDataExtractor:
    """
    Extract raw data from scraped HTML files and save to Excel format using Selenium
    """
    
    def __init__(self):
        self.standalone_data = []
        self.multiple_data = []
        self.errors = []
        self.driver = None
        
    def setup_selenium_driver(self):
        """Set up Selenium WebDriver for local file access"""
        try:
            options = Options()
            options.add_argument('--no-sandbox')
            options.add_argument('--disable-dev-shm-usage')
            options.add_argument('--headless')  # Run in headless mode for efficiency
            options.add_argument('--disable-gpu')
            
            service = Service(ChromeDriverManager().install())
            self.driver = webdriver.Chrome(service=service, options=options)
            print("Selenium WebDriver initialized successfully")
        except Exception as e:
            print(f"Failed to initialize Selenium WebDriver: {e}")
            raise
    
    def safe_find_element_text(self, by, value):
        """Safely find element and return its text with proper encoding handling"""
        try:
            element = self.driver.find_element(by, value)
            if element:
                raw_text = element.text.strip()
                return self.clean_text_encoding(raw_text)
            return None
        except Exception:
            return None
    
    def safe_find_element_attribute(self, by, value, attribute):
        """Safely find element and return its attribute with proper encoding handling"""
        try:
            element = self.driver.find_element(by, value)
            if element:
                raw_attr = element.get_attribute(attribute)
                return self.clean_text_encoding(raw_attr) if raw_attr else None
            return None
        except Exception:
            return None
    
    def convert_date_to_timestamp(self, date_str):
        """Convert DD-Mmm-YYYY to database timestamp format"""
        try:
            date_obj = datetime.strptime(date_str, '%d-%b-%Y')
            return date_obj.strftime('%Y-%m-%d 00:00:00.000 +0800')
        except Exception as e:
            return None
    
    def parse_acad_term(self, term_text, filepath=None):
        """Parse academic term text and return structured data with folder path fallback"""
        try:
            # Clean the term text first
            if term_text:
                term_text = self.clean_text_encoding(term_text)
            
            # Pattern like "2021-22 Term 2" or "2021-22 Session 1"
            pattern = r'(\d{4})-(\d{2})\s+(.*)'
            match = re.search(pattern, term_text) if term_text else None
            
            if not match:
                return None, None, None, None
            
            start_year = int(match.group(1))
            end_year_short = int(match.group(2))
            term_desc = match.group(3).lower()
            
            # Convert 2-digit year to 4-digit
            if end_year_short < 50:
                end_year = 2000 + end_year_short
            else:
                end_year = 1900 + end_year_short
            
            # Determine term code from text
            term_code = None
            if 'term 1' in term_desc or 'session 1' in term_desc or 'august term' in term_desc:
                term_code = 'T1'
            elif 'term 2' in term_desc or 'session 2' in term_desc or 'january term' in term_desc:
                term_code = 'T2'
            elif 'term 3a' in term_desc:
                term_code = 'T3A'
            elif 'term 3b' in term_desc:
                term_code = 'T3B'
            elif 'term 3' in term_desc:
                # Generic T3 - need to check folder path for A/B
                term_code = 'T3'
            
            # If term_code is incomplete or missing, use folder path as fallback
            if not term_code or term_code == 'T3':
                folder_term = self.extract_term_from_folder_path(filepath) if filepath else None
                if folder_term:
                    # If we have folder term, use it
                    if term_code == 'T3' and folder_term in ['T3A', 'T3B']:
                        term_code = folder_term
                    elif not term_code:
                        term_code = folder_term
            
            # If still no term code, return None
            if not term_code:
                return start_year, end_year, None, None
            
            acad_term_id = f"AY{start_year}{end_year_short:02d}{term_code}"
            
            return start_year, end_year, term_code, acad_term_id
        except Exception as e:
            return None, None, None, None
    
    def parse_course_and_section(self, header_text):
        """Parse course code and section from header text with encoding fixes"""
        try:
            if not header_text:
                return None, None
            
            # Clean the text first
            clean_text = self.clean_text_encoding(header_text)
            clean_text = re.sub(r'<[^>]+>', '', clean_text)
            clean_text = re.sub(r'\s+', ' ', clean_text.strip())
            
            # Try multiple regex patterns
            patterns = [
                r'([A-Z0-9_-]+)\s+—\s+(.+)',  # Standard format with em-dash
                r'([A-Z0-9_-]+)\s+-\s+(.+)',  # Standard format with hyphen
                r'([A-Z]+)\s+(\d+[A-Z0-9_]*)\s+—\s+(.+)',  # Split format with em-dash
                r'([A-Z]+)\s+(\d+[A-Z0-9_]*)\s+-\s+(.+)',  # Split format with hyphen
                r'([A-Z0-9_\s-]+?)\s+—\s+([^—]+)',  # Flexible format with em-dash
                r'([A-Z0-9_\s-]+?)\s+-\s+([^-]+)',  # Flexible format with hyphen
            ]
            
            for pattern in patterns:
                match = re.match(pattern, clean_text)
                if match:
                    if len(match.groups()) == 2:
                        # Standard format: course_code - section
                        course_section = match.group(1).strip()
                        section_name = match.group(2).strip()
                        
                        # Extract section from the end of course_section if it's there
                        section_match = re.search(r'^(.+?)\s+(SG\d+|G\d+|\d+)$', course_section)
                        if section_match:
                            course_code = section_match.group(1)
                            section = section_match.group(2)
                        else:
                            course_code = course_section
                            # Try to extract section from section_name
                            section_extract = re.search(r'(SG\d+|G\d+|\d+)', section_name)
                            section = section_extract.group(1) if section_extract else None
                    else:
                        # Split format: course_prefix course_number - section_name
                        course_code = f"{match.group(1)}{match.group(2)}"
                        section_name = match.group(3).strip()
                        section_extract = re.search(r'(SG\d+|G\d+|\d+)', section_name)
                        section = section_extract.group(1) if section_extract else None
                    
                    return course_code.strip() if course_code else None, section
            
            return None, None
        except Exception as e:
            return None, None
    
    def parse_date_range(self, date_text):
        """Parse date range text and return start and end timestamps"""
        try:
            # Example: "10-Jan-2022 to 01-May-2022"
            pattern = r'(\d{1,2}-\w{3}-\d{4})\s+to\s+(\d{1,2}-\w{3}-\d{4})'
            match = re.search(pattern, date_text)
            
            if not match:
                return None, None
            
            start_date = self.convert_date_to_timestamp(match.group(1))
            end_date = self.convert_date_to_timestamp(match.group(2))
            
            return start_date, end_date
        except Exception as e:
            return None, None
    
    def extract_course_areas_list(self):
        """Extract course areas with encoding fixes"""
        try:
            course_areas_element = self.driver.find_element(By.ID, 'lblCourseAreas')
            if not course_areas_element:
                return None
            
            # Get innerHTML to handle HTML content
            course_areas_html = course_areas_element.get_attribute('innerHTML')
            if course_areas_html:
                # Clean encoding first
                course_areas_html = self.clean_text_encoding(course_areas_html)
                
                # Extract list items
                areas_list = re.findall(r'<li[^>]*>([^<]+)</li>', course_areas_html)
                if areas_list:
                    # Clean each area and join
                    cleaned_areas = [self.clean_text_encoding(area.strip()) for area in areas_list]
                    return ', '.join(cleaned_areas)
                else:
                    # Fallback to text content
                    text_content = course_areas_element.text.strip()
                    return self.clean_text_encoding(text_content)
            else:
                # Fallback to text content
                text_content = course_areas_element.text.strip()
                return self.clean_text_encoding(text_content)
        except Exception:
            return None
    
    def extract_course_outline_url(self):
        """Extract course outline URL from HTML using Selenium"""
        try:
            onclick_attr = self.safe_find_element_attribute(By.ID, 'imgCourseOutline', 'onclick')
            if onclick_attr:
                url_match = re.search(r"window\.open\('([^']+)'", onclick_attr)
                if url_match:
                    return url_match.group(1)
        except Exception:
            pass
        return None
    
    def extract_boss_ids_from_filepath(self, filepath):
        """Extract BOSS IDs from filepath"""
        try:
            filename = os.path.basename(filepath)
            acad_term_match = re.search(r'SelectedAcadTerm=(\d+)', filename)
            class_match = re.search(r'SelectedClassNumber=(\d+)', filename)
            
            acad_term_boss_id = int(acad_term_match.group(1)) if acad_term_match else None
            class_boss_id = int(class_match.group(1)) if class_match else None
            
            return acad_term_boss_id, class_boss_id
        except Exception:
            return None, None
    
    def extract_meeting_information(self, record_key):
        """Extract class timing and exam timing information using Selenium"""
        try:
            meeting_table = self.driver.find_element(By.ID, 'RadGrid_MeetingInfo_ctl00')
            tbody = meeting_table.find_element(By.TAG_NAME, 'tbody')
            rows = tbody.find_elements(By.TAG_NAME, 'tr')
            
            for row in rows:
                cells = row.find_elements(By.TAG_NAME, 'td')
                if len(cells) < 7:
                    continue
                
                meeting_type = cells[0].text.strip()
                start_date_text = cells[1].text.strip()
                end_date_text = cells[2].text.strip()
                day_of_week = cells[3].text.strip()
                start_time = cells[4].text.strip()
                end_time = cells[5].text.strip()
                venue = cells[6].text.strip() if len(cells) > 6 else ""
                professor_name = cells[7].text.strip() if len(cells) > 7 else ""
                
                # Assume CLASS if meeting_type is empty
                if not meeting_type:
                    meeting_type = 'CLASS'
                
                if meeting_type == 'CLASS':
                    # Convert dates to timestamp format
                    start_date = self.convert_date_to_timestamp(start_date_text)
                    end_date = self.convert_date_to_timestamp(end_date_text)
                    
                    timing_record = {
                        'record_key': record_key,
                        'type': 'CLASS',
                        'start_date': start_date,
                        'end_date': end_date,
                        'day_of_week': day_of_week,
                        'start_time': start_time,
                        'end_time': end_time,
                        'venue': venue,
                        'professor_name': professor_name
                    }
                    self.multiple_data.append(timing_record)
                
                elif meeting_type == 'EXAM':
                    # For exams, use the second date (end_date_text) as the exam date
                    exam_date = self.convert_date_to_timestamp(end_date_text)
                    
                    exam_record = {
                        'record_key': record_key,
                        'type': 'EXAM',
                        'date': exam_date,
                        'day_of_week': day_of_week,
                        'start_time': start_time,
                        'end_time': end_time,
                        'venue': venue,
                        'professor_name': professor_name
                    }
                    self.multiple_data.append(exam_record)
        
        except Exception as e:
            self.errors.append({
                'record_key': record_key,
                'error': f'Error extracting meeting information: {str(e)}',
                'type': 'parse_error'
            })
    
    def process_html_file(self, filepath):
        """Process a single HTML file and extract all data using Selenium"""
        try:
            # Load HTML file
            html_file = Path(filepath).resolve()
            file_url = html_file.as_uri()
            self.driver.get(file_url)
            
            # Create unique record key
            record_key = f"{os.path.basename(filepath)}"
            
            # Extract basic information
            class_header_text = self.safe_find_element_text(By.ID, 'lblClassInfoHeader')
            if not class_header_text:
                self.errors.append({
                    'filepath': filepath,
                    'error': 'Missing class header',
                    'type': 'parse_error'
                })
                return False
            
            course_code, section = self.parse_course_and_section(class_header_text)
            
            # Extract academic term
            term_text = self.safe_find_element_text(By.ID, 'lblClassInfoSubHeader')
            acad_year_start, acad_year_end, term, acad_term_id = self.parse_acad_term(term_text, filepath) if term_text else (None, None, None, None)
            
            # Extract course information
            course_name = self.safe_find_element_text(By.ID, 'lblClassSection')
            course_description = self.safe_find_element_text(By.ID, 'lblCourseDescription')
            credit_units_text = self.safe_find_element_text(By.ID, 'lblUnits')
            course_areas = self.extract_course_areas_list()
            enrolment_requirements = self.safe_find_element_text(By.ID, 'lblEnrolmentRequirements')
            
            # Process credit units
            try:
                credit_units = float(credit_units_text) if credit_units_text else None
            except (ValueError, TypeError):
                credit_units = None
            
            # Extract grading basis
            grading_text = self.safe_find_element_text(By.ID, 'lblGradingBasis')
            grading_basis = None
            if grading_text:
                if grading_text.lower() == 'graded':
                    grading_basis = 'Graded'
                elif grading_text.lower() in ['pass/fail', 'pass fail']:
                    grading_basis = 'Pass/Fail'
                else:
                    grading_basis = 'NA'
            
            # Extract course outline URL
            course_outline_url = self.extract_course_outline_url()
            
            # Extract dates
            period_text = self.safe_find_element_text(By.ID, 'lblDates')
            start_dt, end_dt = self.parse_date_range(period_text) if period_text else (None, None)
            
            # Extract BOSS IDs
            acad_term_boss_id, class_boss_id = self.extract_boss_ids_from_filepath(filepath)
            
            # Extract bidding information
            total, current_enrolled, reserved, available = self.extract_bidding_info()
            
            # Get extraction date and determine bidding window from folder path
            extraction_date = datetime.now()
            bidding_window = self.determine_bidding_window_from_filepath(filepath)
            
            # Create standalone record
            standalone_record = {
                'record_key': record_key,
                'filepath': filepath,
                'course_code': course_code,
                'section': section,
                'course_name': course_name,
                'course_description': course_description,
                'credit_units': credit_units,
                'course_area': course_areas,
                'enrolment_requirements': enrolment_requirements,
                'acad_term_id': acad_term_id,
                'acad_year_start': acad_year_start,
                'acad_year_end': acad_year_end,
                'term': term,
                'start_dt': start_dt,
                'end_dt': end_dt,
                'grading_basis': grading_basis,
                'course_outline_url': course_outline_url,
                'acad_term_boss_id': acad_term_boss_id,
                'class_boss_id': class_boss_id,
                'term_text': term_text,
                'period_text': period_text,
                'total': total,
                'current_enrolled': current_enrolled,
                'reserved': reserved,
                'available': available,
                'date_extracted': extraction_date.strftime('%Y-%m-%d %H:%M:%S'),
                'bidding_window': bidding_window
            }
            
            self.standalone_data.append(standalone_record)
            
            # Extract meeting information
            self.extract_meeting_information(record_key)
            
            return True
            
        except Exception as e:
            self.errors.append({
                'filepath': filepath,
                'error': str(e),
                'type': 'processing_error'
            })
            return False

    def determine_bidding_window_from_filepath(self, filepath):
        """Determine bidding window from the file path structure"""
        try:
            # Extract folder name from path
            # e.g., script_input/classTimingsFull/2025-26_T1/2025-26_T1_R1W1/file.html
            folder_path = os.path.dirname(filepath)
            folder_name = os.path.basename(folder_path)
            
            return self.extract_bidding_window_from_folder(folder_name)
            
        except Exception as e:
            print(f"Error determining bidding window from filepath: {e}")
            return None
    
    def run_test(self, scraped_filepaths_csv='script_input/scraped_filepaths.csv', test_count=10):
        """Randomly test the extraction on a subset of files"""
        try:
            print(f"Starting test run with {test_count} randomly selected files...")

            # Reset data containers
            self.standalone_data = []
            self.multiple_data = []
            self.errors = []

            # Set up Selenium driver
            self.setup_selenium_driver()

            # Read the CSV file with file paths
            df = pd.read_csv(scraped_filepaths_csv)

            # Handle both 'Filepath' and 'filepath' column names
            filepath_column = 'Filepath' if 'Filepath' in df.columns else 'filepath'
            all_filepaths = df[filepath_column].dropna().tolist()

            if len(all_filepaths) == 0:
                raise ValueError("No valid filepaths found in CSV")

            # Randomly sample filepaths
            sample_size = min(test_count, len(all_filepaths))
            sampled_filepaths = random.sample(all_filepaths, sample_size)

            processed_files = 0
            successful_files = 0

            for i, filepath in enumerate(sampled_filepaths, start=1):
                if os.path.exists(filepath):
                    print(f"Processing test file {i}/{sample_size}: {os.path.basename(filepath)}")
                    if self.process_html_file(filepath):
                        successful_files += 1
                    processed_files += 1
                else:
                    self.errors.append({
                        'filepath': filepath,
                        'error': 'File not found',
                        'type': 'file_error'
                    })

            print(f"\nTest run complete: {successful_files}/{processed_files} files successful")
            print(f"Standalone records extracted: {len(self.standalone_data)}")
            print(f"Multiple records extracted: {len(self.multiple_data)}")
            if self.errors:
                print(f"Errors encountered: {len(self.errors)}")
                for error in self.errors[:3]:  # Show only the first 3 errors
                    print(f"  - {error['type']}: {error['error']}")

            # Save test results
            test_output_path = 'script_input/test_raw_data.xlsx'
            self.save_to_excel(test_output_path)

            return successful_files > 0

        except Exception as e:
            print(f"Error in test run: {e}")
            return False

        finally:
            if self.driver:
                self.driver.quit()
                print("Test selenium driver closed")
    
    def process_all_files(self, base_dir='script_input/classTimingsFull'):
        """Process only files from the latest round folder that haven't been processed yet"""
        try:
            # Find the current academic term (e.g., 2025-26_T1)
            current_term = self.get_current_academic_term()
            if not current_term:
                print("Could not determine current academic term")
                return
            
            print(f"Current academic term: {current_term}")
            print(f"Base directory: {base_dir}")
            
            term_path = os.path.join(base_dir, current_term)
            print(f"Term path: {term_path}")
            
            if not os.path.exists(term_path):
                print(f"Academic term folder not found: {term_path}")
                return
            
            # Find the latest round folder
            latest_round_folder = self.find_latest_round_folder(term_path)
            if not latest_round_folder:
                print(f"No round folders found in {term_path}")
                return
            
            latest_round_path = os.path.join(term_path, latest_round_folder)
            bidding_window = self.extract_bidding_window_from_folder(latest_round_folder)
            
            print(f"Processing latest round: {latest_round_folder}")
            print(f"Latest round path: {latest_round_path}")
            print(f"Bidding window: {bidding_window}")
            
            # Get all HTML files from the latest round folder
            html_files = []
            for filename in os.listdir(latest_round_path):
                if filename.endswith('.html'):
                    filepath = os.path.join(latest_round_path, filename)
                    html_files.append(filepath)
            
            if not html_files:
                print(f"No HTML files found in {latest_round_path}")
                return
            
            print(f"Found {len(html_files)} HTML files in latest round")
            
            # Load existing data to check what's already processed
            existing_standalone, _, _ = self.load_existing_data()
            
            # Filter files that haven't been processed for this bidding window
            files_to_process = []
            for filepath in html_files:
                record_key = os.path.basename(filepath)
                
                # Check if this file has already been processed for this bidding window
                if existing_standalone.empty:
                    files_to_process.append(filepath)
                else:
                    # Extract course info from filename to check against existing data
                    acad_term_boss_id, class_boss_id = self.extract_boss_ids_from_filepath(filepath)
                    
                    # Check if record exists for this bidding window
                    mask = (existing_standalone['acad_term_boss_id'] == acad_term_boss_id) & \
                        (existing_standalone['class_boss_id'] == class_boss_id) & \
                        (existing_standalone['bidding_window'] == bidding_window)
                    
                    if not mask.any():
                        files_to_process.append(filepath)
            
            if not files_to_process:
                print(f"All files from {latest_round_folder} have already been processed")
                return
            
            print(f"Processing {len(files_to_process)} new files from {latest_round_folder}")
            
            # Process only the new files
            processed_files = 0
            successful_files = 0
            
            for filepath in files_to_process:
                if os.path.exists(filepath):
                    # print(f"Processing: {os.path.basename(filepath)}")
                    if self.process_html_file(filepath):
                        successful_files += 1
                    processed_files += 1
                    
                    if processed_files % 100 == 0:
                        print(f"Processed {processed_files}/{len(files_to_process)} files")
            
            print(f"Processing complete: {successful_files}/{processed_files} files successful")
            
        except Exception as e:
            print(f"Error in process_all_files: {e}")
            raise
    
    def save_to_excel(self, output_path='script_input/raw_data.xlsx'):
        """Save extracted data to Excel file, appending new records only"""
        try:
            # Ensure output directory exists
            os.makedirs(os.path.dirname(output_path), exist_ok=True)
            
            # Load existing data
            existing_standalone, existing_multiple, existing_errors = self.load_existing_data(output_path)
            
            # Filter out duplicate standalone records
            new_standalone_records = []
            skipped_standalone = 0
            
            for record in self.standalone_data:
                if not self.check_record_exists(existing_standalone, record):
                    new_standalone_records.append(record)
                else:
                    skipped_standalone += 1
            
            # Handle multiple records - replace if changes detected per record_key
            records_to_update = set()
            updated_multiple = 0
            
            # Group new records by record_key
            new_records_by_key = {}
            for record in self.multiple_data:
                key = record['record_key']
                if key not in new_records_by_key:
                    new_records_by_key[key] = []
                new_records_by_key[key].append(record)
            
            # Check each record_key for changes
            for record_key, new_records in new_records_by_key.items():
                if existing_multiple.empty:
                    records_to_update.add(record_key)
                else:
                    # Get existing records for this record_key
                    existing_records = existing_multiple[existing_multiple['record_key'] == record_key]
                    
                    # Check if there are any changes
                    changes_detected = False
                    
                    # Compare counts first
                    if len(existing_records) != len(new_records):
                        changes_detected = True
                    else:
                        # Compare each record
                        for new_record in new_records:
                            # Find matching existing record
                            existing_match = existing_records[existing_records['type'] == new_record['type']]
                            
                            if existing_match.empty:
                                changes_detected = True
                                break
                            
                            # Compare relevant fields based on type
                            if new_record['type'] == 'CLASS':
                                compare_fields = ['start_date', 'end_date', 'day_of_week', 'start_time', 'end_time', 'venue', 'professor_name']
                            elif new_record['type'] == 'EXAM':
                                compare_fields = ['date', 'day_of_week', 'start_time', 'end_time', 'venue', 'professor_name']
                            else:
                                continue
                            
                            for field in compare_fields:
                                if new_record.get(field) != existing_match.iloc[0].get(field):
                                    changes_detected = True
                                    break
                            
                            if changes_detected:
                                break
                    
                    if changes_detected:
                        records_to_update.add(record_key)
                        updated_multiple += 1
            
            # Build final multiple records list
            if existing_multiple.empty:
                final_multiple_records = self.multiple_data
            else:
                # Keep existing records that are not being updated
                final_multiple_records = []
                for _, row in existing_multiple.iterrows():
                    if row['record_key'] not in records_to_update:
                        final_multiple_records.append(row.to_dict())
                
                # Add new records for updated record_keys
                for record_key in records_to_update:
                    final_multiple_records.extend(new_records_by_key[record_key])
            
            new_multiple_df = pd.DataFrame(final_multiple_records)
            
            # Create DataFrames for new records
            new_standalone_df = pd.DataFrame(new_standalone_records)
            new_errors_df = pd.DataFrame(self.errors)
            
            # Combine with existing data
            if not existing_standalone.empty and not new_standalone_df.empty:
                combined_standalone = pd.concat([existing_standalone, new_standalone_df], ignore_index=True)
            elif not new_standalone_df.empty:
                combined_standalone = new_standalone_df
            else:
                combined_standalone = existing_standalone
            
            # Multiple records are already handled above
            combined_multiple = new_multiple_df
            
            if not existing_errors.empty and not new_errors_df.empty:
                combined_errors = pd.concat([existing_errors, new_errors_df], ignore_index=True)
            elif not new_errors_df.empty:
                combined_errors = new_errors_df
            else:
                combined_errors = existing_errors
            
            # Save to Excel with multiple sheets
            max_retries = 3
            for attempt in range(max_retries):
                try:
                    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
                        combined_standalone.to_excel(writer, sheet_name='standalone', index=False)
                        combined_multiple.to_excel(writer, sheet_name='multiple', index=False)
                        
                        if not combined_errors.empty:
                            combined_errors.to_excel(writer, sheet_name='errors', index=False)
                    break
                except PermissionError:
                    if attempt < max_retries - 1:
                        print(f"Excel file is locked. Retrying in 2 seconds... (Attempt {attempt + 1}/{max_retries})")
                        time.sleep(2)
                    else:
                        print(f"Failed to save Excel file after {max_retries} attempts. Please close the file and try again.")
                        raise
            
            print(f"Data saved to {output_path}")
            print(f"New standalone records added: {len(new_standalone_records)}")
            print(f"Skipped duplicate standalone records: {skipped_standalone}")
            print(f"Total standalone records: {len(combined_standalone)}")
            print(f"Updated multiple record keys: {updated_multiple}")
            print(f"Total multiple records: {len(combined_multiple)}")
            if self.errors:
                print(f"New errors: {len(self.errors)}")
            
        except Exception as e:
            print(f"Error saving to Excel: {e}")
            raise
    
    def run(self, output_path='script_input/raw_data.xlsx'):
        """Run the complete extraction process"""
        print("Starting HTML data extraction...")
        
        # Reset data containers
        self.standalone_data = []
        self.multiple_data = []
        self.errors = []
        
        # Set up Selenium driver
        self.setup_selenium_driver()
        
        try:
            # Process files from latest round folder only
            self.process_all_files()  # Use default base_dir parameter
            
            # Save to Excel
            self.save_to_excel(output_path)
            
            print("HTML data extraction completed!")
            
        finally:
            if self.driver:
                self.driver.quit()
                print("Selenium driver closed")

    def clean_text_encoding(self, text):
        """Clean text to fix encoding issues like â€" -> —"""
        if not text:
            return text
        
        # Common encoding fixes - ORDER MATTERS! Process longer patterns first
        encoding_fixes = [
            ('â€"', '—'),   # em-dash
            ('â€™', "'"),   # right single quotation mark
            ('â€œ', '"'),   # left double quotation mark
            ('â€¦', '…'),   # horizontal ellipsis
            ('â€¢', '•'),   # bullet
            ('â€‹', ''),    # zero-width space
            ('â€‚', ' '),   # en space
            ('â€ƒ', ' '),   # em space
            ('â€‰', ' '),   # thin space
            ('â€', '"'),    # right double quotation mark (shorter pattern, process last)
            ('Â', ''),      # non-breaking space artifacts
        ]
        
        cleaned_text = text
        # Process in order to avoid substring conflicts
        for bad, good in encoding_fixes:
            cleaned_text = cleaned_text.replace(bad, good)
        
        # Remove any remaining problematic characters
        cleaned_text = re.sub(r'â€[^\w]', '', cleaned_text)
        
        return cleaned_text.strip()
    
    def extract_term_from_folder_path(self, filepath):
        """Extract term from folder path as fallback
        E.g., script_input\\classTimingsFull\\2023-24_T3A -> T3A"""
        try:
            # Get the folder path
            folder_path = os.path.dirname(filepath)
            folder_name = os.path.basename(folder_path)
            
            # Look for term pattern in folder name
            # Pattern: YYYY-YY_TXX or YYYY-YY_TXXA
            term_match = re.search(r'(\d{4}-\d{2})_T(\w+)', folder_name)
            if term_match:
                return f"T{term_match.group(2)}"
            
            # Fallback: look for any T followed by alphanumeric
            term_fallback = re.search(r'T(\w+)', folder_name)
            if term_fallback:
                return f"T{term_fallback.group(1)}"
            
            return None
        except Exception as e:
            return None

    def extract_bidding_info(self):
        """Extract current bidding information from HTML elements"""
        try:
            # Extract Total
            total = self.safe_find_element_text(By.ID, 'lblClassCapacity')
            total = int(total) if total and total.isdigit() else None
            
            # Extract Current Enrolled
            current_enrolled = self.safe_find_element_text(By.ID, 'lblEnrolmentTotal')
            current_enrolled = int(current_enrolled) if current_enrolled and current_enrolled.isdigit() else None
            
            # Extract Reserved (for incoming students)
            reserved = self.safe_find_element_text(By.ID, 'lblReserved')
            reserved = int(reserved) if reserved and reserved.isdigit() else None
            
            # Extract Available Seats
            available = self.safe_find_element_text(By.ID, 'lblAvailableSeats')
            available = int(available) if available and available.isdigit() else None
            
            return total, current_enrolled, reserved, available
            
        except Exception as e:
            return None, None, None, None

    def load_existing_data(self, output_path='script_input/raw_data.xlsx'):
        """Load existing data from Excel file if it exists"""
        try:
            if os.path.exists(output_path):
                existing_standalone = pd.read_excel(output_path, sheet_name='standalone')
                existing_multiple = pd.read_excel(output_path, sheet_name='multiple')
                
                # Handle case where errors sheet might not exist
                try:
                    existing_errors = pd.read_excel(output_path, sheet_name='errors')
                except:
                    existing_errors = pd.DataFrame()
                
                return existing_standalone, existing_multiple, existing_errors
            else:
                return pd.DataFrame(), pd.DataFrame(), pd.DataFrame()
        except Exception as e:
            print(f"Error loading existing data: {e}")
            return pd.DataFrame(), pd.DataFrame(), pd.DataFrame()
        
    def check_record_exists(self, existing_df, new_record):
        """Check if a record already exists based on key fields"""
        if existing_df.empty:
            return False
        
        # Define key fields that make a record unique
        key_fields = ['acad_term_id', 'course_code', 'section', 'bidding_window']
        
        # Check if all key fields exist in both dataframes
        for field in key_fields:
            if field not in existing_df.columns:
                return False
            if new_record.get(field) is None:
                return False
        
        # Create a mask to check for matching records
        mask = True
        for field in key_fields:
            mask = mask & (existing_df[field] == new_record[field])
        
        return mask.any()
    
    def get_current_academic_term(self):
        """
        Determines the current academic term (e.g., 2025-26_T1) by mapping the
        current date to the academic calendar terms.

        This logic is based on the SMU Academic Calendar for 2025-26, where bidding
        for a term can start before the official term commencement date. The function
        approximates term boundaries based on the calendar.
        """
        now = datetime.now()
        month = now.month

        # Determine the starting year of the academic year.
        # The academic year 'YYYY-(YY+1)' is assumed to start with T1 bidding in July of 'YYYY'.
        # So, from July onwards, we are in the 'YYYY' to 'YYYY+1' academic year.
        # Before July, we are in the latter half of the 'YYYY-1' to 'YYYY' academic year.
        if month >= 7:
            acad_year_start = now.year
        else:
            acad_year_start = now.year - 1
        
        acad_year_end_short = (acad_year_start + 1) % 100
        
        # Define the approximate start dates for each term's activities (including bidding)
        # for the determined academic year. These dates are based on the 2025-26 calendar.
        # 
        t1_start = datetime(acad_year_start, 7, 1) # Bidding for T1 starts in July
        t2_start = datetime(acad_year_start + 1, 1, 1) # T2 starts in January
        t3a_start = datetime(acad_year_start + 1, 5, 11) # T3A starts May 11
        t3b_start = datetime(acad_year_start + 1, 6, 29) # T3B starts June 29
        
        # Determine the term by checking the current date against the start dates
        # in reverse chronological order.
        term_code = None
        if now >= t3b_start:
            term_code = 'T3B'
        elif now >= t3a_start:
            term_code = 'T3A'
        elif now >= t2_start:
            term_code = 'T2'
        elif now >= t1_start:
            term_code = 'T1'

        if term_code:
            return f"{acad_year_start}-{acad_year_end_short:02d}_{term_code}"
        else:
            # This is a fallback for dates that might fall outside the defined ranges,
            # though the logic should cover all dates within a valid academic year.
            print(f"Warning: Could not determine the academic term for the date {now}. Review term start dates.")
            return None

    def find_latest_round_folder(self, term_path):
        """Find the latest round folder in the academic term directory"""
        try:
            # Get all subdirectories
            subdirs = [d for d in os.listdir(term_path) if os.path.isdir(os.path.join(term_path, d))]
            
            # Filter for round folders (should contain R and W)
            round_folders = [d for d in subdirs if 'R' in d and 'W' in d]
            
            if not round_folders:
                return None
            
            # Sort by folder name (this should work for the naming convention)
            # R1W1, R1AW1, R1AW2, etc.
            round_folders.sort(key=lambda x: (
                int(x.split('R')[1].split('W')[0].replace('A', '').replace('B', '').replace('C', '').replace('F', '')),
                x.count('A') + x.count('B') * 2 + x.count('C') * 3 + x.count('F') * 4,
                int(x.split('W')[1])
            ))
            
            return round_folders[-1]  # Return the latest one
            
        except Exception as e:
            print(f"Error finding latest round folder: {e}")
            return None

    def extract_bidding_window_from_folder(self, folder_name):
        """Extract bidding window from folder name (e.g., 2025-26_T1_R1W1 -> Round 1 Window 1)"""
        try:
            # Extract the round and window part (e.g., R1W1, R1AW2, etc.)
            round_part = folder_name.split('_')[-1]  # Get the last part after underscore
            
            # Map folder codes to full names
            folder_to_window = {
                'R1W1': 'Round 1 Window 1',
                'R1AW1': 'Round 1A Window 1',
                'R1AW2': 'Round 1A Window 2',
                'R1AW3': 'Round 1A Window 3',
                'R1BW1': 'Round 1B Window 1',
                'R1BW2': 'Round 1B Window 2',
                'R1CW1': 'Incoming Exchange Rnd 1C Win 1',
                'R1CW2': 'Incoming Exchange Rnd 1C Win 2',
                'R1CW3': 'Incoming Exchange Rnd 1C Win 3',
                'R1FW1': 'Incoming Freshmen Rnd 1 Win 1',
                'R1FW2': 'Incoming Freshmen Rnd 1 Win 2',
                'R1FW3': 'Incoming Freshmen Rnd 1 Win 3',
                'R1FW4': 'Incoming Freshmen Rnd 1 Win 4',
                'R2W1': 'Round 2 Window 1',
                'R2W2': 'Round 2 Window 2',
                'R2W3': 'Round 2 Window 3',
                'R2AW1': 'Round 2A Window 1',
                'R2AW2': 'Round 2A Window 2',
                'R2AW3': 'Round 2A Window 3',
            }
            
            return folder_to_window.get(round_part, round_part)
            
        except Exception as e:
            print(f"Error extracting bidding window from folder: {e}")
            return folder_name

In [None]:
# Example usage
extractor = HTMLDataExtractor()

# Run the extraction process
extractor.run(output_path='script_input/raw_data.xlsx')


---

## **4. Process Raw Data into Database Tables**

### **What This Code Does**
The `TableBuilder` class processes structured data from the HTML extractor and transforms it into database-ready CSV files for SMU's class management system. It handles complex data relationships, professor name normalization, duplicate detection, and creates all necessary tables for courses, classes, professors, timing schedules, bidding data, and faculty assignments while maintaining referential integrity across all database tables.

**Key Features:**
- **Three-Phase Processing**: Phase 1 (professors/courses with automated faculty mapping), Phase 2 (classes/timings), Phase 3 (BOSS bidding results from raw_data.xlsx)
- **Enhanced Professor Schema**: Uses `boss_aliases` as JSON array strings for CSV output supporting multiple name variations per professor
- **Intelligent Professor Matching**: Advanced name normalization with email resolution via Outlook integration, comprehensive duplicate detection, and improved NaN handling from raw_data.xlsx
- **Automated Faculty Mapping**: Uses course code patterns from existing database courses to automatically assign new courses to SMU's schools and centers
- **Comprehensive Data Pipeline**: Processes professors, courses, academic terms, classes, class timings, exam schedules, bid windows, class availability, and bid results with proper foreign key relationships
- **Database Cache Integration**: Loads existing data from PostgreSQL to avoid duplicates and maintain consistency with enhanced validation for clean data files
- **Manual Review Workflow**: Outputs verification files for human review and correction before final processing
- **BOSS Bidding Integration**: Complete processing of bidding data directly from raw_data.xlsx with fields like total capacity, current enrollment, reserved seats, available seats, extraction timestamps, and bidding windows
- **Asian Name Handling**: Specialized normalization for Asian, Western, and mixed naming conventions with hardcoded multi-instructor handling
- **Data Integrity Validation**: Comprehensive validation system that checks referential integrity across all generated CSV files with detailed error reporting

**Input Requirements:**
- **Raw Data Excel**: `script_input/raw_data.xlsx` from HTML extractor with `standalone` and `multiple` sheets containing bidding data fields
- **Database Configuration**: `.env` file with PostgreSQL connection parameters
- **Professor Lookup**: `script_input/professor_lookup.csv` for existing professor mappings (optional)

**Output Structure:**
- **Verification Files** (`script_output/verify/`): `new_professors.csv`, `new_courses.csv`, `new_faculties.csv`
- **Database Insert Files** (`script_output/`): All table CSV files including classes, timings, exams, bid windows, class availability, and bid results
- **Update Files**: `update_courses.csv`, `update_professor.csv` for existing record modifications
- **Validation Reports**: Data integrity validation with error/warning reports and comprehensive statistics
- **Processing Logs**: Detailed BOSS processing logs with timestamps and failure analysis

In [None]:
# Set up logging
import traceback


logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class TableBuilder:
    """Comprehensive table builder for university class management system"""
    
    def __init__(self, input_file: str = 'script_input/raw_data.xlsx'):
        """Initialize TableBuilder with database configuration and caching"""
        self.input_file = input_file
        self.output_base = 'script_output'
        self.verify_dir = os.path.join(self.output_base, 'verify')
        self.cache_dir = 'db_cache'
        
        # Create output directories
        os.makedirs(self.output_base, exist_ok=True)
        os.makedirs(self.verify_dir, exist_ok=True)
        os.makedirs(self.cache_dir, exist_ok=True)
        
        # Load environment variables
        load_dotenv()
        self.db_config = {
            'host': os.getenv('DB_HOST'),
            'database': os.getenv('DB_NAME'),
            'user': os.getenv('DB_USER'),
            'password': os.getenv('DB_PASSWORD'),
            'port': int(os.getenv('DB_PORT', 5432)),
            'gssencmode': 'disable'
        }
        
        # Database connection
        self.connection = None
        
        # Data storage
        self.standalone_data = None
        self.multiple_data = None
        
        # Caches
        self.professors_cache = {}  # name -> professor data
        self.courses_cache = {}     # code -> course data
        self.acad_term_cache = {}   # id -> acad_term data
        self.faculties_cache = {}   # id -> faculty data
        self.faculty_acronym_to_id = {}  # acronym -> faculty_id mapping
        self.professor_lookup = {}  # scraped_name -> database mapping
        
        self.processed_timing_keys = set()
        self.processed_exam_class_ids = set()

        # Output data collectors
        self.new_professors = []
        self.new_courses = []
        self.update_courses = []
        self.new_acad_terms = []
        self.new_classes = []
        self.new_class_timings = []
        self.new_class_exam_timings = []
        self.update_professors = []  # For boss_name updates
        self.update_bid_result = []  # For bid result updates
        
        # Class ID mapping for timing tables
        self.class_id_mapping = {}  # record_key -> class_id
        
        # Courses requiring faculty assignment
        self.courses_needing_faculty = []
        
        # Statistics
        self.stats = {
            'professors_created': 0,
            'professors_updated': 0,
            'courses_created': 0,
            'courses_updated': 0,
            'classes_created': 0,
            'timings_created': 0,
            'exams_created': 0,
            'courses_needing_faculty': 0
        }
        
        # Use the global bidding schedule, assuming the start term is the target
        self.bidding_schedule = BIDDING_SCHEDULES.get(START_AY_TERM, [])
        
        # Asian surnames database for name normalization
        self.asian_surnames = {
            # Top 100+ common surnames covering Mainland China, Taiwan, HK, and Singapore/Malaysia
            'chinese': [
                'WANG', 'LI', 'ZHANG', 'LIU', 'CHEN', 'YANG', 'HUANG', 'ZHAO', 'WU', 'ZHOU', 'XU', 'SUN', 'MA', 'ZHU', 'HU', 'GUO', 'HE', 'LIN', 'GAO', 'LUO', 
                'CHENG', 'LIANG', 'XIE', 'SONG', 'TANG', 'HAN', 'FENG', 'DENG', 'CAO', 'PENG', 'YUAN', 'SU', 'JIANG', 'JIA', 'LU', 'WEI', 'XIAO', 'YU', 'QIAN', 
                'PAN', 'YAO', 'TAN', 'DU', 'YE', 'TIAN', 'SHI', 'BAI', 'QIN', 'XUE', 'YAN', 'DAI', 'MO', 'CHANG', 'WAN', 'GU', 'ZENG', 'LUO', 'FAN', 'JIN',
                'ONG', 'LIM', 'LEE', 'TEO', 'NG', 'GOH', 'CHUA', 'CHAN', 'KOH', 'ANG', 'YEO', 'SIM', 'CHIA', 'CHONG', 'LAM', 'CHEW', 'TOH', 'LOW', 'SEAH',
                'PEK', 'KWEK', 'QUEK', 'LOH', 'AW', 'CHYE', 'LOK'
            ],
            # Top ~30 Korean surnames
            'korean': [
                'KIM', 'LEE', 'PARK', 'CHOI', 'JEONG', 'KANG', 'CHO', 'YOON', 'JANG', 'LIM', 'HAN', 'OH', 'SEO', 'KWON', 'HWANG', 'SONG', 'JUNG', 'HONG', 
                'AHN', 'GO', 'MOON', 'SON', 'BAE', 'BAEK', 'HEO', 'NAM'
            ],
            # Top ~20 Vietnamese surnames
            'vietnamese': [
                'NGUYEN', 'TRAN', 'LE', 'PHAM', 'HOANG', 'PHAN', 'VU', 'VO', 'DANG', 'BUI', 'DO', 'HO', 'NGO', 'DUONG', 'LY'
            ],
            # Top ~30 Indian surnames from various regions
            'indian': [
                'SHARMA', 'SINGH', 'KUMAR', 'GUPTA', 'PATEL', 'KHAN', 'REDDY', 'YADAV', 'DAS', 'JAIN', 'RAO', 'MEHTA', 'CHOPRA', 'KAPOOR', 'MALHOTRA',
                'AGGARWAL', 'JOSHI', 'MISHRA', 'TRIPATHI', 'PANDEY', 'NAIR', 'MENON', 'PILLAI', 'IYER', 'MUKHERJEE', 'BANERJEE', 'CHATTERJEE'
            ],
            # Top ~20 Japanese surnames
            'japanese': [
                'SATO', 'SUZUKI', 'TAKAHASHI', 'TANAKA', 'WATANABE', 'ITO', 'YAMAMOTO', 'NAKAMURA', 'KOBAYASHI', 'SAITO', 'KATO', 'YOSHIDA', 'YAMADA'
            ]
        }
        self.all_asian_surnames = set()
        for surnames in self.asian_surnames.values():
            self.all_asian_surnames.update(surnames)
        
        # Western given names
        self.western_given_names = {
            'AARON', 'ADAM', 'ADRIAN', 'ALAN', 'ALBERT', 'ALEX', 'ALEXANDER', 'ALFRED', 'ALVIN', 'AMANDA', 'AMY', 'ANDREA', 'ANDREW', 'ANGELA', 'ANNA', 'ANTHONY', 'ARTHUR', 'AUDREY',
            'BEN', 'BENJAMIN', 'BERNARD', 'BETTY', 'BILLY', 'BOB', 'BOWEN', 'BRANDON', 'BRENDA', 'BRIAN', 'BRYAN', 'BRUCE',
            'CARL', 'CAROL', 'CATHERINE', 'CHARLES', 'CHRIS', 'CHRISTIAN', 'CHRISTINA', 'CHRISTINE', 'CHRISTOPHER', 'COLIN', 'CRAIG', 'CRYS',
            'DANIEL', 'DANNY', 'DARREN', 'DAVID', 'DEBORAH', 'DENISE', 'DENNIS', 'DEREK', 'DIANA', 'DONALD', 'DOUGLAS',
            'EDWARD', 'EDWIN', 'ELAINE', 'ELIZABETH', 'EMILY', 'ERIC', 'EUGENE', 'EVELYN',
            'FELIX', 'FRANCIS', 'FRANK',
            'GABRIEL', 'GARY', 'GEOFFREY', 'GEORGE', 'GERALD', 'GLORIA', 'GORDON', 'GRACE', 'GRAHAM', 'GREGORY',
            'HANNAH', 'HARRY', 'HELEN', 'HENRY', 'HOWARD',
            'IAN', 'IVAN',
            'JACK', 'JACOB', 'JAMES', 'JANE', 'JANET', 'JASON', 'JEAN', 'JEFFREY', 'JENNIFER', 'JEREMY', 'JERRY', 'JESSICA', 'JIM', 'JOAN', 'JOE', 'JOHN', 'JONATHAN', 'JOSEPH', 'JOSHUA', 'JOYCE', 'JUDY', 'JULIA', 'JULIE', 'JUSTIN',
            'KAREN', 'KATHERINE', 'KATHY', 'KEITH', 'KELLY', 'KELVIN', 'KENNETH', 'KEVIN', 'KIMBERLY',
            'LARRY', 'LAURA', 'LAWRENCE', 'LEO', 'LEONARD', 'LINDA', 'LISA',
            'MARGARET', 'MARIA', 'MARK', 'MARTIN', 'MARY', 'MATTHEW', 'MEGAN', 'MELISSA', 'MICHAEL', 'MICHELLE', 'MIKE',
            'NANCY', 'NATHAN', 'NEHA', 'NICHOLAS', 'NICOLE',
            'OLIVER', 'OLIVIA',
            'PAMELA', 'PATRICIA', 'PATRICK', 'PAUL', 'PETER', 'PHILIP',
            'RACHEL', 'RAYMOND', 'REBECCA', 'RICHARD', 'ROBERT', 'ROGER', 'RONALD', 'ROY', 'RUSSELL', 'RYAN',
            'SAM', 'SAMUEL', 'SANDRA', 'SARAH', 'SCOTT', 'SEAN', 'SHARON', 'SOPHIA', 'STANLEY', 'STEPHANIE', 'STEPHEN', 'STEVEN', 'SUSAN',
            'TERENCE', 'TERRY', 'THERESA', 'THOMAS', 'TIMOTHY', 'TONY',
            'VALERIE', 'VICTOR', 'VINCENT', 'VIRGINIA',
            'WALTER', 'WAYNE', 'WENDY', 'WILLIAM', 'WILLIE'
        }

        # Keywords for patronymic names (Malay, Indian, etc.)
        self.patronymic_keywords = {'BIN', 'BINTE', 'S/O', 'D/O'}

        # European surname particles
        self.surname_particles = {'DE', 'DI', 'DA', 'VAN', 'VON', 'LA', 'LE', 'DEL', 'DELLA'}       

        # Bid results data collectors
        self.boss_log_file = os.path.join(self.output_base, 'boss_result_log.txt')
        self.new_bid_windows = []
        self.new_class_availability = []
        self.new_bid_result = []
        self.failed_mappings = []
        self.bid_window_cache = {}
        self.bid_window_id_counter = 1

        # Professor lookup from CSV
        self.professor_lookup = {}
        
        # Load professor lookup if available
        self.load_professor_lookup_csv()

        # LLM Configuration
        logger.info("🔧 Initializing LLM configuration...")
        self.llm_model_name = "gemini-2.5-flash"
        self.llm_batch_size = 50
        self.llm_prompt = """
        You are an expert in academic name structures from around the world.
        You will be given a JSON list of professor names.
        Your task is to identify the primary surname for each name.
        You MUST return a single JSON array of strings, where each string is the identified surname.
        The order of surnames in your response must exactly match the order of the full names in the input list.
        Provide ONLY the JSON array in your response.
        """
        
        # Pre-configure the model if the API key exists
        self.llm_model = None
        if os.getenv('GEMINI_API_KEY'):
            try:
                genai.configure(api_key=os.getenv('GEMINI_API_KEY'))
                optimized_config = genai.GenerationConfig(
                    response_mime_type="application/json"
                )
                self.llm_model = genai.GenerativeModel(
                    self.llm_model_name,
                    generation_config=optimized_config
                )
                logger.info(f"✅ Gemini model '{self.llm_model_name}' configured successfully.")
            except Exception as e:
                logger.warning(f"⚠️ Could not pre-configure Gemini model: {e}")
        else:
            logger.warning("⚠️ GEMINI_API_KEY not found. LLM normalization will be skipped.")

    def connect_database(self):
        """Connect to PostgreSQL database"""
        try:
            self.connection = psycopg2.connect(**self.db_config)
            logger.info("✅ Database connection established")
            return True
        except Exception as e:
            logger.error(f"❌ Database connection failed: {e}")
            return False

    def load_or_cache_data(self):
        """Load data from cache or database"""
        # Try loading from cache first
        if self._load_from_cache():
            logger.info("✅ Loaded data from cache")
            return True
        
        # Connect to database and download
        if not self.connect_database():
            return False
        
        try:
            self._download_and_cache_data()
            logger.info("✅ Downloaded and cached data from database")
            return True
        except Exception as e:
            logger.error(f"❌ Failed to download data: {e}")
            return False

    def _download_and_cache_data(self):
        """Download data from all required database tables and cache them locally."""
        try:
            # Define all tables to be cached
            tables_to_cache = [
                "professors", "courses", "acad_term", "faculties", 
                "classes", "class_timing", "class_exam_timing", 
                "class_availability", "bid_window", "bid_result", "bid_prediction"
            ]
            
            for table_name in tables_to_cache:
                logger.info(f"⬇️ Caching table: {table_name}")
                query = f"SELECT * FROM {table_name}"
                df = pd.read_sql_query(query, self.connection)
                df.to_pickle(os.path.join(self.cache_dir, f'{table_name}_cache.pkl'))
                
            logger.info("✅ Downloaded all tables from database and cached locally")
            
            # Load the newly cached data into memory
            self._load_from_cache()
            
        except Exception as e:
            logger.error(f"❌ Failed to download and cache data for table '{table_name}': {e}")
            # Add traceback for detailed debugging
            import traceback
            traceback.print_exc()
            raise

    def _load_from_cache(self) -> bool:
        """
        Load cached data from files, with robust validation of the professor lookup against the database cache.
        Professor validation only runs during Phase 1.
        """
        try:
            cache_files = {
                'professors': os.path.join(self.cache_dir, 'professors_cache.pkl'),
                'courses': os.path.join(self.cache_dir, 'courses_cache.pkl'),
                'acad_terms': os.path.join(self.cache_dir, 'acad_term_cache.pkl'),
                'faculties': os.path.join(self.cache_dir, 'faculties_cache.pkl'),
                'bid_result': os.path.join(self.cache_dir, 'bid_result_cache.pkl'),
                'bid_window': os.path.join(self.cache_dir, 'bid_window_cache.pkl'),
                'class_availability': os.path.join(self.cache_dir, 'class_availability_cache.pkl'),
                'class_exam_timing': os.path.join(self.cache_dir, 'class_exam_timing_cache.pkl'),
                'class_timing': os.path.join(self.cache_dir, 'class_timing_cache.pkl'),
                'classes': os.path.join(self.cache_dir, 'classes_cache.pkl')
            }
            
            if not all(os.path.exists(f) for f in cache_files.values()):
                logger.warning("⚠️ Not all cache files found. Need to download from database.")
                return False

            # Load professors data first
            professors_df = pd.read_pickle(cache_files['professors'])
            
            # Check if this is Phase 1 by looking at the call stack or phase indicator
            is_phase1 = (hasattr(self, '_phase2_mode') and not self._phase2_mode) or not hasattr(self, '_phase2_mode')
            
            if is_phase1:
                # --- Professor Lookup Synchronization (Phase 1 only) ---
                logger.info("🔄 Phase 1: Synchronizing professor lookup with database cache...")
                
                database_professors = {}
                all_database_aliases = {}

                for _, row in professors_df.iterrows():
                    professor_data = row.to_dict()
                    professor_id = str(row.get('id'))
                    professor_name = str(row.get('name', '')).strip()
                    
                    database_professors[professor_id] = professor_data
                    
                    # Add the professor's actual name as an alias
                    if professor_name:
                        all_database_aliases[professor_name.upper()] = professor_id

                    # Handle boss_aliases - robust parsing
                    aliases_list = self._parse_boss_aliases(row.get('boss_aliases'))
                    for alias in aliases_list:
                        if alias and str(alias).strip():
                            all_database_aliases[str(alias).upper()] = professor_id

                logger.info(f"📚 Loaded {len(database_professors)} professors from cache")
                logger.info(f"📚 Found {len(all_database_aliases)} total aliases (including names)")

                # Load and validate professor_lookup.csv
                lookup_file = 'script_input/professor_lookup.csv'
                validated_professor_lookup = {}
                csv_entries_removed = 0
                csv_entries_corrected = 0
                csv_entries_added = 0

                if os.path.exists(lookup_file):
                    try:
                        lookup_df = pd.read_csv(lookup_file)
                        for _, row in lookup_df.iterrows():
                            boss_name = str(row.get('boss_name', '')).strip()
                            afterclass_name = str(row.get('afterclass_name', '')).strip()
                            database_id = str(row.get('database_id', '')).strip()
                            
                            if not boss_name or not database_id: continue
                                
                            boss_name_key = boss_name.upper()

                            # CRITICAL: Validate database_id exists in database
                            if database_id not in database_professors:
                                logger.warning(f"❌ Invalid database_id in lookup: '{boss_name}' references non-existent ID {database_id}. Removing.")
                                csv_entries_removed += 1
                                continue

                            db_professor = database_professors[database_id]
                            db_name = str(db_professor.get('name', '')).strip()
                            
                            # Correct afterclass_name if it differs from database
                            if afterclass_name != db_name:
                                logger.warning(f"✏️ Correcting lookup entry for '{boss_name}': Name mismatch (CSV: '{afterclass_name}' vs DB: '{db_name}'). Using DB name.")
                                afterclass_name = db_name
                                csv_entries_corrected += 1
                            
                            validated_professor_lookup[boss_name_key] = {
                                'database_id': database_id,
                                'boss_name': boss_name,
                                'afterclass_name': afterclass_name
                            }
                    except Exception as e:
                        logger.error(f"❌ Error reading professor_lookup.csv: {e}")
                else:
                    logger.info("📋 professor_lookup.csv not found. Creating from database.")

                # Add missing database aliases to lookup (bidirectional sync)
                for alias_key, professor_id in all_database_aliases.items():
                    if alias_key not in validated_professor_lookup:
                        db_professor = database_professors[professor_id]
                        db_name = str(db_professor.get('name', '')).strip()
                        
                        validated_professor_lookup[alias_key] = {
                            'database_id': str(professor_id),
                            'boss_name': alias_key,
                            'afterclass_name': db_name
                        }
                        csv_entries_added += 1
                        # Only log for non-name aliases to reduce noise
                        if alias_key != db_name.upper():
                            logger.info(f"➕ Added missing DB alias to lookup: '{alias_key}' -> '{db_name}' (ID: {professor_id})")

                self.professor_lookup = validated_professor_lookup
                
                logger.info("✅ Phase 1 Professor lookup synchronization complete:")
                logger.info(f"  - Entries validated: {len(validated_professor_lookup)}")
                logger.info(f"  - Corrected entries: {csv_entries_corrected}")
                logger.info(f"  - Added DB entries: {csv_entries_added}")
                logger.info(f"  - Removed invalid entries: {csv_entries_removed}")

                # Save corrected lookup back to file
                corrected_lookup_data = sorted(list(self.professor_lookup.values()), key=lambda x: x['boss_name'])
                for item in corrected_lookup_data:
                    item['method'] = 'validated'
                
                corrected_df = pd.DataFrame(corrected_lookup_data)
                corrected_df.to_csv(lookup_file, index=False, columns=['boss_name', 'afterclass_name', 'database_id', 'method'])
                logger.info(f"💾 Updated '{lookup_file}' with synchronized data.")

                # Build professors_cache for lookups
                self.professors_cache = {}
                for lookup_data in self.professor_lookup.values():
                    db_id = lookup_data['database_id']
                    boss_name_key = lookup_data['boss_name'].upper()
                    if db_id in database_professors:
                        self.professors_cache[boss_name_key] = database_professors[db_id]
            else:
                # Phase 2/3: Simple loading without validation
                logger.info("🔄 Phase 2/3: Loading professor data without validation...")
                self.professors_cache = {}
                for _, row in professors_df.iterrows(): 
                    # Simple loading for non-Phase 1
                    professor_data = row.to_dict()
                    professor_name = str(row.get('name', '')).strip().upper()
                    if professor_name:
                        self.professors_cache[professor_name] = professor_data

            # --- Load Remaining Caches (all phases) ---
            courses_df = pd.read_pickle(cache_files['courses'])
            for _, row in courses_df.iterrows(): self.courses_cache[row['code']] = row.to_dict()
            
            acad_terms_df = pd.read_pickle(cache_files['acad_terms'])
            for _, row in acad_terms_df.iterrows(): self.acad_term_cache[row['id']] = row.to_dict()
                
            faculties_df = pd.read_pickle(cache_files['faculties'])
            for _, row in faculties_df.iterrows():
                self.faculties_cache[row['id']] = row.to_dict()
                self.faculty_acronym_to_id[row['acronym'].upper()] = row['id']
                
            bid_window_df = pd.read_pickle(cache_files['bid_window'])
            if not bid_window_df.empty:
                self.bid_window_id_counter = bid_window_df['id'].max() + 1
                for _, row in bid_window_df.iterrows():
                    self.bid_window_cache[(row['acad_term_id'], row['round'], row['window'])] = row['id']
            else:
                self.bid_window_id_counter = 1
                self.bid_window_cache = {}

            logger.info("✅ All cache files loaded successfully.")
            if is_phase1:
                logger.info(f"  - Professor lookup entries: {len(self.professor_lookup)} entries")
            logger.info(f"  - Professors cache: {len(self.professors_cache)} entries")
            return True

        except Exception as e:
            logger.error(f"❌ Cache loading error: {e}")
            import traceback
            traceback.print_exc()
            return False
    
    def load_raw_data(self):
        """
        Load raw data WITHOUT applying global filtering.
        Each processing function will apply its own appropriate filtering.
        """
        try:
            logger.info(f"📂 Loading raw data from {self.input_file}")
            
            full_standalone_df = pd.read_excel(self.input_file, sheet_name='standalone')
            full_multiple_df = pd.read_excel(self.input_file, sheet_name='multiple')
            
            logger.info(f"✅ Loaded {len(full_standalone_df)} total standalone and {len(full_multiple_df)} total multiple records.")
            
            # === REMOVED GLOBAL FILTERING ===
            # Store the full data without filtering - each processing function will filter as needed
            self.standalone_data = full_standalone_df
            self.multiple_data = full_multiple_df
            
            # Log available bidding windows for debugging
            if 'bidding_window' in full_standalone_df.columns:
                available_windows = full_standalone_df['bidding_window'].dropna().unique()
                logger.info(f"📊 Available bidding windows in data: {sorted(available_windows)}")
            
            # Log available academic terms for debugging  
            if 'acad_term_id' in full_standalone_df.columns:
                available_terms = full_standalone_df['acad_term_id'].dropna().unique()
                logger.info(f"📊 Available academic terms in data: {sorted(available_terms)}")

            # Create optimized lookup for the multiple_data
            self.multiple_lookup = defaultdict(list)
            for _, row in self.multiple_data.iterrows():
                key = row.get('record_key')
                if pd.notna(key):
                    self.multiple_lookup[key].append(row)
            
            logger.info(f"✅ Created optimized lookup for {len(self.multiple_lookup)} record keys from unfiltered data.")
            return True

        except Exception as e:
            logger.error(f"❌ Failed to load raw data: {e}")
            return False

    def _normalize_professor_name_fallback(self, name: str) -> Tuple[str, str]:
        """
        (Fallback Method) Normalizes professor names using a definitive, rule-based system.
        """
        if name is None or pd.isna(name) or not str(name).strip():
            return "UNKNOWN", "Unknown"

        # --- Step 1: Aggressive Preprocessing ---
        name_str = str(name).strip().replace("’", "'")
        name_str = re.sub(r'\s*\(.*\)\s*', ' ', name_str).strip()
        # Remove all middle initials (e.g., "S.", "H.", "H H", "S") to standardize names
        # This looks for standalone single letters, with or without a dot.
        words = name_str.split()
        words_no_initials = [word for word in words if not (len(word) == 1 and word.isalpha()) and not (len(word) == 2 and word.endswith('.'))]
        name_str = ' '.join(words_no_initials)

        boss_name = name_str.upper()

        # --- Step 2: Handle High-Certainty Delimiters ---
        if ',' in name_str:
            parts = [p.strip() for p in name_str.split(',')]
            words = ' '.join(parts).split()
            surname_to_check = words[0].upper()
            if len(parts) == 2:
                words_after_comma = parts[1].split()
                if words_after_comma and words_after_comma[0].upper() in self.all_asian_surnames:
                    surname_to_check = words_after_comma[0].upper()
                else:
                    words_before_comma = parts[0].split()
                    if words_before_comma and words_before_comma[0].upper() in self.all_asian_surnames:
                        surname_to_check = words_before_comma[0].upper()
            afterclass_parts = [word.capitalize() for word in words]
            for i, word in enumerate(words):
                if word.upper() == surname_to_check:
                    afterclass_parts[i] = word.upper()
            return boss_name, ' '.join(afterclass_parts)

        words = name_str.split()
        if not words: return boss_name, "Unknown"
        if len(words) == 1: return boss_name, words[0].capitalize()

        for i, word in enumerate(words):
            if word.upper() in self.patronymic_keywords and i < len(words) - 1:
                surname_index = i + 1
                afterclass_parts = [w.capitalize() for w in words]
                afterclass_parts[i] = word.lower()
                afterclass_parts[surname_index] = words[surname_index].upper()
                return boss_name, ' '.join(afterclass_parts)

        # --- Step 3: Definitive Rule-Based Surname Identification ---
        surname_index = -1
        
        # Rule 1 (Fixes "Middle Surname"): If the name starts with a Western/Indian given name,
        # actively search for the first known Asian/Indian surname that follows.
        if words[0].upper() in self.western_given_names:
            for i in range(1, len(words)):
                if words[i].upper() in self.all_asian_surnames:
                    surname_index = i
                    break
        
        # Rule 2 (Fixes "Surname-First Western"): If a name contains a Western given name but
        # does NOT start with it, and the first word is not an Asian surname, assume the first word is the surname.
        elif any(w.upper() in self.western_given_names for w in words) and words[0].upper() not in self.all_asian_surnames:
            surname_index = 0

        # Rule 3: If neither of the complex cases above apply, check if the name starts with a known Asian surname.
        # This handles the most common SURNAME-first pattern.
        elif words[0].upper() in self.all_asian_surnames:
            surname_index = 0

        # Rule 4 (Fallback): If no specific pattern has been matched, default to the last word.
        if surname_index == -1:
            surname_index = len(words) - 1
            
        # Post-processing for European particles
        afterclass_parts = [word.capitalize() for word in words]
        if surname_index > 0 and words[surname_index-1].upper() in self.surname_particles:
             afterclass_parts[surname_index-1] = words[surname_index-1].upper()

        afterclass_parts[surname_index] = words[surname_index].upper()
        
        return boss_name, ' '.join(afterclass_parts)
    def resolve_professor_email(self, professor_name):
        """Resolve professor email using Outlook contacts"""
        try:
            # Initialize Outlook
            outlook = win32.Dispatch("Outlook.Application")
            namespace = outlook.GetNamespace("MAPI")
            
            # Try exact resolver first
            recipient = namespace.CreateRecipient(professor_name)
            if recipient.Resolve():
                # Try to get SMTP address
                address_entry = recipient.AddressEntry
                
                # Try Exchange user
                try:
                    exchange_user = address_entry.GetExchangeUser()
                    if exchange_user and exchange_user.PrimarySmtpAddress:
                        return exchange_user.PrimarySmtpAddress.lower()
                except:
                    pass
                
                # Try Exchange distribution list
                try:
                    exchange_dl = address_entry.GetExchangeDistributionList()
                    if exchange_dl and exchange_dl.PrimarySmtpAddress:
                        return exchange_dl.PrimarySmtpAddress.lower()
                except:
                    pass
                
                # Try PR_SMTP_ADDRESS property
                try:
                    property_accessor = address_entry.PropertyAccessor
                    smtp_addr = property_accessor.GetProperty("http://schemas.microsoft.com/mapi/proptag/0x39FE001E")
                    if smtp_addr:
                        return smtp_addr.lower()
                except:
                    pass
                
                # Fallback: regex search in Address field
                try:
                    address = getattr(address_entry, "Address", "") or ""
                    match = re.search(r"[\w\.-]+@[\w\.-]+\.\w+", address)
                    if match:
                        return match.group(0).lower()
                except:
                    pass
            
            # If exact resolve fails, try contacts search
            contacts_folder = namespace.GetDefaultFolder(10)  # olFolderContacts
            tokens = [t.lower() for t in professor_name.split() if t]
            
            for item in contacts_folder.Items:
                try:
                    full_name = (item.FullName or "").lower()
                    if all(token in full_name for token in tokens):
                        # Try the three standard email slots
                        for field in ("Email1Address", "Email2Address", "Email3Address"):
                            addr = getattr(item, field, "") or ""
                            if addr and "@" in addr:
                                return addr.lower()
                except:
                    continue
            
            # If no email found, return default
            return 'enquiry@smu.edu.sg'
            
        except Exception as e:
            logger.warning(f"Email resolution failed for {professor_name}: {e}")
            return 'enquiry@smu.edu.sg'
        
    def process_professors(self):
        """
        Orchestrates the processing of professors: extraction, normalization, and creation.
        """
        logger.info("👥 Processing professors...")
        
        # Step 1: Extract unique names and their variations from the data source.
        unique_professors, professor_variations = self._extract_unique_professors()

        # Step 2: Filter out existing professors to find only new names.
        new_professors_to_normalize = []
        for prof_name in unique_professors:
            # A professor is considered "new" if they are not in the primary lookup.
            if prof_name.upper() not in self.professor_lookup:
                new_professors_to_normalize.append(prof_name)

        logger.info(f"Found {len(unique_professors)} unique names. "
                    f"Identified {len(new_professors_to_normalize)} as new and requiring normalization.")
        
        # Step 2b: Normalize only the new names using the LLM-first, fallback-second approach.
        normalized_map = self._normalize_professors_batch(new_professors_to_normalize)
        
        # Add a fallback for existing professors to ensure they are still processed later
        for prof_name in unique_professors:
            if prof_name not in normalized_map:
                # If an existing professor wasn't normalized, add them to the map using the fallback
                # to ensure they are processed correctly in the steps that follow.
                normalized_map[prof_name] = self._normalize_professor_name_fallback(prof_name)

        if not normalized_map:
            logger.info("No professor names were normalized. Aborting professor processing.")
            return

        # Step 3: Check cache, fuzzy match, and create new professor records.
        email_to_professor = {}
        for boss_name_key, prof_data in self.professors_cache.items():
            if 'email' in prof_data and prof_data['email'] and prof_data['email'].lower() != 'enquiry@smu.edu.sg':
                email_to_professor[prof_data['email'].lower()] = prof_data

        fuzzy_matched_professors = []
        
        for prof_name in unique_professors:
            try:
                boss_name, afterclass_name = normalized_map[prof_name]
                
                # --- The logic below is IDENTICAL to your original script's Step 3 ---
                if hasattr(self, 'professor_lookup') and prof_name.upper() in self.professor_lookup:
                    continue
                if hasattr(self, 'professor_lookup') and boss_name.upper() in self.professor_lookup:
                    self.professor_lookup[prof_name.upper()] = self.professor_lookup[boss_name.upper()]
                    continue
                
                if hasattr(self, 'professor_lookup'):
                    found_partial_match = False
                    for lookup_boss_name, lookup_data in self.professor_lookup.items():
                        prof_words = set(prof_name.upper().split())
                        lookup_words = set(lookup_boss_name.split())
                        
                        if prof_words.issubset(lookup_words) and len(prof_words) >= 2:
                            self.professor_lookup[prof_name.upper()] = lookup_data
                            found_partial_match = True
                            break
                    if found_partial_match:
                        continue
                
                if boss_name in self.professors_cache:
                    if not hasattr(self, 'professor_lookup'):
                        self.professor_lookup = {}
                    self.professor_lookup[prof_name.upper()] = {
                        'database_id': self.professors_cache[boss_name]['id'],
                        'boss_name': boss_name,
                        'afterclass_name': self.professors_cache[boss_name].get('name', afterclass_name)
                    }
                    continue
                
                fuzzy_match_found = False
                normalized_prof = ' '.join(str(prof_name).replace(',', ' ').split()).upper()
                
                for cached_name, cached_prof in self.professors_cache.items():
                    if cached_name is None:
                        continue
                    cached_normalized = ' '.join(str(cached_name).replace(',', ' ').split()).upper()
                    if normalized_prof == cached_normalized:
                        if not hasattr(self, 'professor_lookup'):
                            self.professor_lookup = {}
                        self.professor_lookup[prof_name.upper()] = {
                            'database_id': cached_prof['id'],
                            'boss_name': cached_prof.get('boss_name', cached_prof['name'].upper()),
                            'afterclass_name': cached_prof.get('name', afterclass_name)
                        }
                        fuzzy_match_found = True
                        break
                if fuzzy_match_found:
                    continue
                
                # This is the block that was previously a placeholder
                for new_prof in self.new_professors:
                    if 'boss_aliases' in new_prof:
                        try:
                            boss_aliases = json.loads(new_prof['boss_aliases'])
                            if isinstance(boss_aliases, list) and boss_aliases:
                                new_normalized = ' '.join(boss_aliases[0].replace(',', ' ').split()).upper()
                            else:
                                new_normalized = ' '.join(new_prof.get('afterclass_name', '').replace(',', ' ').split()).upper()
                        except (json.JSONDecodeError, TypeError):
                            new_normalized = ' '.join(new_prof.get('afterclass_name', '').replace(',', ' ').split()).upper()
                    else:
                        new_normalized = ' '.join(new_prof.get('afterclass_name', '').replace(',', ' ').split()).upper()

                    if normalized_prof == new_normalized:
                        if not hasattr(self, 'professor_lookup'):
                            self.professor_lookup = {}
                        self.professor_lookup[prof_name.upper()] = {
                            'database_id': new_prof['id'],
                            'boss_name': boss_name,
                            'afterclass_name': new_prof['afterclass_name']
                        }
                        fuzzy_match_found = True
                        break
                
                if fuzzy_match_found:
                    continue
                
                if hasattr(self, 'professor_lookup'):
                    best_fuzzy_match = None
                    best_fuzzy_score = 0
                    FUZZY_MATCH_THRESHOLD = 90
                    for lookup_boss_name, lookup_data in self.professor_lookup.items():                        
                        score = self._calculate_fuzzy_score(prof_name, lookup_boss_name)
                        if score > best_fuzzy_score:
                            best_fuzzy_match = lookup_data
                            best_fuzzy_score = score
                    
                    if best_fuzzy_match and best_fuzzy_score >= FUZZY_MATCH_THRESHOLD:
                        fuzzy_matched_professors.append({
                            'boss_aliases': f'["{prof_name.upper()}"]',
                            'afterclass_name': best_fuzzy_match.get('afterclass_name', prof_name),
                            'database_id': best_fuzzy_match['database_id'],
                            'method': 'fuzzy_match',
                            'confidence_score': f"{best_fuzzy_score:.2f}"
                        })
                        if not hasattr(self, 'professor_lookup'):
                            self.professor_lookup = {}
                        self.professor_lookup[prof_name.upper()] = best_fuzzy_match
                        continue
                
                resolved_email = self.resolve_professor_email(afterclass_name)
                
                if (resolved_email and 
                    resolved_email.lower() != 'enquiry@smu.edu.sg' and 
                    resolved_email.lower() in email_to_professor):
                    existing_prof = email_to_professor[resolved_email.lower()]
                    if not hasattr(self, 'professor_lookup'):
                        self.professor_lookup = {}
                    self.professor_lookup[prof_name.upper()] = {
                        'database_id': existing_prof['id'],
                        'boss_name': boss_name,
                        'afterclass_name': existing_prof.get('name', afterclass_name)
                    }
                    continue
                
                self._create_new_professor(prof_name, professor_variations, email_to_professor)

            except Exception as e:
                logger.error(f"❌ Error processing professor '{prof_name}': {e}")
                continue
        
        if fuzzy_matched_professors:
            fuzzy_df = pd.DataFrame(fuzzy_matched_professors)
            fuzzy_path = os.path.join(self.verify_dir, 'fuzzy_matched_professors.csv')
            fuzzy_df.to_csv(fuzzy_path, index=False)
            logger.info(f"🔍 Saved {len(fuzzy_matched_professors)} fuzzy matched professors for validation")
        
        logger.info(f"✅ Created {self.stats['professors_created']} new professors")

    def _calculate_fuzzy_score(self, new_name: str, known_alias: str) -> float:
        """
        Calculates a fuzzy match score using a hybrid strategy that prioritizes
        ordered matches and handles permutations.
        """
        if not new_name or not known_alias:
            return 0.0
        
        # Clean and normalize names for consistent comparison
        new_name_clean = ' '.join(str(new_name).upper().replace(',', ' ').split())
        known_alias_clean = ' '.join(str(known_alias).upper().replace(',', ' ').split())
        
        # --- Layer 1: High-Precision Substring Check ---
        # This is the most important check. It handles short forms like 
        # 'WARREN B. CHIK' being perfectly contained within 'KAM WAI WARREN BARTHOLOMEW CHIK'.
        # This check is fast and respects word order.
        if new_name_clean in known_alias_clean or known_alias_clean in new_name_clean:
            # We return a very high score, but not 100, to indicate a strong partial match.
            return 95

        # --- Layer 2: Hybrid Fuzzy Logic ---
        # If it's not a direct substring, we use two different fuzzy algorithms.
        
        # a) Partial Ratio: Good for ordered, partial matches.
        # This respects word order. A jumbled name will get a LOW score here.
        # e.g., 'KAM WAI CHIK' vs 'KAM WAI WARREN CHIK' will score high.
        partial_score = fuzz.partial_ratio(new_name_clean, known_alias_clean)
        
        # b) Token Set Ratio: Good for permutations and names with extra/missing words.
        # This handles cases like 'RACHEL TAN YEN JUN' vs 'RACHEL TAN'.
        token_set_score = fuzz.token_set_ratio(new_name_clean, known_alias_clean)
        
        # We take the best score from the two fuzzy methods. This gives us the
        # flexibility to catch both ordered variations and unordered permutations.
        return max(partial_score, token_set_score)

    def process_courses(self):
        """
        Processes courses from the standalone sheet. It correctly identifies if a course
        is new or if an existing course needs to be updated.
        """
        logger.info("📚 Processing courses with robust CREATE vs. UPDATE logic...")
        
        # Use a set to only process each unique course code once from the input file.
        processed_course_codes_in_run = set()

        for idx, row in self.standalone_data.iterrows():
            course_code = row.get('course_code')
            if pd.isna(course_code) or course_code in processed_course_codes_in_run:
                continue
            
            processed_course_codes_in_run.add(course_code)

            # Check if the course already exists in our database cache
            if course_code in self.courses_cache:
                # --- UPDATE LOGIC ---
                existing_course = self.courses_cache[course_code]
                update_record = {'id': existing_course['id'], 'code': course_code}
                
                # Define fields to check for potential updates
                field_mapping = {
                    'name': 'course_name', 'description': 'course_description',
                    'credit_units': 'credit_units', 'course_area': 'course_area',
                    'enrolment_requirements': 'enrolment_requirements'
                }

                if self.needs_update(existing_course, row, field_mapping):
                    # self.needs_update will populate the update_record
                    for db_field, raw_field in field_mapping.items():
                        new_value = row.get(raw_field)
                        if pd.notna(new_value) and str(new_value) != str(existing_course.get(db_field)):
                            update_record[db_field] = new_value
                    
                    self.update_courses.append(update_record)
                    self.stats['courses_updated'] += 1
            else:
                # --- CREATE LOGIC ---
                course_id = str(uuid.uuid4())
                new_course = {
                    'id': course_id, 'code': course_code,
                    'name': row.get('course_name', 'Unknown Course'),
                    'description': row.get('course_description', 'No description available'),
                    'credit_units': float(row.get('credit_units', 1.0)) if pd.notna(row.get('credit_units')) else 1.0,
                    'belong_to_university': 1, 'belong_to_faculty': None,
                    'course_area': row.get('course_area'),
                    'enrolment_requirements': row.get('enrolment_requirements')
                }
                self.new_courses.append(new_course)
                self.courses_cache[course_code] = new_course  # Add to cache for this run
                self.stats['courses_created'] += 1
        
        logger.info(f"✅ Course processing complete. New: {self.stats['courses_created']}, Updated: {self.stats['courses_updated']}.")

    def assign_course_faculties_interactive(self):
        """Interactive faculty assignment with option to create new faculties"""
        if not self.courses_needing_faculty:
            logger.info("✅ No courses need faculty assignment")
            return
        
        logger.info(f"🎓 Starting interactive faculty assignment for {len(self.courses_needing_faculty)} courses")
        
        # Get current max faculty ID for incrementing
        max_faculty_id = max(self.faculties_cache.keys()) if self.faculties_cache else 0
        
        faculty_assignments = []
        
        for course_info in self.courses_needing_faculty:
            print(f"\n{'='*60}")
            print(f"🎓 FACULTY ASSIGNMENT NEEDED")
            print(f"{'='*60}")
            print(f"Course Code: {course_info['course_code']}")
            print(f"Course Name: {course_info['course_name']}")
            
            # Get the last filepath for this course from multiple sheet
            driver = None
            course_code = course_info['course_code']
            last_filepath = self.get_last_filepath_by_course(course_code)
            
            if last_filepath:
                print(f"\nOpening scraped HTML file: {last_filepath}")
                
                try:
                    # Setup Chrome options
                    chrome_options = Options()
                    chrome_options.add_argument("--new-window")
                    chrome_options.add_argument("--start-maximized")
                    
                    # Initialize driver
                    driver = webdriver.Chrome(options=chrome_options)
                    
                    # Open the HTML file
                    abs_path = os.path.abspath(last_filepath)
                    from pathlib import Path
                    file_path = Path(abs_path)
                    
                    if file_path.exists():
                        # Use pathlib's as_uri() method for proper file:// URL
                        file_url = file_path.as_uri()
                        driver.get(file_url)
                        print("✅ Scraped HTML file opened in browser")
                        print("📋 Review the course content to determine the correct faculty")
                    else:
                        print(f"⚠️ HTML file not found: {abs_path}")
                        print("📋 Proceeding without file preview")
                        
                except Exception as e:
                    print(f"⚠️ Could not open HTML file: {e}")
                    print("📋 Proceeding without file preview")
            else:
                print(f"⚠️ No scraped HTML file found for course {course_code}")
                print("📋 Proceeding without file preview")
            
            # Show existing faculties
            print("\nExisting Faculty Options:")
            faculty_list = sorted(self.faculties_cache.values(), key=lambda x: x['id'])
            for faculty in faculty_list:
                print(f"{faculty['id']}. {faculty['name']} ({faculty['acronym']})")
            
            print(f"\n0. Skip (will need manual review)")
            print(f"99. Create new faculty")
            
            while True:
                choice = input(f"\nEnter faculty number (0-{max(f['id'] for f in faculty_list)}, 99): ").strip()
                
                if choice == '0':
                    faculty_id = None
                    break
                elif choice == '99':
                    # Create new faculty
                    print("\n📝 Creating new faculty:")
                    faculty_name = input("Enter faculty name: ").strip()
                    faculty_acronym = input("Enter faculty acronym (e.g., SCIS): ").strip().upper()
                    faculty_url = input("Enter faculty website URL (or press Enter for default): ").strip()
                    
                    if not faculty_url:
                        faculty_url = f"https://smu.edu.sg/{faculty_acronym.lower()}"
                    
                    # Increment faculty ID
                    max_faculty_id += 1
                    new_faculty = {
                        'id': max_faculty_id,
                        'name': faculty_name,
                        'acronym': faculty_acronym,
                        'site_url': faculty_url,
                        'belong_to_university': 1,  # SMU
                        'created_at': datetime.now().isoformat(),
                        'updated_at': datetime.now().isoformat()
                    }
                    
                    # Add to cache
                    self.faculties_cache[max_faculty_id] = new_faculty
                    self.faculty_acronym_to_id[faculty_acronym] = max_faculty_id
                    
                    # Save to new_faculties list
                    if not hasattr(self, 'new_faculties'):
                        self.new_faculties = []
                    self.new_faculties.append(new_faculty)
                    
                    faculty_id = max_faculty_id
                    print(f"✅ Created new faculty: {faculty_name} (ID: {faculty_id})")
                    break
                else:
                    try:
                        faculty_id = int(choice)
                        if faculty_id in [f['id'] for f in faculty_list]:
                            break
                        else:
                            print(f"Invalid choice. Please enter a valid faculty ID.")
                    except ValueError:
                        print("Invalid input. Please enter a number.")
            
            # Close browser after selection
            if driver:
                try:
                    print("\n🔄 Closing browser...")
                    driver.quit()
                except Exception as e:
                    print(f"⚠️ Error closing browser: {e}")
            
            # Store assignment
            faculty_assignments.append({
                'course_id': course_info['course_id'],
                'course_code': course_info['course_code'],
                'faculty_id': faculty_id
            })
        
        # Apply assignments
        for assignment in faculty_assignments:
            if assignment['faculty_id'] is not None:
                # Update new_courses
                for course in self.new_courses:
                    if course['id'] == assignment['course_id']:
                        course['belong_to_faculty'] = assignment['faculty_id']
                        break
                
                # Update cache
                if assignment['course_code'] in self.courses_cache:
                    self.courses_cache[assignment['course_code']]['belong_to_faculty'] = assignment['faculty_id']
        
        # Save outputs
        if self.new_courses:
            df = pd.DataFrame(self.new_courses)
            df.to_csv(os.path.join(self.verify_dir, 'new_courses.csv'), index=False)
            logger.info(f"✅ Updated new_courses.csv with faculty assignments")
        
        if hasattr(self, 'new_faculties') and self.new_faculties:
            df = pd.DataFrame(self.new_faculties)
            df.to_csv(os.path.join(self.verify_dir, 'new_faculties.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_faculties)} new faculties")
        
        logger.info("✅ Faculty assignment completed")

    # Also add this as an alias to the existing method name
    def assign_course_faculties(self):
        """Alias for assign_course_faculties_interactive"""
        return self.assign_course_faculties_interactive()

    def process_acad_terms(self):
        """Process academic terms from standalone sheet"""
        logger.info("📅 Processing academic terms...")
        
        # Group by (acad_year_start, acad_year_end, term)
        term_groups = defaultdict(list)
        
        for _, row in self.standalone_data.iterrows():
            # Try to extract from row data first
            year_start = row.get('acad_year_start')
            year_end = row.get('acad_year_end')
            term = row.get('term')
            
            # If any are missing, try to extract from source file path if available
            if pd.isna(year_start) or pd.isna(year_end) or pd.isna(term):
                if 'source_file' in row and pd.notna(row['source_file']):
                    fallback_term_id = self.extract_acad_term_from_path(row['source_file'])
                    if fallback_term_id:
                        # Parse the fallback
                        match = re.match(r'AY(\d{4})(\d{2})T(\w+)', fallback_term_id)
                        if match:
                            year_start = int(match.group(1)) if pd.isna(year_start) else year_start
                            year_end = int(match.group(2)) if pd.isna(year_end) else year_end
                            term = f"T{match.group(3)}" if pd.isna(term) else term
            
            key = (year_start, year_end, term)
            if all(pd.notna(v) for v in key):
                term_groups[key].append(row)
        
        # Rest of the function remains the same...
        for (year_start, year_end, term), rows in term_groups.items():
            # Generate acad_term_id (keep T for ID)
            acad_term_id = f"AY{int(year_start)}{int(year_end) % 100:02d}{term}"
            
            # Check if already exists
            if acad_term_id in self.acad_term_cache:
                continue
            
            # Find most common period_text and dates
            period_counter = Counter()
            date_info = {}
            
            for row in rows:
                period_text = row.get('period_text', '')
                if pd.notna(period_text):
                    period_counter[period_text] += 1
                    if period_text not in date_info:
                        date_info[period_text] = {
                            'start_dt': row.get('start_dt'),
                            'end_dt': row.get('end_dt')
                        }
            
            # Get most common period
            if period_counter:
                most_common_period = period_counter.most_common(1)[0][0]
                dates = date_info[most_common_period]
            else:
                dates = {'start_dt': None, 'end_dt': None}
            
            # Get boss_id from first row
            boss_id = rows[0].get('acad_term_boss_id')
            
            # Remove T prefix from term field for database storage
            clean_term = str(term)[1:] if str(term).startswith('T') else str(term)
            
            new_term = {
                'id': acad_term_id,
                'acad_year_start': int(year_start),
                'acad_year_end': int(year_end),
                'term': clean_term,  # Store without T prefix
                'boss_id': int(boss_id) if pd.notna(boss_id) else None,
                'start_dt': dates['start_dt'],
                'end_dt': dates['end_dt']
            }
            
            self.new_acad_terms.append(new_term)
            self.acad_term_cache[acad_term_id] = new_term
            
            logger.info(f"✅ Created academic term: {acad_term_id} (term: {clean_term})")
        
        logger.info(f"✅ Created {len(self.new_acad_terms)} new academic terms")

    def process_classes(self, use_db_cache_for_classes=True):
        """
        Process classes from the standalone sheet. For single professor classes,
        uniqueness is determined by course_id + section + acad_term_id. For multi-professor classes, uniqueness includes professor_id.
        """
        logger.info("🏫 Processing classes with robust CREATE vs. UPDATE logic...")
        
        # Use sets to track processed classes and prevent duplicate operations
        processed_class_keys = set()
        processed_update_class_ids = set() # FIX: Prevents duplicate TBA updates

        # Initialize update_classes if it doesn't exist
        if not hasattr(self, 'update_classes'):
            self.update_classes = []

        # Build a lookup of existing classes from the cache for quick checks
        self.existing_class_lookup = {}
        if use_db_cache_for_classes:
            self.load_existing_classes_cache()
            if self.existing_classes_cache:
                for c in self.existing_classes_cache:
                    # For existing classes, use acad_term_id + boss_id + professor_id as the key
                    acad_term_id = c.get('acad_term_id')
                    class_boss_id = c.get('boss_id')
                    professor_id = c.get('professor_id')

                    # The primary key parts must exist to create a valid entry
                    if acad_term_id and class_boss_id is not None:
                        key = (acad_term_id, class_boss_id, professor_id)
                        self.existing_class_lookup[key] = c

        for idx, row in self.standalone_data.iterrows():
            try:
                acad_term_id = row.get('acad_term_id')
                class_boss_id = row.get('class_boss_id')
                course_code = row.get('course_code')
                section = str(row.get('section'))

                if pd.isna(acad_term_id) or pd.isna(class_boss_id):
                    continue
                
                course_id = self.get_course_id(course_code)
                if not course_id:
                    continue
                
                record_key = row.get('record_key')
                
                # Get all professors for this class
                professor_mappings = self._find_professors_for_class(record_key) if record_key else []
                
                # Handle the specific case where a TBA class gets a professor assigned.
                class_rows_in_multiple = [r for r in self.multiple_lookup.get(record_key, []) if r.get('type') == 'CLASS']
                
                # Condition: Exactly one professor is assigned now, and there's only one CLASS row for it.
                if len(professor_mappings) == 1 and len(class_rows_in_multiple) == 1:
                    new_prof_id = professor_mappings[0][0]
                    
                    # Search for a matching TBA record in the cache (where professor_id is null).
                    tba_class_to_update = None
                    if hasattr(self, 'existing_classes_cache') and self.existing_classes_cache:
                        for existing_class in self.existing_classes_cache:
                            if (existing_class.get('course_id') == course_id and
                                str(existing_class.get('section')) == section and
                                existing_class.get('acad_term_id') == acad_term_id and
                                pd.isna(existing_class.get('professor_id'))):
                                tba_class_to_update = existing_class
                                break
                    
                    # If we found a TBA record and haven't updated it yet, perform an UPDATE.
                    if tba_class_to_update and tba_class_to_update['id'] not in processed_update_class_ids:
                        logger.info(f"🔄 Converting TBA class {course_code}-{section} to assigned.")
                        
                        # Add to set to prevent this specific update from repeating.
                        processed_update_class_ids.add(tba_class_to_update['id'])

                        # Create the update record with the new professor.
                        update_record = {
                            'id': tba_class_to_update['id'],
                            'professor_id': new_prof_id
                        }
                        self.update_classes.append(update_record)
                        
                        # Map the record_key to the now-updated class ID for timing processing.
                        if record_key:
                            if record_key not in self.class_id_mapping:
                                self.class_id_mapping[record_key] = []
                            if tba_class_to_update['id'] not in self.class_id_mapping[record_key]:
                                self.class_id_mapping[record_key].append(tba_class_to_update['id'])
                                
                        # Update the in-memory cache to prevent a duplicate record from being created.
                        # This tells the rest of the function that this class now exists with an assigned professor.
                        new_assigned_key = (acad_term_id, class_boss_id, new_prof_id)
                        old_tba_key = (acad_term_id, class_boss_id, None)

                        updated_class_record = tba_class_to_update.copy()
                        updated_class_record['professor_id'] = new_prof_id
                        
                        self.existing_class_lookup[new_assigned_key] = updated_class_record
                        
                        # Clean up the old TBA entry from the in-memory cache
                        if old_tba_key in self.existing_class_lookup:
                            del self.existing_class_lookup[old_tba_key]

                        logger.info(f"🧠 In-memory cache updated for {course_code}-{section} to prevent duplicate creation.")

                        # Mark the new professor-class combination as processed to prevent the default logic
                        # from creating a duplicate class later in the loop.
                        class_key_for_processing = (acad_term_id, class_boss_id, new_prof_id)
                        processed_class_keys.add(class_key_for_processing)
                
                # If no professors found, create one class with professor_id = None
                if not professor_mappings:
                    professor_mappings = [(None, '')]
                
                # Check if this is a multi-professor class
                is_multi_professor = len(professor_mappings) > 1
                warn_inaccuracy = is_multi_professor
                
                # Process each professor
                for prof_id, prof_name in professor_mappings:
                    # Create unique key using boss_id
                    class_key = (acad_term_id, class_boss_id, prof_id)
                    
                    # Skip if we've already processed this exact class
                    if class_key in processed_class_keys:
                        continue
                        
                    processed_class_keys.add(class_key)
                    
                    # Check if this exact class exists (including professor_id match)
                    existing_class = self.existing_class_lookup.get(class_key)
                    
                    if existing_class:
                        # --- UPDATE LOGIC ---
                        update_record = {'id': existing_class['id']}
                        needs_update = False
                        
                        fields_to_check = {
                            'grading_basis': row.get('grading_basis'),
                            'course_outline_url': row.get('course_outline_url'),
                            'boss_id': int(row.get('class_boss_id')) if pd.notna(row.get('class_boss_id')) else None,
                            'warn_inaccuracy': warn_inaccuracy
                        }
                        
                        for field, new_value in fields_to_check.items():
                            old_value = existing_class.get(field)

                            # FIX: Safely handle array-like objects and pandas/numpy types
                            if hasattr(old_value, '__iter__') and not isinstance(old_value, str):
                                try:
                                    if hasattr(old_value, 'item'):  # numpy array
                                        old_value = old_value.item()
                                    elif hasattr(old_value, '__len__') and len(old_value) > 0:
                                        old_value = old_value[0]
                                    else:
                                        old_value = None
                                except:
                                    old_value = None
                            
                            # Convert new_value if it's also an array
                            if hasattr(new_value, '__iter__') and not isinstance(new_value, str):
                                try:
                                    if hasattr(new_value, 'item'):  # numpy array
                                        new_value = new_value.item()
                                    elif hasattr(new_value, '__len__') and len(new_value) > 0:
                                        new_value = new_value[0]
                                    else:
                                        new_value = None
                                except:
                                    new_value = None
                            
                            # Safe comparison
                            try:
                                # Handle None/NaN values first
                                if pd.isna(new_value) and pd.isna(old_value):
                                    continue
                                elif pd.isna(new_value) or pd.isna(old_value):
                                    if pd.notna(new_value):
                                        update_record[field] = new_value
                                        needs_update = True
                                    continue
                                
                                # Convert to strings for comparison
                                new_val_str = str(new_value).strip()
                                old_val_str = str(old_value).strip()
                                
                                if new_val_str != old_val_str:
                                    update_record[field] = new_value
                                    needs_update = True
                                    
                            except Exception as e:
                                # If any comparison fails, check if we have a new value to update
                                if pd.notna(new_value):
                                    update_record[field] = new_value
                                    needs_update = True
                        
                        if needs_update:
                            self.update_classes.append(update_record)
                        
                        # Map record_key to existing class ID for timings
                        if record_key:
                            if record_key not in self.class_id_mapping:
                                self.class_id_mapping[record_key] = []
                            if existing_class['id'] not in self.class_id_mapping[record_key]:
                                self.class_id_mapping[record_key].append(existing_class['id'])
                    else:
                        # --- CREATE LOGIC ---
                        # Check if we already created this class in new_classes
                        already_created = False
                        for new_class in self.new_classes:
                            if (new_class['acad_term_id'] == acad_term_id and
                                str(new_class.get('boss_id')) == str(class_boss_id) and
                                new_class.get('professor_id') == prof_id):
                                already_created = True
                                # Map record_key to this class ID
                                if record_key:
                                    if record_key not in self.class_id_mapping:
                                        self.class_id_mapping[record_key] = []
                                    if new_class['id'] not in self.class_id_mapping[record_key]:
                                        self.class_id_mapping[record_key].append(new_class['id'])
                                break
                        
                        if not already_created:
                            class_id = str(uuid.uuid4())
                            new_class = {
                                'id': class_id,
                                'section': section,
                                'course_id': course_id,
                                'professor_id': prof_id,
                                'acad_term_id': acad_term_id,
                                'created_at': datetime.now().isoformat(),
                                'updated_at': datetime.now().isoformat(),
                                'grading_basis': row.get('grading_basis'),
                                'course_outline_url': row.get('course_outline_url'),
                                'boss_id': int(row.get('class_boss_id')) if pd.notna(row.get('class_boss_id')) else None,
                                'raw_professor_name': prof_name,
                                'warn_inaccuracy': warn_inaccuracy
                            }
                            
                            self.new_classes.append(new_class)
                            self.stats['classes_created'] += 1
                            
                            # Also add to existing_class_lookup to prevent duplicates in same run
                            self.existing_class_lookup[class_key] = new_class
                            
                            # Map record_key to new class ID
                            if record_key:
                                if record_key not in self.class_id_mapping:
                                    self.class_id_mapping[record_key] = []
                                self.class_id_mapping[record_key].append(class_id)
            
            except Exception as e:
                logger.error(f"❌ Exception processing class row {idx}: {e}")
        
        logger.info(f"✅ Class processing complete. New: {self.stats['classes_created']}, Updates: {len(self.update_classes)}.")
        return True
        
    def _find_professors_for_class(self, record_key: str) -> List[tuple]:
        """Find professor IDs for a class and return list of (professor_id, original_name) tuples
        Deduplicates by professor_id to avoid creating multiple class records for same professor"""
        if not record_key or pd.isna(record_key):
            return []
        
        rows = self.multiple_lookup.get(record_key, [])
        professor_mappings = []
        seen_professor_ids = set()  # Track unique professor IDs
        
        # Ensure professor lookup is loaded
        if not hasattr(self, 'professor_lookup_loaded'):
            self.load_professor_lookup_csv()
        
        for row in rows:
            prof_name_raw = row.get('professor_name')
            
            # FIXED: Better handling of NaN values from raw_data.xlsx
            if prof_name_raw is None or pd.isna(prof_name_raw):
                continue
            
            # Convert to string and strip - handles float NaN properly
            original_prof_name = str(prof_name_raw).strip()
            
            # Skip empty strings and 'nan' strings
            if not original_prof_name or original_prof_name.lower() == 'nan':
                continue
            
            # Split the professor names intelligently
            split_professors = self._split_professor_names(original_prof_name)
            
            # Process each split professor
            for prof_name in split_professors:
                if prof_name and prof_name.strip():  # Additional check for empty strings
                    prof_id = self._lookup_professor_with_fallback(prof_name.strip())
                    if prof_id and prof_id not in seen_professor_ids:
                        professor_mappings.append((prof_id, prof_name.strip()))
                        seen_professor_ids.add(prof_id)
        
        return professor_mappings

    def _split_professor_names(self, prof_name: str) -> List[str]:
        """
        Intelligently splits a string of professor names using a greedy, longest-match-first
        approach, which eliminates the need for hardcoded combinations.
        """
        if prof_name is None or pd.isna(prof_name) or not str(prof_name).strip():
            return []

        prof_name_str = str(prof_name).strip()
        
        # First, check if the entire string is already a known professor.
        # This handles names that include commas, like "LEE, MICHELLE PUI YEE".
        if prof_name_str.upper() in self.professor_lookup:
            return [prof_name_str]
            
        # If there are no commas, it can only be one professor.
        if ',' not in prof_name_str:
            return [prof_name_str]

        parts = [p.strip() for p in prof_name_str.split(',') if p.strip()]
        
        found_professors = []
        i = 0
        while i < len(parts):
            # Start from the longest possible combination of remaining parts and work backwards.
            match_found = False
            for j in range(len(parts), i, -1):
                # Create a candidate name by joining the parts
                candidate = ', '.join(parts[i:j])
                
                # Check if this longest possible candidate is a known professor
                if candidate.upper() in self.professor_lookup:
                    found_professors.append(candidate)
                    i = j  # Move the pointer past the parts we just consumed
                    match_found = True
                    break # Exit the inner loop and continue with the next part
            
            # If the inner loop finished without finding any match for the part(s)
            if not match_found:
                # This part is an unknown entity.
                # Per your logic, we can try to append it to the previously found professor.
                # This handles cases like "WONG LI DE, BRIAN" where "BRIAN" is unknown.
                unknown_part = parts[i]
                if found_professors and len(unknown_part.split()) == 1:
                    # Append the unknown single-word part to the last known professor
                    found_professors[-1] = f"{found_professors[-1]}, {unknown_part}"
                    logger.info(f"✅ Combined unknown single word '{unknown_part}' with previous professor -> '{found_professors[-1]}'")
                else:
                    # Otherwise, treat it as its own (potentially new) professor
                    found_professors.append(unknown_part)
                
                i += 1 # Move to the next part
                
        return found_professors

    def _lookup_professor_with_fallback(self, prof_name: str) -> Optional[str]:
        """Enhanced professor lookup with improved partial word matching and no phantom professor creation."""
        
        if prof_name is None or pd.isna(prof_name):
            return None
        
        prof_name = str(prof_name).strip()
        if not prof_name or prof_name.lower() == 'nan':
            return None
        
        # Strategy 1 & 2: Direct and variation-based lookup (unchanged).
        normalized_name = prof_name.upper()
        if hasattr(self, 'professor_lookup'):
            if normalized_name in self.professor_lookup:
                return self.professor_lookup[normalized_name]['database_id']
            
            variations = [
                prof_name.strip().upper(),
                prof_name.replace(',', '').strip().upper(),
                ' '.join(prof_name.replace(',', ' ').split()).upper()
            ]
            for variation in variations:
                if variation in self.professor_lookup:
                    return self.professor_lookup[variation]['database_id']
        
        # Strategy 3: Search boss_aliases in professors_cache using the new robust parser.
        search_name_normalized = normalized_name
        for prof_data in self.professors_cache.values():
            aliases_list = self._parse_boss_aliases(prof_data.get('boss_aliases'))
            
            for alias in aliases_list:
                alias_normalized = alias.strip().upper()
                
                if alias_normalized == search_name_normalized:
                    logger.info(f"✅ Found exact match in boss_aliases: {prof_name} → {alias} (ID: {prof_data.get('id')})")
                    if not hasattr(self, 'professor_lookup'): self.professor_lookup = {}
                    self.professor_lookup[search_name_normalized] = {
                        'database_id': str(prof_data.get('id')),
                        'boss_name': alias_normalized,
                        'afterclass_name': prof_data.get('name', prof_name)
                    }
                    return str(prof_data.get('id'))

        # Strategy 4: Enhanced partial word matching for cases like "DENNIS LIM" → "LIM CHONG BOON DENNIS"
        search_words = set(normalized_name.replace(',', ' ').split())
        if len(search_words) >= 2:  # Only try partial matching for multi-word names
            best_match = None
            best_score = 0
            
            for prof_data in self.professors_cache.values():
                # Check against afterclass_name (database name)
                db_name = prof_data.get('name', '').upper()
                db_words = set(db_name.replace(',', ' ').split())
                
                # Check if all search words are found in database name
                if search_words.issubset(db_words):
                    # Calculate match score (percentage of db_words that match search_words)
                    match_score = len(search_words) / len(db_words) if db_words else 0
                    
                    if match_score > best_score and match_score >= 0.5:  # At least 50% match
                        best_match = prof_data
                        best_score = match_score
                
                # Also check against boss_aliases
                aliases_list = self._parse_boss_aliases(prof_data.get('boss_aliases'))
                for alias in aliases_list:
                    alias_words = set(alias.upper().replace(',', ' ').split())
                    if search_words.issubset(alias_words):
                        match_score = len(search_words) / len(alias_words) if alias_words else 0
                        if match_score > best_score and match_score >= 0.5:
                            best_match = prof_data
                            best_score = match_score
            
            if best_match and best_score >= 0.5:
                logger.info(f"🔍 Partial word match found: '{prof_name}' → '{best_match.get('name')}' (score: {best_score:.2f})")
                
                # Add to lookup and save to partial matches tracking
                if not hasattr(self, 'professor_lookup'): self.professor_lookup = {}
                self.professor_lookup[normalized_name] = {
                    'database_id': str(best_match.get('id')),
                    'boss_name': normalized_name,
                    'afterclass_name': best_match.get('name', prof_name)
                }
                
                # Track partial matches for review
                if not hasattr(self, 'partial_matches'):
                    self.partial_matches = []
                self.partial_matches.append({
                    'boss_name': prof_name,
                    'afterclass_name': best_match.get('name'),
                    'database_id': str(best_match.get('id')),
                    'method': 'partial_match',
                    'match_score': f"{best_score:.2f}"
                })
                
                return str(best_match.get('id'))
        
        # Strategy 5: Exact fuzzy matching (unchanged)
        if hasattr(self, 'professor_lookup'):
            for lookup_name in self.professor_lookup.keys():
                if self._names_match_fuzzy_exact(normalized_name, lookup_name):
                    return self.professor_lookup[lookup_name]['database_id']
        
        if normalized_name in self.professors_cache:
            return self.professors_cache[normalized_name]['id']
        
        # Strategy 6: DO NOT create new professor - log as unmatched instead
        logger.warning(f"⚠️ Professor not found, will create new: {prof_name}")
        
        # Create new professor (only when absolutely necessary)
        return self._create_new_professor(prof_name)
        
    def _names_match_fuzzy_exact(self, name1: str, name2: str) -> bool:
        """Exact fuzzy matching for names - only matches if completely identical after normalization"""
        
        # Handle None and non-string values
        if name1 is None or name2 is None:
            return False
        
        # Ensure both names are strings
        name1 = str(name1) if name1 is not None else ""
        name2 = str(name2) if name2 is not None else ""
        
        # Remove common variations and normalize
        clean1 = ' '.join(name1.replace(',', ' ').replace('.', ' ').split()).upper()
        clean2 = ' '.join(name2.replace(',', ' ').replace('.', ' ').split()).upper()
        
        # Only return True if they are exactly the same after cleaning
        return clean1 == clean2

    def load_professor_lookup_csv(self):
        """Load professor lookup CSV once and cache it properly"""
        # Check if already loaded to prevent repeated loading
        if hasattr(self, 'professor_lookup_loaded') and self.professor_lookup_loaded:
            return
        
        lookup_file = 'script_input/professor_lookup.csv'
        
        if not os.path.exists(lookup_file):
            logger.warning("📋 professor_lookup.csv not found - will use database cache only")
            self.professor_lookup_loaded = True
            return
        
        try:
            # Load the CSV file
            lookup_df = pd.read_csv(lookup_file)
            
            # Validate required columns exist
            required_cols = ['boss_name', 'afterclass_name', 'database_id', 'method']
            missing_cols = [col for col in required_cols if col not in lookup_df.columns]
            if missing_cols:
                logger.error(f"❌ professor_lookup.csv missing required columns: {missing_cols}")
                self.professor_lookup_loaded = True
                return
            
            # Clear existing lookup and load fresh data
            self.professor_lookup = {}
            loaded_count = 0
            
            for _, row in lookup_df.iterrows():
                boss_name = row.get('boss_name')
                afterclass_name = row.get('afterclass_name')
                database_id = row.get('database_id')
                
                # Skip rows with critical missing values
                if pd.isna(boss_name) or pd.isna(database_id):
                    continue
                    
                # Use boss_name as the primary key for lookup (as you specified)
                boss_name_key = str(boss_name).strip().upper()
                self.professor_lookup[boss_name_key] = {
                    'database_id': str(database_id),
                    'boss_name': str(boss_name),
                    'afterclass_name': str(afterclass_name) if not pd.isna(afterclass_name) else str(boss_name)
                }
                loaded_count += 1
            
            logger.info(f"✅ Loaded {loaded_count} entries from professor_lookup.csv")
            self.professor_lookup_loaded = True
            
        except Exception as e:
            logger.error(f"❌ Error loading professor_lookup.csv: {e}")
            logger.info("📋 Continuing with database cache only")
            self.professor_lookup_loaded = True

    def _create_new_professor(self, prof_name: str, professor_variations: dict = None, email_to_professor: dict = None) -> str:
        """
        Create a new professor record, ensure proper tracking, and handle both primary and fallback alias creation.
        """
        boss_name, afterclass_name = self._normalize_professor_name_fallback(prof_name)
        
        # Check if already created in this session to prevent duplicates
        for new_prof in self.new_professors:
            aliases_val = new_prof.get('boss_aliases', '[]')
            try:
                import json
                alias_list = json.loads(aliases_val) if isinstance(aliases_val, str) else aliases_val
            except (json.JSONDecodeError, TypeError):
                alias_list = []

            if boss_name in alias_list or afterclass_name == new_prof.get('name', ''):
                # This professor was already created in this run, just return its ID.
                return new_prof.get('id')
        
        # --- Unified Creation Logic ---
        professor_id = str(uuid.uuid4())
        slug = re.sub(r'[^a-zA-Z0-9]+', '-', afterclass_name.lower()).strip('-')
        resolved_email = self.resolve_professor_email(afterclass_name)

        # --- Conditional Alias Creation ---
        boss_aliases_set = set()
        boss_aliases_set.add(boss_name)
        
        # SCENARIO A: Use sophisticated alias creation if professor_variations dictionary is provided
        if professor_variations:
            professor_specific_variations = professor_variations.get(prof_name, set())
            for variation in professor_specific_variations:
                if variation and variation.strip():
                    variation_boss_name, _ = self._normalize_professor_name_fallback(variation.strip())
                    boss_aliases_set.add(variation_boss_name)
        # SCENARIO B: Fallback to simple alias creation if not provided
        else:
            if boss_name != prof_name.upper():
                boss_aliases_set.add(prof_name.upper())

        boss_aliases_list = sorted(list(boss_aliases_set))
        import json
        boss_aliases_json = json.dumps(boss_aliases_list)
        
        # --- Create and Register the New Professor ---
        new_prof = {
            'id': professor_id,
            'name': afterclass_name,
            'email': resolved_email,
            'slug': slug,
            'photo_url': 'https://smu.edu.sg',
            'profile_url': 'https://smu.edu.sg',
            'belong_to_university': 1,
            'boss_aliases': boss_aliases_json,
            'original_scraped_name': prof_name
        }
        
        self.new_professors.append(new_prof)
        self.stats['professors_created'] += 1
        
        # Update lookup tables
        if not hasattr(self, 'professor_lookup'):
            self.professor_lookup = {}
        
        lookup_entry = {
            'database_id': professor_id,
            'boss_name': boss_name,
            'afterclass_name': afterclass_name
        }
        # Map the original name and all its aliases to the new ID
        self.professor_lookup[prof_name.upper()] = lookup_entry
        for alias in boss_aliases_list:
            self.professor_lookup[alias.upper()] = lookup_entry

        # Update the email duplicate checker dictionary if it was passed in
        if email_to_professor is not None and resolved_email and resolved_email.lower() != 'enquiry@smu.edu.sg':
            email_to_professor[resolved_email.lower()] = new_prof
        
        logger.info(f"✅ Created professor: {afterclass_name} with email: {resolved_email}")
        logger.info(f"   Boss aliases: {boss_aliases_list}")
        
        return professor_id

    def process_timings(self):
        """
        MODIFICATION 2 (REVISED): Process class and exam timings with robust deduplication that handles TBA cases.
        """
        logger.info("⏰ Processing class timings and exam timings with strict uniqueness checks...")

        # A tiny, local helper function to consistently handle missing values for key creation.
        # This prevents the 'nan' vs 'None' string issue.
        def _clean_key_val(v):
            return '' if pd.isna(v) else str(v)

        # Load existing timing keys from the database cache ONCE.
        if not self.processed_timing_keys:
            cache_file = os.path.join(self.cache_dir, 'class_timing_cache.pkl')
            if os.path.exists(cache_file):
                try:
                    df = pd.read_pickle(cache_file)
                    if not df.empty and 'class_id' in df.columns:
                        for _, record in df.iterrows():
                            # FIX: Use the robust key creation logic for cached data
                            key = (
                                record['class_id'],
                                _clean_key_val(record.get('day_of_week')),
                                _clean_key_val(record.get('start_time')),
                                _clean_key_val(record.get('end_time')),
                                _clean_key_val(record.get('venue'))
                            )
                            self.processed_timing_keys.add(key)
                        logger.info(f"✅ Pre-loaded {len(self.processed_timing_keys)} existing class timing keys from cache.")
                except Exception as e:
                    logger.warning(f"Could not preload class_timing_cache.pkl: {e}")

        # Load existing exam timing keys
        if not self.processed_exam_class_ids:
            exam_cache_file = os.path.join(self.cache_dir, 'class_exam_timing_cache.pkl')
            if os.path.exists(exam_cache_file):
                try:
                    df = pd.read_pickle(exam_cache_file)
                    if not df.empty and 'class_id' in df.columns:
                        self.processed_exam_class_ids.update(df['class_id'].unique())
                        logger.info(f"✅ Pre-loaded {len(self.processed_exam_class_ids)} existing exam class IDs from cache.")
                except Exception as e:
                    logger.warning(f"Could not preload class_exam_timing_cache.pkl: {e}")
        
        for _, row in self.multiple_data.iterrows():
            record_key = row.get('record_key')
            if record_key not in self.class_id_mapping:
                continue
            
            class_ids = self.class_id_mapping.get(record_key, [])
            timing_type = row.get('type', 'CLASS')
            
            for class_id in class_ids:
                if timing_type == 'CLASS':
                    # --- FIX: Use the same robust key generation for new data ---
                    timing_key = (
                        class_id,
                        _clean_key_val(row.get('day_of_week')),
                        _clean_key_val(row.get('start_time')),
                        _clean_key_val(row.get('end_time')),
                        _clean_key_val(row.get('venue'))
                    )
                    
                    # Check if this exact timing record has already been processed
                    if timing_key in self.processed_timing_keys:
                        continue
                    
                    # Add to set *before* appending to prevent duplicates within the same run
                    self.processed_timing_keys.add(timing_key)
                    
                    timing_record = {
                        'class_id': class_id, 'start_date': row.get('start_date'),
                        'end_date': row.get('end_date'), 'day_of_week': row.get('day_of_week'),
                        'start_time': row.get('start_time'), 'end_time': row.get('end_time'),
                        'venue': row.get('venue', '')
                    }
                    self.new_class_timings.append(timing_record)
                    self.stats['timings_created'] += 1
                
                elif timing_type == 'EXAM':
                    if class_id in self.processed_exam_class_ids:
                        continue
                    
                    self.processed_exam_class_ids.add(class_id)
                    
                    exam_record = {
                        'class_id': class_id, 'date': row.get('date'),
                        'day_of_week': row.get('day_of_week'), 
                        'start_time': str(row.get('start_time')),
                        'end_time': str(row.get('end_time')), 
                        'venue': row.get('venue')
                    }
                    self.new_class_exam_timings.append(exam_record)
                    self.stats['exams_created'] += 1
        
        logger.info(f"✅ Created {self.stats['timings_created']} new class timings (after deduplication).")
        logger.info(f"✅ Created {self.stats['exams_created']} new exam timings (after deduplication).")

    def save_outputs(self):
        """Save all generated CSV files, only creating files that have data."""
        logger.info("💾 Saving output files...")

        def to_csv_if_not_empty(data_list, filename):
            if data_list:
                df = pd.DataFrame(data_list)
                if not df.empty:
                    path = os.path.join(self.output_base, filename)
                    df.to_csv(path, index=False)
                    logger.info(f"✅ Saved {len(df)} records to {filename}")
                    
                    # DEBUG: For update_bid_result, show what we're saving
                    if filename == 'update_bid_result.csv':
                        logger.info(f"🔍 DEBUG: update_bid_result.csv columns: {list(df.columns)}")
                        # Show first few rows with median/min values
                        rows_with_bid_data = df[(df['median'].notna()) | (df['min'].notna())]
                        if not rows_with_bid_data.empty:
                            logger.info(f"🔍 DEBUG: {len(rows_with_bid_data)} rows have median/min data")
                            logger.info(f"🔍 DEBUG: Sample data:")
                            for idx, row in rows_with_bid_data.head(3).iterrows():
                                logger.info(f"  bid_window_id={row['bid_window_id']}, class_id={row['class_id']}, median={row.get('median')}, min={row.get('min')}")
                        else:
                            logger.warning("⚠️ DEBUG: No rows in update_bid_result have median/min values!")

        # The following line is the only change. It has been removed.
        # to_csv_if_not_empty(self.new_professors, 'new_professors.csv') 
        
        to_csv_if_not_empty(getattr(self, 'update_professors', []), 'update_professor.csv')
        # We also no longer need to save new_courses here, as it's handled in Phase 1.
        # to_csv_if_not_empty(self.new_courses, os.path.join('verify', 'new_courses.csv'))
        to_csv_if_not_empty(self.update_courses, 'update_courses.csv')
        to_csv_if_not_empty(getattr(self, 'update_classes', []), 'update_classes.csv')
        to_csv_if_not_empty(self.new_acad_terms, 'new_acad_term.csv')
        to_csv_if_not_empty(self.new_classes, 'new_classes.csv')
        to_csv_if_not_empty(self.new_class_timings, 'new_class_timing.csv')
        to_csv_if_not_empty(self.new_class_exam_timings, 'new_class_exam_timing.csv')
        update_records = getattr(self, 'update_bid_result', [])
        if update_records:
            logger.info(f"📝 Preparing to save {len(update_records)} bid result updates")
            # Show sample of records with actual bid data
            records_with_bids = [r for r in update_records if r.get('median') is not None or r.get('min') is not None]
            logger.info(f"   - {len(records_with_bids)} records have median/min bid data")
            if records_with_bids:
                sample = records_with_bids[0]
                logger.info(f"   - Sample: bid_window_id={sample.get('bid_window_id')}, median={sample.get('median')}, min={sample.get('min')}")
        to_csv_if_not_empty(update_records, 'update_bid_result.csv')

        if self.courses_needing_faculty:
            df = pd.DataFrame(self.courses_needing_faculty)
            # Note: This path should probably also be in the 'verify' folder
            df.to_csv(os.path.join(self.verify_dir, 'courses_needing_faculty.csv'), index=False)
            logger.info(f"✅ Saved {len(self.courses_needing_faculty)} courses needing faculty assignment to the verify folder.")
            
    def update_professor_lookup_from_corrected_csv(self):
        """Update professor lookup from manually corrected new_professors.csv"""
        logger.info("🔄 Updating professor lookup from corrected CSV...")
        
        # Read corrected new_professors.csv
        corrected_csv_path = os.path.join(self.verify_dir, 'new_professors.csv')
        if not os.path.exists(corrected_csv_path):
            logger.info(f"📝 No corrected CSV found: {corrected_csv_path} - assuming all professors already exist")
            return True

        corrected_df = pd.read_csv(corrected_csv_path)
        if corrected_df.empty:
            logger.info(f"📝 Empty corrected CSV - no professors to update")
            return True

        try:
            logger.info(f"📖 Reading {len(corrected_df)} corrected professor records")
            
            # Clear and rebuild the new_professors list with corrected data
            self.new_professors = []
            
            # Update internal professor_lookup and rebuild new_professors
            updated_count = 0
            
            # FIXED: Initialize professor_lookup if it doesn't exist
            if not hasattr(self, 'professor_lookup'):
                self.professor_lookup = {}
            
            for _, row in corrected_df.iterrows():
                original_name = row.get('original_scraped_name', '')
                corrected_afterclass_name = row.get('name', '')  # This is the corrected name
                boss_aliases = row.get('boss_aliases', '')  # This should be JSON string
                professor_id = row.get('id', '')
                
                # Parse boss_aliases JSON string
                try:
                    import json
                    if isinstance(boss_aliases, str) and boss_aliases.strip():
                        boss_aliases_list = json.loads(boss_aliases)
                        if isinstance(boss_aliases_list, list) and boss_aliases_list:
                            boss_name = boss_aliases_list[0]  # Use first boss alias
                        else:
                            boss_name = original_name.upper() if original_name else corrected_afterclass_name.upper()
                    else:
                        boss_name = original_name.upper() if original_name else corrected_afterclass_name.upper()
                except (json.JSONDecodeError, TypeError):
                    # Fallback if JSON parsing fails
                    boss_name = original_name.upper() if original_name else corrected_afterclass_name.upper()
                
                # Rebuild the professor record with corrected data
                corrected_prof = {
                    'id': professor_id,
                    'name': corrected_afterclass_name,  # Use corrected name
                    'email': row.get('email', 'enquiry@smu.edu.sg'),
                    'slug': row.get('slug', ''),
                    'photo_url': row.get('photo_url', 'https://smu.edu.sg'),
                    'profile_url': row.get('profile_url', 'https://smu.edu.sg'),
                    'belong_to_university': row.get('belong_to_university', 1),
                    'boss_aliases': boss_aliases,  # Keep as JSON string
                    'afterclass_name': corrected_afterclass_name,
                    'original_scraped_name': original_name
                }
                
                # Add to new_professors list
                self.new_professors.append(corrected_prof)
                
                # FIXED: Update professor_lookup with ALL variations
                if professor_id:
                    lookup_entry = {
                        'database_id': professor_id,
                        'boss_name': boss_name,
                        'afterclass_name': corrected_afterclass_name
                    }
                    
                    # Add original scraped name to lookup
                    if original_name:
                        self.professor_lookup[original_name.upper()] = lookup_entry
                        updated_count += 1
                    
                    # Add corrected afterclass name to lookup
                    if corrected_afterclass_name:
                        self.professor_lookup[corrected_afterclass_name.upper()] = lookup_entry
                    
                    # Add boss_name to lookup
                    if boss_name:
                        self.professor_lookup[boss_name.upper()] = lookup_entry
                    
                    # FIXED: Add all boss aliases to lookup
                    try:
                        if isinstance(boss_aliases, str) and boss_aliases.strip():
                            boss_aliases_list = json.loads(boss_aliases)
                            if isinstance(boss_aliases_list, list):
                                for alias in boss_aliases_list:
                                    if alias and str(alias).strip():
                                        self.professor_lookup[str(alias).upper()] = lookup_entry
                    except (json.JSONDecodeError, TypeError):
                        pass  # Skip if JSON parsing fails
            
            # Save updated professor lookup to CSV
            self._save_corrected_professor_lookup()
            
            logger.info(f"✅ Updated {updated_count} professor lookup entries")
            logger.info(f"✅ Rebuilt {len(self.new_professors)} professor records with corrections")
            logger.info(f"✅ Total lookup entries now: {len(self.professor_lookup)}")
            
            return True
            
        except Exception as e:
            logger.error(f"❌ Failed to update professor lookup: {e}")
            import traceback
            traceback.print_exc()
            return False

    def update_professors_with_boss_names(self):
        """
        Update professors with missing/additional boss_names by comparing professor_lookup.csv
        with database boss_aliases and combining new variations from high-confidence fuzzy matches.
        """
        logger.info("👤 Updating professors with boss_names and detecting new variations...")

        # --- Step 1: Load high-confidence fuzzy matches from Phase 1 ---
        fuzzy_path = os.path.join(self.verify_dir, 'fuzzy_matched_professors.csv')
        new_aliases_by_id = defaultdict(list)

        if os.path.exists(fuzzy_path):
            try:
                fuzzy_df = pd.read_csv(fuzzy_path)
                high_confidence_matches = fuzzy_df[fuzzy_df['confidence_score'] >= 95]
                logger.info(f"🔍 Found {len(high_confidence_matches)} high-confidence (>=95) fuzzy matches to process.")

                for _, row in high_confidence_matches.iterrows():
                    database_id = str(row['database_id'])
                    afterclass_name = row['afterclass_name']
                    
                    try:
                        import json
                        aliases_val = row.get('boss_aliases', '[]')
                        new_aliases = json.loads(aliases_val) if isinstance(aliases_val, str) else []
                        
                        for alias in new_aliases:
                            if alias and str(alias).strip():
                                clean_alias = str(alias).strip()
                                new_aliases_by_id[database_id].append(clean_alias)
                                
                                # Add to in-memory professor_lookup to be saved later
                                if not hasattr(self, 'professor_lookup'):
                                    self.professor_lookup = {}
                                
                                alias_key = clean_alias.upper()
                                if alias_key not in self.professor_lookup:
                                    self.professor_lookup[alias_key] = {
                                        'database_id': database_id,
                                        'boss_name': clean_alias,
                                        'afterclass_name': afterclass_name,
                                        'method': 'fuzzy_match' # Add method for tracking
                                    }
                                    logger.info(f"➕ Adding fuzzy match to lookup: '{clean_alias}' -> '{afterclass_name}'")

                    except (json.JSONDecodeError, TypeError) as e:
                        logger.warning(f"⚠️ Could not parse boss_aliases from fuzzy_matched_professors.csv for row: {row.to_dict()}. Error: {e}")
            
            except Exception as e:
                logger.error(f"❌ Error processing fuzzy_matched_professors.csv: {e}")

        # --- Step 2: Load existing variations from professor_lookup.csv ---
        lookup_file = 'script_input/professor_lookup.csv'
        lookup_groups = defaultdict(list)
        if os.path.exists(lookup_file):
            try:
                lookup_df = pd.read_csv(lookup_file)
                for _, row in lookup_df.iterrows():
                    database_id = row.get('database_id')
                    boss_name = row.get('boss_name')
                    if pd.notna(database_id) and pd.notna(boss_name):
                        lookup_groups[str(database_id)].append(str(boss_name).strip())
            except Exception as e:
                logger.error(f"❌ Error loading professor_lookup.csv: {e}")

        # --- Step 3: Iterate through professors and combine all alias sources ---
        updated_professor_ids = set() 
        self.update_professors = []
        new_variations_found = []
        import json

        for prof_key, prof_data in self.professors_cache.items():
            professor_id = str(prof_data.get('id'))
            if professor_id in updated_professor_ids:
                continue

            # Get all sources of aliases as sets for easy combination
            current_boss_aliases = set(self._parse_boss_aliases(prof_data.get('boss_aliases')))
            lookup_variations = set(lookup_groups.get(professor_id, []))
            fuzzy_variations = set(new_aliases_by_id.get(professor_id, []))

            # Combine all unique variations using set union
            final_aliases_raw = current_boss_aliases.union(lookup_variations).union(fuzzy_variations)

            # Normalize both sets for a stable comparison, preventing repeated updates
            current_aliases_normalized = {name.replace("’", "'") for name in current_boss_aliases}
            final_aliases_normalized = {name.replace("’", "'") for name in final_aliases_raw}

            # Check for changes using the normalized sets
            if final_aliases_normalized != current_aliases_normalized:
                # Save the raw, original names to preserve the smart quote from the source
                unique_boss_names = sorted(list(final_aliases_raw))
                # Use ensure_ascii=False to prevent encoding '’' to '\u2019'
                boss_aliases_json = json.dumps(unique_boss_names, ensure_ascii=False)

                self.update_professors.append({
                    'id': professor_id,
                    'boss_aliases': boss_aliases_json,
                })
                
                # For logging, find the newly added variations
                newly_added = final_aliases_raw - current_boss_aliases
                if newly_added:
                    logger.info(f"✅ Adding {len(newly_added)} new variations for professor {professor_id}: {sorted(list(newly_added))}")
                    new_variations_found.append({
                        'professor_id': professor_id,
                        'professor_name': prof_data.get('name', 'Unknown'),
                        'existing_aliases': sorted(list(current_boss_aliases)),
                        'new_variations': sorted(list(newly_added)),
                        'final_aliases': unique_boss_names
                    })
                
                updated_professor_ids.add(professor_id)

        # --- Step 4: Save all outputs ---
        # Save partial matches if any were found
        if hasattr(self, 'partial_matches') and self.partial_matches:
            partial_df = pd.DataFrame(self.partial_matches)
            partial_path = os.path.join(self.verify_dir, 'partial_matches.csv')
            partial_df.to_csv(partial_path, index=False)
            logger.info(f"🔍 Saved {len(self.partial_matches)} partial matches to partial_matches.csv")

        # Save new variations summary
        if new_variations_found:
            report_data = [{'professor_id': item.get('professor_id'),'professor_name': item.get('professor_name'), 'existing_aliases': '|'.join(item.get('existing_aliases', [])), 'new_variations': '|'.join(item.get('new_variations', [])),'final_aliases': '|'.join(item.get('final_aliases', []))} for item in new_variations_found]
            variations_df = pd.DataFrame(report_data)
            variations_path = os.path.join(self.verify_dir, 'new_variations_found.csv')
            variations_df.to_csv(variations_path, index=False, encoding='utf-8-sig')
            logger.info(f"🆕 Saved {len(new_variations_found)} professors with new variations to new_variations_found.csv")

        # Save the update_professor.csv file
        if self.update_professors:
            df = pd.DataFrame(self.update_professors)
            update_path = os.path.join(self.output_base, 'update_professor.csv')
            df.to_csv(update_path, index=False, encoding='utf-8')
            logger.info(f"✅ Saved {len(self.update_professors)} unique professor updates to update_professor.csv")
            self.stats['professors_updated'] = len(self.update_professors)
        else:
            logger.info("ℹ️ No professors need boss_name updates.")
            self.stats['professors_updated'] = 0

        # --- Step 5: Persist the updated professor lookup table ---
        self._save_corrected_professor_lookup()

    def process_remaining_tables(self):
        """Process classes and timings after professor lookup is updated"""
        logger.info("🏫 Processing remaining tables (classes, timings)...")
        
        try:
            # Clear any existing data from Phase 1 to avoid duplicates
            self.new_classes = []
            self.new_class_timings = []
            self.new_class_exam_timings = []
            self.class_id_mapping = {}
            self.stats['classes_created'] = 0
            self.stats['timings_created'] = 0
            self.stats['exams_created'] = 0
            
            # Process classes (depends on updated professor lookup)
            self.process_classes()
            
            # Process timings (depends on classes)
            self.process_timings()
            
            logger.info("✅ Remaining tables processed successfully")
            return True
            
        except Exception as e:
            logger.error(f"❌ Failed to process remaining tables: {e}")
            return False

    def _save_corrected_professor_lookup(self):
        """Save professor lookup preserving all input entries, adding new ones, and including partial matches"""
        # Start with all existing entries from input professor_lookup.csv
        all_lookup_data = {}
        
        # Step 1: Load ALL entries from input professor_lookup.csv (preserve existing)
        input_lookup_file = 'script_input/professor_lookup.csv'
        if os.path.exists(input_lookup_file):
            try:
                input_df = pd.read_csv(input_lookup_file)
                for _, row in input_df.iterrows():
                    boss_name = row.get('boss_name')
                    afterclass_name = row.get('afterclass_name')
                    database_id = row.get('database_id')
                    method = row.get('method', 'exists')
                    
                    # Only require database_id to be present
                    if pd.notna(database_id):
                        if pd.isna(boss_name) or str(boss_name).strip() == '':
                            if pd.notna(afterclass_name):
                                lookup_key = f"EMPTY_BOSS_{str(afterclass_name).upper().replace(' ', '_')}"
                                boss_name_value = ""
                            else:
                                lookup_key = f"EMPTY_BOSS_{str(database_id)}"
                                boss_name_value = ""
                        else:
                            lookup_key = str(boss_name).upper()
                            boss_name_value = str(boss_name)
                        
                        all_lookup_data[lookup_key] = {
                            'boss_name': boss_name_value,
                            'afterclass_name': str(afterclass_name) if pd.notna(afterclass_name) else "",
                            'database_id': str(database_id),
                            'method': str(method)
                        }
                
                logger.info(f"📖 Loaded {len(all_lookup_data)} existing entries from input professor_lookup.csv")
            except Exception as e:
                logger.warning(f"⚠️ Could not load input professor_lookup.csv: {e}")
        
        # Step 2: Add/update with new entries from current processing
        new_entries_count = 0
        updated_entries_count = 0
        
        for scraped_name, data in self.professor_lookup.items():
            boss_name = data.get('boss_name', scraped_name.upper())
            afterclass_name = data.get('afterclass_name', scraped_name)
            database_id = data['database_id']
            
            # Determine method: check if this is a newly created professor or partial match
            method = 'exists'  # default
            if any(prof['id'] == database_id for prof in self.new_professors):
                method = 'created'
            elif hasattr(self, 'partial_matches') and any(match['database_id'] == database_id and match['boss_name'] == scraped_name for match in self.partial_matches):
                method = 'partial_match'
            
            boss_name_key = str(boss_name).upper()
            
            if boss_name_key in all_lookup_data:
                # Update existing entry if method changed
                if method in ['created', 'partial_match']:
                    all_lookup_data[boss_name_key]['method'] = method
                    updated_entries_count += 1
            else:
                # Add new entry
                all_lookup_data[boss_name_key] = {
                    'boss_name': str(boss_name),
                    'afterclass_name': str(afterclass_name),
                    'database_id': str(database_id),
                    'method': method
                }
                new_entries_count += 1
                logger.info(f"   -> NEW LOOKUP: Adding '{boss_name}' for '{afterclass_name}' (ID: {database_id}, method: {method})")
        
        # Step 3: Add partial matches that weren't already in professor_lookup
        if hasattr(self, 'partial_matches'):
            for match in self.partial_matches:
                boss_name_key = match['boss_name'].upper()
                if boss_name_key not in all_lookup_data:
                    all_lookup_data[boss_name_key] = {
                        'boss_name': match['boss_name'],
                        'afterclass_name': match['afterclass_name'],
                        'database_id': match['database_id'],
                        'method': f"partial_match_{match.get('match_score', '')}"
                    }
                    new_entries_count += 1
                    logger.info(f"   -> PARTIAL MATCH: Adding '{match['boss_name']}' → '{match['afterclass_name']}' (score: {match.get('match_score', 'N/A')})")
        
        # Step 4: Convert to list and sort
        lookup_data = list(all_lookup_data.values())
        lookup_data.sort(key=lambda x: x['boss_name'] if x['boss_name'] else x['afterclass_name'])
        
        # Step 5: Save main lookup file
        df = pd.DataFrame(lookup_data)
        df.to_csv(input_lookup_file, index=False)
        
        # Step 6: Save separate tracking files for manual review
        if hasattr(self, 'partial_matches') and self.partial_matches:
            partial_df = pd.DataFrame(self.partial_matches)
            partial_path = os.path.join(self.verify_dir, 'partial_matches_log.csv')
            partial_df.to_csv(partial_path, index=False)
            logger.info(f"🔍 Saved {len(self.partial_matches)} partial matches to partial_matches_log.csv")
        
        logger.info(f"✅ Updated professor_lookup.csv:")
        logger.info(f"   • Total entries: {len(lookup_data)}")
        logger.info(f"   • New entries added: {new_entries_count}")
        logger.info(f"   • Existing entries updated: {updated_entries_count}")
        
        # Step 7: Log summary of different methods
        method_counts = {}
        for entry in lookup_data:
            method = entry.get('method', 'unknown')
            method_counts[method] = method_counts.get(method, 0) + 1
        
        logger.info("📊 Entries by method:")
        for method, count in sorted(method_counts.items()):
            logger.info(f"   • {method}: {count}")

    def print_summary(self):
        """Print processing summary"""
        print("\n" + "="*70)
        print("📊 PROCESSING SUMMARY")
        print("="*70)
        print(f"✅ Professors created: {self.stats['professors_created']}")
        print(f"✅ Courses created: {self.stats['courses_created']}")
        print(f"✅ Courses updated: {self.stats['courses_updated']}")
        print(f"⚠️  Courses needing faculty: {self.stats['courses_needing_faculty']}")
        print(f"✅ Classes created: {self.stats['classes_created']}")
        print(f"✅ Class timings created: {self.stats['timings_created']}")
        print(f"✅ Exam timings created: {self.stats['exams_created']}")
        print("="*70)
        
        print("\n📁 OUTPUT FILES:")
        print(f"   Verify folder: {self.verify_dir}/")
        print(f"   - new_professors.csv ({self.stats['professors_created']} records)")
        print(f"   - new_courses.csv ({self.stats['courses_created']} records)")
        print(f"   Output folder: {self.output_base}/")
        print(f"   - update_courses.csv ({self.stats['courses_updated']} records)")
        print(f"   - new_acad_term.csv ({len(self.new_acad_terms)} records)")
        print(f"   - new_classes.csv ({self.stats['classes_created']} records)")
        print(f"   - new_class_timing.csv ({self.stats['timings_created']} records)")
        print(f"   - new_class_exam_timing.csv ({self.stats['exams_created']} records)")
        print(f"   - professor_lookup.csv (updated)")
        print(f"   - courses_needing_faculty.csv ({self.stats['courses_needing_faculty']} records)")
        print("="*70)

    def run_phase1_professors_and_courses(self):
        """Phase 1: Process professors and courses with automated faculty mapping and cache checking"""
        try:
            logger.info("🚀 Starting Phase 1: Professors and Courses with Cache Checking")
            logger.info("="*60)
            
            # Step 1: Load or cache database data.
            # The new _load_from_cache now handles all professor lookup validation and synchronization internally.
            if not self.load_or_cache_data_with_freshness_check():
                logger.error("❌ Failed to load or validate database data")
                return False
            
            # Step 2: Load the raw input data from Excel
            if not self.load_raw_data():
                logger.error("❌ Failed to load raw data")
                return False
            
            # Step 3: Process the data using the now-validated caches
            logger.info("\n🎓 Running automated faculty mapping...")
            self.process_professors()
            self.process_courses()
            
            try:
                self.map_courses_to_faculties_from_boss()
            except Exception as e:
                logger.warning(f"⚠️ Automated faculty mapping failed: {e}")
                logger.info("  Continuing with manual faculty assignment...")
            
            self.process_acad_terms()
            
            # Step 4: Save phase 1 outputs
            self._save_phase1_outputs()
            
            # Step 5: Print faculty mapping summary
            if hasattr(self, 'courses_needing_faculty') and self.courses_needing_faculty:
                logger.info(f"\n📋 Faculty Assignment Summary:")
                logger.info(f"  • Automated mappings applied to {self.stats.get('courses_created', 0) - len(self.courses_needing_faculty)} courses")
                logger.info(f"  • {len(self.courses_needing_faculty)} courses still need manual review")
                
                if len(self.courses_needing_faculty) <= 10:
                    logger.info(f"  Courses needing manual review:")
                    for course_info in self.courses_needing_faculty:
                        logger.info(f"    - {course_info['course_code']}: {course_info['course_name']}")
            
            logger.info("✅ Phase 1 completed - Review files in verify/ folder")
            return True
            
        except Exception as e:
            logger.error(f"❌ Phase 1 failed: {e}")
            import traceback
            traceback.print_exc()
            return False

    def run_phase2_remaining_tables(self):
        """Phase 2: Process classes and timings after professor correction with cache checking"""
        try:
            logger.info("🚀 Starting Phase 2: Classes and Timings with Cache Checking")
            logger.info("="*60)
            
            # Set phase 2 mode to prevent overwriting corrected professors
            self._phase2_mode = True
            
            # Ensure cache is fresh
            if not self.load_or_cache_data_with_freshness_check():
                logger.error("❌ Failed to load fresh database data")
                return False
            
            # Update professor lookup from corrected CSV
            if not self.update_professor_lookup_from_corrected_csv():
                logger.error("❌ Failed to update professor lookup")
                return False
            
            # Update professors with missing boss_names
            self.update_professors_with_boss_names()
            
            # Process remaining tables with cache checking
            if not self.process_remaining_tables():
                logger.error("❌ Failed to process remaining tables")
                return False
            
            # Save all outputs
            self.save_outputs()
            
            # Print summary
            self.print_summary()
            
            logger.info("✅ Phase 2 completed successfully!")
            return True
            
        except Exception as e:
            logger.error(f"❌ Phase 2 failed: {e}")
            return False

    def _save_phase1_outputs(self):
        """Save Phase 1 outputs (professors, courses, acad_terms)"""
        # Save new professors (to verify folder for manual correction)
        # Always create the file, even if empty
        df = pd.DataFrame(self.new_professors) if self.new_professors else pd.DataFrame(columns=['id', 'name', 'boss_name', 'afterclass_name', 'original_scraped_name'])
        df.to_csv(os.path.join(self.verify_dir, 'new_professors.csv'), index=False)
        if self.new_professors:
            logger.info(f"✅ Saved {len(self.new_professors)} new professors for review")
        else:
            logger.info(f"✅ Created empty new_professors.csv (all professors already exist)")
        
        # Save new courses (to verify folder)
        if self.new_courses:
            df = pd.DataFrame(self.new_courses)
            df.to_csv(os.path.join(self.verify_dir, 'new_courses.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_courses)} new courses")
        
        # Save course updates
        if self.update_courses:
            df = pd.DataFrame(self.update_courses)
            df.to_csv(os.path.join(self.output_base, 'update_courses.csv'), index=False)
            logger.info(f"✅ Saved {len(self.update_courses)} course updates")
        
        # Save academic terms
        if self.new_acad_terms:
            df = pd.DataFrame(self.new_acad_terms)
            df.to_csv(os.path.join(self.output_base, 'new_acad_term.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_acad_terms)} academic terms")

    def setup_boss_processing(self):
        """Initialize BOSS results processing with logging and caches"""
        # Setup logging for BOSS processing
        self.boss_log_file = os.path.join(self.output_base, 'boss_result_log.txt')
        
        # Create the log file and write header
        try:
            with open(self.boss_log_file, 'w') as f:
                f.write(f"BOSS Results Processing Log - {datetime.now().isoformat()}\n")
                f.write("="*70 + "\n\n")
            print(f"📝 Log file created: {self.boss_log_file}")
        except Exception as e:
            print(f"⚠️ Warning: Could not create log file {self.boss_log_file}: {e}")
            self.boss_log_file = None
        
        # Initialize existing classes cache
        self.existing_classes_cache = []
        
        # Data storage for BOSS results
        self.boss_data = []
        self.failed_mappings = []
        
        # Output collectors
        self.new_bid_windows = []
        self.new_class_availability = []
        self.new_bid_result = []
        self.update_bid_result = []
        
        # Caches for deduplication
        self.bid_window_cache = {}  # (acad_term_id, round, window) -> bid_window_id
        
        # PROPERLY INITIALIZE bid_window_id_counter from database cache
        self.bid_window_id_counter = 1  # Default fallback
        
        # Load existing bid_window data and find max ID
        try:
            cache_file = os.path.join(self.cache_dir, 'bid_window_cache.pkl')
            if os.path.exists(cache_file):
                bid_window_df = pd.read_pickle(cache_file)
                if not bid_window_df.empty:
                    max_id = bid_window_df['id'].max()
                    self.bid_window_id_counter = max_id + 1
                    
                    # Build deduplication cache
                    for _, row in bid_window_df.iterrows():
                        window_key = (row['acad_term_id'], row['round'], row['window'])
                        self.bid_window_cache[window_key] = row['id']
                    
                    self.log_boss_activity(f"✅ Loaded {len(bid_window_df)} existing bid windows, next ID will be {self.bid_window_id_counter}")
                else:
                    self.log_boss_activity("⚠️ Bid window cache exists but is empty, starting from ID 1")
            else:
                # Try to download from database if cache doesn't exist
                if hasattr(self, 'connection') and self.connection:
                    try:
                        query = "SELECT * FROM bid_window ORDER BY id"
                        bid_window_df = pd.read_sql_query(query, self.connection)
                        if not bid_window_df.empty:
                            # Save to cache for future use
                            bid_window_df.to_pickle(cache_file)
                            
                            max_id = bid_window_df['id'].max()
                            self.bid_window_id_counter = max_id + 1
                            
                            # Build deduplication cache
                            for _, row in bid_window_df.iterrows():
                                window_key = (row['acad_term_id'], row['round'], row['window'])
                                self.bid_window_cache[window_key] = row['id']
                            
                            self.log_boss_activity(f"✅ Downloaded {len(bid_window_df)} bid windows from database, next ID will be {self.bid_window_id_counter}")
                        else:
                            self.log_boss_activity("⚠️ Database bid_window table is empty, starting from ID 1")
                    except Exception as e:
                        self.log_boss_activity(f"⚠️ Could not download bid_window from database: {e}")
                else:
                    self.log_boss_activity("⚠️ No bid window cache found and no database connection, starting from ID 1")
        
        except Exception as e:
            self.log_boss_activity(f"⚠️ Error initializing bid_window counter: {e}")
            self.bid_window_id_counter = 1
        
        # Statistics
        self.boss_stats = {
            'files_processed': 0,
            'total_rows': 0,
            'bid_windows_created': 0,
            'class_availability_created': 0,
            'bid_results_created': 0,
            'failed_mappings': 0
        }
        
        print("🔄 BOSS results processing setup completed")

    def log_boss_activity(self, message, print_to_stdout=True):
        """Log activity to both file and optionally stdout"""
        timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        log_message = f"[{timestamp}] {message}\n"
        
        # Only write to file if boss_log_file exists (after setup_boss_processing is called)
        if hasattr(self, 'boss_log_file') and self.boss_log_file:
            try:
                with open(self.boss_log_file, 'a') as f:
                    f.write(log_message)
            except Exception as e:
                print(f"⚠️ Warning: Could not write to log file: {e}")
        
        if print_to_stdout:
            print(f"📝 {message}")

    def parse_bidding_window(self, bidding_window_str):
        """Complete parser for bidding window string to extract round and window
        
        Examples:
        "Round 1 Window 1" -> ("1", 1)
        "Round 1A Window 2" -> ("1A", 2)
        "Round 2A Window 3" -> ("2A", 3)
        "Incoming Exchange Rnd 1C Win 1" -> ("1C", 1)
        "Incoming Freshmen Rnd 1 Win 4" -> ("1F", 4)
        """
        if not bidding_window_str or pd.isna(bidding_window_str):
            return None, None
        
        # Clean the string
        bidding_window_str = str(bidding_window_str).strip()
        
        # Pattern 1: Standard format "Round X[A/B/C] Window Y"
        pattern1 = r'Round\s+(\w+)\s+Window\s+(\d+)'
        match1 = re.match(pattern1, bidding_window_str)
        if match1:
            round_str = match1.group(1)
            window_num = int(match1.group(2))
            return round_str, window_num
        
        # Pattern 2: Incoming Exchange format "Incoming Exchange Rnd X[A/B/C] Win Y"
        # Map to same round but keep distinction if needed
        pattern2 = r'Incoming\s+Exchange\s+Rnd\s+(\w+)\s+Win\s+(\d+)'
        match2 = re.match(pattern2, bidding_window_str)
        if match2:
            round_str = match2.group(1)  # Keep original round (1C)
            window_num = int(match2.group(2))
            return round_str, window_num
        
        # Pattern 3: Incoming Freshmen format "Incoming Freshmen Rnd X Win Y"
        # Map Round 1 -> Round 1F for distinction
        pattern3 = r'Incoming\s+Freshmen\s+Rnd\s+(\w+)\s+Win\s+(\d+)'
        match3 = re.match(pattern3, bidding_window_str)
        if match3:
            original_round = match3.group(1)
            window_num = int(match3.group(2))
            # Map Incoming Freshmen Round 1 to Round 1F
            if original_round == "1":
                round_str = "1F"
            else:
                round_str = f"{original_round}F"  # For other rounds if they exist
            return round_str, window_num
        
        return None, None

    def load_boss_results(self):
        """Load BOSS results from raw_data.xlsx standalone sheet"""
        self.log_boss_activity("🔍 Loading BOSS results from raw_data.xlsx...")
        
        # Use existing standalone_data that's already loaded
        if not hasattr(self, 'standalone_data') or self.standalone_data is None:
            self.log_boss_activity("❌ No standalone data loaded")
            return False
        
        # Filter rows that have bidding data using the correct column names
        bidding_data = self.standalone_data[
            self.standalone_data['bidding_window'].notna() & 
            self.standalone_data['total'].notna()
        ].copy()
        
        if bidding_data.empty:
            self.log_boss_activity("❌ No bidding data found in raw_data.xlsx")
            return False
        
        self.boss_data = bidding_data
        self.boss_stats['total_rows'] = len(self.boss_data)
        self.boss_stats['files_processed'] = 1
        
        self.log_boss_activity(f"✅ Loaded {self.boss_stats['total_rows']} bidding records from raw_data.xlsx")
        return True

    def process_bid_windows(self):
        """Process and create bid_window entries from raw_data bidding_window column"""
        self.log_boss_activity("🪟 Processing bid windows from raw_data...")
        
        if self.boss_data is None or len(self.boss_data) == 0:
            self.log_boss_activity("❌ No BOSS data loaded")
            return False
        
        # Track all unique bid windows found in data
        found_windows = defaultdict(set)  # acad_term_id -> set of (round, window) tuples
        
        # Discover all windows that exist in the data
        for _, row in self.boss_data.iterrows():
            acad_term_id = row.get('acad_term_id')
            bidding_window_str = row.get('bidding_window')
            
            if pd.isna(acad_term_id) or pd.isna(bidding_window_str):
                continue
            
            round_str, window_num = self.parse_bidding_window(bidding_window_str)
            
            if acad_term_id and round_str and window_num:
                found_windows[acad_term_id].add((round_str, window_num))
        
        # Use the counter that was set from existing data
        bid_window_id = self.bid_window_id_counter
        round_order = {'1': 1, '1A': 2, '1B': 3, '1C': 4, '1F': 5, '2': 6, '2A': 7}
        
        for acad_term_id in sorted(found_windows.keys()):
            windows_for_term = found_windows[acad_term_id]
            sorted_windows = sorted(windows_for_term, key=lambda x: (round_order.get(x[0], 99), x[1]))
            
            self.log_boss_activity(f"📅 Processing {acad_term_id}: found {len(sorted_windows)} windows")
            
            for round_str, window_num in sorted_windows:
                window_key = (acad_term_id, round_str, window_num)
                
                # Skip if already exists in database
                if window_key in self.bid_window_cache:
                    self.log_boss_activity(f"⏭️ Bid window already exists: {acad_term_id} Round {round_str} Window {window_num}")
                    continue
                
                new_bid_window = {
                    'id': bid_window_id,
                    'acad_term_id': acad_term_id,
                    'round': round_str,
                    'window': window_num
                }
                
                self.new_bid_windows.append(new_bid_window)
                self.bid_window_cache[window_key] = bid_window_id
                self.boss_stats['bid_windows_created'] += 1
                
                self.log_boss_activity(f"✅ Created bid_window {bid_window_id}: {acad_term_id} Round {round_str} Window {window_num}")
                bid_window_id += 1
        
        self.bid_window_id_counter = bid_window_id
        self.log_boss_activity(f"✅ Created {self.boss_stats['bid_windows_created']} bid windows")
        return True

    def get_course_id(self, course_code):
        """Get course_id from course_code, checking multiple sources"""
        # Check courses cache (from database)
        if course_code in self.courses_cache:
            return self.courses_cache[course_code]['id']
        
        # Check in new_courses (newly created)
        for course in self.new_courses:
            if course['code'] == course_code:
                return course['id']
        
        # Check new_courses.csv file
        try:
            new_courses_path = os.path.join(self.output_base, 'new_courses.csv')
            verify_courses_path = os.path.join(self.verify_dir, 'new_courses.csv')
            
            for path in [verify_courses_path, new_courses_path]:
                if os.path.exists(path):
                    df = pd.read_csv(path)
                    matching_courses = df[df['code'] == course_code]
                    if not matching_courses.empty:
                        return matching_courses.iloc[0]['id']
        except Exception as e:
            self.log_boss_activity(f"⚠️ Error reading new_courses.csv: {e}", print_to_stdout=False)
        
        return None

    def load_existing_classes_cache(self):
        """Load existing classes from database cache with proper full extraction"""
        self.existing_classes_cache = []
        
        try:
            cache_file = os.path.join(self.cache_dir, 'classes_cache.pkl')
            
            # Try loading from cache file first
            if os.path.exists(cache_file):
                try:
                    classes_df = pd.read_pickle(cache_file)
                    if not classes_df.empty:
                        self.existing_classes_cache = classes_df.to_dict('records')
                        logger.info(f"📚 Loaded {len(self.existing_classes_cache)} existing classes from cache")
                        return
                    else:
                        logger.info("⚠️ Cache file exists but is empty")
                except Exception as e:
                    logger.warning(f"⚠️ Error reading cache file: {e}")
            
            # If cache doesn't exist or is empty, try database with SELECT *
            if self.connection:
                try:
                    query = "SELECT * FROM classes"
                    classes_df = pd.read_sql_query(query, self.connection)
                    if not classes_df.empty:
                        # Save to cache for future use
                        classes_df.to_pickle(cache_file)
                        self.existing_classes_cache = classes_df.to_dict('records')
                        logger.info(f"📚 Downloaded and cached {len(self.existing_classes_cache)} existing classes")
                        return
                    else:
                        logger.warning("⚠️ Database classes table is empty")
                except Exception as e:
                    logger.warning(f"⚠️ Error downloading classes from database: {e}")
            
            # Final fallback
            logger.warning("⚠️ All class loading methods failed - using empty cache")
                    
        except Exception as e:
            self.existing_classes_cache = []
            logger.error(f"⚠️ Critical error in load_existing_classes_cache: {e}")

    def process_class_availability(self):
        """
        Process class availability data, preventing the creation of duplicate records
        by checking against existing cache data first.
        """
        self.log_boss_activity("📊 Processing class availability from raw_data...")
        
        # === STEP 1: Determine Current Bidding Window ===
        now = datetime.now()
        current_window_name = None
        
        # Get the bidding schedule for the current term
        bidding_schedule_for_term = BIDDING_SCHEDULES.get(START_AY_TERM, [])
        
        if bidding_schedule_for_term:
            # Find the current window (first future window)
            for i, (results_date, window_name, folder_suffix) in enumerate(bidding_schedule_for_term):
                if now < results_date:
                    current_window_name = window_name
                    break
            
            # If no future window found, we're past all scheduled windows
            if current_window_name is None and bidding_schedule_for_term:
                # Use the last window as current
                current_window_name = bidding_schedule_for_term[-1][1]
        
        logger.info(f"🎯 Processing class availability for current window: '{current_window_name}'")

        # === STEP 2: Filter the data to only current window and current term records ===
        if current_window_name and hasattr(self, 'standalone_data') and not self.standalone_data.empty:
            if 'bidding_window' in self.standalone_data.columns:
                original_count = len(self.standalone_data)
                
                # Filter by bidding window
                current_window_data = self.standalone_data[
                    self.standalone_data['bidding_window'] == current_window_name
                ].copy()
                
                # Also filter by current academic term to prevent cross-term contamination
                if 'acad_term_id' in current_window_data.columns:
                    # Extract expected term from START_AY_TERM (e.g., '2025-26_T1' -> 'AY202526T1')
                    expected_term_id = START_AY_TERM.replace('-', '').replace('_', '')
                    expected_term_id = f"AY{expected_term_id}"
                    
                    before_term_filter = len(current_window_data)
                    current_window_data = current_window_data[
                        current_window_data['acad_term_id'] == expected_term_id
                    ].copy()
                    
                    self.log_boss_activity(f"🔽 Filtered data: {original_count} → {before_term_filter} (window) → {len(current_window_data)} (window + term)")
                    self.log_boss_activity(f"    Window filter: '{current_window_name}', Term filter: '{expected_term_id}'")
                else:
                    self.log_boss_activity(f"🔽 Filtered data from {original_count} to {len(current_window_data)} records for current window: '{current_window_name}'")
            else:
                self.log_boss_activity("⚠️ No 'bidding_window' column found - processing all data")
                current_window_data = self.standalone_data.copy()
        else:
            self.log_boss_activity("⚠️ Could not determine current window or no standalone data - processing all data")
            current_window_data = self.standalone_data.copy() if hasattr(self, 'standalone_data') else pd.DataFrame()
        
        # Load existing class availability data to prevent duplicates
        existing_availability_keys = set()
        cache_file = os.path.join(self.cache_dir, 'class_availability_cache.pkl')
        if os.path.exists(cache_file):
            try:
                existing_df = pd.read_pickle(cache_file)
                if not existing_df.empty:
                    for _, record in existing_df.iterrows():
                        key = (record['class_id'], record['bid_window_id'])
                        existing_availability_keys.add(key)
                    self.log_boss_activity(f"✅ Pre-loaded {len(existing_availability_keys)} existing class availability keys from cache.")
            except Exception as e:
                self.log_boss_activity(f"⚠️ Could not pre-load class_availability_cache: {e}")
        
        # ADDED: Track keys from current run to prevent duplicates within the same processing
        current_run_keys = set()
        for availability_record in self.new_class_availability:
            key = (availability_record['class_id'], availability_record['bid_window_id'])
            current_run_keys.add(key)

        newly_created_count = 0
        updated_count = 0
        
        # === STEP 3: Process only the filtered current window data ===
        for _, row in current_window_data.iterrows():
            course_code = row.get('course_code')
            section = row.get('section')
            acad_term_id = row.get('acad_term_id')
            bidding_window_str = row.get('bidding_window')
            
            if pd.isna(course_code) or pd.isna(section) or pd.isna(acad_term_id) or pd.isna(bidding_window_str):
                continue
            
            round_str, window_num = self.parse_bidding_window(bidding_window_str)
            if not all([round_str, window_num]):
                continue
            
            class_boss_id = row.get('class_boss_id')
            class_ids = self.find_all_class_ids(acad_term_id, class_boss_id)

            if not class_ids:
                failed_row = {
                    'course_code': course_code, 'section': section, 'acad_term_id': acad_term_id,
                    'bidding_window_str': bidding_window_str, 'reason': 'class_not_found'
                }
                self.failed_mappings.append(failed_row)
                self.boss_stats['failed_mappings'] += 1
                continue
            
            window_key = (acad_term_id, round_str, window_num)
            bid_window_id = self.bid_window_cache.get(window_key)
            if not bid_window_id:
                self.log_boss_activity(f"⚠️ No bid_window_id for {window_key}")
                continue
            
            # Extract values safely
            total_val = int(row.get('total')) if pd.notna(row.get('total')) else 0
            current_enrolled_val = int(row.get('current_enrolled')) if pd.notna(row.get('current_enrolled')) else 0
            reserved_val = int(row.get('reserved')) if pd.notna(row.get('reserved')) else 0
            available_val = int(row.get('available')) if pd.notna(row.get('available')) else 0
            
            for class_id in class_ids:
                # Check for existence in both existing data and current run
                availability_key = (class_id, bid_window_id)
                
                # FIXED: Check both existing and current run keys
                if availability_key in existing_availability_keys or availability_key in current_run_keys:
                    # Check if update is needed
                    if availability_key in existing_availability_keys:
                        # Could implement update logic here if needed
                        pass
                    continue

                # Create new record
                availability_record = {
                    'class_id': class_id,
                    'bid_window_id': bid_window_id,
                    'total': total_val,
                    'current_enrolled': current_enrolled_val,
                    'reserved': reserved_val,
                    'available': available_val
                }
                
                self.new_class_availability.append(availability_record)
                current_run_keys.add(availability_key)
                newly_created_count += 1
        
        self.boss_stats['class_availability_created'] = newly_created_count
        self.log_boss_activity(f"✅ Class availability checks complete. Created {newly_created_count} new records, Updated {updated_count} records.")
        return True

    def process_bid_results(self):
        """
        Process bid data from raw_data.xlsx. Creates update records when median/min data exists,
        and new records only when they don't exist yet.
        """
        logger.info("📈 Processing bid results from raw_data...")
        
        # Ensure the list for update records exists
        if not hasattr(self, 'update_bid_result'):
            self.update_bid_result = []

        # === STEP 1: Determine Current and Previous Bidding Windows ===
        now = datetime.now()
        current_window_name = None
        previous_window_name = None
        
        # Get the bidding schedule for the current term
        bidding_schedule_for_term = BIDDING_SCHEDULES.get(START_AY_TERM, [])
        
        if bidding_schedule_for_term:
            # Find the current window (first future window) and previous window
            for i, (results_date, window_name, folder_suffix) in enumerate(bidding_schedule_for_term):
                if now < results_date:
                    current_window_name = window_name
                    # Previous window is the one before current (if exists)
                    if i > 0:
                        previous_window_name = bidding_schedule_for_term[i-1][1]
                    break
            
            # If no future window found, we're past all scheduled windows
            if current_window_name is None and bidding_schedule_for_term:
                # Use the last window as current, and second-to-last as previous
                current_window_name = bidding_schedule_for_term[-1][1]
                if len(bidding_schedule_for_term) > 1:
                    previous_window_name = bidding_schedule_for_term[-2][1]
        
        logger.info(f"🎯 Determined bidding windows - Current: '{current_window_name}', Previous: '{previous_window_name}'")

        # === STEP 2: Filter the data to only current window and current term records ===
        # For new bid results, we only want data from the CURRENT window and CURRENT term
        if current_window_name and hasattr(self, 'standalone_data') and not self.standalone_data.empty:
            if 'bidding_window' in self.standalone_data.columns:
                original_count = len(self.standalone_data)
                
                # Filter by bidding window
                current_window_data = self.standalone_data[
                    self.standalone_data['bidding_window'] == current_window_name
                ].copy()
                
                # Also filter by current academic term to prevent cross-term contamination
                if 'acad_term_id' in current_window_data.columns:
                    # Extract expected term from START_AY_TERM (e.g., '2025-26_T1' -> 'AY202526T1')
                    expected_term_id = START_AY_TERM.replace('-', '').replace('_', '')
                    expected_term_id = f"AY{expected_term_id}"
                    
                    before_term_filter = len(current_window_data)
                    current_window_data = current_window_data[
                        current_window_data['acad_term_id'] == expected_term_id
                    ].copy()
                    
                    logger.info(f"🔽 Filtered standalone_data: {original_count} → {before_term_filter} (window) → {len(current_window_data)} (window + term)")
                    logger.info(f"    Window filter: '{current_window_name}', Term filter: '{expected_term_id}'")
                else:
                    logger.info(f"🔽 Filtered standalone_data from {original_count} to {len(current_window_data)} records for current window: '{current_window_name}'")
            else:
                logger.warning("⚠️ No 'bidding_window' column found - processing all data")
                current_window_data = self.standalone_data.copy()
        else:
            logger.warning("⚠️ Could not determine current window or no standalone data - processing all data")
            current_window_data = self.standalone_data.copy() if hasattr(self, 'standalone_data') else pd.DataFrame()

        # Load existing bid_result data to check for duplicates
        existing_bid_result_keys = set()
        existing_bid_results = {}  # Store full records for update comparison
        cache_file = os.path.join(self.cache_dir, 'bid_result_cache.pkl')
        if os.path.exists(cache_file):
            try:
                existing_df = pd.read_pickle(cache_file)
                if not existing_df.empty:
                    for _, record in existing_df.iterrows():
                        key = (record['bid_window_id'], record['class_id'])
                        existing_bid_result_keys.add(key)
                        existing_bid_results[key] = record.to_dict()
                    self.log_boss_activity(f"✅ Pre-loaded {len(existing_bid_result_keys)} existing bid result keys from cache.")
            except Exception as e:
                self.log_boss_activity(f"⚠️ Could not pre-load bid_result_cache: {e}")

        newly_created_count = 0
        updated_count = 0
        
        # DEBUG: Check column names in the data
        if not current_window_data.empty:
            # Check for any rows with median/min data
            median_cols = [col for col in current_window_data.columns if 'median' in col.lower() or 'bid' in col.lower()]
            min_cols = [col for col in current_window_data.columns if 'min' in col.lower()]
        
        # === STEP 3: Process only the filtered current window data ===
        for idx, row in current_window_data.iterrows():
            try:
                course_code = row.get('course_code')
                section = row.get('section')
                acad_term_id = row.get('acad_term_id')
                class_boss_id = row.get('class_boss_id')
                bidding_window_str = row.get('bidding_window')
                
                if pd.isna(acad_term_id) or pd.isna(class_boss_id):
                    continue
                
                round_str, window_num = self.parse_bidding_window(bidding_window_str)
                if not all([round_str, window_num]):
                    continue
                
                class_ids = self.find_all_class_ids(acad_term_id, class_boss_id)
                if not class_ids:
                    continue
                
                window_key = (acad_term_id, round_str, window_num)
                bid_window_id = self.bid_window_cache.get(window_key)
                if not bid_window_id:
                    continue

                # FIXED: Check all possible column names for median and min
                median_bid = None
                min_bid = None
                
                # Try all possible column names for median
                median_column_names = ['median', 'Median', 'Median Bid', 'median_bid', 'Median_Bid', 'MEDIAN']
                for col_name in median_column_names:
                    if col_name in row.index:
                        val = row[col_name]
                        if pd.notna(val):
                            median_bid = val
                            if idx < 5:  # Log first few for debugging
                                self.log_boss_activity(f"🔍 DEBUG Row {idx}: Found median value {median_bid} in column '{col_name}'")
                            break
                
                # Try all possible column names for min
                min_column_names = ['min', 'Min', 'Min Bid', 'min_bid', 'Min_Bid', 'MIN']
                for col_name in min_column_names:
                    if col_name in row.index:
                        val = row[col_name]
                        if pd.notna(val):
                            min_bid = val
                            if idx < 5:  # Log first few for debugging
                                self.log_boss_activity(f"🔍 DEBUG Row {idx}: Found min value {min_bid} in column '{col_name}'")
                            break
                
                has_bid_data = pd.notna(median_bid) or pd.notna(min_bid)
                
                # DEBUG: Log if we have bid data
                if has_bid_data and idx < 10:
                    self.log_boss_activity(f"🔍 DEBUG: Row {idx} {course_code}-{section} has bid data: median={median_bid}, min={min_bid}")

                # Prepare data record
                def safe_int(val): return int(val) if pd.notna(val) else None
                def safe_float(val): return float(val) if pd.notna(val) else None
                
                total_val = safe_int(row.get('total'))
                enrolled_val = safe_int(row.get('current_enrolled'))
                
                for class_id in class_ids:
                    # Check if record exists
                    bid_result_key = (bid_window_id, class_id)
                    
                    result_data = {
                        'bid_window_id': bid_window_id, 
                        'class_id': class_id,
                        'vacancy': total_val,
                        'opening_vacancy': safe_int(row.get('opening_vacancy')),
                        'before_process_vacancy': total_val - enrolled_val if total_val is not None and enrolled_val is not None else None,
                        'dice': safe_int(row.get('d_i_c_e') or row.get('dice')),
                        'after_process_vacancy': safe_int(row.get('after_process_vacancy')),
                        'enrolled_students': enrolled_val,
                        'median': safe_float(median_bid),
                        'min': safe_float(min_bid)
                    }

                    if bid_result_key in existing_bid_result_keys:
                        # Check if update is needed
                        existing_record = existing_bid_results.get(bid_result_key, {})
                        needs_update = False
                        
                        # Check if median or min values have changed
                        if has_bid_data:
                            if (pd.notna(median_bid) and safe_float(median_bid) != existing_record.get('median')):
                                needs_update = True
                            if (pd.notna(min_bid) and safe_float(min_bid) != existing_record.get('min')):
                                needs_update = True
                        
                        # Also check other fields for updates
                        for field in ['vacancy', 'opening_vacancy', 'before_process_vacancy', 'dice', 
                                    'after_process_vacancy', 'enrolled_students']:
                            if result_data.get(field) is not None and result_data[field] != existing_record.get(field):
                                needs_update = True
                        
                        if needs_update:
                            self.update_bid_result.append(result_data)
                            updated_count += 1
                            if has_bid_data:
                                self.log_boss_activity(f"📊 Update bid_result: {course_code}-{section} with median={median_bid}, min={min_bid}")
                    else:
                        # This is a NEW record
                        self.new_bid_result.append(result_data)
                        existing_bid_result_keys.add(bid_result_key)
                        newly_created_count += 1
            
            except Exception as e:
                logger.error(f"Error processing bid result row for {row.get('course_code')}-{row.get('section')}: {e}")

        self.boss_stats['bid_results_created'] += newly_created_count
        self.log_boss_activity(f"✅ Bid result checks complete. Created: {newly_created_count}, Updated: {updated_count}.")
        return True

    def find_all_class_ids(self, acad_term_id, class_boss_id):
        """Finds all class_ids for a given acad_term_id and class_boss_id.
        Returns ALL class records for multi-professor classes."""
        
        if pd.isna(acad_term_id) or pd.isna(class_boss_id):
            return []

        found_class_ids = []

        # Source 1: Check newly created classes in this run
        if hasattr(self, 'new_classes') and self.new_classes:
            for class_obj in self.new_classes:
                if (class_obj.get('acad_term_id') == acad_term_id and 
                    str(class_obj.get('boss_id')) == str(class_boss_id)):
                    found_class_ids.append(class_obj['id'])

        # Source 2: Check classes that existed before this run (from cache)
        if hasattr(self, 'existing_classes_cache') and self.existing_classes_cache:
            for class_obj in self.existing_classes_cache:
                if (class_obj.get('acad_term_id') == acad_term_id and 
                    str(class_obj.get('boss_id')) == str(class_boss_id)):
                    found_class_ids.append(class_obj['id'])
        
        # Source 3: Check new_classes.csv file if cache is incomplete
        try:
            new_classes_path = os.path.join(self.output_base, 'new_classes.csv')
            if os.path.exists(new_classes_path):
                df = pd.read_csv(new_classes_path)
                # This logic now correctly searches the CSV using boss_id
                matching_classes = df[
                    (df['acad_term_id'] == acad_term_id) & 
                    (df['boss_id'].astype(str) == str(class_boss_id))
                ]
                for _, row in matching_classes.iterrows():
                    if row['id'] not in found_class_ids:
                        found_class_ids.append(row['id'])
        except Exception as e:
            self.log_boss_activity(f"⚠️ Error reading new_classes.csv: {e}", print_to_stdout=False)
        
        # Remove duplicates while preserving order
        unique_class_ids = []
        seen = set()
        for class_id in found_class_ids:
            if class_id not in seen:
                unique_class_ids.append(class_id)
                seen.add(class_id)
        
        # Debug logging for multi-professor classes
        if len(unique_class_ids) > 1:
            self.log_boss_activity(f"📚 Found {len(unique_class_ids)} class records for boss_id {class_boss_id}: multi-professor class")
        
        return unique_class_ids

    def save_boss_outputs(self):
        """Save all BOSS-related output files"""
        self.log_boss_activity("💾 Saving BOSS output files...")
        
        # Save bid windows
        if self.new_bid_windows:
            df = pd.DataFrame(self.new_bid_windows)
            output_path = os.path.join(self.output_base, 'new_bid_window.csv')
            df.to_csv(output_path, index=False)
            self.log_boss_activity(f"✅ Saved {len(self.new_bid_windows)} bid windows to new_bid_window.csv")
        
        # Save class availability
        if self.new_class_availability:
            df = pd.DataFrame(self.new_class_availability)
            output_path = os.path.join(self.output_base, 'new_class_availability.csv')
            df.to_csv(output_path, index=False)
            self.log_boss_activity(f"✅ Saved {len(self.new_class_availability)} availability records to new_class_availability.csv")
        
        # Save bid results
        if self.new_bid_result:
            df = pd.DataFrame(self.new_bid_result)
            output_path = os.path.join(self.output_base, 'new_bid_result.csv')
            df.to_csv(output_path, index=False)
            self.log_boss_activity(f"✅ Saved {len(self.new_bid_result)} bid results to new_bid_result.csv")
        
        # Save failed mappings
        if self.failed_mappings:
            df = pd.DataFrame(self.failed_mappings)
            output_path = os.path.join(self.output_base, 'failed_boss_results_mapping.csv')
            df.to_csv(output_path, index=False)
            self.log_boss_activity(f"⚠️ Saved {len(self.failed_mappings)} failed mappings to failed_boss_results_mapping.csv")
        
        self.log_boss_activity("✅ All BOSS output files saved successfully")

    def print_boss_summary(self):
        """Print BOSS processing summary"""
        print("\n" + "="*70)
        print("📊 BOSS RESULTS PROCESSING SUMMARY")
        print("="*70)
        print(f"📂 Files processed: {self.boss_stats['files_processed']}")
        print(f"📄 Total rows: {self.boss_stats['total_rows']}")
        print(f"🪟 Bid windows created: {self.boss_stats['bid_windows_created']}")
        print(f"📊 Class availability records: {self.boss_stats['class_availability_created']}")
        print(f"📈 Bid result records: {self.boss_stats['bid_results_created']}")
        print(f"❌ Failed mappings: {self.boss_stats['failed_mappings']}")
        print("="*70)
        
        print("\n📁 OUTPUT FILES:")
        print(f"   - new_bid_window.csv ({self.boss_stats['bid_windows_created']} records)")
        print(f"   - new_class_availability.csv ({self.boss_stats['class_availability_created']} records)")
        print(f"   - new_bid_result.csv ({self.boss_stats['bid_results_created']} records)")
        if self.boss_stats['failed_mappings'] > 0:
            print(f"   - failed_boss_results_mapping.csv ({self.boss_stats['failed_mappings']} records)")
        print(f"   - boss_result_log.txt (processing log)")
        print("="*70)

    def run_phase3_boss_processing(self):
        """
        Orchestrates the entire bidding data processing workflow with a robust, linear order of operations.
        """
        try:
            print("🚀 Starting Enhanced Phase 3: Final, Robust Workflow")
            print("============================================================")

            logger.info("🛠️ Pre-Phase 3: Updating cache with newly created records from previous phases...")
            
            # Update professors cache
            new_profs_path = os.path.join(self.verify_dir, 'new_professors.csv')
            prof_cache_path = os.path.join(self.cache_dir, 'professors_cache.pkl')
            
            if os.path.exists(new_profs_path) and os.path.exists(prof_cache_path):
                new_profs_df = pd.read_csv(new_profs_path)
                if not new_profs_df.empty:
                    # Drop columns from new_profs_df that are not in the main professors table
                    # This prevents errors if new_professors.csv has extra columns
                    prof_cache_df = pd.read_pickle(prof_cache_path)
                    new_profs_df = new_profs_df[[col for col in new_profs_df.columns if col in prof_cache_df.columns]]
                    
                    combined_profs_df = pd.concat([prof_cache_df, new_profs_df], ignore_index=True).drop_duplicates(subset=['id'])
                    combined_profs_df.to_pickle(prof_cache_path)
                    logger.info(f"   ✅ Updated professors_cache.pkl with {len(new_profs_df)} new records.")

            # Update classes cache (similar logic)
            new_classes_path = os.path.join(self.output_base, 'new_classes.csv')
            class_cache_path = os.path.join(self.cache_dir, 'classes_cache.pkl')

            if os.path.exists(new_classes_path) and os.path.exists(class_cache_path):
                new_classes_df = pd.read_csv(new_classes_path)
                if not new_classes_df.empty:
                    class_cache_df = pd.read_pickle(class_cache_path)
                    combined_classes_df = pd.concat([class_cache_df, new_classes_df], ignore_index=True).drop_duplicates(subset=['id'])
                    combined_classes_df.to_pickle(class_cache_path)
                    logger.info(f"   ✅ Updated classes_cache.pkl with {len(new_classes_df)} new records.")

            self.setup_boss_processing()

            # --- Step 1: Load ALL required data from cache and input files ---
            self.log_boss_activity("🔄 Loading all required data caches with freshness check and combination...")
            # Use the method that combines new data from CSVs before validation
            if not self.load_or_cache_data_with_freshness_check(): return False
            
            self.log_boss_activity("🔄 Loading all raw input files...")
            if not self.load_raw_data(): return False
            self.overall_boss_results_df = self.load_overall_boss_results() # Load this once

            # --- Step 2: Process base entities to establish a stable state ---
            self.log_boss_activity("🔄 Processing base entities (Courses, Classes, Timings)...")
            self.process_acad_terms()
            self.process_professors()
            self.process_courses() 
            self.process_classes()  # This now correctly handles create vs. update
            self.process_timings()  # This now receives stable, non-duplicate class_ids

            # --- Step 3: Sequential catch-up processing for all windows ---
            self.log_boss_activity("🔄 Starting sequential catch-up processing...")

            # Determine current live window and processing range
            now = datetime.now()
            current_window_index = None
            bidding_schedule_for_term = BIDDING_SCHEDULES.get(START_AY_TERM, [])

            if not bidding_schedule_for_term:
                self.log_boss_activity("❌ No bidding schedule found for current term")
                return False

            # Find current live window (first future window)
            for i, (results_date, window_name, folder_suffix) in enumerate(bidding_schedule_for_term):
                if now < results_date:
                    current_window_index = i
                    break

            if current_window_index is None:
                # Past all windows, use last window as current
                current_window_index = len(bidding_schedule_for_term) - 1

            current_window_name = bidding_schedule_for_term[current_window_index][1]
            self.log_boss_activity(f"🎯 Current live window: {current_window_name}")

            # Processing range: from start to current window (inclusive)
            processing_range = bidding_schedule_for_term[:current_window_index + 1]
            self.log_boss_activity(f"📅 Processing {len(processing_range)} windows chronologically")

            # Load data sources once
            if not self.load_raw_data():
                self.log_boss_activity("⚠️ Could not load raw_data.xlsx")
                return False

            self.overall_boss_results_df = self.load_overall_boss_results()
            if self.overall_boss_results_df is None:
                self.log_boss_activity("⚠️ Could not load overallBossResults.xlsx")

            # Load existing bid result keys for deduplication
            existing_bid_result_keys = set()
            cache_file = os.path.join(self.cache_dir, 'bid_result_cache.pkl')
            if os.path.exists(cache_file):
                try:
                    existing_df = pd.read_pickle(cache_file)
                    if not existing_df.empty:
                        for _, record in existing_df.iterrows():
                            key = (record['bid_window_id'], record['class_id'])
                            existing_bid_result_keys.add(key)
                        self.log_boss_activity(f"✅ Pre-loaded {len(existing_bid_result_keys)} existing bid result keys")
                except Exception as e:
                    self.log_boss_activity(f"⚠️ Could not pre-load bid_result_cache: {e}")

            # Sequential processing loop
            for window_index, (results_date, window_name, folder_suffix) in enumerate(processing_range):
                self.log_boss_activity(f"🔄 Processing window {window_index + 1}/{len(processing_range)}: {window_name}")
                
                # Parse window name to get round and window number
                round_str, window_num = self.parse_bidding_window(window_name)
                if not round_str or not window_num:
                    self.log_boss_activity(f"⚠️ Could not parse window: {window_name}")
                    continue
                
                acad_term_id = ACAD_TERM_ID
                window_key = (acad_term_id, round_str, window_num)
                
                # A. Find or create BidWindow record
                bid_window_id = self.bid_window_cache.get(window_key)
                if not bid_window_id:
                    bid_window_id = self.bid_window_id_counter
                    new_bid_window = {
                        'id': bid_window_id,
                        'acad_term_id': acad_term_id,
                        'round': round_str,
                        'window': window_num
                    }
                    self.new_bid_windows.append(new_bid_window)
                    self.bid_window_cache[window_key] = bid_window_id
                    self.bid_window_id_counter += 1
                    self.boss_stats['bid_windows_created'] += 1
                    self.log_boss_activity(f"✅ Created bid window {bid_window_id}: {window_name}")
                
                # B. Process ClassAvailability (check if scrape data exists in raw_data.xlsx)
                window_data_in_raw = self.standalone_data[
                    self.standalone_data['bidding_window'] == window_name
                ] if hasattr(self, 'standalone_data') and self.standalone_data is not None else pd.DataFrame()
                
                if not window_data_in_raw.empty:
                    self.log_boss_activity(f"📊 Processing ClassAvailability for {window_name} from raw_data.xlsx")
                    
                    # Load existing availability keys for deduplication
                    existing_availability_keys = set()
                    avail_cache_file = os.path.join(self.cache_dir, 'class_availability_cache.pkl')
                    if os.path.exists(avail_cache_file):
                        try:
                            avail_df = pd.read_pickle(avail_cache_file)
                            if not avail_df.empty:
                                for _, record in avail_df.iterrows():
                                    key = (record['class_id'], record['bid_window_id'])
                                    existing_availability_keys.add(key)
                        except Exception as e:
                            pass
                    
                    for _, row in window_data_in_raw.iterrows():
                        course_code = row.get('course_code')
                        section = row.get('section')
                        class_boss_id = row.get('class_boss_id')
                        
                        if pd.isna(course_code) or pd.isna(section) or pd.isna(class_boss_id):
                            continue
                        
                        class_ids = self.find_all_class_ids(acad_term_id, class_boss_id)
                        if not class_ids:
                            continue
                        
                        # Extract availability values
                        total_val = int(row.get('total')) if pd.notna(row.get('total')) else 0
                        current_enrolled_val = int(row.get('current_enrolled')) if pd.notna(row.get('current_enrolled')) else 0
                        reserved_val = int(row.get('reserved')) if pd.notna(row.get('reserved')) else 0
                        available_val = int(row.get('available')) if pd.notna(row.get('available')) else 0
                        
                        for class_id in class_ids:
                            availability_key = (class_id, bid_window_id)
                            if availability_key not in existing_availability_keys:
                                availability_record = {
                                    'class_id': class_id,
                                    'bid_window_id': bid_window_id,
                                    'total': total_val,
                                    'current_enrolled': current_enrolled_val,
                                    'reserved': reserved_val,
                                    'available': available_val
                                }
                                self.new_class_availability.append(availability_record)
                                existing_availability_keys.add(availability_key)
                                self.boss_stats['class_availability_created'] += 1
                else:
                    self.log_boss_activity(f"⏭️ Skipping ClassAvailability for {window_name} (no scrape data)")
                
                # C. Process BidResult based on window type
                is_current_live = (window_index == current_window_index)
                
                if is_current_live:
                    # Current live window: create placeholder BidResult from raw_data.xlsx
                    self.log_boss_activity(f"📈 Processing placeholder BidResult for current live window: {window_name}")
                    
                    if not window_data_in_raw.empty:
                        for _, row in window_data_in_raw.iterrows():
                            course_code = row.get('course_code')
                            section = row.get('section')
                            class_boss_id = row.get('class_boss_id')
                            
                            if pd.isna(class_boss_id):
                                continue
                            
                            class_ids = self.find_all_class_ids(acad_term_id, class_boss_id)
                            if not class_ids:
                                continue
                            
                            def safe_int(val): return int(val) if pd.notna(val) else None
                            
                            total_val = safe_int(row.get('total'))
                            enrolled_val = safe_int(row.get('current_enrolled'))
                            
                            for class_id in class_ids:
                                bid_result_key = (bid_window_id, class_id)
                                if bid_result_key not in existing_bid_result_keys:
                                    # Create placeholder record (median/min as None)
                                    result_data = {
                                        'bid_window_id': bid_window_id,
                                        'class_id': class_id,
                                        'vacancy': total_val,
                                        'opening_vacancy': safe_int(row.get('opening_vacancy')),
                                        'before_process_vacancy': total_val - enrolled_val if total_val is not None and enrolled_val is not None else None,
                                        'dice': safe_int(row.get('d_i_c_e') or row.get('dice')),
                                        'after_process_vacancy': safe_int(row.get('after_process_vacancy')),
                                        'enrolled_students': enrolled_val,
                                        'median': None,  # Placeholder
                                        'min': None      # Placeholder
                                    }
                                    self.new_bid_result.append(result_data)
                                    existing_bid_result_keys.add(bid_result_key)
                                    self.boss_stats['bid_results_created'] += 1
                else:
                    # Historical window: process from overallBossResults.xlsx
                    self.log_boss_activity(f"📈 Processing historical BidResult for {window_name} from overallBossResults.xlsx")
                    
                    if self.overall_boss_results_df is not None and not self.overall_boss_results_df.empty:
                        # Filter overall results for this specific window
                        overall_df = self.overall_boss_results_df.copy()
                        
                        # Parse bidding window column
                        bidding_window_col = None
                        for col in overall_df.columns:
                            if 'bidding window' in col.lower() or 'bidding_window' in col.lower():
                                bidding_window_col = col
                                break
                        
                        if bidding_window_col:
                            # Parse and filter for current window
                            parsed_windows = overall_df[bidding_window_col].apply(self.parse_bidding_window)
                            overall_df['round'] = parsed_windows.apply(lambda x: x[0] if isinstance(x, tuple) else None)
                            overall_df['window'] = parsed_windows.apply(lambda x: x[1] if isinstance(x, tuple) else None)
                            
                            overall_df.dropna(subset=['round', 'window'], inplace=True)
                            overall_df['round'] = overall_df['round'].astype(str)
                            overall_df['window'] = pd.to_numeric(overall_df['window']).astype(int)
                            
                            window_filtered_df = overall_df[
                                (overall_df['round'] == str(round_str)) &
                                (overall_df['window'] == int(window_num))
                            ]
                            
                            if not window_filtered_df.empty:
                                for _, row in window_filtered_df.iterrows():
                                    # Extract course and section info
                                    course_code = self._get_column_value(row, ['Course Code', 'course_code', 'Course_Code'])
                                    section = self._get_column_value(row, ['Section', 'section'])
                                    
                                    if pd.isna(course_code) or pd.isna(section):
                                        continue
                                    
                                    # Find class_boss_id from raw data
                                    class_boss_id = self._find_class_boss_id_from_course_section(course_code, section, acad_term_id)
                                    if not class_boss_id:
                                        continue
                                    
                                    class_ids = self.find_all_class_ids(acad_term_id, class_boss_id)
                                    if not class_ids:
                                        continue
                                    
                                    # Extract bid data
                                    median_bid = None
                                    min_bid = None
                                    for col in row.index:
                                        col_lower = str(col).lower()
                                        if 'median' in col_lower and 'bid' in col_lower:
                                            median_bid = row[col]
                                        elif 'min' in col_lower and 'bid' in col_lower:
                                            min_bid = row[col]
                                    
                                    def safe_int(val): return int(val) if pd.notna(val) else None
                                    def safe_float(val): return float(val) if pd.notna(val) else None
                                    
                                    for class_id in class_ids:
                                        result_data = {
                                            'bid_window_id': bid_window_id,
                                            'class_id': class_id,
                                            'vacancy': safe_int(self._get_column_value(row, ['Vacancy', 'vacancy'])),
                                            'opening_vacancy': safe_int(self._get_column_value(row, ['Opening Vacancy', 'opening_vacancy', 'Opening_Vacancy'])),
                                            'before_process_vacancy': safe_int(self._get_column_value(row, ['Before Process Vacancy', 'before_process_vacancy', 'Before_Process_Vacancy'])),
                                            'dice': safe_int(self._get_column_value(row, ['D.I.C.E', 'dice', 'd_i_c_e', 'DICE'])),
                                            'after_process_vacancy': safe_int(self._get_column_value(row, ['After Process Vacancy', 'after_process_vacancy', 'After_Process_Vacancy'])),
                                            'enrolled_students': safe_int(self._get_column_value(row, ['Enrolled Students', 'enrolled_students', 'Enrolled_Students'])),
                                            'median': safe_float(median_bid),
                                            'min': safe_float(min_bid)
                                        }
                                        
                                        # Check if record exists (UPDATE vs CREATE)
                                        bid_result_key = (bid_window_id, class_id)
                                        if bid_result_key in existing_bid_result_keys:
                                            self.update_bid_result.append(result_data)
                                        else:
                                            self.new_bid_result.append(result_data)
                                            existing_bid_result_keys.add(bid_result_key)
                                            self.boss_stats['bid_results_created'] += 1
                            else:
                                self.log_boss_activity(f"⚠️ No data found in overallBossResults for {window_name}")
                        else:
                            self.log_boss_activity(f"⚠️ No bidding window column found in overallBossResults.xlsx")

            self.log_boss_activity("✅ Sequential catch-up processing completed")

            # --- Step 5: Save all generated files ---
            self.save_outputs()
            self.save_boss_outputs()
            
            # --- Step 6: Final Summary ---
            self.print_boss_summary()
            
            self.log_boss_activity("📝 ✅ Enhanced Phase 3 completed successfully!")
            return True
            
        except Exception as e:
            print(f"❌ Enhanced Phase 3 failed catastrophically: {e}")
            traceback.print_exc()
            return False

    def load_faculties_cache(self):
        """Load faculties from database cache for mapping"""
        try:
            cache_file = os.path.join(self.cache_dir, 'faculties_cache.pkl')
            
            # Try loading from cache file first
            if os.path.exists(cache_file):
                try:
                    faculties_df = pd.read_pickle(cache_file)
                    if not faculties_df.empty:
                        self.faculties_cache = {}
                        self.faculty_acronym_to_id = {}
                        
                        for _, row in faculties_df.iterrows():
                            faculty_id = row['id']
                            acronym = row['acronym'].upper()
                            
                            self.faculties_cache[faculty_id] = row.to_dict()
                            self.faculty_acronym_to_id[acronym] = faculty_id
                        
                        logger.info(f"📚 Loaded {len(self.faculties_cache)} faculties from cache")
                        return True
                    else:
                        logger.warning("⚠️ Faculty cache file exists but is empty")
                except Exception as e:
                    logger.warning(f"⚠️ Error reading faculty cache file: {e}")
            
            # If cache doesn't exist or failed, try database
            if self.connection:
                try:
                    query = "SELECT * FROM faculties"
                    faculties_df = pd.read_sql_query(query, self.connection)
                    if not faculties_df.empty:
                        # Save to cache for future use
                        faculties_df.to_pickle(cache_file)
                        
                        # Load into memory
                        self.faculties_cache = {}
                        self.faculty_acronym_to_id = {}
                        
                        for _, row in faculties_df.iterrows():
                            faculty_id = row['id']
                            acronym = row['acronym'].upper()
                            
                            self.faculties_cache[faculty_id] = row.to_dict()
                            self.faculty_acronym_to_id[acronym] = faculty_id
                        
                        logger.info(f"📚 Downloaded and cached {len(self.faculties_cache)} faculties from database")
                        return True
                    else:
                        logger.warning("⚠️ Database faculties table is empty")
                except Exception as e:
                    logger.warning(f"⚠️ Error downloading faculties from database: {e}")
            
            # Fallback: create basic mapping from known data
            logger.warning("⚠️ Using fallback faculty mapping")
            self.faculties_cache = {}
            self.faculty_acronym_to_id = {
                'LKCSB': 1,   # Lee Kong Chian School of Business
                'YPHSL': 2,   # Yong Pung How School of Law
                'SOE': 3,     # School of Economics
                'SCIS': 4,    # School of Computing and Information Systems
                'SOSS': 5,    # School of Social Sciences
                'SOA': 6,     # School of Accountancy
                'CIS': 7,     # College of Integrative Studies
                'CEC': 8,      # Center for English Communication
                'C4SR': 9,      # Centre for Social Responsibility
                'OCS': 10,      # Dato’ Kho Hui Meng Career Centre
            }
            return True
            
        except Exception as e:
            logger.error(f"❌ Critical error in load_faculties_cache: {e}")
            return False

    def map_courses_to_faculties_from_boss(self):
        """Map courses to faculties using course code prefix patterns from existing courses"""
        logger.info("🎓 Starting automated faculty mapping from course code patterns...")
        
        # Load faculties cache first
        if not self.load_faculties_cache():
            logger.error("❌ Failed to load faculties cache")
            return False
        
        # Build prefix-to-faculty mapping from existing courses
        prefix_faculty_mapping = defaultdict(set)  # prefix -> set of faculty_ids
        
        logger.info("📋 Analyzing existing course patterns...")
        
        # Analyze existing courses in cache to build prefix patterns
        for course_code, course_data in self.courses_cache.items():
            faculty_id = course_data.get('belong_to_faculty')
            if faculty_id:
                # Extract prefix (characters before numbers)
                # Handle patterns like "COR-COMM567A" -> "COR-COMM"
                import re
                prefix_match = re.match(r'^([A-Z-]+)', course_code.upper())
                if prefix_match:
                    prefix = prefix_match.group(1)
                    prefix_faculty_mapping[prefix].add(faculty_id)
        
        logger.info(f"📊 Found {len(prefix_faculty_mapping)} unique course prefixes in existing courses")
        
        # Log the patterns found
        for prefix, faculty_ids in prefix_faculty_mapping.items():
            faculty_names = []
            for fid in faculty_ids:
                if fid in self.faculties_cache:
                    faculty_names.append(self.faculties_cache[fid]['acronym'])
            logger.info(f"   {prefix}: {len(faculty_ids)} faculties ({', '.join(faculty_names)})")
        
        # Apply automatic mapping to new courses
        mapped_count = 0
        course_faculty_mappings = {}
        
        for course_info in self.courses_needing_faculty[:]:  # Copy list to modify during iteration
            course_code = course_info['course_code']
            
            # Extract prefix from new course code
            import re
            prefix_match = re.match(r'^([A-Z-]+)', course_code.upper())
            if not prefix_match:
                continue
            
            prefix = prefix_match.group(1)
            
            # Check if this prefix has exactly 1 unique faculty in existing courses
            if prefix in prefix_faculty_mapping:
                faculty_ids = prefix_faculty_mapping[prefix]
                
                if len(faculty_ids) == 1:
                    # Only 1 faculty found - auto-assign
                    faculty_id = list(faculty_ids)[0]
                    course_faculty_mappings[course_code] = faculty_id
                    mapped_count += 1
                    
                    faculty_name = self.faculties_cache[faculty_id]['acronym'] if faculty_id in self.faculties_cache else str(faculty_id)
                    logger.info(f"✅ Auto-mapped {course_code}: prefix '{prefix}' → {faculty_name}")
                else:
                    # Multiple faculties found - leave for manual assignment
                    faculty_names = [self.faculties_cache[fid]['acronym'] for fid in faculty_ids if fid in self.faculties_cache]
                    logger.info(f"⚠️ {course_code}: prefix '{prefix}' maps to {len(faculty_ids)} faculties ({', '.join(faculty_names)}) - manual review needed")
            else:
                # No existing pattern found
                logger.info(f"🆕 {course_code}: new prefix '{prefix}' - manual review needed")
        
        # Apply mappings to courses
        if course_faculty_mappings:
            self._apply_faculty_mappings_to_courses(course_faculty_mappings)
        
        logger.info(f"✅ Pattern-based faculty mapping completed:")
        logger.info(f"   • {mapped_count} courses auto-mapped based on prefix patterns")
        logger.info(f"   • {len(self.courses_needing_faculty)} courses still need manual review")
        
        return True

    def _apply_faculty_mappings_to_courses(self, course_faculty_mappings):
        """Apply faculty mappings to new courses and update courses needing faculty"""
        logger.info(f"🔄 Applying faculty mappings to {len(course_faculty_mappings)} courses...")
        
        mapped_count = 0
        
        # Update new_courses
        for course in self.new_courses:
            course_code = course['code']
            if course_code in course_faculty_mappings:
                course['belong_to_faculty'] = course_faculty_mappings[course_code]
                mapped_count += 1
        
        # Update courses_cache
        for course_code, faculty_id in course_faculty_mappings.items():
            if course_code in self.courses_cache:
                self.courses_cache[course_code]['belong_to_faculty'] = faculty_id
        
        # Remove mapped courses from courses_needing_faculty
        original_needing_count = len(self.courses_needing_faculty)
        self.courses_needing_faculty = [
            course_info for course_info in self.courses_needing_faculty
            if course_info['course_code'] not in course_faculty_mappings
        ]
        
        removed_count = original_needing_count - len(self.courses_needing_faculty)
        
        logger.info(f"✅ Applied faculty mappings:")
        logger.info(f"   • {mapped_count} courses updated with faculty")
        logger.info(f"   • {removed_count} courses removed from manual review queue")
        logger.info(f"   • {len(self.courses_needing_faculty)} courses still need manual review")

    def extract_acad_term_from_path(self, file_path: str) -> Optional[str]:
        r"""Extract acad_term_id from file path as fallback
        Examples:
        'script_input\classTimingsFull\2021-22_T1' -> 'AY202122T1'
        'script_input\classTimingsFull\2022-23_T3A' -> 'AY202223T3A'
        """
        # Extract the term folder name
        path_parts = file_path.replace('/', '\\').split('\\')
        
        for part in path_parts:
            # Look for pattern like "2021-22_T1"
            match = re.match(r'(\d{4})-(\d{2})_T(\w+)', part)
            if match:
                year_start = match.group(1)
                year_end = match.group(2)
                term = match.group(3)
                return f"AY{year_start}{year_end}T{term}"
        
        return None

    def get_last_filepath_by_course(self, course_code):
        """Direct filepath lookup for course code - bypasses record_key linking"""
        print(f"🔍 DEBUG: Looking for course {course_code} using direct method")
        
        # Check if we have standalone data with filepath column
        if hasattr(self, 'standalone_data') and self.standalone_data is not None:
            if 'filepath' in self.standalone_data.columns:
                print(f"✅ DEBUG: Found filepath column in standalone_data")
                
                course_records = self.standalone_data[
                    self.standalone_data['course_code'].str.upper() == course_code.upper()
                ].copy()
                
                print(f"📊 DEBUG: Found {len(course_records)} records for {course_code}")
                
                if not course_records.empty:
                    # Get the most recent record (last row)
                    last_record = course_records.iloc[-1]
                    filepath = last_record.get('filepath')
                    
                    print(f"📁 DEBUG: Last record filepath: {filepath}")
                    
                    if pd.notna(filepath):
                        print(f"✅ Found filepath for {course_code}: {filepath}")
                        return filepath
                    else:
                        print(f"❌ DEBUG: Filepath is NaN for {course_code}")
            else:
                print(f"❌ DEBUG: No 'filepath' column in standalone_data")
                print(f"Available columns: {list(self.standalone_data.columns)}")
        
        # Fallback: check multiple_data if standalone doesn't have filepath
        if hasattr(self, 'multiple_data') and self.multiple_data is not None:
            if 'filepath' in self.multiple_data.columns and 'course_code' in self.multiple_data.columns:
                print(f"✅ DEBUG: Checking multiple_data as fallback")
                
                course_records = self.multiple_data[
                    self.multiple_data['course_code'].str.upper() == course_code.upper()
                ].copy()
                
                if not course_records.empty:
                    last_record = course_records.iloc[-1]
                    filepath = last_record.get('filepath')
                    
                    if pd.notna(filepath):
                        print(f"✅ Found filepath in multiple_data for {course_code}: {filepath}")
                        return filepath
        
        print(f"❌ DEBUG: No filepath found for {course_code}")
        return None

    def close_connection(self):
        """Explicitly close database connection"""
        if self.connection:
            self.connection.close()
            self.connection = None
            logger.info("🔒 Database connection closed")

    def check_cache_freshness(self) -> bool:
        """
        Implements granular, window-aware cache freshness logic.
        The cache is STALE if its modification time is before the results date 
        of the previous bidding window.
        """
        logger.info("🔍 Checking cache freshness with window-aware logic...")
        now = datetime.now()

        # a. Identify the current active bidding window from the schedule.
        # The current round is the first one whose results_date is in the future.
        current_window_index = -1
        if not self.bidding_schedule:
            logger.info("✅ No bidding schedule found. Assuming cache is fresh.")
            return True

        for i, (results_date, _, _) in enumerate(self.bidding_schedule):
            if now < results_date:
                current_window_index = i
                break
                
        # b. If no future window is found, or we are before/in the first window,
        # there is no "previous window" to check against. The cache is fresh.
        if current_window_index <= 0:
            logger.info("✅ Not in an active bidding period or before the second round. Cache is considered fresh.")
            return True

        # c. Get the results_date of the previous bidding window.
        previous_window_info = self.bidding_schedule[current_window_index - 1]
        previous_round_results_date = previous_window_info[0]
        
        logger.info(f"ℹ️ Rule: Cache must be newer than the previous window's results date: {previous_round_results_date.strftime('%Y-%m-%d %H:%M')}")

        # d. Check if the cache is fresh by comparing its modification time.
        cache_files_to_check = [
            os.path.join(self.cache_dir, 'professors_cache.pkl'),
            os.path.join(self.cache_dir, 'courses_cache.pkl'),
            os.path.join(self.cache_dir, 'classes_cache.pkl'),
        ]
        oldest_cache_time = None

        for path in cache_files_to_check:
            if not os.path.exists(path):
                logger.warning(f"⚠️ Critical cache file not found: {path}. A full download is required.")
                return False
                
            mod_time_dt = datetime.fromtimestamp(os.path.getmtime(path))
            if oldest_cache_time is None or mod_time_dt < oldest_cache_time:
                oldest_cache_time = mod_time_dt

        if oldest_cache_time is None:
            # This case should be prevented by the check above, but is included for safety.
            logger.warning("⚠️ No cache files exist. A full download is required.")
            return False

        logger.info(f"ℹ️ Oldest relevant cache file was last modified on: {oldest_cache_time.strftime('%Y-%m-%d %H:%M')}")

        # The cache is fresh if its oldest part was created on or after the previous window's results.
        if oldest_cache_time >= previous_round_results_date:
            logger.info("✅ Cache is FRESH.")
            return True
        else:
            logger.warning("❌ Cache is STALE. A full download is required.")
            return False

    def _get_bidding_round_info_for_term(self, ay_term, now):
        """Get current bidding round info for a term"""
        if START_AY_TERM in ay_term:
            for results_date, _, folder_suffix in self.bidding_schedule:
                if now < results_date:
                    return f"{ay_term}_{folder_suffix}"
        return None

    def load_or_cache_data_with_freshness_check(self):
        """
        Load data with a freshness check, combine it with new CSV files,
        and handle caching using robust, type-safe logic.
        """
        # Part 1: Check cache freshness and download new data if necessary.
        if not self.check_cache_freshness():
            logger.info("🔄 Cache is stale, downloading fresh data from the database...")
            if not self.connect_database():
                return False
            
            try:
                self._download_and_cache_data()
                logger.info("✅ Successfully downloaded fresh data from the database.")
            except Exception as e:
                logger.error(f"❌ Failed to download fresh data: {e}")
                return False
        
        # Part 2: Load the core data from the cache (either freshly downloaded or existing).
        if not self.load_or_cache_data():
            return False
        
        # Part 3: Combine the loaded cache with new data from local CSV files.
        # This logic is integrated from the improved _combine_with_new_files method.
        logger.info("🔄 Combining database cache with new CSV files...")
        
        # Combine new_classes.csv with existing_classes_cache
        new_classes_path = os.path.join(self.output_base, 'new_classes.csv')
        if os.path.exists(new_classes_path):
            try:
                new_classes_df = pd.read_csv(new_classes_path)
                if not new_classes_df.empty:
                    new_classes_list = new_classes_df.to_dict('records')
                    
                    # Safely initialize the cache if it doesn't exist to prevent errors.
                    if not hasattr(self, 'existing_classes_cache'):
                        self.existing_classes_cache = []
                    
                    # Add new classes, checking for duplicates with precise, multi-field logic.
                    added_count = 0
                    for new_class in new_classes_list:
                        exists = False
                        for existing_class in self.existing_classes_cache:
                            # Precise check including course, section, term, and professor.
                            if (existing_class['course_id'] == new_class['course_id'] and
                                str(existing_class['section']) == str(new_class['section']) and
                                existing_class['acad_term_id'] == new_class['acad_term_id'] and
                                existing_class.get('professor_id') == new_class.get('professor_id')):
                                exists = True
                                break
                        
                        if not exists:
                            self.existing_classes_cache.append(new_class)
                            added_count += 1
                    
                    if added_count > 0:
                        logger.info(f"✅ Added {added_count} new, unique classes to the existing cache.")
            except Exception as e:
                logger.warning(f"⚠️ Could not combine new_classes.csv: {e}")
        
        # Update bid_window_cache with new_bid_window.csv
        new_bid_window_path = os.path.join(self.output_base, 'new_bid_window.csv')
        if os.path.exists(new_bid_window_path):
            try:
                new_bid_window_df = pd.read_csv(new_bid_window_path)
                if not new_bid_window_df.empty:
                    added_count = 0
                    for _, row in new_bid_window_df.iterrows():
                        # Use explicit type casting for robust key creation.
                        window_key = (row['acad_term_id'], str(row['round']), int(row['window']))
                        if window_key not in self.bid_window_cache:
                            self.bid_window_cache[window_key] = row['id']
                            added_count += 1
                    if added_count > 0:
                        logger.info(f"✅ Added {added_count} new bid windows to the cache.")
            except Exception as e:
                logger.warning(f"⚠️ Could not combine new_bid_window.csv: {e}")
        
        # Update courses_cache with new_courses.csv from multiple locations
        new_courses_paths = [
            os.path.join(self.output_base, 'new_courses.csv'),
            os.path.join(self.verify_dir, 'new_courses.csv')
        ]
        
        for path in new_courses_paths:
            if os.path.exists(path):
                try:
                    new_courses_df = pd.read_csv(path)
                    if not new_courses_df.empty:
                        added_count = 0
                        for _, row in new_courses_df.iterrows():
                            if row['code'] not in self.courses_cache:
                                self.courses_cache[row['code']] = row.to_dict()
                                added_count += 1
                        if added_count > 0:
                            logger.info(f"✅ Added {added_count} new courses from {path}.")
                except Exception as e:
                    logger.warning(f"⚠️ Could not combine {path}: {e}")
        
        return True

    def check_record_exists_in_cache(self, table_name, record_data, key_fields):
        """Check if record exists in database cache and if it needs updates"""
        try:
            cache_file = os.path.join(self.cache_dir, f'{table_name}_cache.pkl')
            if not os.path.exists(cache_file):
                return False, None
            
            df = pd.read_pickle(cache_file)
            if df.empty:
                return False, None
            
            # Build query mask
            mask = True
            for field in key_fields:
                if field in df.columns and field in record_data:
                    mask = mask & (df[field] == record_data[field])
            
            matching_records = df[mask]
            if matching_records.empty:
                return False, None
            
            # Return first match for update comparison
            return True, matching_records.iloc[0].to_dict()
            
        except Exception as e:
            logger.error(f"Error checking cache for {table_name}: {e}")
            return False, None

    def needs_update(self, existing_record, new_record_or_row, field_mapping_or_fields):
        """
        Check if existing record needs updates based on field mapping or field list.
        Handles both dictionary-to-dictionary and row-to-record comparisons.
        """
        # Handle different parameter types
        if isinstance(field_mapping_or_fields, dict):
            # field_mapping case: {db_field: raw_field}
            field_mapping = field_mapping_or_fields
            compare_fields = []
            for db_field, raw_field in field_mapping.items():
                old_value = existing_record.get(db_field)
                new_value = new_record_or_row.get(raw_field)
                
                # Type-specific comparison
                if db_field == 'credit_units':
                    new_value = float(new_value) if pd.notna(new_value) else None
                    old_value = float(old_value) if pd.notna(old_value) else None
                else:
                    if pd.isna(new_value):
                        new_value = None
                    else:
                        new_value = str(new_value).strip()
                    
                    if pd.isna(old_value):
                        old_value = None
                    else:
                        old_value = str(old_value).strip() if old_value is not None else None
                
                # Check for actual change
                if new_value != old_value:
                    # Don't overwrite existing data with empty data
                    if new_value is None or new_value == '':
                        if old_value is not None and old_value != '':
                            continue
                    return True
            return False
        else:
            # field list case: direct field comparison
            compare_fields = field_mapping_or_fields
            for field in compare_fields:
                existing_value = existing_record.get(field)
                new_value = new_record_or_row.get(field)
                
                # Handle different data types
                if pd.isna(existing_value) and pd.isna(new_value):
                    continue
                if pd.isna(existing_value) or pd.isna(new_value):
                    return True
                
                # FIXED: Proper string comparison for course_outline_url and other fields
                if str(existing_value).strip() != str(new_value).strip():
                    return True
            
            return False

    def load_overall_boss_results(self):
        """Load overall BOSS results from script_input/overallBossResults/"""
        logger.info("📊 Loading overall BOSS results...")
        
        now = datetime.now()
        current_round_info = self._get_bidding_round_info_for_term(START_AY_TERM, now)
        
        if not current_round_info:
            logger.info("⏭️ Not in active bidding period - skipping overall results")
            return None
        
        # Determine which overall results file to load based on current round
        overall_results_dir = 'script_input/overallBossResults'
        if not os.path.exists(overall_results_dir):
            logger.warning(f"⚠️ Overall results directory not found: {overall_results_dir}")
            return None
        
        # Look for the appropriate Excel file
        results_file = os.path.join(overall_results_dir, f"{START_AY_TERM}.xlsx")
        if not os.path.exists(results_file):
            logger.warning(f"⚠️ Overall results file not found: {results_file}")
            return None
        
        try:
            # Load the Excel file
            df = pd.read_excel(results_file)
            logger.info(f"✅ Loaded {len(df)} overall BOSS results from {results_file}")
            return df
            
        except Exception as e:
            logger.error(f"❌ Error loading overall results: {e}")
            return None

    def determine_previous_round_for_overall_results(self, current_round_info):
        """Determine which round's results to create based on current round"""
        if not current_round_info:
            return None, None
        
        # Extract current round suffix
        current_suffix = current_round_info.split('_')[-1]
        
        # Mapping of current round to previous round results
        round_mapping = {
            'R1AW1': ('1', 1),     # When in R1A W1, create R1 W1 results
            'R1AW2': ('1A', 1),    # When in R1A W2, create R1A W1 results
            'R1AW3': ('1A', 2),    # When in R1A W3, create R1A W2 results
            'R1BW1': ('1A', 3),    # When in R1B W1, create R1A W3 results
            'R1BW2': ('1B', 1),    # When in R1B W2, create R1B W1 results
            'R1CW1': ('1B', 2),    # When in R1C W1, create R1B W2 results
            'R1CW2': ('1C', 1),    # When in R1C W2, create R1C W1 results
            'R1CW3': ('1C', 2),    # When in R1C W3, create R1C W2 results
            'R1FW1': ('1C', 3),    # When in R1F W1, create R1C W3 results
            'R1FW2': ('1F', 1),    # When in R1F W2, create R1F W1 results
            'R1FW3': ('1F', 2),    # When in R1F W3, create R1F W2 results
            'R1FW4': ('1F', 3),    # When in R1F W4, create R1F W3 results
            'R2W1': ('1F', 4),     # When in R2 W1, create R1F W4 results
            'R2W2': ('2', 1),      # When in R2 W2, create R2 W1 results
            'R2W3': ('2', 2),      # When in R2 W3, create R2 W2 results
            'R2AW1': ('2', 3),     # When in R2A W1, create R2 W3 results
            'R2AW2': ('2A', 1),    # When in R2A W2, create R2A W1 results
            'R2AW3': ('2A', 2),    # When in R2A W3, create R2A W2 results
        }
        
        return round_mapping.get(current_suffix, (None, None))

    def process_overall_boss_results(self):
        """
        Process overall BOSS results from the dedicated Excel file.
        - It parses the 'Bidding Window' column to extract round and window numbers.
        - For existing bid_result records, it creates an UPDATE record with the latest median/min values.
        - If a record doesn't exist, it creates a NEW record as a fallback.
        - It correctly logs any failed class lookups.
        """
        logger.info("📈 Processing overall BOSS results (for updates)...")
        
        overall_df = self.load_overall_boss_results()
        if overall_df is None or overall_df.empty:
            logger.warning("⚠️ No overall BOSS results file found or file is empty. Skipping.")
            return True

        # --- FIX for 'Bidding Window' column ---
        # Standardize column names for robustness BUT preserve the actual column names for data access
        standardized_columns = {}
        for col in overall_df.columns:
            standardized = str(col).lower().replace(' ', '_').replace('.', '')
            standardized_columns[standardized] = col
        
        # Check for the bidding window column
        bidding_window_col = None
        for std_name, orig_name in standardized_columns.items():
            if 'bidding_window' in std_name or 'bidding window' in orig_name.lower():
                bidding_window_col = orig_name
                break
        
        if not bidding_window_col:
            logger.error("❌ 'overallBossResults' file is missing the 'Bidding Window' column. Cannot process.")
            return False

        # Apply the parsing function to create 'round' and 'window' columns from the 'bidding_window' string
        try:
            parsed_windows = overall_df[bidding_window_col].apply(self.parse_bidding_window)
            overall_df['round'] = parsed_windows.apply(lambda x: x[0] if isinstance(x, tuple) else None)
            overall_df['window'] = parsed_windows.apply(lambda x: x[1] if isinstance(x, tuple) else None)
        except Exception as e:
            logger.error(f"❌ Failed to parse 'bidding_window' column: {e}")
            return False

        # Drop rows where parsing might have failed (e.g., unexpected format)
        overall_df.dropna(subset=['round', 'window'], inplace=True)
        if overall_df.empty:
            logger.warning("⚠️ Could not parse any valid round/window from the 'Bidding Window' column. Skipping.")
            return True
        
        # ==================================================================
        # === NEW LOGIC TO PROCESS ONLY THE PREVIOUS ROUND/WINDOW        ===
        # ==================================================================
        now = datetime.now()
        # Use the global START_AY_TERM to find the current active round based on the system time and schedule
        current_round_info = self._get_bidding_round_info_for_term(START_AY_TERM, now)

        if not current_round_info:
            logger.info("ℹ️ Not in an active bidding period. Skipping overall results processing.")
            return True

        # Determine the target round/window that we SHOULD be processing from the Excel file
        target_round, target_window = self.determine_previous_round_for_overall_results(current_round_info)

        if not target_round or not target_window:
            logger.info(f"ℹ️ Current active window ('{current_round_info}') does not require processing of previous results. Skipping.")
            return True
            
        logger.info(f"🎯 Current active window is '{current_round_info}'. Targeting results for Round [{target_round}] Window [{target_window}].")
        
        # Filter the DataFrame to only include rows for the target round and window
        overall_df['round'] = overall_df['round'].astype(str)
        overall_df['window'] = pd.to_numeric(overall_df['window']).astype(int)
        
        original_rows = len(overall_df)
        overall_df = overall_df[
            (overall_df['round'] == str(target_round)) &
            (overall_df['window'] == int(target_window))
        ]
        
        if overall_df.empty:
            logger.warning(f"⚠️ No data found in 'overallBossResults.xlsx' for the target Round [{target_round}] Window [{target_window}]. (Checked {original_rows} rows). Skipping.")
            return True
        
        logger.info(f"✅ Filtered 'overallBossResults' to {len(overall_df)} rows for the target round and window.")

        if not hasattr(self, 'update_bid_result'):
            self.update_bid_result = []
            
        overall_df['round'] = overall_df['round'].astype(str)
        overall_df['window'] = pd.to_numeric(overall_df['window']).astype(int)
        
        grouped_results = overall_df.groupby(['round', 'window'])
        
        new_records_count = 0
        updated_records_count = 0
        failed_count = 0

        for (current_round, current_window), group in grouped_results:
            logger.info(f"📊 Processing results for Round {current_round} Window {current_window}")
            
            acad_term_id = "AY202526T1"
            window_key = (acad_term_id, str(current_round), int(current_window))
            bid_window_id = self.bid_window_cache.get(window_key)

            if not bid_window_id:
                logger.warning(f"⚠️ Could not find bid_window_id for {window_key}, creating it now.")
                new_bid_window = {
                    'id': self.bid_window_id_counter, 'acad_term_id': acad_term_id,
                    'round': str(current_round), 'window': int(current_window)
                }
                self.new_bid_windows.append(new_bid_window)
                self.bid_window_cache[window_key] = self.bid_window_id_counter
                bid_window_id = self.bid_window_id_counter
                self.boss_stats['bid_windows_created'] += 1
                self.bid_window_id_counter += 1

            # Track rows with bid data for debugging
            rows_with_bid_data = 0

            for idx, row in group.iterrows():
                try:
                    # FIXED: Use Course Code + Section instead of class_boss_id
                    course_code = self._get_column_value(row, ['Course Code', 'course_code', 'Course_Code'])
                    section = self._get_column_value(row, ['Section', 'section'])
                    
                    if pd.isna(course_code) or pd.isna(section):
                        continue

                    # Find class_boss_id from the raw data using course_code + section
                    class_boss_id = self._find_class_boss_id_from_course_section(course_code, section, acad_term_id)
                    
                    if not class_boss_id:
                        failed_count += 1
                        self.failed_mappings.append({
                            'course_code': course_code, 'section': section, 'acad_term_id': acad_term_id,
                            'round': current_round, 'window': current_window, 'reason': 'class_boss_id_not_found',
                            'bidding_window_str': row.get(bidding_window_col, '')
                        })
                        continue

                    class_ids = self.find_all_class_ids(acad_term_id, class_boss_id)
                    
                    if not class_ids:
                        failed_count += 1
                        self.failed_mappings.append({
                            'course_code': course_code, 'section': section, 'acad_term_id': acad_term_id,
                            'round': current_round, 'window': current_window, 'reason': 'class_not_found',
                            'bidding_window_str': row.get(bidding_window_col, '')
                        })
                        continue

                    # Find the correct column names (case-insensitive)
                    median_bid = None
                    min_bid = None
                    
                    # Map standardized names to actual column names
                    for col in row.index:
                        col_lower = str(col).lower()
                        if 'median' in col_lower and 'bid' in col_lower:
                            median_bid = row[col]
                        elif 'min' in col_lower and 'bid' in col_lower:
                            min_bid = row[col]

                    # DEBUG: Log first few rows with bid data
                    if pd.notna(median_bid) or pd.notna(min_bid):
                        rows_with_bid_data += 1
                        if rows_with_bid_data <= 3:
                            logger.info(f"🔍 DEBUG: {course_code}-{section} has median={median_bid}, min={min_bid}")

                    for class_id in class_ids:
                        def safe_int(val): return int(val) if pd.notna(val) else None
                        def safe_float(val): return float(val) if pd.notna(val) else None

                        # Map all the column names properly
                        result_data = {
                            'bid_window_id': bid_window_id, 
                            'class_id': class_id,
                            'vacancy': safe_int(self._get_column_value(row, ['Vacancy', 'vacancy'])),
                            'opening_vacancy': safe_int(self._get_column_value(row, ['Opening Vacancy', 'opening_vacancy', 'Opening_Vacancy'])),
                            'before_process_vacancy': safe_int(self._get_column_value(row, ['Before Process Vacancy', 'before_process_vacancy', 'Before_Process_Vacancy'])),
                            'dice': safe_int(self._get_column_value(row, ['D.I.C.E', 'dice', 'd_i_c_e', 'DICE'])),
                            'after_process_vacancy': safe_int(self._get_column_value(row, ['After Process Vacancy', 'after_process_vacancy', 'After_Process_Vacancy'])),
                            'enrolled_students': safe_int(self._get_column_value(row, ['Enrolled Students', 'enrolled_students', 'Enrolled_Students'])),
                            'median': safe_float(median_bid),
                            'min': safe_float(min_bid)
                        }
                        
                        exists, existing_record = self.check_record_exists_in_cache(
                            'bid_result',
                            {'bid_window_id': bid_window_id, 'class_id': class_id},
                            ['bid_window_id', 'class_id']
                        )

                        if exists:
                            # Always update when processing overall results (they have the final bid data)
                            self.update_bid_result.append(result_data)
                            updated_records_count += 1
                            if (pd.notna(median_bid) or pd.notna(min_bid)) and updated_records_count <= 5:
                                logger.info(f"📊 UPDATE: {course_code}-{section} with median={median_bid}, min={min_bid}")
                        else:
                            self.new_bid_result.append(result_data)
                            self.boss_stats['bid_results_created'] += 1
                            new_records_count += 1
                            
                except Exception as e:
                    logger.error(f"Error processing row for {row.get('Course Code', 'unknown')}-{row.get('Section', 'unknown')}: {e}")
                    continue

            logger.info(f"✅ Round {current_round} Window {current_window}: {rows_with_bid_data} rows had bid data")

        self.boss_stats['failed_mappings'] += failed_count
        logger.info("✅ Overall Results Processing Complete.")
        logger.info(f"  - Records to CREATE: {new_records_count}")
        logger.info(f"  - Records to UPDATE: {updated_records_count}")
        logger.info(f"  - Failed Mappings: {failed_count}")
        
        return True

    def _get_column_value(self, row, possible_names):
        """Helper method to get column value by trying multiple possible column names"""
        for name in possible_names:
            if name in row.index:
                return row[name]
        return None

    def _find_class_boss_id_from_course_section(self, course_code, section, acad_term_id):
        """Find class_boss_id from course_code + section + acad_term_id"""
        if not hasattr(self, 'standalone_data') or self.standalone_data is None:
            return None
        
        # Look up in standalone_data
        matches = self.standalone_data[
            (self.standalone_data['course_code'] == course_code) &
            (self.standalone_data['section'].astype(str) == str(section).strip()) &
            (self.standalone_data['acad_term_id'] == acad_term_id)
        ]
        
        if not matches.empty:
            return matches.iloc[0].get('class_boss_id')
        
        return None
     
    def _parse_boss_aliases(self, boss_aliases_val: any) -> list[str]:
        """
        Robustly parses the boss_aliases value from various formats into a clean list of strings.

        This function correctly handles:
        - None, pd.isna(), or other "empty" values.
        - A standard Python list.
        - A NumPy array.
        - A raw PostgreSQL array string (e.g., '{"item1","item2"}').
        - A JSON-formatted string array (e.g., '["item1", "item2"]').

        Returns:
            A clean Python list of strings. Returns an empty list for any invalid or empty input.
        """
        # Return an empty list for any "empty" or None-like value.
        if boss_aliases_val is None:
            return []
        
        # Handle arrays/lists before using pd.isna
        if hasattr(boss_aliases_val, '__len__') and not isinstance(boss_aliases_val, str):
            # It's already an array/list, so check if it's empty
            if len(boss_aliases_val) == 0:
                return []
            # If it's a non-empty array, process it
            if isinstance(boss_aliases_val, list):
                return [str(item).strip() for item in boss_aliases_val if item and str(item).strip()]
            elif hasattr(boss_aliases_val, 'tolist'):
                # NumPy array
                return [str(item).strip() for item in boss_aliases_val.tolist() if item and str(item).strip()]
            else:
                # Other iterable
                return [str(item).strip() for item in boss_aliases_val if item and str(item).strip()]
        
        # Now safe to use pd.isna for non-array values
        try:
            if pd.isna(boss_aliases_val):
                return []
        except:
            # If pd.isna fails for any reason, continue processing
            pass

        # Handle standard Python list.
        if isinstance(boss_aliases_val, list):
            return [str(item).strip() for item in boss_aliases_val if item and str(item).strip()]

        # Handle NumPy array by checking for the .tolist() method.
        if hasattr(boss_aliases_val, 'tolist'):
            return [str(item).strip() for item in boss_aliases_val.tolist() if item and str(item).strip()]

        # Handle various string formats.
        if isinstance(boss_aliases_val, str):
            aliases_str = boss_aliases_val.strip()
            
            if not aliases_str:
                return []
                
            # Case 1: PostgreSQL array format '{"item1","item2"}'
            if aliases_str.startswith('{') and aliases_str.endswith('}'):
                content = aliases_str[1:-1]
                # Split by comma, then strip whitespace and quotes from each item.
                return [item.strip().strip('"') for item in content.split(',') if item.strip()]

            # Case 2: JSON array format '["item1", "item2"]'
            if aliases_str.startswith('[') and aliases_str.endswith(']'):
                try:
                    parsed_list = json.loads(aliases_str)
                    if isinstance(parsed_list, list):
                        return [str(item).strip() for item in parsed_list if item and str(item).strip()]
                except (json.JSONDecodeError, TypeError):
                    # If JSON is malformed, fall back to treating it as a plain string.
                    pass

            # Case 3: A single alias provided as a plain string.
            return [aliases_str]

        # Fallback for other iterable types like tuples or sets.
        if hasattr(boss_aliases_val, '__iter__'):
            return [str(item).strip() for item in boss_aliases_val if item and str(item).strip()]
            
        return []

    def _extract_unique_professors(self) -> Tuple[set, dict]:
        """Extracts unique professor names and their variations from the raw data."""
        unique_professors = set()
        professor_variations = defaultdict(set)

        for _, row in self.multiple_data.iterrows():
            prof_name_raw = row.get('professor_name')
            if prof_name_raw is None or pd.isna(prof_name_raw):
                continue
            
            prof_name = str(prof_name_raw).strip()
            if not prof_name or prof_name.lower() in ['nan', 'tba', 'to be announced']:
                continue
            
            split_professors = self._split_professor_names(prof_name)
            for individual in split_professors:
                clean_prof = individual.strip()
                if clean_prof:
                    unique_professors.add(clean_prof)
                    if ', ' in clean_prof:
                        parts = clean_prof.split(', ')
                        if len(parts) == 2:
                            base_name = parts[0].strip()
                            extension = parts[1].strip()
                            if len(extension.split()) == 1:
                                professor_variations[clean_prof].add(base_name)
                                professor_variations[clean_prof].add(clean_prof)
                                if base_name in professor_variations:
                                    professor_variations[base_name].add(clean_prof)
                    else:
                        professor_variations[clean_prof].add(clean_prof)
        
        return unique_professors, professor_variations

    def _normalize_professors_batch(self, names_to_process: list) -> dict:
        """
        Normalizes a list of professor names using the pre-configured LLM model,
        with a rule-based fallback.
        """
        normalized_map = {}
        if not names_to_process:
            return normalized_map

        # --- LLM Pathway ---
        try:
            # Check if the model was successfully initialized in __init__
            if not self.llm_model:
                raise ValueError("LLM model not configured. Check API key.")

            total_batches = (len(names_to_process) + self.llm_batch_size - 1) // self.llm_batch_size
            logger.info(f"🧪 Normalizing {len(names_to_process)} names in {total_batches} batches using '{self.llm_model_name}'...")

            for i in range(0, len(names_to_process), self.llm_batch_size):
                batch_names = names_to_process[i:i + self.llm_batch_size]
                logger.info(f"  -> Processing batch {i//self.llm_batch_size + 1} of {total_batches} ({len(batch_names)} names)...")
                
                response = self.llm_model.generate_content(
                    contents=f"{self.llm_prompt}\n\n{json.dumps(batch_names)}"
                )

                # Robustly find the JSON block within the response text
                match = re.search(r'\[.*\]', response.text, re.DOTALL)
                if not match:
                    raise ValueError("LLM response did not contain a valid JSON array.")
                
                json_text = match.group(0)
                surnames = json.loads(json_text)
                
                if not isinstance(surnames, list) or len(surnames) != len(batch_names):
                    raise ValueError(f"LLM returned malformed data for batch {i//self.llm_batch_size + 1}.")

                for original_name, surname in zip(batch_names, surnames):
                    name_str = str(original_name).strip().replace("’", "'")
                    name_str = re.sub(r'\s*\(.*\)\s*', ' ', name_str).strip()
                    words = name_str.split()
                    words_no_initials = [word for word in words if not (len(word) == 1 and word.isalpha()) and not (len(word) == 2 and word.endswith('.'))]
                    boss_name = ' '.join(words_no_initials).upper()

                    name_parts = re.split(r'([ ,])', original_name)
                    afterclass_parts = []
                    surname_found = False
                    for part in name_parts:
                        if not surname_found and part.strip(" ,").upper() == surname.upper():
                            afterclass_parts.append(part.upper())
                            surname_found = True
                        else:
                            afterclass_parts.append(part.capitalize())
                    afterclass_name = "".join(afterclass_parts)
                    normalized_map[original_name] = (boss_name, afterclass_name)
                
                time.sleep(6)
            
            logger.info("✅ Batch normalization completed using Gemini LLM.")

        # --- Fallback Pathway ---
        except Exception as e:
            logger.warning(f"⚠️ LLM normalization failed ({e}). Falling back to rule-based method.")
            normalized_map.clear() # Ensure map is empty before filling
            for name in names_to_process:
                normalized_map[name] = self._normalize_professor_name_fallback(name)
        
        return normalized_map

### **Cell 1: Phase 1 - Professor and Course Processing with Enhanced Schema Support**

**What This Does:**
- Initializes the TableBuilder system and connects to PostgreSQL database for existing data validation
- Processes professors from raw data with advanced name normalization handling Asian, Western, and mixed naming patterns using enhanced schema with `boss_aliases` JSON arrays
- Resolves professor email addresses automatically using Microsoft Outlook integration with improved duplicate detection that excludes default emails
- Handles hardcoded multi-instructor combinations and prevents duplicate professor creation through multiple validation strategies including better NaN handling from raw_data.xlsx
- Creates new courses from standalone data and automatically maps them to SMU faculties using course code prefix patterns from existing database courses
- Generates academic terms with proper ID formatting and date range extraction from multiple sources
- Outputs verification files for manual review: `new_professors.csv` for name corrections with enhanced validation and `new_courses.csv` for faculty validation
- Provides detailed statistics on professors created, courses processed, automated faculty mappings applied, and comprehensive error tracking

In [None]:
# Initialize the TableBuilder
builder = TableBuilder()

In [None]:
# Run Phase 1 (professors, courses, acad_terms)
success = builder.run_phase1_professors_and_courses()

if success:
    print("\n🎉 Phase 1 completed successfully!")
    print("📝 Next steps:")
    print("   1. Review script_output/verify/new_professors.csv")
    print("   2. Manually correct any professor names if needed")
    print("   3. Run Phase 2 in the next cell")
else:
    print("\n❌ Phase 1 failed. Check logs for details.")

### **Cell 2: Professor Name Review and Correction Interface**

**What This Does:**
- Loads the generated `new_professors.csv` file from the verification directory
- Displays a comparison table showing four name formats: original scraped name, boss format (ALL CAPS), afterclass format (Title Case), and the final processed name
- Provides clear instructions for manual correction focusing only on the 'name' column (afterclass format)
- Guides users to preserve the boss_name format while correcting any parsing errors or name formatting issues
- Handles empty files gracefully when all professors already exist in the database
- Prepares corrected data for Phase 2 processing by maintaining proper name mapping relationships

In [None]:
# Display new professors for review
new_prof_path = os.path.join('script_output', 'verify', 'new_professors.csv')
if os.path.exists(new_prof_path):
    df = pd.read_csv(new_prof_path)
    if not df.empty:
        print(f"📋 {len(df)} new professors created:")
        print("\n🔍 Review these professor names:")
        display(df[['name', 'boss_aliases', 'original_scraped_name']])
        print("\n📝 If any names need correction, edit the 'name' column in:")
        print(f"   {new_prof_path}")
        print("\n⚠️  Only edit the 'name' column (afterclass format)")
        print("   Keep 'boss_aliases' unchanged")
    else:
        print("✅ No new professors created - all professors already exist in database")
else:
    print("✅ No new professors file - all professors already exist in database")

### **Cell 3: Phase 2 - Class and Timing Processing with Corrected Professor Data**

**What This Does:**
- Reads manually corrected professor names from verification CSV files and updates internal lookup tables
- Processes classes from standalone data using corrected professor mappings and established course relationships
- Handles complex professor assignments including single professors, JSON arrays for multi-instructor classes, and missing professor scenarios
- Generates class timing records (weekly schedules) and exam timing records with proper foreign key relationships
- Links all timing data to valid class IDs while maintaining referential integrity
- Creates complete set of database-ready CSV files: `new_classes.csv`, `new_class_timing.csv`, `new_class_exam_timing.csv`
- Provides comprehensive error reporting for validation issues and successful record creation statistics

In [None]:
# Run Phase 2 (classes, timings) after manual correction
success = builder.run_phase2_remaining_tables()

if success:
    print("\n🎉 Phase 2 completed successfully!")
    print("📝 All tables generated with corrected professor names")
else:
    print("\n❌ Phase 2 failed. Check logs for details.")

### **Cell 4: Interactive Faculty Assignment for Unmapped Courses**

**What This Does:**
- Identifies courses that still require manual faculty assignment after automated BOSS-based mapping
- Opens scraped HTML course outline files in web browser for informed faculty assignment decisions  
- Presents interactive menu of SMU's schools and centers with options to create new faculties for unmapped departments
- Provides course code, name, and content preview to guide proper faculty placement decisions
- Updates course records with selected faculty assignments and maintains faculty cache consistency
- Allows skipping courses that need additional research while preserving assignment workflow
- Re-saves updated CSV files with complete faculty information for database insertion

In [None]:
# Run faculty assignment process if needed
if hasattr(builder, 'courses_needing_faculty') and builder.courses_needing_faculty:
    builder.assign_course_faculties()
    print("\n✅ Faculty assignment completed!")
else:
    print("✅ No courses need faculty assignment")

### **Cell 5: BOSS Bidding Data Processing from Raw Data Integration**

**What This Does:**
- Processes BOSS bidding data directly from `script_input/raw_data.xlsx` standalone sheet containing integrated bidding information from the HTML extraction phase
- Extracts bidding metrics including total capacity, current enrollment, reserved seats, available seats, extraction timestamps, and bidding window information
- Parses complex bidding window formats including standard rounds (Round 1, 1A, 1B), incoming exchange rounds (Round 1C), and incoming freshmen rounds (Round 1F) with proper hierarchical ordering
- Creates comprehensive bid window records following SMU's bidding system rules with automatic deduplication based on academic term, round, and window combinations
- Maps course codes and sections from raw bidding data to existing class records using multiple fallback strategies including memory cache, database cache, and CSV file lookups
- Generates three interconnected database tables: `new_bid_window.csv` for bidding periods, `new_class_availability.csv` for seat availability tracking, and `new_bid_result.csv` for bidding outcomes
- Handles complex class-professor relationships by creating records for all class IDs associated with each course/section/term combination to support multi-instructor scenarios
- Provides comprehensive error tracking and failed mapping analysis with detailed logging for troubleshooting data integration issues
- Creates processing logs with timestamps, validation statistics, and detailed analysis of unmapped records for quality assurance
- Supports academic year variations and bidding rule changes while maintaining backward compatibility with historical data formats

In [None]:
# Run complete Phase 3 pipeline
print("🚀 Starting Phase 3: BOSS Results Processing")
success = builder.run_phase3_boss_processing()

if success:
    print("\n🎉 Phase 3 completed successfully!")
    builder.close_connection()
else:
    print("\n❌ Phase 3 failed. Check logs for details.")

# Check failed mappings (if any)
failed_path = os.path.join('script_output', 'failed_boss_results_mapping.csv')
if os.path.exists(failed_path):
    failed_df = pd.read_csv(failed_path)
    print(f"⚠️ {len(failed_df)} failed mappings found:")
    display(failed_df.head(10))
    print(f"\n📝 Review failed mappings in: {failed_path}")
else:
    print("✅ No failed mappings - all BOSS results mapped successfully!")

# Inspect generated data
print("📋 Generated Data Summary:")

# Check bid windows
bid_window_path = os.path.join('script_output', 'new_bid_window.csv')
if os.path.exists(bid_window_path):
    bw_df = pd.read_csv(bid_window_path)
    print(f"\n🪟 Bid Windows ({len(bw_df)} records):")

# Check class availability
availability_path = os.path.join('script_output', 'new_class_availability.csv')
if os.path.exists(availability_path):
    av_df = pd.read_csv(availability_path)
    print(f"\n📊 Class Availability ({len(av_df)} records):")

# Check bid results
result_path = os.path.join('script_output', 'new_bid_result.csv')
if os.path.exists(result_path):
    br_df = pd.read_csv(result_path)
    print(f"\n📈 Bid Results ({len(br_df)} records):")

### **Cell 6: Comprehensive Data Integrity Validation**

**What This Does:**
- Validates referential integrity across all generated CSV files by checking foreign key relationships between tables
- Loads valid IDs from multiple sources: database cache files, new CSV files, professor lookup tables, and verification files
- Performs comprehensive validation of course_id references in classes, professor_id fields (both single UUIDs and JSON arrays), and class_id references in timing tables
- Checks UUID format validity and ensures all referenced IDs exist in their respective source tables
- Generates detailed error reports with specific row numbers, invalid IDs, and raw professor names for debugging
- Creates validation summary statistics including total records checked, error counts by type, and data loading metrics
- Provides categorized error analysis for professor ID issues including format errors, missing references, and null assignments
- Saves validation results to CSV files: `validation_errors.csv`, `validation_warnings.csv`, and `validation_summary.csv` for comprehensive quality assurance

In [None]:
# Set up logging
import logging
import os
import re
import json
import pandas as pd
from datetime import datetime
import traceback
import sys

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class DataIntegrityValidator:
    """Validates data integrity across generated CSV files and database cache"""
    
    def __init__(self, output_base='script_output', cache_dir='db_cache'):
        self.output_base = output_base
        self.verify_dir = os.path.join(output_base, 'verify')
        self.cache_dir = cache_dir
        
        # Ensure directories exist
        os.makedirs(self.verify_dir, exist_ok=True)
        os.makedirs(self.cache_dir, exist_ok=True)
        
        # Data containers
        self.valid_course_ids = set()
        self.valid_professor_ids = set()
        self.valid_class_ids = set()
        
        # Professor lookup mapping
        self.professor_lookup = {}
        
        # Validation results
        self.validation_errors = []
        self.validation_warnings = []
        
        # Statistics
        self.stats = {
            'total_classes_checked': 0,
            'total_timings_checked': 0,
            'total_exam_timings_checked': 0,
            'course_id_errors': 0,
            'professor_id_errors': 0,
            'professor_id_format_errors': 0,
            'class_id_errors': 0,
            'warnings': 0,
            'professors_created': 0,
            'professors_updated': 0,
            'courses_created': 0,
            'courses_updated': 0,
            'courses_needing_faculty': 0,
            'classes_created': 0,
            'timings_created': 0,
            'exams_created': 0
        }
        
        # Initialize new_acad_terms for compatibility
        self.new_acad_terms = []
    
    def is_valid_uuid(self, uuid_string):
        """Check if a string is a valid UUID format"""
        if not uuid_string or pd.isna(uuid_string):
            return False
        
        try:
            uuid_pattern = re.compile(
                r'^[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$',
                re.IGNORECASE
            )
            return bool(uuid_pattern.match(str(uuid_string).strip()))
        except Exception as e:
            logger.warning(f"UUID validation error for {uuid_string}: {e}")
            return False
    
    def safe_read_csv(self, file_path, required_columns=None):
        """Safely read CSV file with error handling"""
        try:
            if not os.path.exists(file_path):
                logger.warning(f"File not found: {file_path}")
                return pd.DataFrame()
            
            if os.path.getsize(file_path) == 0:
                logger.warning(f"File is empty: {file_path}")
                return pd.DataFrame()
            
            df = pd.read_csv(file_path)
            
            if df.empty:
                logger.warning(f"CSV file is empty: {file_path}")
                return pd.DataFrame()
            
            if required_columns:
                missing_columns = [col for col in required_columns if col not in df.columns]
                if missing_columns:
                    logger.warning(f"Missing columns in {file_path}: {missing_columns}")
                    return pd.DataFrame()
            
            return df
        except Exception as e:
            logger.error(f"Error reading CSV {file_path}: {e}")
            return pd.DataFrame()
    
    def safe_read_pickle(self, file_path):
        """Safely read pickle file with error handling"""
        try:
            if not os.path.exists(file_path):
                logger.warning(f"Pickle file not found: {file_path}")
                return pd.DataFrame()
            
            df = pd.read_pickle(file_path)
            return df if not df.empty else pd.DataFrame()
        except Exception as e:
            logger.error(f"Error reading pickle {file_path}: {e}")
            return pd.DataFrame()
    
    def load_valid_course_ids(self):
        """Load valid course IDs from new_courses.csv and database cache"""
        logger.info("📚 Loading valid course IDs...")
        
        # Load from new_courses.csv (verify folder)
        new_courses_path = os.path.join(self.verify_dir, 'new_courses.csv')
        df = self.safe_read_csv(new_courses_path, ['id'])
        if not df.empty:
            new_course_ids = set(df['id'].astype(str))
            self.valid_course_ids.update(new_course_ids)
            logger.info(f"   ✅ Loaded {len(new_course_ids)} course IDs from new_courses.csv")
        
        # Load from database cache
        cache_file = os.path.join(self.cache_dir, 'courses_cache.pkl')
        courses_df = self.safe_read_pickle(cache_file)
        if not courses_df.empty and 'id' in courses_df.columns:
            cache_course_ids = set(courses_df['id'].astype(str))
            self.valid_course_ids.update(cache_course_ids)
            logger.info(f"   ✅ Loaded {len(cache_course_ids)} course IDs from database cache")
        
        logger.info(f"   📊 Total valid course IDs: {len(self.valid_course_ids)}")
    
    def load_valid_professor_ids(self):
        """Load valid professor IDs from multiple sources including professor_lookup.csv"""
        logger.info("👥 Loading valid professor IDs...")
        
        # PRIORITY 1: Load from professor_lookup.csv (most authoritative)
        lookup_file = 'script_input/professor_lookup.csv'
        if os.path.exists(lookup_file):
            lookup_df = self.safe_read_csv(lookup_file, ['database_id'])
            if not lookup_df.empty and 'database_id' in lookup_df.columns:
                lookup_professor_ids = set(lookup_df['database_id'].astype(str))
                self.valid_professor_ids.update(lookup_professor_ids)
                logger.info(f"   ✅ Loaded {len(lookup_professor_ids)} professor IDs from professor_lookup.csv")
                
                # Also build lookup mapping for analysis
                for _, row in lookup_df.iterrows():
                    boss_name = row.get('boss_name')
                    database_id = str(row.get('database_id'))
                    if pd.notna(boss_name) and pd.notna(database_id):
                        self.professor_lookup[boss_name] = database_id
        
        # PRIORITY 2: Load from database cache (professors table)
        cache_file = os.path.join(self.cache_dir, 'professors_cache.pkl')
        professors_df = self.safe_read_pickle(cache_file)
        if not professors_df.empty and 'id' in professors_df.columns:
            cache_professor_ids = set(professors_df['id'].astype(str))
            self.valid_professor_ids.update(cache_professor_ids)
            logger.info(f"   ✅ Loaded {len(cache_professor_ids)} professor IDs from database cache")
        
        # PRIORITY 3: Load from new_professors.csv (verify folder)
        new_professors_path = os.path.join(self.verify_dir, 'new_professors.csv')
        df = self.safe_read_csv(new_professors_path, ['id'])
        if not df.empty:
            new_professor_ids = set(df['id'].astype(str))
            self.valid_professor_ids.update(new_professor_ids)
            logger.info(f"   ✅ Loaded {len(new_professor_ids)} professor IDs from new_professors.csv")
        
        logger.info(f"   📊 Total valid professor IDs: {len(self.valid_professor_ids)}")
    
    def load_valid_class_ids(self):
        """Load valid class IDs from new_classes.csv"""
        logger.info("🏫 Loading valid class IDs...")
        
        classes_path = os.path.join(self.output_base, 'new_classes.csv')
        df = self.safe_read_csv(classes_path, ['id'])
        if not df.empty:
            self.valid_class_ids = set(df['id'].astype(str))
            logger.info(f"   ✅ Loaded {len(self.valid_class_ids)} class IDs from new_classes.csv")
        else:
            logger.error(f"   ❌ Could not load class IDs from {classes_path}")
        
        logger.info(f"   📊 Total valid class IDs: {len(self.valid_class_ids)}")
    
    def parse_professor_ids(self, professor_id_field):
        """Safely parse professor ID field which can be single ID or JSON array"""
        if pd.isna(professor_id_field) or str(professor_id_field).strip() == '':
            return []
        
        professor_id_str = str(professor_id_field).strip()
        
        # Check if it's a JSON array
        if professor_id_str.startswith('[') and professor_id_str.endswith(']'):
            try:
                # Handle both single and double quotes
                normalized_json = professor_id_str.replace("'", '"')
                parsed_ids = json.loads(normalized_json)
                
                if isinstance(parsed_ids, list):
                    return [str(pid).strip() for pid in parsed_ids if pd.notna(pid)]
                else:
                    return []
            except (json.JSONDecodeError, TypeError) as e:
                logger.warning(f"JSON parsing error for professor_id: {professor_id_str} - {e}")
                return []
        else:
            # Single professor ID
            return [professor_id_str] if professor_id_str else []
    
    def validate_classes(self):
        """Validate course_id and professor_id references in new_classes.csv"""
        logger.info("🔍 Validating new_classes.csv...")
        
        classes_path = os.path.join(self.output_base, 'new_classes.csv')
        df = self.safe_read_csv(classes_path, ['id', 'course_id'])
        
        if df.empty:
            logger.error(f"   ❌ Could not validate classes - file not found or empty")
            return
        
        try:
            self.stats['total_classes_checked'] = len(df)
            
            for idx, row in df.iterrows():
                try:
                    class_id = str(row['id'])
                    course_id = str(row['course_id'])
                    professor_id_field = row.get('professor_id')
                    raw_professor_name = row.get('raw_professor_name', '')
                    
                    # Validate course_id
                    if course_id not in self.valid_course_ids:
                        error = {
                            'type': 'course_id_missing',
                            'file': 'new_classes.csv',
                            'row': idx,
                            'class_id': class_id,
                            'invalid_id': course_id,
                            'field': 'course_id'
                        }
                        self.validation_errors.append(error)
                        self.stats['course_id_errors'] += 1
                    
                    # Validate professor_id
                    professor_ids_to_check = self.parse_professor_ids(professor_id_field)
                    
                    if professor_ids_to_check:
                        for prof_id in professor_ids_to_check:
                            prof_id_str = str(prof_id).strip()
                            
                            # Check UUID format
                            if not self.is_valid_uuid(prof_id_str):
                                error = {
                                    'type': 'professor_id_invalid_uuid',
                                    'file': 'new_classes.csv',
                                    'row': idx,
                                    'class_id': class_id,
                                    'invalid_id': prof_id_str,
                                    'field': 'professor_id',
                                    'raw_professor_name': raw_professor_name,
                                    'course_id': course_id
                                }
                                self.validation_errors.append(error)
                                self.stats['professor_id_format_errors'] += 1
                                continue
                            
                            # Check if professor exists
                            if prof_id_str not in self.valid_professor_ids:
                                error = {
                                    'type': 'professor_id_not_found',
                                    'file': 'new_classes.csv',
                                    'row': idx,
                                    'class_id': class_id,
                                    'invalid_id': prof_id_str,
                                    'field': 'professor_id',
                                    'raw_professor_name': raw_professor_name,
                                    'course_id': course_id
                                }
                                self.validation_errors.append(error)
                                self.stats['professor_id_errors'] += 1
                    else:
                        # Warning for missing professor
                        warning = {
                            'type': 'professor_id_null',
                            'file': 'new_classes.csv',
                            'row': idx,
                            'class_id': class_id,
                            'message': 'No professors found for class',
                            'raw_professor_name': raw_professor_name,
                            'course_id': course_id
                        }
                        self.validation_warnings.append(warning)
                        self.stats['warnings'] += 1
                
                except Exception as e:
                    logger.error(f"Error processing row {idx}: {e}")
                    continue
            
            logger.info(f"   ✅ Validated {len(df)} classes")
            
        except Exception as e:
            logger.error(f"   ❌ Error validating classes: {e}")
            traceback.print_exc()
    
    def validate_class_timings(self):
        """Validate class_id references in new_class_timing.csv"""
        logger.info("⏰ Validating new_class_timing.csv...")
        
        timings_path = os.path.join(self.output_base, 'new_class_timing.csv')
        df = self.safe_read_csv(timings_path, ['class_id'])
        
        if df.empty:
            logger.warning(f"   ⚠️ new_class_timing.csv not found or empty")
            return
        
        try:
            self.stats['total_timings_checked'] = len(df)
            
            for idx, row in df.iterrows():
                try:
                    class_id = str(row['class_id'])
                    
                    if class_id not in self.valid_class_ids:
                        error = {
                            'type': 'class_id_missing',
                            'file': 'new_class_timing.csv',
                            'row': idx,
                            'invalid_id': class_id,
                            'field': 'class_id'
                        }
                        self.validation_errors.append(error)
                        self.stats['class_id_errors'] += 1
                
                except Exception as e:
                    logger.error(f"Error processing timing row {idx}: {e}")
                    continue
            
            logger.info(f"   ✅ Validated {len(df)} class timings")
            
        except Exception as e:
            logger.error(f"   ❌ Error validating class timings: {e}")
    
    def validate_exam_timings(self):
        """Validate class_id references in new_class_exam_timing.csv"""
        logger.info("📝 Validating new_class_exam_timing.csv...")
        
        exam_timings_path = os.path.join(self.output_base, 'new_class_exam_timing.csv')
        df = self.safe_read_csv(exam_timings_path, ['class_id'])
        
        if df.empty:
            logger.warning(f"   ⚠️ new_class_exam_timing.csv not found or empty")
            return
        
        try:
            self.stats['total_exam_timings_checked'] = len(df)
            
            for idx, row in df.iterrows():
                try:
                    class_id = str(row['class_id'])
                    
                    if class_id not in self.valid_class_ids:
                        error = {
                            'type': 'class_id_missing',
                            'file': 'new_class_exam_timing.csv',
                            'row': idx,
                            'invalid_id': class_id,
                            'field': 'class_id'
                        }
                        self.validation_errors.append(error)
                        self.stats['class_id_errors'] += 1
                
                except Exception as e:
                    logger.error(f"Error processing exam timing row {idx}: {e}")
                    continue
            
            logger.info(f"   ✅ Validated {len(df)} exam timings")
            
        except Exception as e:
            logger.error(f"   ❌ Error validating exam timings: {e}")
    
    def analyze_professor_issues(self):
        """Analyze professor-related issues in detail"""
        logger.info("🔬 Analyzing professor issues...")
        
        # Group professor errors by type
        error_types = {}
        for error in self.validation_errors:
            if 'professor_id' in error['type']:
                error_type = error['type']
                if error_type not in error_types:
                    error_types[error_type] = []
                error_types[error_type].append(error)
        
        if error_types:
            print(f"\n📊 PROFESSOR ID ERROR ANALYSIS:")
            for error_type, errors in error_types.items():
                print(f"\n   Error Type: {error_type}")
                print(f"   Count: {len(errors)}")
                
                # Show unique invalid IDs for this error type
                unique_invalid_ids = set()
                for error in errors:
                    unique_invalid_ids.add(error['invalid_id'])
                
                print(f"   Unique Invalid IDs: {len(unique_invalid_ids)}")
                
                # Show sample errors
                print(f"   Sample errors:")
                for i, error in enumerate(errors[:3]):
                    print(f"     {i+1}. Class {error['class_id']} - Raw name: {error.get('raw_professor_name', 'N/A')}")
                    print(f"        Invalid ID: {error['invalid_id']}")
                
                if len(errors) > 3:
                    print(f"     ... and {len(errors) - 3} more")
    
    def save_validation_report(self):
        """Save validation errors and warnings to CSV files"""
        logger.info("💾 Saving validation report...")
        
        try:
            # Save validation errors
            if self.validation_errors:
                errors_df = pd.DataFrame(self.validation_errors)
                errors_path = os.path.join(self.output_base, 'validation_errors.csv')
                errors_df.to_csv(errors_path, index=False)
                logger.info(f"   ❌ Saved {len(self.validation_errors)} validation errors to validation_errors.csv")
            
            # Save validation warnings
            if self.validation_warnings:
                warnings_df = pd.DataFrame(self.validation_warnings)
                warnings_path = os.path.join(self.output_base, 'validation_warnings.csv')
                warnings_df.to_csv(warnings_path, index=False)
                logger.info(f"   ⚠️ Saved {len(self.validation_warnings)} validation warnings to validation_warnings.csv")
            
            # Save summary report
            summary = {
                'validation_timestamp': datetime.now().isoformat(),
                'total_classes_checked': self.stats['total_classes_checked'],
                'total_timings_checked': self.stats['total_timings_checked'],
                'total_exam_timings_checked': self.stats['total_exam_timings_checked'],
                'total_errors': len(self.validation_errors),
                'total_warnings': len(self.validation_warnings),
                'course_id_errors': self.stats['course_id_errors'],
                'professor_id_errors': self.stats['professor_id_errors'],
                'professor_id_format_errors': self.stats['professor_id_format_errors'],
                'class_id_errors': self.stats['class_id_errors'],
                'valid_course_ids_loaded': len(self.valid_course_ids),
                'valid_professor_ids_loaded': len(self.valid_professor_ids),
                'valid_class_ids_loaded': len(self.valid_class_ids)
            }
            
            summary_df = pd.DataFrame([summary])
            summary_path = os.path.join(self.output_base, 'validation_summary.csv')
            summary_df.to_csv(summary_path, index=False)
            logger.info(f"   📊 Saved validation summary to validation_summary.csv")
            
        except Exception as e:
            logger.error(f"Error saving validation report: {e}")
    
    def print_summary(self):
        """Print processing summary with enhanced statistics"""
        print("\n" + "="*70)
        print("📊 PROCESSING SUMMARY")
        print("="*70)
        print(f"✅ Professors created: {self.stats['professors_created']}")
        print(f"✅ Professors updated: {self.stats.get('professors_updated', 0)}")
        print(f"✅ Courses created: {self.stats['courses_created']}")
        print(f"✅ Courses updated: {self.stats['courses_updated']}")
        print(f"⚠️  Courses needing faculty: {self.stats['courses_needing_faculty']}")
        print(f"✅ Classes created: {self.stats['classes_created']}")
        print(f"✅ Class timings created: {self.stats['timings_created']}")
        print(f"✅ Exam timings created: {self.stats['exams_created']}")
        print("="*70)
        
        print("\n📁 OUTPUT FILES:")
        print(f"   Verify folder: {self.verify_dir}/")
        print(f"   - new_professors.csv ({self.stats['professors_created']} records)")
        print(f"   - new_courses.csv ({self.stats['courses_created']} records)")
        print(f"   Output folder: {self.output_base}/")
        print(f"   - update_courses.csv ({self.stats['courses_updated']} records)")
        print(f"   - update_professor.csv ({self.stats.get('professors_updated', 0)} records)")
        if hasattr(self, 'update_classes') and self.update_classes:
            print(f"   - update_classes.csv ({len(self.update_classes)} records)")
        if hasattr(self, 'update_bid_result') and self.update_bid_result:
            print(f"   - update_bid_result.csv ({len(self.update_bid_result)} records)")
        print(f"   - new_acad_term.csv ({len(self.new_acad_terms)} records)")
        print(f"   - new_classes.csv ({self.stats['classes_created']} records)")
        print(f"   - new_class_timing.csv ({self.stats['timings_created']} records)")
        print(f"   - new_class_exam_timing.csv ({self.stats['exams_created']} records)")
        print(f"   - professor_lookup.csv (updated)")
        print(f"   - courses_needing_faculty.csv ({self.stats['courses_needing_faculty']} records)")
        print("="*70)
    
    def run_validation(self):
        """Run complete data integrity validation"""
        try:
            logger.info("🚀 Starting Data Integrity Validation")
            logger.info("="*60)
            
            # Step 1: Load valid IDs from all sources
            self.load_valid_course_ids()
            self.load_valid_professor_ids()
            self.load_valid_class_ids()
            
            # Step 2: Validate references
            self.validate_classes()
            self.validate_class_timings()
            self.validate_exam_timings()
            
            # Step 3: Analyze issues
            self.analyze_professor_issues()
            
            # Step 4: Save and display results
            self.save_validation_report()
            self.print_summary()
            
            logger.info("\n✅ Data integrity validation completed!")
            
            # Return validation status
            return len(self.validation_errors) == 0
            
        except Exception as e:
            logger.error(f"❌ Validation failed: {e}")
            traceback.print_exc()
            return False

In [None]:
validator = DataIntegrityValidator()
success = validator.run_validation()

if success:
    print("\n🎉 All data integrity checks passed!")
    exit(0)
else:
    print("\n💥 Data integrity issues found - check error reports!")
    exit(1)

In [None]:
def check_class_coverage_standalone(output_dir='script_output'):
    """Standalone function to analyze class coverage and generate detailed report"""
    import pandas as pd
    import os
    from collections import defaultdict
    
    print("\n" + "="*70)
    print("🔍 CLASS COVERAGE ANALYSIS")
    print("="*70)
    
    # Load new_classes.csv
    classes_path = os.path.join(output_dir, 'new_classes.csv')
    if not os.path.exists(classes_path):
        print(f"❌ File not found: {classes_path}")
        return
    
    try:
        classes_df = pd.read_csv(classes_path)
        total_classes = len(classes_df)
        all_class_ids = set(classes_df['id'].unique())
        
        # Files to check
        files_to_check = {
            'new_class_availability.csv': 'class_availability',
            'new_class_exam_timing.csv': 'class_exam_timing', 
            'new_class_timing.csv': 'class_timing',
            'new_bid_result.csv': 'bid_result'
        }
        
        coverage_results = {}
        orphan_class_ids = defaultdict(list)
        
        # Load each file and analyze
        for filename, table_name in files_to_check.items():
            file_path = os.path.join(output_dir, filename)
            
            if not os.path.exists(file_path):
                coverage_results[table_name] = {
                    'found_ids': set(),
                    'orphan_ids': set()
                }
                continue
                
            df = pd.read_csv(file_path)
            if 'class_id' in df.columns:
                found_class_ids = set(df['class_id'].unique())
                
                # Check for orphan class_ids
                orphan_ids = found_class_ids - all_class_ids
                if orphan_ids:
                    orphan_class_ids[table_name] = orphan_ids
                
                # Store valid class_ids
                valid_class_ids = found_class_ids & all_class_ids
                
                coverage_results[table_name] = {
                    'found_ids': valid_class_ids,
                    'orphan_ids': orphan_ids
                }
        
        # Calculate statistics
        print("\n📊 STATISTICS:")
        print("-" * 50)
        
        # 1. Total class rows
        print(f"1. Total class rows created: {total_classes}")
        
        # 2. Unique course/section/term combinations
        unique_combinations = classes_df.groupby(['course_id', 'section', 'acad_term_id']).size()
        num_unique_combinations = len(unique_combinations)
        print(f"2. Unique course/section/term combinations: {num_unique_combinations}")
        
        # 3. Classes from multiple professors
        multi_professor_combinations = unique_combinations[unique_combinations > 1]
        total_multi_professor_classes = multi_professor_combinations.sum()
        print(f"3. Class records from multiple professors: {total_multi_professor_classes}")
        
        # 4. Unique classes duplicated due to multiple professors
        num_duplicated_unique_classes = len(multi_professor_combinations)
        print(f"4. Unique classes duplicated due to multiple professors: {num_duplicated_unique_classes}")
        
        # 5. Classes with no BOSS results
        no_boss_classes = []
        for class_id in all_class_ids:
            in_availability = class_id in coverage_results.get('class_availability', {}).get('found_ids', set())
            in_bid_result = class_id in coverage_results.get('bid_result', {}).get('found_ids', set())
            if not in_availability and not in_bid_result:
                no_boss_classes.append(class_id)
        print(f"5. Classes with no BOSS results (no availability/bid_result): {len(no_boss_classes)}")
        
        # 6. Classes with no exams but have class timings
        has_timing = coverage_results.get('class_timing', {}).get('found_ids', set())
        has_exam = coverage_results.get('class_exam_timing', {}).get('found_ids', set())
        no_exam_with_timing = has_timing - has_exam
        print(f"6. Classes with class timings but no exams: {len(no_exam_with_timing)}")
        
        # 7. Classes with exams but no class timings
        exam_no_timing = has_exam - has_timing
        print(f"7. Classes with exams but no class timings: {len(exam_no_timing)}")
        
        # 8. Classes with both exams and class timings
        both_exam_timing = has_exam & has_timing
        print(f"8. Classes with both exams and class timings: {len(both_exam_timing)}")
        
        # 9. Orphan class_ids
        total_orphans = sum(len(ids) for ids in orphan_class_ids.values())
        print(f"9. Orphan class_ids (in tables but not in new_classes): {total_orphans}")
        if total_orphans > 0:
            for table, ids in orphan_class_ids.items():
                print(f"   - {table}: {len(ids)} orphan IDs")
        
        # 10. BOSS records not mapped to scraped data
        print("\n📊 Checking BOSS records not mapped to scraped data...")
        
        # Load BOSS data to check unmapped records
        boss_unmapped_count = 0
        try:
            import glob
            boss_files = glob.glob(os.path.join('script_input', 'overallBossResults', '*.xlsx'))
            if boss_files:
                # Get unique course/section/term from scraped data
                scraped_combinations = set()
                for _, row in classes_df.iterrows():
                    # Need to get course code from course_id
                    course_id = row['course_id']
                    section = str(row['section'])
                    acad_term_id = row['acad_term_id']
                    scraped_combinations.add((course_id, section, acad_term_id))
                
                # Count unique BOSS combinations not in scraped
                boss_unique_combinations = set()
                for file_path in boss_files[:1]:  # Sample first file for performance
                    boss_df = pd.read_excel(file_path)
                    if all(col in boss_df.columns for col in ['Course Code', 'Section', 'Term']):
                        for _, row in boss_df.iterrows():
                            if pd.notna(row['Course Code']) and pd.notna(row['Section']) and pd.notna(row['Term']):
                                course_code = row['Course Code']
                                section = str(row['Section'])
                                term = row['Term']
                                # Convert term to acad_term_id format
                                if isinstance(term, str) and '-' in term:
                                    import re
                                    match = re.match(r'(\d{4})-(\d{2})\s+Term\s+(\w+)', term)
                                    if match:
                                        acad_term_id = f"AY{match.group(1)}{match.group(2)}T{match.group(3)}"
                                        boss_unique_combinations.add((course_code, section, acad_term_id))
                
                # Note: This is approximate as we're comparing course_code vs course_id
                print(f"10. Unique BOSS combinations sampled: {len(boss_unique_combinations)}")
                print("    (Note: Exact count requires course_code to course_id mapping)")
        except Exception as e:
            print(f"10. Could not analyze BOSS unmapped records: {e}")
        
        # Generate detailed report
        print("\n💾 Generating detailed report...")
        
        missing_report = []
        for _, class_row in classes_df.iterrows():
            class_id = class_row['id']
            
            row = {
                'class_id': class_id,
                'course_id': class_row.get('course_id'),
                'section': class_row.get('section'),
                'professor_id': class_row.get('professor_id'),
                'acad_term_id': class_row.get('acad_term_id'),
                'boss_id': class_row.get('boss_id'),
                'raw_professor_name': class_row.get('raw_professor_name', ''),
                'warn_inaccuracy': class_row.get('warn_inaccuracy', False)
            }
            
            # Check each table
            for table_name, result in coverage_results.items():
                row[f'in_{table_name}'] = 'Yes' if class_id in result['found_ids'] else 'No'
            
            # Add summary flags
            row['has_boss_data'] = 'Yes' if (
                row.get('in_class_availability') == 'Yes' or 
                row.get('in_bid_result') == 'Yes'
            ) else 'No'
            
            row['has_timing_data'] = 'Yes' if row.get('in_class_timing') == 'Yes' else 'No'
            row['has_exam_data'] = 'Yes' if row.get('in_class_exam_timing') == 'Yes' else 'No'
            
            missing_report.append(row)
        
        # Save report
        report_df = pd.DataFrame(missing_report)
        report_path = os.path.join(output_dir, 'class_coverage_detailed_report.csv')
        report_df.to_csv(report_path, index=False)
        print(f"✅ Detailed report saved to: {report_path}")
        
        # Summary
        print("\n" + "="*70)
        print("📊 SUMMARY")
        print("="*70)
        print(f"Total classes analyzed: {total_classes}")
        print(f"Report generated: class_coverage_detailed_report.csv")
        print("="*70)
        
    except Exception as e:
        print(f"❌ Error: {e}")
        import traceback
        traceback.print_exc()

# Usage:
check_class_coverage_standalone('script_output')

In [None]:
def check_duplicates_in_tables(output_dir='script_output'):
    """Check for duplicate records in class_availability, class_exam_timing, class_timing, and bid_result"""
    import pandas as pd
    import os
    
    print("\n" + "="*70)
    print("🔍 DUPLICATE RECORDS ANALYSIS")
    print("="*70)
    
    # Tables to check with their unique key combinations
    tables_to_check = {
        'new_class_availability.csv': {
            'name': 'class_availability',
            'key_columns': ['class_id', 'bid_window_id'],
            'description': 'class_id + bid_window_id'
        },
        'new_bid_result.csv': {
            'name': 'bid_result',
            'key_columns': ['bid_window_id', 'class_id'],
            'description': 'bid_window_id + class_id'
        },
        'new_class_timing.csv': {
            'name': 'class_timing',
            'key_columns': None,  # No composite key, check for exact duplicates
            'description': 'all columns (no defined unique key)'
        },
        'new_class_exam_timing.csv': {
            'name': 'class_exam_timing',
            'key_columns': None,  # No composite key, check for exact duplicates
            'description': 'all columns (no defined unique key)'
        }
    }
    
    all_duplicates = {}
    
    for filename, config in tables_to_check.items():
        file_path = os.path.join(output_dir, filename)
        table_name = config['name']
        
        print(f"\n📊 Checking {table_name}...")
        print("-" * 50)
        
        if not os.path.exists(file_path):
            print(f"⚠️  {filename} not found - skipping")
            continue
        
        try:
            df = pd.read_csv(file_path)
            total_rows = len(df)
            print(f"Total rows: {total_rows}")
            
            if total_rows == 0:
                print("❌ No data in file")
                continue
            
            duplicates_found = []
            
            if config['key_columns']:
                # Check for duplicates based on composite key
                key_cols = config['key_columns']
                
                # Verify columns exist
                missing_cols = [col for col in key_cols if col not in df.columns]
                if missing_cols:
                    print(f"❌ Missing required columns: {missing_cols}")
                    continue
                
                # Find duplicates
                duplicated_mask = df.duplicated(subset=key_cols, keep=False)
                duplicates = df[duplicated_mask].copy()
                
                if len(duplicates) > 0:
                    # Sort by key columns for better visualization
                    duplicates = duplicates.sort_values(by=key_cols)
                    
                    # Group duplicates
                    duplicate_groups = duplicates.groupby(key_cols).size().reset_index(name='count')
                    num_duplicate_groups = len(duplicate_groups)
                    
                    print(f"❌ Found {len(duplicates)} duplicate rows")
                    print(f"   Duplicate groups: {num_duplicate_groups}")
                    print(f"   Unique constraint violated: {config['description']}")
                    
                    # Show sample duplicates
                    print("\n   Sample duplicate groups:")
                    for idx, group in duplicate_groups.head(5).iterrows():
                        key_values = {col: group[col] for col in key_cols}
                        print(f"   • {key_values} appears {group['count']} times")
                    
                    if num_duplicate_groups > 5:
                        print(f"   ... and {num_duplicate_groups - 5} more duplicate groups")
                    
                    duplicates_found = duplicates
                else:
                    print(f"✅ No duplicates found on composite key: {config['description']}")
            
            else:
                # Check for exact row duplicates (all columns)
                duplicated_mask = df.duplicated(keep=False)
                duplicates = df[duplicated_mask].copy()
                
                if len(duplicates) > 0:
                    print(f"❌ Found {len(duplicates)} duplicate rows (exact matches)")
                    
                    # Show sample duplicates
                    print("\n   Sample duplicate rows:")
                    shown = 0
                    for idx, row in duplicates.head(10).iterrows():
                        if shown < 5:
                            print(f"   • Row {idx}: class_id={row.get('class_id', 'N/A')}")
                            if 'start_time' in row:
                                print(f"     Timing: {row.get('day_of_week', '')} {row.get('start_time', '')}-{row.get('end_time', '')}")
                            if 'date' in row:
                                print(f"     Exam: {row.get('date', '')} {row.get('start_time', '')}-{row.get('end_time', '')}")
                            shown += 1
                    
                    duplicates_found = duplicates
                else:
                    print(f"✅ No exact duplicate rows found")
            
            # Additional checks for timing tables
            if table_name == 'class_timing' and len(df) > 0:
                # Check for same class with overlapping timings
                print("\n🔍 Checking for overlapping class timings...")
                if all(col in df.columns for col in ['class_id', 'day_of_week', 'start_time', 'end_time']):
                    overlap_issues = []
                    for class_id in df['class_id'].unique():
                        class_timings = df[df['class_id'] == class_id]
                        if len(class_timings) > 1:
                            # Check each pair of timings
                            timings_list = class_timings.to_dict('records')
                            for i in range(len(timings_list)):
                                for j in range(i + 1, len(timings_list)):
                                    t1 = timings_list[i]
                                    t2 = timings_list[j]
                                    if (t1['day_of_week'] == t2['day_of_week'] and 
                                        pd.notna(t1['day_of_week']) and pd.notna(t2['day_of_week'])):
                                        # Same day - check for time overlap
                                        overlap_issues.append({
                                            'class_id': class_id,
                                            'day': t1['day_of_week'],
                                            'timing1': f"{t1['start_time']}-{t1['end_time']}",
                                            'timing2': f"{t2['start_time']}-{t2['end_time']}"
                                        })
                    
                    if overlap_issues:
                        print(f"⚠️  Found {len(overlap_issues)} potential timing conflicts")
                        for issue in overlap_issues[:3]:
                            print(f"   • Class {issue['class_id']} on {issue['day']}: {issue['timing1']} and {issue['timing2']}")
                    else:
                        print("✅ No overlapping timings found")
            
            # Store results
            if len(duplicates_found) > 0:
                all_duplicates[table_name] = {
                    'dataframe': duplicates_found,
                    'total_duplicates': len(duplicates_found),
                    'key_columns': config['key_columns'],
                    'description': config['description']
                }
            
        except Exception as e:
            print(f"❌ Error processing {filename}: {e}")
            import traceback
            traceback.print_exc()
    
    # Export duplicate reports
    print("\n" + "="*70)
    print("💾 EXPORTING DUPLICATE REPORTS")
    print("="*70)
    
    if all_duplicates:
        for table_name, dup_info in all_duplicates.items():
            output_filename = f"duplicates_{table_name}.csv"
            output_path = os.path.join(output_dir, output_filename)
            
            # Add duplicate group numbers
            df = dup_info['dataframe']
            if dup_info['key_columns']:
                # Add group number for easier identification
                df['duplicate_group'] = df.groupby(dup_info['key_columns']).ngroup() + 1
                df = df.sort_values(by=['duplicate_group'] + dup_info['key_columns'])
            
            df.to_csv(output_path, index=False)
            print(f"✅ Exported {dup_info['total_duplicates']} duplicate records to: {output_filename}")
    else:
        print("✅ No duplicates found in any table!")
    
    # Summary
    print("\n" + "="*70)
    print("📊 SUMMARY")
    print("="*70)
    
    total_duplicates = sum(info['total_duplicates'] for info in all_duplicates.values())
    print(f"Total duplicate records found: {total_duplicates}")
    
    if all_duplicates:
        print("\nDuplicates by table:")
        for table_name, info in all_duplicates.items():
            print(f"  • {table_name}: {info['total_duplicates']} duplicates on {info['description']}")
    
    print("="*70)

# Usage:
check_duplicates_in_tables('script_output')