# **SMU Course Scraping Using Selenium**

<div style="background-color:#FFD700; padding:15px; border-radius:5px; border: 2px solid #FF4500;">
    
  <h1 style="color:#8B0000;">⚠️🚨 SCRAPE THIS DATA AT YOUR OWN RISK 🚨⚠️</h1>
  
  <p><strong>📌 If you need the data, please contact me directly.</strong> Only available for **existing students**.</p>

  <h3>🔗 📩 How to Get the Data?</h3>
  <p>📨 <strong>Reach out to me for access</strong> instead of scraping manually.</p>
  <p>Visit <a href='https://www.afterclass.io/'>AfterClass</a> to use the data for planning</p>

</div>

<br>

### **Objective**
This script is designed to scrape SMU course details from the BOSS system using Selenium. The process involves:
1. Logging into the system manually to bypass authentication.
2. Iteratively scraping class details for specified academic years and terms.
3. Writing the scraped data to structured CSV files.

The data is then ingested into [AfterClass.io](https://www.afterclass.io/) to serve students.

### **Script Structure**
1. **Setup**: Import libraries and initialize Selenium WebDriver.
2. **Login**: Wait for manual login and authentication.
3. **Scraping Logic**:
    - `scrape_class_details`: Scrapes course details for a specific class number, academic year, and term.
    - `main`: Manages the scraping process for multiple academic years and terms.
4. **Execution**: Log in and start scraping.


---

## **1. Setup**

In [None]:
import os
os.environ['PGGSSENCMODE'] = 'disable'

import re
import csv
import time
import pandas as pd
import random
import glob
import win32com.client as win32
from collections import defaultdict
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from webdriver_manager.chrome import ChromeDriverManager
from pathlib import Path
import uuid
import logging
import psycopg2
from typing import List, Optional, Tuple
from collections import Counter, defaultdict
from dotenv import load_dotenv
import webbrowser


## **2. Scrape all BOSS data**

### **BOSS Class Scraper Summary**

#### **What This Code Does**
The `BOSSClassScraper` class automates the extraction of class timing data from SMU's BOSS system. It systematically scrapes class details across multiple academic terms and saves them as HTML files for further processing.

**Key Features:**
- **Automated Web Scraping**: Navigates through BOSS class detail pages using Selenium WebDriver
- **Resume Capability**: Automatically detects existing scraped files and continues from the last scraped class number, preventing duplicate work
- **Flexible Term Range**: Dynamically derives academic years from input parameters (e.g., '2025-26_T1' to '2028-29_T2') rather than hardcoded lists
- **Smart Pagination**: Scans class numbers from 1000-5000 with intelligent termination after 300 consecutive empty records
- **Progress Tracking**: Monitors existing files and resumes scraping from the highest class number found for each term
- **Data Organization**: Saves HTML files in structured directories by academic term (`script_input/classTimingsFull/`)
- **Incremental CSV Updates**: Appends only new valid files to the existing CSV index, avoiding duplicates

#### **What Is Required**

**Technical Dependencies:**
- Python packages: `selenium`, `webdriver-manager`, standard libraries (`os`, `time`, `csv`, `re`)
- Chrome browser and ChromeDriver (auto-managed)
- Network access to SMU's BOSS system

**User Requirements:**
- **Manual Authentication**: User must manually log in and complete Microsoft Authenticator process when prompted
- **SMU Credentials**: Valid access to BOSS system
- **Directory Structure**: Code creates `script_input/classTimingsFull/` for HTML files and `script_input/scraped_filepaths.csv` for the file index

**Resume Functionality:**
- **Interruption Handling**: If scraping stops halfway due to network issues or manual interruption, the next run automatically resumes from the exact point it left off
- **Duplicate Prevention**: Existing files are automatically detected and skipped, preventing re-downloading of already scraped data
- **Natural Termination**: Uses 300 consecutive empty records threshold to handle BOSS system inconsistencies without hardcoded limits

**Usage in Jupyter Notebook:**
```python
scraper = BOSSClassScraper()
# Will automatically resume from previous progress if files exist
success = scraper.run_full_scraping_process('2025-26_T1', '2025-26_T3B')
```

In [None]:
class BOSSClassScraper:
    """
    A class to scrape class details from BOSS (SMU's online class registration system)
    and save them as HTML files for further processing with resume capability.
    """
    
    def __init__(self):
        """
        Initialize the BOSS Class Scraper with configuration parameters.
        """
        self.term_code_map = {'T1': '10', 'T2': '20', 'T3A': '31', 'T3B': '32'}
        self.all_terms = ['T1', 'T2', 'T3A', 'T3B']
        self.driver = None
        self.min_class_number = 1000
        self.max_class_number = 5000
        self.consecutive_empty_threshold = 300
        
    def _derive_academic_years(self, start_ay_term, end_ay_term):
        """
        Derive academic years from start and end terms.
        
        Args:
            start_ay_term: Starting term (e.g., '2025-26_T1')
            end_ay_term: Ending term (e.g., '2028-29_T2')
            
        Returns:
            List of academic years in format ['2025-26', '2026-27', ...]
        """
        start_year = int(start_ay_term[:4])
        end_year = int(end_ay_term[:4])
        
        academic_years = []
        for year in range(start_year, end_year + 1):
            next_year = (year + 1) % 100
            ay = f"{year}-{next_year:02d}"
            academic_years.append(ay)
            
        return academic_years
    
    def _get_existing_files_progress(self, base_dir):
        """
        Check existing files and determine the last scraped position for each term.
        
        Args:
            base_dir: Base directory where HTML files are stored
            
        Returns:
            Dictionary with term as key and last scraped class number as value
        """
        progress = {}
        
        if not os.path.exists(base_dir):
            return progress
            
        for term_folder in os.listdir(base_dir):
            term_path = os.path.join(base_dir, term_folder)
            if os.path.isdir(term_path):
                max_class_num = 0
                
                for filename in os.listdir(term_path):
                    if filename.endswith('.html'):
                        # Extract class number from filename
                        # Format: SelectedAcadTerm=XXYY&SelectedClassNumber=ZZZZ.html
                        match = re.search(r'SelectedClassNumber=(\d+)\.html', filename)
                        if match:
                            class_num = int(match.group(1))
                            max_class_num = max(max_class_num, class_num)
                
                if max_class_num > 0:
                    progress[term_folder] = max_class_num
                    print(f"Found existing progress for {term_folder}: last class number {max_class_num}")
        
        return progress
    
    def wait_for_manual_login(self):
        """
        Wait for manual login and Microsoft Authenticator process completion.
        """
        print("Please log in manually and complete the Microsoft Authenticator process.")
        print("Waiting for BOSS dashboard to load...")
        
        wait = WebDriverWait(self.driver, 120)
        
        try:
            wait.until(EC.presence_of_element_located((By.ID, "Label_UserName")))
            wait.until(EC.presence_of_element_located((By.XPATH, "//a[contains(text(),'Sign out')]")))
            
            username = self.driver.find_element(By.ID, "Label_UserName").text
            print(f"Login successful! Logged in as {username}")
            
        except TimeoutException:
            print("Login failed or timed out. Could not detect login elements.")
            raise Exception("Login failed")
        
        time.sleep(1)
    
    def scrape_and_save_html(self, start_ay_term='2025-26_T1', end_ay_term='2025-26_T1', base_dir='script_input/classTimingsFull'):
        """
        Scrapes class details from BOSS and saves them as HTML files with resume capability.
        
        Args:
            start_ay_term: Starting academic year and term (e.g., '2025-26_T1')
            end_ay_term: Ending academic year and term (e.g., '2025-26_T1')
            base_dir: Base directory to save the HTML files
        """
        # Check existing progress
        existing_progress = self._get_existing_files_progress(base_dir)
        
        # Derive academic years from input terms
        all_academic_years = self._derive_academic_years(start_ay_term, end_ay_term)
        
        # Generate all possible AY_TERM combinations
        all_ay_terms = []
        for ay in all_academic_years:
            for term in self.all_terms:
                all_ay_terms.append(f"{ay}_{term}")
        
        # Find the indices of the start and end terms
        try:
            start_idx = all_ay_terms.index(start_ay_term)
            end_idx = all_ay_terms.index(end_ay_term)
        except ValueError:
            print("Invalid start or end term provided. Using full range.")
            start_idx = 0
            end_idx = len(all_ay_terms) - 1
        
        # Select the range to scrape
        ay_terms_to_scrape = all_ay_terms[start_idx:end_idx+1]
        
        # Create base directory if needed
        os.makedirs(base_dir, exist_ok=True)
        
        # Process each AY_TERM
        for ay_term in ay_terms_to_scrape:
            print(f"Processing {ay_term}...")
            
            # Parse AY_TERM for URL
            ay, term = ay_term.split('_')
            ay_short = ay[2:4]  # last two digits of first year
            term_code = self.term_code_map.get(term, '10')
            
            # Create folder for AY_TERM
            folder_path = os.path.join(base_dir, ay_term)
            os.makedirs(folder_path, exist_ok=True)
            
            # Determine starting class number based on existing progress
            start_class_num = self.min_class_number
            if ay_term in existing_progress:
                start_class_num = existing_progress[ay_term] + 1
                print(f"Resuming {ay_term} from class number {start_class_num}")
            
            consecutive_empty = 0
            
            # Scrape each class number in range
            for class_num in range(start_class_num, self.max_class_number + 1):
                # Check if file already exists
                filename = f"SelectedAcadTerm={ay_short}{term_code}&SelectedClassNumber={class_num:04}.html"
                filepath = os.path.join(folder_path, filename)
                
                if os.path.exists(filepath):
                    print(f"File already exists: {filepath}, skipping...")
                    consecutive_empty = 0  # Reset counter since we have data
                    continue
                
                url = f"https://boss.intranet.smu.edu.sg/ClassDetails.aspx?SelectedClassNumber={class_num:04}&SelectedAcadTerm={ay_short}{term_code}&SelectedAcadCareer=UGRD"
                
                try:
                    self.driver.get(url)
                    
                    wait = WebDriverWait(self.driver, 15)
                    try:
                        element = wait.until(EC.any_of(
                            EC.presence_of_element_located((By.ID, "lblClassInfoHeader")),
                            EC.presence_of_element_located((By.ID, "lblErrorDetails"))
                        ))
                        
                        error_elements = self.driver.find_elements(By.ID, "lblErrorDetails")
                        has_data = True
                        
                        for error in error_elements:
                            if "No record found" in error.text:
                                has_data = False
                                break
                                
                    except Exception as e:
                        print(f"Wait error: {e}")
                        has_data = False
                    
                    if not has_data:
                        consecutive_empty += 1
                        print(f"No record found for {ay_term}, class {class_num:04}. Consecutive empty: {consecutive_empty}")
                        
                        if consecutive_empty >= self.consecutive_empty_threshold:
                            print(f"{self.consecutive_empty_threshold} consecutive empty records reached for {ay_term}, moving on.")
                            break
                        
                        time.sleep(2)
                        continue
                    
                    # Reset consecutive empty counter if data found
                    consecutive_empty = 0
                    
                    # Save HTML file
                    with open(filepath, 'w', encoding='utf-8') as f:
                        f.write(self.driver.page_source)
                    
                    print(f"Saved {filepath}")
                    time.sleep(2)
                    
                except Exception as e:
                    print(f"Error processing {url}: {str(e)}")
                    time.sleep(5)
        
        print("Scraping completed.")
    
    def generate_scraped_filepaths_csv(self, base_dir='script_input/classTimingsFull', output_csv='script_input/scraped_filepaths.csv'):
        """
        Generates a CSV file with paths to all valid HTML files (those without "No record found").
        Updates existing CSV by appending new valid files.
        
        Args:
            base_dir: Base directory where HTML files are stored
            output_csv: Path to the output CSV file
            
        Returns:
            Path to the generated CSV file or None if error
        """
        # Read existing filepaths if CSV exists
        existing_filepaths = set()
        if os.path.exists(output_csv):
            try:
                with open(output_csv, 'r', encoding='utf-8') as csvfile:
                    reader = csv.reader(csvfile)
                    next(reader)  # Skip header
                    for row in reader:
                        if row:
                            existing_filepaths.add(row[0])
                print(f"Found {len(existing_filepaths)} existing filepaths in CSV")
            except Exception as e:
                print(f"Error reading existing CSV: {str(e)}")
        
        filepaths = []
        
        if not os.path.exists(base_dir):
            print(f"Directory '{base_dir}' does not exist.")
            return None
        
        # Ensure output directory exists
        os.makedirs(os.path.dirname(output_csv), exist_ok=True)
        
        # Walk through directory structure
        for root, dirs, files in os.walk(base_dir):
            for file in files:
                if file.endswith('.html'):
                    filepath = os.path.join(root, file)
                    
                    # Skip if already in existing filepaths
                    if filepath in existing_filepaths:
                        continue
                        
                    try:
                        with open(filepath, 'r', encoding='utf-8') as f:
                            content = f.read()
                            if 'No record found' not in content:
                                filepaths.append(filepath)
                    except Exception as e:
                        print(f"Error reading file {filepath}: {str(e)}")
        
        # Append new filepaths to CSV
        mode = 'a' if existing_filepaths else 'w'
        with open(output_csv, mode, newline='', encoding='utf-8') as csvfile:
            writer = csv.writer(csvfile)
            if not existing_filepaths:  # Write header only if new file
                writer.writerow(['Filepath'])
            for path in filepaths:
                writer.writerow([path])
        
        total_valid_files = len(existing_filepaths) + len(filepaths)
        print(f"CSV updated with {len(filepaths)} new valid file paths. Total: {total_valid_files} files at {output_csv}")
        return output_csv
    
    def run_full_scraping_process(self, start_ay_term='2025-26_T1', end_ay_term='2025-26_T1'):
        """
        Run the complete scraping process from login to CSV generation with resume capability.
        
        Args:
            start_ay_term: Starting academic year and term
            end_ay_term: Ending academic year and term
            
        Returns:
            True if successful, False otherwise
        """
        try:
            # Set up WebDriver
            options = webdriver.ChromeOptions()
            options.add_argument('--no-sandbox')
            options.add_argument('--disable-dev-shm-usage')
            
            service = Service(ChromeDriverManager().install())
            self.driver = webdriver.Chrome(service=service, options=options)
            
            # Navigate to login page and wait for manual login
            self.driver.get("https://boss.intranet.smu.edu.sg/")
            self.wait_for_manual_login()
            
            # Run the main scraping function
            self.scrape_and_save_html(start_ay_term, end_ay_term)
            
            # Generate CSV with valid file paths
            self.generate_scraped_filepaths_csv()
            
            return True
            
        except Exception as e:
            print(f"Error during scraping process: {str(e)}")
            return False
            
        finally:
            if self.driver:
                self.driver.quit()
                self.driver = None
            print("Process completed!")

In [None]:
# Run the scraper
scraper = BOSSClassScraper()
success = scraper.run_full_scraping_process('2025-26_T1', '2025-26_T1')


---

## **3. Extract Data from HTML Files**

### **HTML Data Extractor Summary**

#### **What This Code Does**
The `HTMLDataExtractor` class processes previously scraped HTML files from SMU's BOSS system and extracts structured data into Excel format. It systematically parses course information, class timings, academic terms, and exam schedules from local HTML files without requiring network access or authentication.

**Key Features:**
- **Local File Processing**: Uses Selenium WebDriver to parse local HTML files without network connectivity requirements
- **Comprehensive Data Extraction**: Extracts course details, academic terms, class timings, exam schedules, grading information, and professor names
- **Test-First Approach**: Includes `run_test()` function to validate extraction logic on a small sample before processing all files
- **Structured Output**: Organizes extracted data into two Excel sheets - standalone records (one per HTML file) and multiple records (class/exam timings)
- **Error Tracking**: Captures and logs parsing errors in a separate sheet for debugging and quality assurance
- **Flexible Data Parsing**: Handles multiple academic term naming conventions and date formats used across different years
- **Record Linking**: Uses record keys to maintain relationships between standalone and multiple data records

#### **What Is Required**

**Technical Dependencies:**
- Python packages: `selenium`, `webdriver-manager`, `pandas`, `openpyxl`, standard libraries (`os`, `re`, `datetime`, `pathlib`)
- Chrome browser and ChromeDriver (auto-managed)
- No network access required (processes local files only)

**Input Requirements:**
- **Scraped HTML Files**: Previously downloaded HTML files from BOSS system stored locally
- **File Path CSV**: `script_input/scraped_filepaths.csv` containing paths to valid HTML files
- **Directory Structure**: HTML files organized in the expected folder structure (typically `script_input/classTimingsFull/`)

**Output Structure:**
- **Excel File**: `script_input/raw_data.xlsx` (or custom path) with multiple sheets:
  - `standalone`: One record per HTML file with course and class information
  - `multiple`: Multiple records for class timings and exam schedules
  - `errors`: Parsing errors and problematic files for debugging

**Data Extraction Capabilities:**
- **Course Information**: Course codes, names, descriptions, credit units, course areas, enrollment requirements
- **Academic Terms**: Term IDs, academic years, start/end dates, BOSS IDs
- **Class Details**: Sections, grading basis, course outline URLs, professor names
- **Timing Data**: Class schedules, exam dates, venues, day-of-week information
- **Cross-References**: Maintains linking keys between related records across sheets

**Usage in Jupyter Notebook:**
```python
# Initialize extractor
extractor = HTMLDataExtractor()

# Test with sample files first (recommended)
test_success = extractor.run_test(test_count=10)

if test_success:
    # Run full extraction
    extractor.run()
    
# Or run directly without testing
extractor.run(
    scraped_filepaths_csv='script_input/scraped_filepaths.csv',
    output_path='script_input/raw_data.xlsx'
)
```

The class provides a crucial intermediate step between raw HTML scraping and database insertion, creating clean, structured data that can be further processed for database integration or analysis.

In [None]:
class HTMLDataExtractor:
    """
    Extract raw data from scraped HTML files and save to Excel format using Selenium
    """
    
    def __init__(self):
        self.standalone_data = []
        self.multiple_data = []
        self.errors = []
        self.driver = None
        
    def setup_selenium_driver(self):
        """Set up Selenium WebDriver for local file access"""
        try:
            options = Options()
            options.add_argument('--no-sandbox')
            options.add_argument('--disable-dev-shm-usage')
            options.add_argument('--headless')  # Run in headless mode for efficiency
            options.add_argument('--disable-gpu')
            
            service = Service(ChromeDriverManager().install())
            self.driver = webdriver.Chrome(service=service, options=options)
            print("Selenium WebDriver initialized successfully")
        except Exception as e:
            print(f"Failed to initialize Selenium WebDriver: {e}")
            raise
    
    def safe_find_element_text(self, by, value):
        """Safely find element and return its text with proper encoding handling"""
        try:
            element = self.driver.find_element(by, value)
            if element:
                raw_text = element.text.strip()
                return self.clean_text_encoding(raw_text)
            return None
        except Exception:
            return None
    
    def safe_find_element_attribute(self, by, value, attribute):
        """Safely find element and return its attribute with proper encoding handling"""
        try:
            element = self.driver.find_element(by, value)
            if element:
                raw_attr = element.get_attribute(attribute)
                return self.clean_text_encoding(raw_attr) if raw_attr else None
            return None
        except Exception:
            return None
    
    def convert_date_to_timestamp(self, date_str):
        """Convert DD-Mmm-YYYY to database timestamp format"""
        try:
            date_obj = datetime.strptime(date_str, '%d-%b-%Y')
            return date_obj.strftime('%Y-%m-%d 00:00:00.000 +0800')
        except Exception as e:
            return None
    
    def parse_acad_term(self, term_text, filepath=None):
        """Parse academic term text and return structured data with folder path fallback"""
        try:
            # Clean the term text first
            if term_text:
                term_text = self.clean_text_encoding(term_text)
            
            # Pattern like "2021-22 Term 2" or "2021-22 Session 1"
            pattern = r'(\d{4})-(\d{2})\s+(.*)'
            match = re.search(pattern, term_text) if term_text else None
            
            if not match:
                return None, None, None, None
            
            start_year = int(match.group(1))
            end_year_short = int(match.group(2))
            term_desc = match.group(3).lower()
            
            # Convert 2-digit year to 4-digit
            if end_year_short < 50:
                end_year = 2000 + end_year_short
            else:
                end_year = 1900 + end_year_short
            
            # Determine term code from text
            term_code = None
            if 'term 1' in term_desc or 'session 1' in term_desc or 'august term' in term_desc:
                term_code = 'T1'
            elif 'term 2' in term_desc or 'session 2' in term_desc or 'january term' in term_desc:
                term_code = 'T2'
            elif 'term 3a' in term_desc:
                term_code = 'T3A'
            elif 'term 3b' in term_desc:
                term_code = 'T3B'
            elif 'term 3' in term_desc:
                # Generic T3 - need to check folder path for A/B
                term_code = 'T3'
            
            # If term_code is incomplete or missing, use folder path as fallback
            if not term_code or term_code == 'T3':
                folder_term = self.extract_term_from_folder_path(filepath) if filepath else None
                if folder_term:
                    # If we have folder term, use it
                    if term_code == 'T3' and folder_term in ['T3A', 'T3B']:
                        term_code = folder_term
                    elif not term_code:
                        term_code = folder_term
            
            # If still no term code, return None
            if not term_code:
                return start_year, end_year, None, None
            
            acad_term_id = f"AY{start_year}{end_year_short:02d}{term_code}"
            
            return start_year, end_year, term_code, acad_term_id
        except Exception as e:
            return None, None, None, None
    
    def parse_course_and_section(self, header_text):
        """Parse course code and section from header text with encoding fixes"""
        try:
            if not header_text:
                return None, None
            
            # Clean the text first
            clean_text = self.clean_text_encoding(header_text)
            clean_text = re.sub(r'<[^>]+>', '', clean_text)
            clean_text = re.sub(r'\s+', ' ', clean_text.strip())
            
            # Try multiple regex patterns
            patterns = [
                r'([A-Z0-9_-]+)\s+—\s+(.+)',  # Standard format with em-dash
                r'([A-Z0-9_-]+)\s+-\s+(.+)',  # Standard format with hyphen
                r'([A-Z]+)\s+(\d+[A-Z0-9_]*)\s+—\s+(.+)',  # Split format with em-dash
                r'([A-Z]+)\s+(\d+[A-Z0-9_]*)\s+-\s+(.+)',  # Split format with hyphen
                r'([A-Z0-9_\s-]+?)\s+—\s+([^—]+)',  # Flexible format with em-dash
                r'([A-Z0-9_\s-]+?)\s+-\s+([^-]+)',  # Flexible format with hyphen
            ]
            
            for pattern in patterns:
                match = re.match(pattern, clean_text)
                if match:
                    if len(match.groups()) == 2:
                        # Standard format: course_code - section
                        course_section = match.group(1).strip()
                        section_name = match.group(2).strip()
                        
                        # Extract section from the end of course_section if it's there
                        section_match = re.search(r'^(.+?)\s+([A-Z]\d+|G\d+|\d+)$', course_section)
                        if section_match:
                            course_code = section_match.group(1)
                            section = section_match.group(2)
                        else:
                            course_code = course_section
                            # Try to extract section from section_name
                            section_extract = re.search(r'([A-Z]\d+|G\d+|\d+)', section_name)
                            section = section_extract.group(1) if section_extract else None
                    else:
                        # Split format: course_prefix course_number - section_name
                        course_code = f"{match.group(1)}{match.group(2)}"
                        section_name = match.group(3).strip()
                        section_extract = re.search(r'([A-Z]\d+|G\d+|\d+)', section_name)
                        section = section_extract.group(1) if section_extract else None
                    
                    return course_code.strip() if course_code else None, section
            
            return None, None
        except Exception as e:
            return None, None
    
    def parse_date_range(self, date_text):
        """Parse date range text and return start and end timestamps"""
        try:
            # Example: "10-Jan-2022 to 01-May-2022"
            pattern = r'(\d{1,2}-\w{3}-\d{4})\s+to\s+(\d{1,2}-\w{3}-\d{4})'
            match = re.search(pattern, date_text)
            
            if not match:
                return None, None
            
            start_date = self.convert_date_to_timestamp(match.group(1))
            end_date = self.convert_date_to_timestamp(match.group(2))
            
            return start_date, end_date
        except Exception as e:
            return None, None
    
    def extract_course_areas_list(self):
        """Extract course areas with encoding fixes"""
        try:
            course_areas_element = self.driver.find_element(By.ID, 'lblCourseAreas')
            if not course_areas_element:
                return None
            
            # Get innerHTML to handle HTML content
            course_areas_html = course_areas_element.get_attribute('innerHTML')
            if course_areas_html:
                # Clean encoding first
                course_areas_html = self.clean_text_encoding(course_areas_html)
                
                # Extract list items
                areas_list = re.findall(r'<li[^>]*>([^<]+)</li>', course_areas_html)
                if areas_list:
                    # Clean each area and join
                    cleaned_areas = [self.clean_text_encoding(area.strip()) for area in areas_list]
                    return ', '.join(cleaned_areas)
                else:
                    # Fallback to text content
                    text_content = course_areas_element.text.strip()
                    return self.clean_text_encoding(text_content)
            else:
                # Fallback to text content
                text_content = course_areas_element.text.strip()
                return self.clean_text_encoding(text_content)
        except Exception:
            return None
    
    def extract_course_outline_url(self):
        """Extract course outline URL from HTML using Selenium"""
        try:
            onclick_attr = self.safe_find_element_attribute(By.ID, 'imgCourseOutline', 'onclick')
            if onclick_attr:
                url_match = re.search(r"window\.open\('([^']+)'", onclick_attr)
                if url_match:
                    return url_match.group(1)
        except Exception:
            pass
        return None
    
    def extract_boss_ids_from_filepath(self, filepath):
        """Extract BOSS IDs from filepath"""
        try:
            filename = os.path.basename(filepath)
            acad_term_match = re.search(r'SelectedAcadTerm=(\d+)', filename)
            class_match = re.search(r'SelectedClassNumber=(\d+)', filename)
            
            acad_term_boss_id = int(acad_term_match.group(1)) if acad_term_match else None
            class_boss_id = int(class_match.group(1)) if class_match else None
            
            return acad_term_boss_id, class_boss_id
        except Exception:
            return None, None
    
    def extract_meeting_information(self, record_key):
        """Extract class timing and exam timing information using Selenium"""
        try:
            meeting_table = self.driver.find_element(By.ID, 'RadGrid_MeetingInfo_ctl00')
            tbody = meeting_table.find_element(By.TAG_NAME, 'tbody')
            rows = tbody.find_elements(By.TAG_NAME, 'tr')
            
            for row in rows:
                cells = row.find_elements(By.TAG_NAME, 'td')
                if len(cells) < 7:
                    continue
                
                meeting_type = cells[0].text.strip()
                start_date_text = cells[1].text.strip()
                end_date_text = cells[2].text.strip()
                day_of_week = cells[3].text.strip()
                start_time = cells[4].text.strip()
                end_time = cells[5].text.strip()
                venue = cells[6].text.strip() if len(cells) > 6 else ""
                professor_name = cells[7].text.strip() if len(cells) > 7 else ""
                
                # Assume CLASS if meeting_type is empty
                if not meeting_type:
                    meeting_type = 'CLASS'
                
                if meeting_type == 'CLASS':
                    # Convert dates to timestamp format
                    start_date = self.convert_date_to_timestamp(start_date_text)
                    end_date = self.convert_date_to_timestamp(end_date_text)
                    
                    timing_record = {
                        'record_key': record_key,
                        'type': 'CLASS',
                        'start_date': start_date,
                        'end_date': end_date,
                        'day_of_week': day_of_week,
                        'start_time': start_time,
                        'end_time': end_time,
                        'venue': venue,
                        'professor_name': professor_name
                    }
                    self.multiple_data.append(timing_record)
                
                elif meeting_type == 'EXAM':
                    # For exams, use the second date (end_date_text) as the exam date
                    exam_date = self.convert_date_to_timestamp(end_date_text)
                    
                    exam_record = {
                        'record_key': record_key,
                        'type': 'EXAM',
                        'date': exam_date,
                        'day_of_week': day_of_week,
                        'start_time': start_time,
                        'end_time': end_time,
                        'venue': venue,
                        'professor_name': professor_name
                    }
                    self.multiple_data.append(exam_record)
        
        except Exception as e:
            self.errors.append({
                'record_key': record_key,
                'error': f'Error extracting meeting information: {str(e)}',
                'type': 'parse_error'
            })
    
    def process_html_file(self, filepath):
        """Process a single HTML file and extract all data using Selenium"""
        try:
            # Load HTML file
            html_file = Path(filepath).resolve()
            file_url = html_file.as_uri()
            self.driver.get(file_url)
            
            # Create unique record key
            record_key = f"{os.path.basename(filepath)}"
            
            # Extract basic information
            class_header_text = self.safe_find_element_text(By.ID, 'lblClassInfoHeader')
            if not class_header_text:
                self.errors.append({
                    'filepath': filepath,
                    'error': 'Missing class header',
                    'type': 'parse_error'
                })
                return False
            
            course_code, section = self.parse_course_and_section(class_header_text)
            
            # Extract academic term
            term_text = self.safe_find_element_text(By.ID, 'lblClassInfoSubHeader')
            acad_year_start, acad_year_end, term, acad_term_id = self.parse_acad_term(term_text, filepath) if term_text else (None, None, None, None)
            
            # Extract course information
            course_name = self.safe_find_element_text(By.ID, 'lblClassSection')
            course_description = self.safe_find_element_text(By.ID, 'lblCourseDescription')
            credit_units_text = self.safe_find_element_text(By.ID, 'lblUnits')
            course_areas = self.extract_course_areas_list()
            enrolment_requirements = self.safe_find_element_text(By.ID, 'lblEnrolmentRequirements')
            
            # Process credit units
            try:
                credit_units = float(credit_units_text) if credit_units_text else None
            except (ValueError, TypeError):
                credit_units = None
            
            # Extract grading basis
            grading_text = self.safe_find_element_text(By.ID, 'lblGradingBasis')
            grading_basis = None
            if grading_text:
                if grading_text.lower() == 'graded':
                    grading_basis = 'Graded'
                elif grading_text.lower() in ['pass/fail', 'pass fail']:
                    grading_basis = 'Pass/Fail'
                else:
                    grading_basis = 'NA'
            
            # Extract course outline URL
            course_outline_url = self.extract_course_outline_url()
            
            # Extract dates
            period_text = self.safe_find_element_text(By.ID, 'lblDates')
            start_dt, end_dt = self.parse_date_range(period_text) if period_text else (None, None)
            
            # Extract BOSS IDs
            acad_term_boss_id, class_boss_id = self.extract_boss_ids_from_filepath(filepath)
            
            # Create standalone record
            standalone_record = {
                'record_key': record_key,
                'filepath': filepath,
                'course_code': course_code,
                'section': section,
                'course_name': course_name,
                'course_description': course_description,
                'credit_units': credit_units,
                'course_area': course_areas,
                'enrolment_requirements': enrolment_requirements,
                'acad_term_id': acad_term_id,
                'acad_year_start': acad_year_start,
                'acad_year_end': acad_year_end,
                'term': term,
                'start_dt': start_dt,
                'end_dt': end_dt,
                'grading_basis': grading_basis,
                'course_outline_url': course_outline_url,
                'acad_term_boss_id': acad_term_boss_id,
                'class_boss_id': class_boss_id,
                'term_text': term_text,
                'period_text': period_text
            }
            
            self.standalone_data.append(standalone_record)
            
            # Extract meeting information
            self.extract_meeting_information(record_key)
            
            return True
            
        except Exception as e:
            self.errors.append({
                'filepath': filepath,
                'error': str(e),
                'type': 'processing_error'
            })
            return False
    
    def run_test(self, scraped_filepaths_csv='script_input/scraped_filepaths.csv', test_count=10):
        """Randomly test the extraction on a subset of files"""
        try:
            print(f"Starting test run with {test_count} randomly selected files...")

            # Reset data containers
            self.standalone_data = []
            self.multiple_data = []
            self.errors = []

            # Set up Selenium driver
            self.setup_selenium_driver()

            # Read the CSV file with file paths
            df = pd.read_csv(scraped_filepaths_csv)

            # Handle both 'Filepath' and 'filepath' column names
            filepath_column = 'Filepath' if 'Filepath' in df.columns else 'filepath'
            all_filepaths = df[filepath_column].dropna().tolist()

            if len(all_filepaths) == 0:
                raise ValueError("No valid filepaths found in CSV")

            # Randomly sample filepaths
            sample_size = min(test_count, len(all_filepaths))
            sampled_filepaths = random.sample(all_filepaths, sample_size)

            processed_files = 0
            successful_files = 0

            for i, filepath in enumerate(sampled_filepaths, start=1):
                if os.path.exists(filepath):
                    print(f"Processing test file {i}/{sample_size}: {os.path.basename(filepath)}")
                    if self.process_html_file(filepath):
                        successful_files += 1
                    processed_files += 1
                else:
                    self.errors.append({
                        'filepath': filepath,
                        'error': 'File not found',
                        'type': 'file_error'
                    })

            print(f"\nTest run complete: {successful_files}/{processed_files} files successful")
            print(f"Standalone records extracted: {len(self.standalone_data)}")
            print(f"Multiple records extracted: {len(self.multiple_data)}")
            if self.errors:
                print(f"Errors encountered: {len(self.errors)}")
                for error in self.errors[:3]:  # Show only the first 3 errors
                    print(f"  - {error['type']}: {error['error']}")

            # Save test results
            test_output_path = 'script_input/test_raw_data.xlsx'
            self.save_to_excel(test_output_path)

            return successful_files > 0

        except Exception as e:
            print(f"Error in test run: {e}")
            return False

        finally:
            if self.driver:
                self.driver.quit()
                print("Test selenium driver closed")
    
    def process_all_files(self, scraped_filepaths_csv='script_input/scraped_filepaths.csv'):
        """Process all files listed in the scraped filepaths CSV"""
        try:
            # Read the CSV file with file paths
            df = pd.read_csv(scraped_filepaths_csv)
            
            # Handle both 'Filepath' and 'filepath' column names
            filepath_column = 'Filepath' if 'Filepath' in df.columns else 'filepath'
            
            total_files = len(df)
            processed_files = 0
            successful_files = 0
            
            print(f"Starting to process {total_files} files")
            
            for index, row in df.iterrows():
                filepath = row[filepath_column]
                
                if os.path.exists(filepath):
                    if self.process_html_file(filepath):
                        successful_files += 1
                    processed_files += 1
                    
                    if processed_files % 100 == 0:
                        print(f"Processed {processed_files}/{total_files} files")
                else:
                    self.errors.append({
                        'filepath': filepath,
                        'error': 'File not found',
                        'type': 'file_error'
                    })
            
            print(f"Processing complete: {successful_files}/{processed_files} files successful")
            
        except Exception as e:
            print(f"Error in process_all_files: {e}")
            raise
    
    def save_to_excel(self, output_path='script_input/raw_data.xlsx'):
        """Save extracted data to Excel file with two sheets"""
        try:
            # Ensure output directory exists
            os.makedirs(os.path.dirname(output_path), exist_ok=True)
            
            # Create DataFrames
            standalone_df = pd.DataFrame(self.standalone_data)
            multiple_df = pd.DataFrame(self.multiple_data)
            
            # Save to Excel with multiple sheets
            with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
                standalone_df.to_excel(writer, sheet_name='standalone', index=False)
                multiple_df.to_excel(writer, sheet_name='multiple', index=False)
                
                # Also save errors if any
                if self.errors:
                    errors_df = pd.DataFrame(self.errors)
                    errors_df.to_excel(writer, sheet_name='errors', index=False)
            
            print(f"Data saved to {output_path}")
            print(f"Standalone records: {len(self.standalone_data)}")
            print(f"Multiple records: {len(self.multiple_data)}")
            if self.errors:
                print(f"Errors: {len(self.errors)}")
            
        except Exception as e:
            print(f"Error saving to Excel: {e}")
            raise
    
    def run(self, scraped_filepaths_csv='script_input/scraped_filepaths.csv', output_path='script_input/raw_data.xlsx'):
        """Run the complete extraction process"""
        print("Starting HTML data extraction...")
        
        # Reset data containers
        self.standalone_data = []
        self.multiple_data = []
        self.errors = []
        
        # Set up Selenium driver
        self.setup_selenium_driver()
        
        try:
            # Process all files
            self.process_all_files(scraped_filepaths_csv)
            
            # Save to Excel
            self.save_to_excel(output_path)
            
            print("HTML data extraction completed!")
            
        finally:
            if self.driver:
                self.driver.quit()
                print("Selenium driver closed")

    def clean_text_encoding(self, text):
        """Clean text to fix encoding issues like â€" -> —"""
        if not text:
            return text
        
        # Common encoding fixes - ORDER MATTERS! Process longer patterns first
        encoding_fixes = [
            ('â€"', '—'),   # em-dash
            ('â€™', "'"),   # right single quotation mark
            ('â€œ', '"'),   # left double quotation mark
            ('â€¦', '…'),   # horizontal ellipsis
            ('â€¢', '•'),   # bullet
            ('â€‹', ''),    # zero-width space
            ('â€‚', ' '),   # en space
            ('â€ƒ', ' '),   # em space
            ('â€‰', ' '),   # thin space
            ('â€', '"'),    # right double quotation mark (shorter pattern, process last)
            ('Â', ''),      # non-breaking space artifacts
        ]
        
        cleaned_text = text
        # Process in order to avoid substring conflicts
        for bad, good in encoding_fixes:
            cleaned_text = cleaned_text.replace(bad, good)
        
        # Remove any remaining problematic characters
        cleaned_text = re.sub(r'â€[^\w]', '', cleaned_text)
        
        return cleaned_text.strip()
    
    def extract_term_from_folder_path(self, filepath):
        """Extract term from folder path as fallback
        E.g., script_input\\classTimingsFull\\2023-24_T3A -> T3A"""
        try:
            # Get the folder path
            folder_path = os.path.dirname(filepath)
            folder_name = os.path.basename(folder_path)
            
            # Look for term pattern in folder name
            # Pattern: YYYY-YY_TXX or YYYY-YY_TXXA
            term_match = re.search(r'(\d{4}-\d{2})_T(\w+)', folder_name)
            if term_match:
                return f"T{term_match.group(2)}"
            
            # Fallback: look for any T followed by alphanumeric
            term_fallback = re.search(r'T(\w+)', folder_name)
            if term_fallback:
                return f"T{term_fallback.group(1)}"
            
            return None
        except Exception as e:
            return None

In [None]:
# Example usage
extractor = HTMLDataExtractor()

# Run the extraction process
extractor.run(scraped_filepaths_csv='script_input/scraped_filepaths.csv', output_path='script_input/raw_data.xlsx')


---

## **4. Process Raw Data into Database Tables**

### **What This Code Does**
The `TableBuilder` class processes structured data from the HTML extractor and transforms it into database-ready CSV files for SMU's class management system. It handles complex data relationships, professor name normalization, duplicate detection, and creates all necessary tables for courses, classes, professors, timing schedules, bidding data, and faculty assignments while maintaining referential integrity across all database tables.

**Key Features:**
- **Three-Phase Processing**: Phase 1 (professors/courses with automated faculty mapping), Phase 2 (classes/timings), Phase 3 (BOSS bidding results)
- **Intelligent Professor Matching**: Advanced name normalization with email resolution via Outlook integration and comprehensive duplicate detection
- **Automated Faculty Mapping**: Uses BOSS data to automatically assign courses to SMU's schools and centers based on department codes
- **Comprehensive Data Pipeline**: Processes professors, courses, academic terms, classes, class timings, exam schedules, bid windows, class availability, and bid results
- **Database Cache Integration**: Loads existing data from PostgreSQL to avoid duplicates and maintain consistency
- **Manual Review Workflow**: Outputs verification files for human review and correction before final processing
- **Asian Name Handling**: Specialized normalization for Asian, Western, and mixed naming conventions with hardcoded multi-instructor handling
- **BOSS Integration**: Complete processing of SMU's bidding system results with hierarchical window ordering and failed mapping tracking
- **Data Integrity Validation**: Comprehensive validation system that checks referential integrity across all generated CSV files

**Input Requirements:**
- **Raw Data Excel**: `script_input/raw_data.xlsx` from HTML extractor with `standalone` and `multiple` sheets
- **BOSS Results**: Excel files in `script_input/overallBossResults/` directory for bidding data processing
- **Database Configuration**: `.env` file with PostgreSQL connection parameters
- **Professor Lookup**: `script_input/professor_lookup.csv` for existing professor mappings (optional)

**Output Structure:**
- **Verification Files** (`script_output/verify/`): `new_professors.csv`, `new_courses.csv`, `new_faculties.csv`
- **Database Insert Files** (`script_output/`): All table CSV files including classes, timings, exams, bid data, and academic terms
- **Validation Reports**: Data integrity validation with error/warning reports and statistics
- **Processing Logs**: Detailed BOSS processing logs with timestamps and failure analysis

In [None]:
# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class TableBuilder:
    """Comprehensive table builder for university class management system"""
    
    def __init__(self, input_file: str = 'script_input/raw_data.xlsx'):
        """Initialize TableBuilder with database configuration and caching"""
        self.input_file = input_file
        self.output_base = 'script_output'
        self.verify_dir = os.path.join(self.output_base, 'verify')
        self.cache_dir = 'db_cache'
        
        # Create output directories
        os.makedirs(self.output_base, exist_ok=True)
        os.makedirs(self.verify_dir, exist_ok=True)
        os.makedirs(self.cache_dir, exist_ok=True)
        
        # Load environment variables
        load_dotenv()
        self.db_config = {
            'host': os.getenv('DB_HOST'),
            'database': os.getenv('DB_NAME'),
            'user': os.getenv('DB_USER'),
            'password': os.getenv('DB_PASSWORD'),
            'port': int(os.getenv('DB_PORT', 5432)),
            'gssencmode': 'disable'
        }
        
        # Database connection
        self.connection = None
        
        # Data storage
        self.standalone_data = None
        self.multiple_data = None
        
        # Caches
        self.professors_cache = {}  # name -> professor data
        self.courses_cache = {}     # code -> course data
        self.acad_term_cache = {}   # id -> acad_term data
        self.faculties_cache = {}   # id -> faculty data
        self.faculty_acronym_to_id = {}  # acronym -> faculty_id mapping
        self.professor_lookup = {}  # scraped_name -> database mapping
        
        # Output data collectors
        self.new_professors = []
        self.new_courses = []
        self.update_courses = []
        self.new_acad_terms = []
        self.new_classes = []
        self.new_class_timings = []
        self.new_class_exam_timings = []
        
        # Class ID mapping for timing tables
        self.class_id_mapping = {}  # record_key -> class_id
        
        # Courses requiring faculty assignment
        self.courses_needing_faculty = []
        
        # Statistics
        self.stats = {
            'professors_created': 0,
            'courses_created': 0,
            'courses_updated': 0,
            'classes_created': 0,
            'timings_created': 0,
            'exams_created': 0,
            'courses_needing_faculty': 0
        }
        
        # Asian surnames database for name normalization
        self.asian_surnames = {
            'chinese': ['WANG', 'LI', 'ZHANG', 'LIU', 'CHEN', 'YANG', 'HUANG', 'ZHAO', 'WU', 'ZHOU',
                       'XU', 'SUN', 'MA', 'ZHU', 'HU', 'GUO', 'HE', 'LIN', 'GAO', 'LUO'],
            'singaporean': ['TAN', 'LIM', 'LEE', 'NG', 'ONG', 'WONG', 'GOH', 'CHUA', 'CHAN', 'KOH',
                           'TEO', 'AW', 'CHYE', 'YEO', 'SIM', 'CHIA', 'CHONG', 'LAM', 'CHEW', 'TOH'],
            'korean': ['KIM', 'LEE', 'PARK', 'CHOI', 'JUNG', 'KANG', 'CHO', 'YUN', 'JANG', 'LIM'],
            'vietnamese': ['NGUYEN', 'TRAN', 'LE', 'PHAM', 'HOANG', 'PHAN', 'VU', 'DANG', 'BUI'],
            'indian': ['SHARMA', 'SINGH', 'KUMAR', 'GUPTA', 'KOHLI', 'PATEL', 'MAKHIJA']
        }
        self.all_asian_surnames = set()
        for surnames in self.asian_surnames.values():
            self.all_asian_surnames.update(surnames)
        
        # Western given names
        self.western_given_names = {
            'AARON', 'ADAM', 'ADRIAN', 'ALEXANDER', 'AMANDA', 'ANDREW', 'ANTHONY',
            'BENJAMIN', 'CHRISTOPHER', 'DANIEL', 'DAVID', 'EMILY', 'JAMES', 'JENNIFER',
            'JOHN', 'MICHAEL', 'PETER', 'ROBERT', 'SARAH', 'THOMAS', 'WILLIAM'
        }

        # Bid results data collectors
        self.boss_log_file = os.path.join(self.output_base, 'boss_result_log.txt')
        self.new_bid_windows = []
        self.new_class_availability = []
        self.new_bid_results = []

        # Professor lookup from CSV
        self.professor_lookup = {}
        
        # Load professor lookup if available
        self.load_professor_lookup_csv()

    def connect_database(self):
        """Connect to PostgreSQL database"""
        try:
            self.connection = psycopg2.connect(**self.db_config)
            logger.info("✅ Database connection established")
            return True
        except Exception as e:
            logger.error(f"❌ Database connection failed: {e}")
            return False

    def load_or_cache_data(self):
        """Load data from cache or database"""
        # Try loading from cache first
        if self._load_from_cache():
            logger.info("✅ Loaded data from cache")
            return True
        
        # Connect to database and download
        if not self.connect_database():
            return False
        
        try:
            self._download_and_cache_data()
            logger.info("✅ Downloaded and cached data from database")
            return True
        except Exception as e:
            logger.error(f"❌ Failed to download data: {e}")
            return False

    def _download_and_cache_data(self):
        """Download data from database and cache locally - includes all requested tables"""
        try:
            # Download professors
            query = "SELECT * FROM professors"
            professors_df = pd.read_sql_query(query, self.connection)
            professors_df.to_pickle(os.path.join(self.cache_dir, 'professors_cache.pkl'))
            
            # Download courses
            query = "SELECT * FROM courses"
            courses_df = pd.read_sql_query(query, self.connection)
            courses_df.to_pickle(os.path.join(self.cache_dir, 'courses_cache.pkl'))
            
            # Download acad_terms
            query = "SELECT * FROM acad_term"
            acad_terms_df = pd.read_sql_query(query, self.connection)
            acad_terms_df.to_pickle(os.path.join(self.cache_dir, 'acad_terms_cache.pkl'))
            
            # Download faculties
            query = "SELECT * FROM faculties"
            faculties_df = pd.read_sql_query(query, self.connection)
            faculties_df.to_pickle(os.path.join(self.cache_dir, 'faculties_cache.pkl'))
            
            # Download bid_result
            query = "SELECT * FROM bid_result"
            bid_result_df = pd.read_sql_query(query, self.connection)
            bid_result_df.to_pickle(os.path.join(self.cache_dir, 'bid_result_cache.pkl'))
            
            # Download bid_window
            query = "SELECT * FROM bid_window"
            bid_window_df = pd.read_sql_query(query, self.connection)
            bid_window_df.to_pickle(os.path.join(self.cache_dir, 'bid_window_cache.pkl'))
            
            # Download class_availability
            query = "SELECT * FROM class_availability"
            class_availability_df = pd.read_sql_query(query, self.connection)
            class_availability_df.to_pickle(os.path.join(self.cache_dir, 'class_availability_cache.pkl'))
            
            # Download class_exam_timing
            query = "SELECT * FROM class_exam_timing"
            class_exam_timing_df = pd.read_sql_query(query, self.connection)
            class_exam_timing_df.to_pickle(os.path.join(self.cache_dir, 'class_exam_timing_cache.pkl'))
            
            # Download class_timing
            query = "SELECT * FROM class_timing"
            class_timing_df = pd.read_sql_query(query, self.connection)
            class_timing_df.to_pickle(os.path.join(self.cache_dir, 'class_timing_cache.pkl'))
            
            # Download classes
            query = "SELECT * FROM classes"
            classes_df = pd.read_sql_query(query, self.connection)
            classes_df.to_pickle(os.path.join(self.cache_dir, 'classes_cache.pkl'))
            
            logger.info("✅ Downloaded all tables from database and cached locally")
            
            # Load into memory
            self._load_from_cache()
            
        except Exception as e:
            logger.error(f"❌ Failed to download and cache data: {e}")
            raise

    def _load_from_cache(self) -> bool:
        """Load cached data from files"""
        try:
            cache_files = {
                'professors': os.path.join(self.cache_dir, 'professors_cache.pkl'),
                'courses': os.path.join(self.cache_dir, 'courses_cache.pkl'),
                'acad_terms': os.path.join(self.cache_dir, 'acad_terms_cache.pkl'),
                'faculties': os.path.join(self.cache_dir, 'faculties_cache.pkl'),
                'bid_result': os.path.join(self.cache_dir, 'bid_result_cache.pkl'),
                'bid_window': os.path.join(self.cache_dir, 'bid_window_cache.pkl'),
                'class_availability': os.path.join(self.cache_dir, 'class_availability_cache.pkl'),
                'class_exam_timing': os.path.join(self.cache_dir, 'class_exam_timing_cache.pkl'),
                'class_timing': os.path.join(self.cache_dir, 'class_timing_cache.pkl'),
                'classes': os.path.join(self.cache_dir, 'classes_cache.pkl')
            }
            
            if all(os.path.exists(f) for f in cache_files.values()):
                # Load professors - KEY CHANGE: Use boss_name if available
                professors_df = pd.read_pickle(cache_files['professors'])
                for _, row in professors_df.iterrows():
                    # Safely handle None values
                    boss_name = row.get('boss_name')
                    regular_name = row.get('name')
                    
                    # Skip records with no valid name data
                    if boss_name is None and (regular_name is None or pd.isna(regular_name)):
                        continue
                        
                    # Use boss_name if available and not None, otherwise use name.upper()
                    if boss_name is not None and not pd.isna(boss_name):
                        cache_key = str(boss_name)
                    elif regular_name is not None and not pd.isna(regular_name):
                        cache_key = str(regular_name).upper()
                    else:
                        continue  # Skip if both are None/NaN
                        
                    self.professors_cache[cache_key] = row.to_dict()
                
                # Load courses
                courses_df = pd.read_pickle(cache_files['courses'])
                for _, row in courses_df.iterrows():
                    self.courses_cache[row['code']] = row.to_dict()
                
                # Load acad_terms
                acad_terms_df = pd.read_pickle(cache_files['acad_terms'])
                for _, row in acad_terms_df.iterrows():
                    self.acad_term_cache[row['id']] = row.to_dict()
                
                # Load faculties
                faculties_df = pd.read_pickle(cache_files['faculties'])
                for _, row in faculties_df.iterrows():
                    faculty_id = row['id']
                    acronym = row['acronym'].upper()
                    self.faculties_cache[faculty_id] = row.to_dict()
                    self.faculty_acronym_to_id[acronym] = faculty_id
                
                # Load professor lookup if exists - FIXED: Proper validation for clean professor_lookup.csv
                lookup_file = 'script_input/professor_lookup.csv'
                if os.path.exists(lookup_file):
                    lookup_df = pd.read_csv(lookup_file)
                    for _, row in lookup_df.iterrows():
                        boss_name = row.get('boss_name')
                        afterclass_name = row.get('afterclass_name')
                        database_id = row.get('database_id')
                        
                        # Validate required fields - professor_lookup.csv should be clean but pandas might introduce NaN
                        if pd.isna(boss_name) or pd.isna(database_id):
                            continue
                        
                        # Convert to string and validate
                        boss_name_str = str(boss_name).strip()
                        if not boss_name_str or boss_name_str.lower() == 'nan':
                            continue
                        
                        self.professor_lookup[boss_name_str.upper()] = {
                            'database_id': str(database_id),
                            'boss_name': boss_name_str,
                            'afterclass_name': str(afterclass_name) if not pd.isna(afterclass_name) else boss_name_str
                        }
                
                return True
            return False
        except Exception as e:
            logger.error(f"Cache loading error: {e}")
            return False

    def load_raw_data(self):
        """Load raw data from Excel file"""
        try:
            logger.info(f"📂 Loading raw data from {self.input_file}")
            
            # Load both sheets
            self.standalone_data = pd.read_excel(self.input_file, sheet_name='standalone')
            self.multiple_data = pd.read_excel(self.input_file, sheet_name='multiple')
            
            logger.info(f"✅ Loaded {len(self.standalone_data)} standalone records")
            logger.info(f"✅ Loaded {len(self.multiple_data)} multiple records")
            
            from collections import defaultdict
            
            self.multiple_lookup = defaultdict(list)
            for _, row in self.multiple_data.iterrows():
                key = row.get('record_key')
                if pd.notna(key):
                    self.multiple_lookup[key].append(row)
            
            logger.info(f"✅ Created optimized lookup for {len(self.multiple_lookup)} record keys")

            return True
        except Exception as e:
            logger.error(f"❌ Failed to load raw data: {e}")
            return False

    def normalize_professor_name(self, name: str) -> Tuple[str, str]:
        """Normalize professor name with improved handling for special cases"""
        # Handle None, NaN, and empty values
        if name is None or pd.isna(name) or name == "":
            return "UNKNOWN", "Unknown"
        
        # Clean and prepare name - ensure it's a string
        name = str(name).strip()
        
        # Additional safety check after conversion
        if not name:
            return "UNKNOWN", "Unknown"
        
        # Handle comma-separated names (SURNAME, GIVEN format)
        # These are almost always single professors
        if ',' in name:
            parts = name.split(',')
            if len(parts) == 2:
                # Standard "SURNAME, GIVEN" format
                surname_part = parts[0].strip()
                given_part = parts[1].strip()
                # Reconstruct as "SURNAME Given Names"
                given_names_title = ' '.join(word.capitalize() for word in given_part.split())
                name = f"{surname_part.upper()} {given_names_title}"
        
        # Detect naming pattern
        words = name.split()
        if not words:
            return name.upper(), name
        
        # Detect pattern
        pattern = self._detect_name_pattern(words)
        
        # Format based on pattern
        if pattern == 'WESTERN':
            # Western: Given SURNAME
            boss_name = name.upper()
            afterclass_parts = []
            for i, word in enumerate(words):
                if i == len(words) - 1:  # Last word is surname
                    afterclass_parts.append(word.upper())
                else:
                    afterclass_parts.append(word.capitalize())
            afterclass_name = ' '.join(afterclass_parts)
        
        elif pattern == 'ASIAN':
            # Asian: SURNAME Given Given
            boss_name = name.upper()
            afterclass_parts = []
            for i, word in enumerate(words):
                if i == 0:  # First word is surname
                    afterclass_parts.append(word.upper())
                else:
                    afterclass_parts.append(word.capitalize())
            afterclass_name = ' '.join(afterclass_parts)
        
        elif pattern == 'SINGAPOREAN':
            # Singaporean: Given SURNAME Given
            boss_name = name.upper()
            surname_idx = self._find_surname_index(words)
            afterclass_parts = []
            for i, word in enumerate(words):
                if i == surname_idx:
                    afterclass_parts.append(word.upper())
                else:
                    afterclass_parts.append(word.capitalize())
            afterclass_name = ' '.join(afterclass_parts)
        
        else:
            # Default fallback
            boss_name = name.upper()
            afterclass_name = ' '.join(word.capitalize() for word in words)
        
        return boss_name, afterclass_name

    def _detect_name_pattern(self, words: List[str]) -> str:
        """Detect naming pattern: WESTERN, ASIAN, or SINGAPOREAN"""
        if not words:
            return 'UNKNOWN'
        
        # Check for Western pattern
        first_upper = words[0].upper()
        if first_upper in self.western_given_names:
            return 'WESTERN'
        
        # Check for pure Asian pattern
        if first_upper in self.all_asian_surnames:
            # Check if no Western names present
            has_western = any(w.upper() in self.western_given_names for w in words)
            if not has_western:
                return 'ASIAN'
        
        # Check for Singaporean mixed pattern
        if len(words) >= 3:
            if (words[0].upper() in self.western_given_names and 
                any(w.upper() in self.all_asian_surnames for w in words[1:])):
                return 'SINGAPOREAN'
        
        # Default to Western if unclear
        return 'WESTERN'

    def _find_surname_index(self, words: List[str]) -> int:
        """Find the index of surname in a list of words"""
        for i, word in enumerate(words):
            if word.upper() in self.all_asian_surnames:
                return i
        # Default to last word if no Asian surname found
        return len(words) - 1

    def resolve_professor_email(self, professor_name):
        """Resolve professor email using Outlook contacts"""
        try:
            # Initialize Outlook
            outlook = win32.Dispatch("Outlook.Application")
            namespace = outlook.GetNamespace("MAPI")
            
            # Try exact resolver first
            recipient = namespace.CreateRecipient(professor_name)
            if recipient.Resolve():
                # Try to get SMTP address
                address_entry = recipient.AddressEntry
                
                # Try Exchange user
                try:
                    exchange_user = address_entry.GetExchangeUser()
                    if exchange_user and exchange_user.PrimarySmtpAddress:
                        return exchange_user.PrimarySmtpAddress.lower()
                except:
                    pass
                
                # Try Exchange distribution list
                try:
                    exchange_dl = address_entry.GetExchangeDistributionList()
                    if exchange_dl and exchange_dl.PrimarySmtpAddress:
                        return exchange_dl.PrimarySmtpAddress.lower()
                except:
                    pass
                
                # Try PR_SMTP_ADDRESS property
                try:
                    property_accessor = address_entry.PropertyAccessor
                    smtp_addr = property_accessor.GetProperty("http://schemas.microsoft.com/mapi/proptag/0x39FE001E")
                    if smtp_addr:
                        return smtp_addr.lower()
                except:
                    pass
                
                # Fallback: regex search in Address field
                try:
                    address = getattr(address_entry, "Address", "") or ""
                    match = re.search(r"[\w\.-]+@[\w\.-]+\.\w+", address)
                    if match:
                        return match.group(0).lower()
                except:
                    pass
            
            # If exact resolve fails, try contacts search
            contacts_folder = namespace.GetDefaultFolder(10)  # olFolderContacts
            tokens = [t.lower() for t in professor_name.split() if t]
            
            for item in contacts_folder.Items:
                try:
                    full_name = (item.FullName or "").lower()
                    if all(token in full_name for token in tokens):
                        # Try the three standard email slots
                        for field in ("Email1Address", "Email2Address", "Email3Address"):
                            addr = getattr(item, field, "") or ""
                            if addr and "@" in addr:
                                return addr.lower()
                except:
                    continue
            
            # If no email found, return default
            return 'enquiry@smu.edu.sg'
            
        except Exception as e:
            logger.warning(f"Email resolution failed for {professor_name}: {e}")
            return 'enquiry@smu.edu.sg'
        
    def process_professors(self):
        """Process professors from multiple sheet with intelligent multi-instructor handling"""
        logger.info("👥 Processing professors...")
        
        unique_professors = set()
        
        # Extract unique professor names from multiple sheet - FIXED: Better NaN handling for raw_data.xlsx
        for _, row in self.multiple_data.iterrows():
            prof_name_raw = row.get('professor_name')
            
            # Handle NaN values from raw_data.xlsx properly
            if prof_name_raw is None or pd.isna(prof_name_raw):
                continue
            
            # Convert to string and validate - this handles float NaN values from pandas
            prof_name = str(prof_name_raw).strip()
            
            # Skip empty strings and special values
            if not prof_name or prof_name.lower() in ['nan', 'tba', 'to be announced']:
                continue
            
            # Use intelligent splitting instead of hardcoded combinations
            split_professors = self._split_professor_names(prof_name)
            for individual_prof in split_professors:
                if individual_prof and individual_prof.strip():  # Additional safety check
                    unique_professors.add(individual_prof.strip())
        
        # Create email-to-professor mapping from existing professors for duplicate detection
        # FIXED: Exclude default email from duplicate detection
        email_to_professor = {}
        for boss_name, prof_data in self.professors_cache.items():
            if 'email' in prof_data and prof_data['email']:
                # Skip default email for duplicate detection
                if prof_data['email'].lower() != 'enquiry@smu.edu.sg':
                    email_to_professor[prof_data['email'].lower()] = prof_data
        
        # Initialize fuzzy match tracking
        fuzzy_matched_professors = []
        
        # Process each unique professor
        for prof_name in unique_professors:
            try:
                boss_name, afterclass_name = self.normalize_professor_name(prof_name)
                
                # Step 1: Check professor_lookup.csv first - FIXED: Exact matching priority
                if hasattr(self, 'professor_lookup') and prof_name.upper() in self.professor_lookup:
                    logger.info(f"✅ Found in professor_lookup.csv: {prof_name}")
                    continue
                
                # Also check with normalized boss_name
                if hasattr(self, 'professor_lookup') and boss_name.upper() in self.professor_lookup:
                    # Update the lookup with the original prof_name as key too
                    self.professor_lookup[prof_name.upper()] = self.professor_lookup[boss_name.upper()]
                    logger.info(f"✅ Found in professor_lookup.csv by boss_name: {prof_name} → {boss_name}")
                    continue
                
                # Step 1.5: FIXED: Check for partial name matches in professor_lookup.csv boss_names
                if hasattr(self, 'professor_lookup'):
                    found_partial_match = False
                    for lookup_boss_name, lookup_data in self.professor_lookup.items():
                        # Check if prof_name is a substring of any boss_name (exact word matching)
                        prof_words = set(prof_name.upper().split())
                        lookup_words = set(lookup_boss_name.split())
                        
                        # If all words in prof_name are found in lookup_boss_name, it's a match
                        if prof_words.issubset(lookup_words) and len(prof_words) >= 2:  # At least 2 words must match
                            self.professor_lookup[prof_name.upper()] = lookup_data
                            logger.info(f"✅ Found partial match in professor_lookup.csv: {prof_name} → {lookup_boss_name}")
                            found_partial_match = True
                            break
                    
                    if found_partial_match:
                        continue
                
                # Step 2: Check exact matches in professors_cache
                if boss_name in self.professors_cache:
                    if not hasattr(self, 'professor_lookup'):
                        self.professor_lookup = {}
                    self.professor_lookup[prof_name.upper()] = {
                        'database_id': self.professors_cache[boss_name]['id'],
                        'boss_name': boss_name,
                        'afterclass_name': self.professors_cache[boss_name].get('name', afterclass_name)
                    }
                    logger.info(f"✅ Found in professors_cache: {prof_name} → {boss_name}")
                    continue
                
                # Step 3: Enhanced fuzzy matching (100% certain matches only)
                fuzzy_match_found = False
                normalized_prof = ' '.join(str(prof_name).replace(',', ' ').split()).upper()
                
                # Check against existing professors
                for cached_name, cached_prof in self.professors_cache.items():
                    if cached_name is None:
                        continue
                    cached_normalized = ' '.join(str(cached_name).replace(',', ' ').split()).upper()
                    
                    # Only match if exactly the same after normalization
                    if normalized_prof == cached_normalized:
                        if not hasattr(self, 'professor_lookup'):
                            self.professor_lookup = {}
                        self.professor_lookup[prof_name.upper()] = {
                            'database_id': cached_prof['id'],
                            'boss_name': cached_prof.get('boss_name', cached_prof['name'].upper()),
                            'afterclass_name': cached_prof.get('name', afterclass_name)
                        }
                        fuzzy_match_found = True
                        logger.info(f"✅ Found fuzzy match (100% certain): {prof_name} → {cached_name}")
                        break
                
                if fuzzy_match_found:
                    continue
                
                # Also check against new professors being created in this session
                for new_prof in self.new_professors:
                    new_normalized = ' '.join(new_prof.get('boss_name', '').replace(',', ' ').split()).upper()
                    if normalized_prof == new_normalized:
                        if not hasattr(self, 'professor_lookup'):
                            self.professor_lookup = {}
                        self.professor_lookup[prof_name.upper()] = {
                            'database_id': new_prof['id'],
                            'boss_name': new_prof['boss_name'],
                            'afterclass_name': new_prof['afterclass_name']
                        }
                        fuzzy_match_found = True
                        logger.info(f"✅ Found in new_professors (100% certain): {prof_name}")
                        break
                
                if fuzzy_match_found:
                    continue
                
                # Step 4: FIXED: Advanced fuzzy matching against boss_name and afterclass_name
                if hasattr(self, 'professor_lookup'):
                    best_fuzzy_match = None
                    best_fuzzy_score = 0
                    
                    for lookup_boss_name, lookup_data in self.professor_lookup.items():
                        # Get afterclass_name for comparison
                        afterclass_candidate = lookup_data.get('afterclass_name', lookup_boss_name)
                        
                        # Fuzzy match against both boss_name and afterclass_name
                        boss_score = self._calculate_fuzzy_score(prof_name, lookup_boss_name)
                        afterclass_score = self._calculate_fuzzy_score(prof_name, afterclass_candidate)
                        
                        max_score = max(boss_score, afterclass_score)
                        
                        # Only consider very strong matches (85%+ similarity)
                        if max_score > 0.85 and max_score > best_fuzzy_score:
                            best_fuzzy_match = lookup_data
                            best_fuzzy_score = max_score
                    
                    if best_fuzzy_match:
                        # Add to fuzzy matched professors for validation
                        fuzzy_matched_professors.append({
                            'boss_name': prof_name.upper(),
                            'afterclass_name': best_fuzzy_match.get('afterclass_name', prof_name),
                            'database_id': best_fuzzy_match['database_id'],
                            'method': 'fuzzy_match',
                            'confidence_score': f"{best_fuzzy_score:.2f}"
                        })
                        
                        # Update lookup
                        if not hasattr(self, 'professor_lookup'):
                            self.professor_lookup = {}
                        self.professor_lookup[prof_name.upper()] = best_fuzzy_match
                        
                        logger.info(f"🔍 Fuzzy match found: {prof_name} → {best_fuzzy_match.get('afterclass_name')} (score: {best_fuzzy_score:.2f})")
                        continue
                
                # Step 5: Validate professor name before creating - reject single words
                prof_words = prof_name.strip().split()
                if len(prof_words) == 1:
                    logger.warning(f"⚠️ Skipping single-word professor name (likely parsing error): '{prof_name}'")
                    continue
                
                # Step 6: Create new professor and resolve email
                resolved_email = self.resolve_professor_email(afterclass_name)
                
                # Step 7: Check if resolved email already exists (FIXED: Skip default email)
                if (resolved_email and 
                    resolved_email.lower() != 'enquiry@smu.edu.sg' and 
                    resolved_email.lower() in email_to_professor):
                    # Email already exists - use existing professor
                    existing_prof = email_to_professor[resolved_email.lower()]
                    if not hasattr(self, 'professor_lookup'):
                        self.professor_lookup = {}
                    self.professor_lookup[prof_name.upper()] = {
                        'database_id': existing_prof['id'],
                        'boss_name': boss_name,
                        'afterclass_name': existing_prof.get('name', afterclass_name)
                    }
                    logger.info(f"✅ Email duplicate found - using existing professor: {prof_name} → {existing_prof.get('name')} (email: {resolved_email})")
                    continue
                
                # Step 8: Create new professor only if no match found
                professor_id = str(uuid.uuid4())
                slug = re.sub(r'[^a-zA-Z0-9]+', '-', afterclass_name.lower()).strip('-')
                
                new_prof = {
                    'id': professor_id,
                    'name': afterclass_name,
                    'email': resolved_email,
                    'slug': slug,
                    'photo_url': 'https://smu.edu.sg',
                    'profile_url': 'https://smu.edu.sg',
                    'belong_to_university': 1,  # SMU
                    'created_at': datetime.now().isoformat(),
                    'updated_at': datetime.now().isoformat(),
                    'boss_name': boss_name,
                    'afterclass_name': afterclass_name,
                    'original_scraped_name': prof_name
                }
                
                self.new_professors.append(new_prof)
                self.stats['professors_created'] += 1
                
                # Update lookup
                if not hasattr(self, 'professor_lookup'):
                    self.professor_lookup = {}
                self.professor_lookup[prof_name.upper()] = {
                    'database_id': professor_id,
                    'boss_name': boss_name,
                    'afterclass_name': afterclass_name
                }
                
                # Add to email mapping to prevent duplicates within this session (FIXED: Skip default email)
                if resolved_email and resolved_email.lower() != 'enquiry@smu.edu.sg':
                    email_to_professor[resolved_email.lower()] = new_prof
                
                logger.info(f"✅ Created professor: {afterclass_name} with email: {resolved_email}")
                
            except Exception as e:
                logger.error(f"❌ Error processing professor '{prof_name}': {e}")
                continue
        
        # Save fuzzy matched professors for validation
        if fuzzy_matched_professors:
            fuzzy_df = pd.DataFrame(fuzzy_matched_professors)
            fuzzy_path = os.path.join(self.verify_dir, 'fuzzy_matched_professors.csv')
            fuzzy_df.to_csv(fuzzy_path, index=False)
            logger.info(f"🔍 Saved {len(fuzzy_matched_professors)} fuzzy matched professors for validation")
        
        logger.info(f"✅ Created {self.stats['professors_created']} new professors")

    def _calculate_fuzzy_score(self, name1: str, name2: str) -> float:
        """Calculate fuzzy similarity score between two names (0-1)"""
        if not name1 or not name2:
            return 0.0
        
        # Normalize names
        name1_clean = ' '.join(str(name1).upper().replace(',', ' ').split())
        name2_clean = ' '.join(str(name2).upper().replace(',', ' ').split())
        
        if name1_clean == name2_clean:
            return 1.0
        
        # Check word overlap
        words1 = set(name1_clean.split())
        words2 = set(name2_clean.split())
        
        if not words1 or not words2:
            return 0.0
        
        # Calculate Jaccard similarity
        intersection = len(words1.intersection(words2))
        union = len(words1.union(words2))
        
        jaccard_score = intersection / union if union > 0 else 0.0
        
        # Boost score if one name is completely contained in the other
        if words1.issubset(words2) or words2.issubset(words1):
            jaccard_score = min(1.0, jaccard_score + 0.2)
        
        return jaccard_score

    def process_courses(self):
        """Process courses from standalone sheet with proper change detection for updates"""
        logger.info("📚 Processing courses...")
        
        # Group by course code to handle duplicates
        course_groups = defaultdict(list)
        for _, row in self.standalone_data.iterrows():
            if pd.notna(row.get('course_code')):
                course_groups[row['course_code']].append(row)
        
        for course_code, rows in course_groups.items():
            # Helper function to get sortable key for academic term ordering
            def get_sort_key(row):
                year_start = row.get('acad_year_start', 0)
                year_end = row.get('acad_year_end', 0)
                term = str(row.get('term', ''))
                
                # Convert term to sortable format with proper hierarchy: T1 → T2 → T3A → T3B
                term_order = {
                    'T1': 1,
                    'T2': 2,
                    'T3A': 3,
                    'T3B': 4,
                    '1': 1,    # Handle cases without T prefix
                    '2': 2,
                    '3A': 3,
                    '3B': 4
                }
                term_value = term_order.get(term.upper(), 0)
                return (year_start, year_end, term_value)
            
            # Sort rows to get the latest one (highest year and term)
            sorted_rows = sorted(rows, key=get_sort_key, reverse=True)
            latest_row = sorted_rows[0]
            
            # Check if course exists in cache
            if course_code in self.courses_cache:
                # Course exists - check for actual changes that need updating
                existing = self.courses_cache[course_code]
                update_needed = False
                update_record = {'id': existing['id'], 'code': course_code}
                
                # Fields to compare for changes
                field_mapping = {
                    'name': 'course_name',
                    'description': 'course_description', 
                    'credit_units': 'credit_units',
                    'course_area': 'course_area',
                    'enrolment_requirements': 'enrolment_requirements'
                }
                
                # Check each field for actual changes
                for db_field, raw_field in field_mapping.items():
                    new_value = latest_row.get(raw_field)
                    old_value = existing.get(db_field)
                    
                    # Handle different data types properly
                    if db_field == 'credit_units':
                        # Convert to float for comparison
                        new_value = float(new_value) if pd.notna(new_value) else None
                        old_value = float(old_value) if pd.notna(old_value) else None
                    else:
                        # String comparison - handle None/NaN
                        if pd.isna(new_value):
                            new_value = None
                        else:
                            new_value = str(new_value).strip()
                        
                        if pd.isna(old_value):
                            old_value = None
                        else:
                            old_value = str(old_value).strip() if old_value is not None else None
                    
                    # Only update if there's an actual change
                    if new_value != old_value:
                        # Skip if new value is empty/None and old value exists (don't overwrite with empty)
                        if new_value is None or new_value == '':
                            if old_value is not None and old_value != '':
                                continue  # Don't overwrite existing data with empty data
                        
                        update_record[db_field] = new_value
                        update_needed = True
                        logger.info(f"📝 Course {course_code}: {db_field} changed from '{old_value}' to '{new_value}'")
                
                if update_needed:
                    self.update_courses.append(update_record)
                    self.stats['courses_updated'] += 1
                    
                    # Update cache with new values
                    for field, value in update_record.items():
                        if field not in ['id', 'code']:
                            self.courses_cache[course_code][field] = value
                            
                    logger.info(f"✅ Course {course_code} marked for update")
                else:
                    logger.info(f"⏭️ Course {course_code} - no changes detected")
            else:
                # Create new course
                course_id = str(uuid.uuid4())
                
                new_course = {
                    'id': course_id,
                    'code': course_code,
                    'name': latest_row.get('course_name', 'Unknown Course'),
                    'description': latest_row.get('course_description', 'No description available'),
                    'credit_units': float(latest_row.get('credit_units', 1.0)) if pd.notna(latest_row.get('credit_units')) else 1.0,
                    'belong_to_university': 1,  # SMU
                    'belong_to_faculty': None,  # Will be assigned later
                    'course_area': latest_row.get('course_area'),
                    'enrolment_requirements': latest_row.get('enrolment_requirements')
                }
                
                self.new_courses.append(new_course)
                self.stats['courses_created'] += 1
                
                # Store course info for later faculty assignment
                self.courses_needing_faculty.append({
                    'course_id': course_id,
                    'course_code': course_code,
                    'course_name': latest_row.get('course_name', 'Unknown Course'),
                    'course_outline_url': latest_row.get('course_outline_url')
                })
                self.stats['courses_needing_faculty'] += 1
                
                # Update cache
                self.courses_cache[course_code] = new_course
        
        logger.info(f"✅ Created {self.stats['courses_created']} new courses")
        logger.info(f"✅ Updated {self.stats['courses_updated']} existing courses")
        logger.info(f"⚠️  {self.stats['courses_needing_faculty']} courses need faculty assignment")

    def assign_course_faculties_interactive(self):
        """Interactive faculty assignment with option to create new faculties"""
        if not self.courses_needing_faculty:
            logger.info("✅ No courses need faculty assignment")
            return
        
        logger.info(f"🎓 Starting interactive faculty assignment for {len(self.courses_needing_faculty)} courses")
        
        # Get current max faculty ID for incrementing
        max_faculty_id = max(self.faculties_cache.keys()) if self.faculties_cache else 0
        
        faculty_assignments = []
        
        for course_info in self.courses_needing_faculty:
            print(f"\n{'='*60}")
            print(f"🎓 FACULTY ASSIGNMENT NEEDED")
            print(f"{'='*60}")
            print(f"Course Code: {course_info['course_code']}")
            print(f"Course Name: {course_info['course_name']}")
            
            # Get the last filepath for this course from multiple sheet
            driver = None
            course_code = course_info['course_code']
            last_filepath = self.get_last_filepath_by_course(course_code)
            
            if last_filepath:
                print(f"\nOpening scraped HTML file: {last_filepath}")
                
                try:
                    # Setup Chrome options
                    chrome_options = Options()
                    chrome_options.add_argument("--new-window")
                    chrome_options.add_argument("--start-maximized")
                    
                    # Initialize driver
                    driver = webdriver.Chrome(options=chrome_options)
                    
                    # Open the HTML file
                    abs_path = os.path.abspath(last_filepath)
                    from pathlib import Path
                    file_path = Path(abs_path)
                    
                    if file_path.exists():
                        # Use pathlib's as_uri() method for proper file:// URL
                        file_url = file_path.as_uri()
                        driver.get(file_url)
                        print("✅ Scraped HTML file opened in browser")
                        print("📋 Review the course content to determine the correct faculty")
                    else:
                        print(f"⚠️ HTML file not found: {abs_path}")
                        print("📋 Proceeding without file preview")
                        
                except Exception as e:
                    print(f"⚠️ Could not open HTML file: {e}")
                    print("📋 Proceeding without file preview")
            else:
                print(f"⚠️ No scraped HTML file found for course {course_code}")
                print("📋 Proceeding without file preview")
            
            # Show existing faculties
            print("\nExisting Faculty Options:")
            faculty_list = sorted(self.faculties_cache.values(), key=lambda x: x['id'])
            for faculty in faculty_list:
                print(f"{faculty['id']}. {faculty['name']} ({faculty['acronym']})")
            
            print(f"\n0. Skip (will need manual review)")
            print(f"99. Create new faculty")
            
            while True:
                choice = input(f"\nEnter faculty number (0-{max(f['id'] for f in faculty_list)}, 99): ").strip()
                
                if choice == '0':
                    faculty_id = None
                    break
                elif choice == '99':
                    # Create new faculty
                    print("\n📝 Creating new faculty:")
                    faculty_name = input("Enter faculty name: ").strip()
                    faculty_acronym = input("Enter faculty acronym (e.g., SCIS): ").strip().upper()
                    faculty_url = input("Enter faculty website URL (or press Enter for default): ").strip()
                    
                    if not faculty_url:
                        faculty_url = f"https://smu.edu.sg/{faculty_acronym.lower()}"
                    
                    # Increment faculty ID
                    max_faculty_id += 1
                    new_faculty = {
                        'id': max_faculty_id,
                        'name': faculty_name,
                        'acronym': faculty_acronym,
                        'site_url': faculty_url,
                        'belong_to_university': 1,  # SMU
                        'created_at': datetime.now().isoformat(),
                        'updated_at': datetime.now().isoformat()
                    }
                    
                    # Add to cache
                    self.faculties_cache[max_faculty_id] = new_faculty
                    self.faculty_acronym_to_id[faculty_acronym] = max_faculty_id
                    
                    # Save to new_faculties list
                    if not hasattr(self, 'new_faculties'):
                        self.new_faculties = []
                    self.new_faculties.append(new_faculty)
                    
                    faculty_id = max_faculty_id
                    print(f"✅ Created new faculty: {faculty_name} (ID: {faculty_id})")
                    break
                else:
                    try:
                        faculty_id = int(choice)
                        if faculty_id in [f['id'] for f in faculty_list]:
                            break
                        else:
                            print(f"Invalid choice. Please enter a valid faculty ID.")
                    except ValueError:
                        print("Invalid input. Please enter a number.")
            
            # Close browser after selection
            if driver:
                try:
                    print("\n🔄 Closing browser...")
                    driver.quit()
                except Exception as e:
                    print(f"⚠️ Error closing browser: {e}")
            
            # Store assignment
            faculty_assignments.append({
                'course_id': course_info['course_id'],
                'course_code': course_info['course_code'],
                'faculty_id': faculty_id
            })
        
        # Apply assignments
        for assignment in faculty_assignments:
            if assignment['faculty_id'] is not None:
                # Update new_courses
                for course in self.new_courses:
                    if course['id'] == assignment['course_id']:
                        course['belong_to_faculty'] = assignment['faculty_id']
                        break
                
                # Update cache
                if assignment['course_code'] in self.courses_cache:
                    self.courses_cache[assignment['course_code']]['belong_to_faculty'] = assignment['faculty_id']
        
        # Save outputs
        if self.new_courses:
            df = pd.DataFrame(self.new_courses)
            df.to_csv(os.path.join(self.verify_dir, 'new_courses.csv'), index=False)
            logger.info(f"✅ Updated new_courses.csv with faculty assignments")
        
        if hasattr(self, 'new_faculties') and self.new_faculties:
            df = pd.DataFrame(self.new_faculties)
            df.to_csv(os.path.join(self.verify_dir, 'new_faculties.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_faculties)} new faculties")
        
        logger.info("✅ Faculty assignment completed")

    # Also add this as an alias to the existing method name
    def assign_course_faculties(self):
        """Alias for assign_course_faculties_interactive"""
        return self.assign_course_faculties_interactive()

    def process_acad_terms(self):
        """Process academic terms from standalone sheet"""
        logger.info("📅 Processing academic terms...")
        
        # Group by (acad_year_start, acad_year_end, term)
        term_groups = defaultdict(list)
        
        for _, row in self.standalone_data.iterrows():
            # Try to extract from row data first
            year_start = row.get('acad_year_start')
            year_end = row.get('acad_year_end')
            term = row.get('term')
            
            # If any are missing, try to extract from source file path if available
            if pd.isna(year_start) or pd.isna(year_end) or pd.isna(term):
                if 'source_file' in row and pd.notna(row['source_file']):
                    fallback_term_id = self.extract_acad_term_from_path(row['source_file'])
                    if fallback_term_id:
                        # Parse the fallback
                        match = re.match(r'AY(\d{4})(\d{2})T(\w+)', fallback_term_id)
                        if match:
                            year_start = int(match.group(1)) if pd.isna(year_start) else year_start
                            year_end = int(match.group(2)) if pd.isna(year_end) else year_end
                            term = f"T{match.group(3)}" if pd.isna(term) else term
            
            key = (year_start, year_end, term)
            if all(pd.notna(v) for v in key):
                term_groups[key].append(row)
        
        # Rest of the function remains the same...
        for (year_start, year_end, term), rows in term_groups.items():
            # Generate acad_term_id (keep T for ID)
            acad_term_id = f"AY{int(year_start)}{int(year_end) % 100:02d}{term}"
            
            # Check if already exists
            if acad_term_id in self.acad_term_cache:
                continue
            
            # Find most common period_text and dates
            period_counter = Counter()
            date_info = {}
            
            for row in rows:
                period_text = row.get('period_text', '')
                if pd.notna(period_text):
                    period_counter[period_text] += 1
                    if period_text not in date_info:
                        date_info[period_text] = {
                            'start_dt': row.get('start_dt'),
                            'end_dt': row.get('end_dt')
                        }
            
            # Get most common period
            if period_counter:
                most_common_period = period_counter.most_common(1)[0][0]
                dates = date_info[most_common_period]
            else:
                dates = {'start_dt': None, 'end_dt': None}
            
            # Get boss_id from first row
            boss_id = rows[0].get('acad_term_boss_id')
            
            # Remove T prefix from term field for database storage
            clean_term = str(term)[1:] if str(term).startswith('T') else str(term)
            
            new_term = {
                'id': acad_term_id,
                'acad_year_start': int(year_start),
                'acad_year_end': int(year_end),
                'term': clean_term,  # Store without T prefix
                'boss_id': int(boss_id) if pd.notna(boss_id) else None,
                'start_dt': dates['start_dt'],
                'end_dt': dates['end_dt']
            }
            
            self.new_acad_terms.append(new_term)
            self.acad_term_cache[acad_term_id] = new_term
            
            logger.info(f"✅ Created academic term: {acad_term_id} (term: {clean_term})")
        
        logger.info(f"✅ Created {len(self.new_acad_terms)} new academic terms")

    def process_classes(self, use_db_cache_for_classes=False):
        """Process classes from standalone sheet with proper field mapping and deduplication"""
        logger.info("🏫 Processing classes...")
        logger.info(f"   Using db_cache for classes: {use_db_cache_for_classes}")
        
        # Load existing classes if using cache
        if use_db_cache_for_classes:
            self.load_existing_classes_cache()
        
        processed_classes = set()
        validation_errors = []
        successful_creates = 0
        
        for idx, row in self.standalone_data.iterrows():
            try:
                course_code = row.get('course_code')
                section = row.get('section')
                acad_term_id = row.get('acad_term_id')
                boss_id = row.get('class_boss_id')  
                record_key = row.get('record_key')
                
                # Less strict validation - only require essential fields
                if pd.isna(course_code) or pd.isna(section) or pd.isna(acad_term_id):
                    validation_errors.append({
                        'row': idx,
                        'errors': ['missing_essential_fields'],
                        'course_code': course_code,
                        'section': section,
                        'acad_term_id': acad_term_id
                    })
                    continue
                
                # Enhanced course lookup with new_courses fallback
                course_id = None
                if course_code in self.courses_cache:
                    course_id = self.courses_cache[course_code]['id']
                else:
                    # Check in newly created courses
                    for course in self.new_courses:
                        if course['code'] == course_code:
                            course_id = course['id']
                            break
                
                if not course_id:
                    validation_errors.append({
                        'row': idx,
                        'errors': ['course_not_found'],
                        'course_code': course_code,
                        'section': section,
                        'acad_term_id': acad_term_id
                    })
                    continue
                
                # Check academic term exists
                if acad_term_id not in self.acad_term_cache:
                    validation_errors.append({
                        'row': idx,
                        'errors': ['acad_term_not_found'],
                        'course_code': course_code,
                        'section': section,
                        'acad_term_id': acad_term_id
                    })
                    continue
                
                # Professor lookup - get list of unique (professor_id, original_name) tuples
                professor_mappings = self._find_professors_for_class(record_key) if record_key else []
                
                # Allow classes without professors (log but continue)
                if not professor_mappings:
                    logger.warning(f"⚠️ No professors found for class: {course_code}-{section} - creating class anyway")
                    professor_mappings = [(None, '')]
                
                # FIXED: Only set warn_inaccuracy=True when multiple DIFFERENT professors teach same course/section/term
                unique_professors = set(prof_id for prof_id, _ in professor_mappings if prof_id is not None)
                warn_inaccuracy = len(unique_professors) > 1
                
                # Create class records - one per UNIQUE professor
                class_ids_created = []
                
                for prof_id, prof_name in professor_mappings:
                    class_id = str(uuid.uuid4())
                    
                    new_class = {
                        'id': class_id,
                        'section': str(section),
                        'course_id': course_id,
                        'professor_id': prof_id,
                        'acad_term_id': acad_term_id,
                        'created_at': datetime.now().isoformat(),
                        'updated_at': datetime.now().isoformat(),
                        'grading_basis': row.get('grading_basis'),
                        'course_outline_url': row.get('course_outline_url'),
                        'boss_id': int(boss_id) if pd.notna(boss_id) else None,
                        'raw_professor_name': prof_name,  # Individual professor name, not the full string
                        'warn_inaccuracy': warn_inaccuracy  # Only True when multiple different professors
                    }
                    
                    self.new_classes.append(new_class)
                    class_ids_created.append(class_id)
                    successful_creates += 1
                    self.stats['classes_created'] += 1
                
                # Update class_id_mapping to store list of class IDs
                if record_key and class_ids_created:
                    self.class_id_mapping[record_key] = class_ids_created
                
                # Mark this course/section/term combination as processed
                class_key = (str(course_code), str(section), str(acad_term_id))
                processed_classes.add(class_key)
                    
            except Exception as e:
                validation_errors.append({
                    'row': idx,
                    'errors': [f'exception: {str(e)}'],
                    'course_code': row.get('course_code'),
                    'section': row.get('section'),
                    'acad_term_id': row.get('acad_term_id')
                })
                logger.error(f"❌ Exception processing row {idx}: {e}")
        
        # Enhanced reporting
        logger.info(f"✅ Successfully created {successful_creates} classes")
        logger.info(f"⚠️ {len(validation_errors)} validation errors encountered")
        
        # Save validation errors for analysis
        if validation_errors:
            error_df = pd.DataFrame(validation_errors)
            error_path = os.path.join(self.output_base, 'class_validation_errors.csv')
            error_df.to_csv(error_path, index=False)
            logger.info(f"💾 Saved validation errors to {error_path}")
        
        return len(self.new_classes) > 0

    def _get_original_professor_names(self, record_key: str) -> List[str]:
        """Extract original professor names from multiple sheet for debugging"""
        rows = self.multiple_lookup.get(record_key, [])
        professor_names = []
        
        for row in rows:
            if pd.notna(row.get('professor_name')):
                professor_name = str(row['professor_name']).strip()
                if professor_name and professor_name not in professor_names:
                    professor_names.append(professor_name)
        
        return professor_names
        
    def _find_professors_for_class(self, record_key: str) -> List[tuple]:
        """Find professor IDs for a class and return list of (professor_id, original_name) tuples
        Deduplicates by professor_id to avoid creating multiple class records for same professor"""
        if not record_key or pd.isna(record_key):
            return []
        
        rows = self.multiple_lookup.get(record_key, [])
        professor_mappings = []
        seen_professor_ids = set()  # Track unique professor IDs
        
        # Ensure professor lookup is loaded
        if not hasattr(self, 'professor_lookup_loaded'):
            self.load_professor_lookup_csv()
        
        for row in rows:
            prof_name_raw = row.get('professor_name')
            
            # FIXED: Better handling of NaN values from raw_data.xlsx
            if prof_name_raw is None or pd.isna(prof_name_raw):
                continue
            
            # Convert to string and strip - handles float NaN properly
            original_prof_name = str(prof_name_raw).strip()
            
            # Skip empty strings and 'nan' strings
            if not original_prof_name or original_prof_name.lower() == 'nan':
                continue
            
            # Split the professor names intelligently
            split_professors = self._split_professor_names(original_prof_name)
            
            # Process each split professor
            for prof_name in split_professors:
                if prof_name and prof_name.strip():  # Additional check for empty strings
                    prof_id = self._lookup_professor_with_fallback(prof_name.strip())
                    if prof_id and prof_id not in seen_professor_ids:
                        professor_mappings.append((prof_id, prof_name.strip()))
                        seen_professor_ids.add(prof_id)
        
        return professor_mappings

    def _split_professor_names(self, prof_name: str) -> List[str]:
        """Intelligently split professor names with improved comma-based parsing"""
        # Handle None, NaN, and non-string values from raw_data.xlsx
        if prof_name is None or pd.isna(prof_name):
            return []
        
        # Ensure it's a string and handle float NaN
        prof_name = str(prof_name).strip()
        
        # Check for empty string or 'nan' string after conversion
        if not prof_name or prof_name.lower() == 'nan':
            return []
        
        # Step 1: Check if the entire name exists in professor_lookup (single professor)
        prof_name_upper = prof_name.upper()
        if hasattr(self, 'professor_lookup') and prof_name_upper in self.professor_lookup:
            return [prof_name]
        
        # Step 2: Check hardcoded multi-instructor combinations first
        multi_instructor_combinations = {
            "ERIC YEE SHIN CHONG, MANDY THAM": ["ERIC YEE SHIN CHONG", "MANDY THAM"],
            "ZHENG ZHICHAO, DANIEL, TAN KAR WAY": ["ZHENG ZHICHAO, DANIEL", "TAN KAR WAY"],
            "KAM WAI WARREN BARTHOLOMEW CHIK, LANX GOH": ["KAM WAI WARREN BARTHOLOMEW CHIK", "LANX GOH"],
            "ANDREW MIN HAN CHIN, DANIEL TAN": ["ANDREW MIN HAN CHIN", "DANIEL TAN"],
            "PAUL GRIFFIN, TA NGUYEN BINH DUONG": ["PAUL GRIFFIN", "TA NGUYEN BINH DUONG"],
            "ANDREW MIN HAN CHIN, JUNJI SUMITANI": ["ANDREW MIN HAN CHIN", "JUNJI SUMITANI"],
            "DAVID GOMULYA, LIM CHON PHUNG, AJAY MAKHIJA": ["DAVID GOMULYA", "LIM CHON PHUNG", "AJAY MAKHIJA"],
            "JACK HONG JIAJUN, ANG SER KENG": ["JACK HONG JIAJUN", "ANG SER KENG"],
            "DAVID GOMULYA, DAVID LLEWELYN": ["DAVID GOMULYA", "DAVID LLEWELYN"],
            "TERENCE FAN PING-CHING, JONATHAN TEE": ["TERENCE FAN PING-CHING", "JONATHAN TEE"],
            "RONG WANG, CHENG QIANG, CHEN XIA, LIANDONG ZHANG, WANG JIWEI, YUE HENG": ["RONG WANG", "CHENG QIANG", "CHEN XIA", "LIANDONG ZHANG", "WANG JIWEI", "YUE HENG"],
            "PASCALE CRAMA, ARNOUD DE MEYER": ["PASCALE CRAMA", "ARNOUD DE MEYER"],
            "TERENCE FAN PING-CHING, WILSON TENG": ["TERENCE FAN PING-CHING", "WILSON TENG"],
            "ANDREW MIN HAN CHIN, LI JIN": ["ANDREW MIN HAN CHIN", "LI JIN"],
            "ONG, BENJAMIN JOSHUA, EUGENE TAN KHENG BOON": ["ONG, BENJAMIN JOSHUA", "EUGENE TAN KHENG BOON"],
            "MANDY THAM, ERIC YEE SHIN CHONG": ["MANDY THAM", "ERIC YEE SHIN CHONG"],
            "TERENCE FAN PING-CHING, RUTH CHIANG": ["TERENCE FAN PING-CHING", "RUTH CHIANG"],
            "JARED POON JUN KEAT, CHAM YANWEI, DERRICK": ["JARED POON JUN KEAT", "CHAM YANWEI, DERRICK"],
            "DAVID GOMULYA, SZE TIAM LIN": ["DAVID GOMULYA", "SZE TIAM LIN"],
            "ANDREW MIN HAN CHIN, JAY WONG": ["ANDREW MIN HAN CHIN", "JAY WONG"],
            "MARK CHONG YIEW KIM, VICTOR OCAMPO": ["MARK CHONG YIEW KIM", "VICTOR OCAMPO"],
            "TSE, JUSTIN K, AIDAN WONG": ["TSE, JUSTIN K", "AIDAN WONG"],
            "TANG HONG WEE, GERALD SEAH, MUHAMMED AMEER S/O MOHAMED NOOR, LAU MENG YAN": ["TANG HONG WEE", "GERALD SEAH", "MUHAMMED AMEER S/O MOHAMED NOOR", "LAU MENG YAN"],
            "AURELIO GURREA MARTINEZ, LOH SONG-EN, SAMUEL": ["AURELIO GURREA MARTINEZ", "LOH SONG-EN, SAMUEL"],
            "CHNG SHUQI, AMELIA CHUA, MUHAMMED AMEER S/O MOHAMED NOOR": ["CHNG SHUQI", "AMELIA CHUA", "MUHAMMED AMEER S/O MOHAMED NOOR"]
        }
        
        if prof_name in multi_instructor_combinations:
            return multi_instructor_combinations[prof_name]
        
        # Step 3: FIXED: Intelligent comma-based splitting with progressive matching
        # Split by comma first
        comma_parts = [part.strip() for part in prof_name.split(',') if part.strip()]
        
        # If only one part (no commas), treat as single professor
        if len(comma_parts) <= 1:
            return [prof_name]
        
        # Get all boss_names from professor_lookup for matching
        boss_names = set()
        if hasattr(self, 'professor_lookup'):
            for key in self.professor_lookup.keys():
                if key is not None and not pd.isna(key):
                    key_str = str(key).strip()
                    if key_str and key_str.lower() != 'nan':
                        boss_names.add(key_str.upper())
        
        professors_found = []
        i = 0
        
        while i < len(comma_parts):
            current_candidate = comma_parts[i]
            matched = False
            
            # Try progressive matching: add more comma parts until we find a match
            for j in range(i + 1, len(comma_parts) + 1):
                candidate = ', '.join(comma_parts[i:j])
                candidate_upper = candidate.upper()
                
                # Check for exact match
                if candidate_upper in boss_names:
                    professors_found.append(candidate)
                    i = j  # Move past all used parts
                    matched = True
                    break
                
                # Check for partial word match (all words in candidate must be in some boss_name)
                candidate_words = set(candidate.replace(',', ' ').split())
                for boss_name in boss_names:
                    boss_words = set(boss_name.replace(',', ' ').split())
                    if candidate_words.issubset(boss_words) and len(candidate_words) >= 2:
                        professors_found.append(candidate)
                        i = j  # Move past all used parts
                        matched = True
                        break
                
                if matched:
                    break
            
            # If no match found and we're at a single part, check if it's reasonable
            if not matched:
                single_part = comma_parts[i]
                words_in_part = single_part.split()
                
                # Only accept if it has at least 2 words (avoid single word professors)
                if len(words_in_part) >= 2:
                    professors_found.append(single_part)
                else:
                    # Try to combine with next part if available
                    if i + 1 < len(comma_parts):
                        combined = f"{single_part} {comma_parts[i + 1]}"
                        professors_found.append(combined)
                        i += 2  # Skip next part too
                    else:
                        # Last resort: single word, but log warning
                        logger.warning(f"⚠️ Single word professor detected (may be parsing error): '{single_part}' from '{prof_name}'")
                        professors_found.append(single_part)
                        i += 1
                
                if not matched:
                    i += 1
        
        # Final validation: remove any single-word results if there are multi-word alternatives
        valid_professors = []
        for prof in professors_found:
            prof_words = prof.strip().split()
            if len(prof_words) >= 2 or len(professors_found) == 1:  # Keep single words only if it's the only result
                valid_professors.append(prof)
        
        # If we couldn't split intelligently, fall back to treating as single professor
        if not valid_professors:
            return [prof_name]
        
        return valid_professors

    def _lookup_professor_with_fallback(self, prof_name: str) -> Optional[str]:
        """Enhanced professor lookup with multiple fallback strategies"""
        
        # Handle None, NaN, and ensure it's a string
        if prof_name is None or pd.isna(prof_name):
            return None
        
        # Ensure prof_name is a string and handle 'nan' strings
        prof_name = str(prof_name).strip()
        
        if not prof_name or prof_name.lower() == 'nan':
            return None
        
        # Strategy 1: Direct boss_name lookup in professor_lookup.csv
        normalized_name = prof_name.upper()
        if hasattr(self, 'professor_lookup') and normalized_name in self.professor_lookup:
            return self.professor_lookup[normalized_name]['database_id']
        
        # Strategy 2: Try variations of the name
        variations = [
            prof_name.strip(),  # Original
            prof_name.strip().upper(),  # Uppercase
            prof_name.replace(',', '').strip().upper(),  # Remove commas
            ' '.join(prof_name.replace(',', ' ').split()).upper()  # Normalize spaces
        ]
        
        if hasattr(self, 'professor_lookup'):
            for variation in variations:
                if variation in self.professor_lookup:
                    return self.professor_lookup[variation]['database_id']
        
        # Strategy 3: Fuzzy matching against boss_name keys (100% certain only)
        if hasattr(self, 'professor_lookup'):
            for lookup_name in self.professor_lookup.keys():
                if self._names_match_fuzzy_exact(normalized_name, lookup_name):
                    return self.professor_lookup[lookup_name]['database_id']
        
        # Strategy 4: Check if professor exists in database cache
        if normalized_name in self.professors_cache:
            return self.professors_cache[normalized_name]['id']
        
        # Strategy 5: Create new professor with email duplicate check
        logger.warning(f"⚠️ Professor not found in lookup: {prof_name} - creating new professor")
        
        try:
            boss_name, afterclass_name = self.normalize_professor_name(prof_name)
            
            # Check if already created in this session
            for new_prof in self.new_professors:
                if new_prof['boss_name'] == boss_name:
                    return new_prof['id']
            
            # Create new professor
            professor_id = str(uuid.uuid4())
            slug = re.sub(r'[^a-zA-Z0-9]+', '-', afterclass_name.lower()).strip('-')
            
            # Resolve email using Outlook
            resolved_email = self.resolve_professor_email(afterclass_name)
            
            # Check if email already exists in existing professors
            if resolved_email:
                # Check in professors_cache
                for cached_boss_name, cached_prof in self.professors_cache.items():
                    if 'email' in cached_prof and cached_prof['email'] and cached_prof['email'].lower() == resolved_email.lower():
                        # Email already exists - use existing professor
                        logger.info(f"✅ Email duplicate found during fallback - using existing professor: {prof_name} → {cached_prof.get('name')} (email: {resolved_email})")
                        
                        # Update lookup to point to existing professor
                        if not hasattr(self, 'professor_lookup'):
                            self.professor_lookup = {}
                        self.professor_lookup[prof_name.upper()] = {
                            'database_id': cached_prof['id'],
                            'boss_name': boss_name,
                            'afterclass_name': cached_prof.get('name', afterclass_name)
                        }
                        
                        # Also add boss_name as lookup key
                        self.professor_lookup[boss_name.upper()] = {
                            'database_id': cached_prof['id'],
                            'boss_name': boss_name,
                            'afterclass_name': cached_prof.get('name', afterclass_name)
                        }
                        
                        return cached_prof['id']
                
                # Also check in already created new professors
                for new_prof in self.new_professors:
                    if 'email' in new_prof and new_prof['email'] and new_prof['email'].lower() == resolved_email.lower():
                        # Email already exists in new professors - use that one
                        logger.info(f"✅ Email duplicate found in new professors - using existing: {prof_name} → {new_prof['name']} (email: {resolved_email})")
                        
                        # Update lookup to point to existing professor
                        if not hasattr(self, 'professor_lookup'):
                            self.professor_lookup = {}
                        self.professor_lookup[prof_name.upper()] = {
                            'database_id': new_prof['id'],
                            'boss_name': boss_name,
                            'afterclass_name': new_prof['afterclass_name']
                        }
                        
                        # Also add boss_name as lookup key
                        self.professor_lookup[boss_name.upper()] = {
                            'database_id': new_prof['id'],
                            'boss_name': boss_name,
                            'afterclass_name': new_prof['afterclass_name']
                        }
                        
                        return new_prof['id']
            
            # No email duplicate found - create new professor
            new_prof = {
                'id': professor_id,
                'name': afterclass_name,
                'email': resolved_email,
                'slug': slug,
                'photo_url': 'https://smu.edu.sg',
                'profile_url': 'https://smu.edu.sg',
                'belong_to_university': 1,  # SMU
                'created_at': datetime.now().isoformat(),
                'updated_at': datetime.now().isoformat(),
                'boss_name': boss_name,
                'afterclass_name': afterclass_name,
                'original_scraped_name': prof_name
            }
            
            self.new_professors.append(new_prof)
            self.stats['professors_created'] += 1
            
            # Update lookup
            if not hasattr(self, 'professor_lookup'):
                self.professor_lookup = {}
            self.professor_lookup[prof_name.upper()] = {
                'database_id': professor_id,
                'boss_name': boss_name,
                'afterclass_name': afterclass_name
            }
            
            # Also add boss_name as lookup key
            self.professor_lookup[boss_name.upper()] = {
                'database_id': professor_id,
                'boss_name': boss_name,
                'afterclass_name': afterclass_name
            }
            
            logger.info(f"✨ Created new professor for verification: {afterclass_name} with email: {resolved_email}")
            return professor_id
            
        except Exception as e:
            logger.error(f"❌ Error creating new professor for '{prof_name}': {e}")
            return None

    def _names_match_fuzzy(self, name1: str, name2: str) -> bool:
        """Simple fuzzy matching for names"""

        # Ensure both names are strings
        name1 = str(name1) if name1 is not None else ""
        name2 = str(name2) if name2 is not None else ""

        # Remove common variations
        clean1 = ' '.join(name1.replace(',', ' ').split())
        clean2 = ' '.join(name2.replace(',', ' ').split())
        
        # Check if all words in shorter name appear in longer name
        words1 = clean1.split()
        words2 = clean2.split()
        
        if len(words1) <= len(words2):
            return all(word in words2 for word in words1)
        else:
            return all(word in words1 for word in words2)
        
    def _names_match_fuzzy_exact(self, name1: str, name2: str) -> bool:
        """Exact fuzzy matching for names - only matches if completely identical after normalization"""
        
        # Handle None and non-string values
        if name1 is None or name2 is None:
            return False
        
        # Ensure both names are strings
        name1 = str(name1) if name1 is not None else ""
        name2 = str(name2) if name2 is not None else ""
        
        # Remove common variations and normalize
        clean1 = ' '.join(name1.replace(',', ' ').replace('.', ' ').split()).upper()
        clean2 = ' '.join(name2.replace(',', ' ').replace('.', ' ').split()).upper()
        
        # Only return True if they are exactly the same after cleaning
        return clean1 == clean2

    def load_professor_lookup_csv(self):
        """Load professor lookup CSV once and cache it properly"""
        # Check if already loaded to prevent repeated loading
        if hasattr(self, 'professor_lookup_loaded') and self.professor_lookup_loaded:
            return
        
        lookup_file = 'script_input/professor_lookup.csv'
        
        if not os.path.exists(lookup_file):
            logger.warning("📋 professor_lookup.csv not found - will use database cache only")
            self.professor_lookup_loaded = True
            return
        
        try:
            # Load the CSV file
            lookup_df = pd.read_csv(lookup_file)
            
            # Validate required columns exist
            required_cols = ['boss_name', 'afterclass_name', 'database_id', 'method']
            missing_cols = [col for col in required_cols if col not in lookup_df.columns]
            if missing_cols:
                logger.error(f"❌ professor_lookup.csv missing required columns: {missing_cols}")
                self.professor_lookup_loaded = True
                return
            
            # Clear existing lookup and load fresh data
            self.professor_lookup = {}
            loaded_count = 0
            
            for _, row in lookup_df.iterrows():
                boss_name = row.get('boss_name')
                afterclass_name = row.get('afterclass_name')
                database_id = row.get('database_id')
                
                # Skip rows with critical missing values
                if pd.isna(boss_name) or pd.isna(database_id):
                    continue
                    
                # Use boss_name as the primary key for lookup (as you specified)
                boss_name_key = str(boss_name).strip().upper()
                self.professor_lookup[boss_name_key] = {
                    'database_id': str(database_id),
                    'boss_name': str(boss_name),
                    'afterclass_name': str(afterclass_name) if not pd.isna(afterclass_name) else str(boss_name)
                }
                loaded_count += 1
            
            logger.info(f"✅ Loaded {loaded_count} entries from professor_lookup.csv")
            self.professor_lookup_loaded = True
            
        except Exception as e:
            logger.error(f"❌ Error loading professor_lookup.csv: {e}")
            logger.info("📋 Continuing with database cache only")
            self.professor_lookup_loaded = True

    def _create_new_professor(self, prof_name: str) -> str:
        """Create a new professor record for verification"""
        boss_name, afterclass_name = self.normalize_professor_name(prof_name)
        
        # Check if already created in this session
        for new_prof in self.new_professors:
            if new_prof['boss_name'] == boss_name:
                return new_prof['id']
        
        # Create new professor
        professor_id = str(uuid.uuid4())
        slug = re.sub(r'[^a-zA-Z0-9]+', '-', afterclass_name.lower()).strip('-')
        
        # Resolve email using Outlook (same as in process_professors)
        resolved_email = self.resolve_professor_email(afterclass_name)
        
        new_prof = {
            'id': professor_id,
            'name': afterclass_name,
            'email': resolved_email,  # Now using Outlook resolution
            'slug': slug,
            'photo_url': 'https://smu.edu.sg',
            'profile_url': 'https://smu.edu.sg',
            'belong_to_university': 1,  # SMU
            'created_at': datetime.now().isoformat(),
            'updated_at': datetime.now().isoformat(),
            'boss_name': boss_name,
            'afterclass_name': afterclass_name,
            'original_scraped_name': prof_name
        }
        
        self.new_professors.append(new_prof)
        self.stats['professors_created'] += 1
        
        # Update lookup
        self.professor_lookup[boss_name] = {
            'database_id': professor_id,
            'boss_name': boss_name,
            'afterclass_name': afterclass_name
        }
        
        logger.info(f"✨ Created new professor for verification: {afterclass_name} with email: {resolved_email}")
        return professor_id

    def process_timings(self):
        """Process class timings and exam timings from multiple sheet"""
        logger.info("⏰ Processing class timings and exam timings...")
        
        for _, row in self.multiple_data.iterrows():
            record_key = row.get('record_key')
            if record_key not in self.class_id_mapping:
                continue
            
            # Get all class IDs for this record_key (now a list)
            class_ids = self.class_id_mapping[record_key]
            if not isinstance(class_ids, list):
                class_ids = [class_ids]  # Ensure it's a list for backward compatibility
            
            timing_type = row.get('type', 'CLASS')
            
            # Create timing records for each class ID
            for class_id in class_ids:
                if timing_type == 'CLASS':
                    timing_record = {
                        'class_id': class_id,
                        'start_date': row.get('start_date'),
                        'end_date': row.get('end_date'),
                        'day_of_week': row.get('day_of_week'),
                        'start_time': row.get('start_time'),
                        'end_time': row.get('end_time'),
                        'venue': row.get('venue', '')
                    }
                    self.new_class_timings.append(timing_record)
                    self.stats['timings_created'] += 1
                
                elif timing_type == 'EXAM':
                    exam_record = {
                        'class_id': class_id,
                        'date': row.get('date'),
                        'day_of_week': row.get('day_of_week'),
                        'start_time': row.get('start_time'),
                        'end_time': row.get('end_time'),
                        'venue': row.get('venue')
                    }
                    self.new_class_exam_timings.append(exam_record)
                    self.stats['exams_created'] += 1
        
        logger.info(f"✅ Created {self.stats['timings_created']} class timings")
        logger.info(f"✅ Created {self.stats['exams_created']} exam timings")
        
    def save_outputs(self):
        """Save all generated CSV files"""
        logger.info("💾 Saving output files...")
        
        # In Phase 2, professors have already been saved and corrected
        # Only save if we're in Phase 1 or if there are new professors to save
        if self.new_professors and not hasattr(self, '_phase2_mode'):
            df = pd.DataFrame(self.new_professors)
            df.to_csv(os.path.join(self.verify_dir, 'new_professors.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_professors)} new professors")
        
        # Save new courses (to verify folder)
        if self.new_courses:
            df = pd.DataFrame(self.new_courses)
            df.to_csv(os.path.join(self.verify_dir, 'new_courses.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_courses)} new courses")
        
        # Save course updates
        if self.update_courses:
            df = pd.DataFrame(self.update_courses)
            df.to_csv(os.path.join(self.output_base, 'update_courses.csv'), index=False)
            logger.info(f"✅ Saved {len(self.update_courses)} course updates")
        
        # Save academic terms
        if self.new_acad_terms:
            df = pd.DataFrame(self.new_acad_terms)
            df.to_csv(os.path.join(self.output_base, 'new_acad_term.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_acad_terms)} academic terms")
        
        # Save classes
        if self.new_classes:
            df = pd.DataFrame(self.new_classes)
            df.to_csv(os.path.join(self.output_base, 'new_classes.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_classes)} classes")
        
        # Save class timings
        if self.new_class_timings:
            df = pd.DataFrame(self.new_class_timings)
            df.to_csv(os.path.join(self.output_base, 'new_class_timing.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_class_timings)} class timings")
        
        # Save exam timings
        if self.new_class_exam_timings:
            df = pd.DataFrame(self.new_class_exam_timings)
            df.to_csv(os.path.join(self.output_base, 'new_class_exam_timing.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_class_exam_timings)} exam timings")
        
        # Save courses needing faculty assignment
        if self.courses_needing_faculty:
            df = pd.DataFrame(self.courses_needing_faculty)
            df.to_csv(os.path.join(self.output_base, 'courses_needing_faculty.csv'), index=False)
            logger.info(f"✅ Saved {len(self.courses_needing_faculty)} courses needing faculty assignment")
        
        # Create placeholder files only if they don't exist
        placeholders = ['new_bid_window.csv', 'new_class_availability.csv', 'new_bid_result.csv']
        for filename in placeholders:
            filepath = os.path.join(self.output_base, filename)
            if not os.path.exists(filepath):
                df = pd.DataFrame()
                df.to_csv(filepath, index=False)
                logger.info(f"✅ Created placeholder: {filename}")

    def _save_professor_lookup(self):
        """Save updated professor lookup table"""
        lookup_data = []
        
        # Add all professors from lookup
        for scraped_name, data in self.professor_lookup.items():
            lookup_data.append({
                'boss_name': data.get('boss_name', scraped_name.upper()),
                'afterclass_name': data.get('afterclass_name', scraped_name),
                'database_id': data['database_id'],
                'method': 'exists' if scraped_name not in [p['original_scraped_name'] for p in self.new_professors] else 'created'
            })
        
        # Sort by scraped_name
        lookup_data.sort(key=lambda x: x['scraped_name'])
        
        # Save to output folder
        df = pd.DataFrame(lookup_data)
        df.to_csv(os.path.join(self.output_base, 'professor_lookup.csv'), index=False)
        logger.info(f"✅ Saved updated professor lookup with {len(lookup_data)} entries")

    def update_professor_lookup_from_corrected_csv(self):
        """Update professor lookup from manually corrected new_professors.csv"""
        logger.info("🔄 Updating professor lookup from corrected CSV...")
        
        # Read corrected new_professors.csv
        corrected_csv_path = os.path.join(self.verify_dir, 'new_professors.csv')
        if not os.path.exists(corrected_csv_path):
            logger.info(f"📝 No corrected CSV found: {corrected_csv_path} - assuming all professors already exist")
            return True

        corrected_df = pd.read_csv(corrected_csv_path)
        if corrected_df.empty:
            logger.info(f"📝 Empty corrected CSV - no professors to update")
            return True

        try:
            corrected_df = pd.read_csv(corrected_csv_path)
            logger.info(f"📖 Reading {len(corrected_df)} corrected professor records")
            
            # Clear and rebuild the new_professors list with corrected data
            self.new_professors = []
            
            # Update internal professor_lookup and rebuild new_professors
            updated_count = 0
            for _, row in corrected_df.iterrows():
                original_name = row.get('original_scraped_name', '')
                corrected_afterclass_name = row.get('name', '')  # This is the corrected name
                boss_name = row.get('boss_name', '')  # Keep boss name same
                professor_id = row.get('id', '')
                
                # Rebuild the professor record with corrected data
                corrected_prof = {
                    'id': professor_id,
                    'name': corrected_afterclass_name,  # Use corrected name
                    'email': row.get('email', 'enquiry@smu.edu.sg'),
                    'slug': row.get('slug', ''),
                    'photo_url': row.get('photo_url', 'https://smu.edu.sg'),
                    'profile_url': row.get('profile_url', 'https://smu.edu.sg'),
                    'belong_to_university': row.get('belong_to_university', 1),
                    'created_at': row.get('created_at', datetime.now().isoformat()),
                    'updated_at': row.get('updated_at', datetime.now().isoformat()),
                    'boss_name': boss_name,
                    'afterclass_name': corrected_afterclass_name,
                    'original_scraped_name': original_name
                }
                
                # Add to new_professors list
                self.new_professors.append(corrected_prof)
                
                if original_name and professor_id:
                    # Update lookup with corrected afterclass name but same boss name
                    self.professor_lookup[original_name] = {
                        'database_id': professor_id,
                        'boss_name': boss_name,  # Keep original boss name
                        'afterclass_name': corrected_afterclass_name  # Use corrected name
                    }
                    updated_count += 1
                    
                    # Also add the corrected name as a lookup key
                    self.professor_lookup[corrected_afterclass_name] = {
                        'database_id': professor_id,
                        'boss_name': boss_name,
                        'afterclass_name': corrected_afterclass_name
                    }
                    
                    # Add boss name as lookup key too
                    self.professor_lookup[boss_name] = {
                        'database_id': professor_id,
                        'boss_name': boss_name,
                        'afterclass_name': corrected_afterclass_name
                    }
            
            # Save updated professor lookup to CSV
            self._save_corrected_professor_lookup()
            
            logger.info(f"✅ Updated {updated_count} professor lookup entries")
            logger.info(f"✅ Rebuilt {len(self.new_professors)} professor records with corrections")
            return True
            
        except Exception as e:
            logger.error(f"❌ Failed to update professor lookup: {e}")
            return False

    def update_professors_with_boss_names(self):
        """Update professors with missing boss_names using professor_lookup.csv"""
        logger.info("👤 Updating professors with missing boss_names...")
        
        # Load professor_lookup.csv
        lookup_file = 'script_input/professor_lookup.csv'
        if not os.path.exists(lookup_file):
            logger.info("📋 professor_lookup.csv not found - skipping boss_name updates")
            return
        
        try:
            lookup_df = pd.read_csv(lookup_file)
            logger.info(f"📖 Loaded {len(lookup_df)} entries from professor_lookup.csv")
        except Exception as e:
            logger.error(f"❌ Error loading professor_lookup.csv: {e}")
            return
        
        # Group lookup entries by database_id to handle multiple boss_names
        lookup_groups = defaultdict(list)
        for _, row in lookup_df.iterrows():
            database_id = row.get('database_id')
            boss_name = row.get('boss_name')
            
            if pd.notna(database_id) and pd.notna(boss_name):
                lookup_groups[str(database_id)].append(str(boss_name).strip())
        
        # Find professors with empty boss_name
        professors_to_update = []
        
        for prof_key, prof_data in self.professors_cache.items():
            professor_id = prof_data.get('id')
            current_boss_name = prof_data.get('boss_name')
            
            # Check if boss_name is empty/null
            if (current_boss_name is None or 
                pd.isna(current_boss_name) or 
                str(current_boss_name).strip() == ''):
                
                # Look for this professor in the lookup
                if str(professor_id) in lookup_groups:
                    boss_names = lookup_groups[str(professor_id)]
                    
                    # Remove duplicates while preserving order
                    unique_boss_names = []
                    seen = set()
                    for name in boss_names:
                        if name not in seen:
                            unique_boss_names.append(name)
                            seen.add(name)
                    
                    # Store as JSON array for CSV compatibility
                    import json
                    boss_name_json = json.dumps(unique_boss_names)
                    
                    professors_to_update.append({
                        'id': professor_id,
                        'boss_name': boss_name_json,  # JSON array string for CSV
                        'original_boss_name': current_boss_name,
                        'boss_names_found': len(unique_boss_names)
                    })
                    
                    logger.info(f"✅ Found boss_name(s) for professor {professor_id}: {unique_boss_names}")
        
        # Save update_professor.csv
        if professors_to_update:
            df = pd.DataFrame(professors_to_update)
            update_path = os.path.join(self.output_base, 'update_professor.csv')
            df.to_csv(update_path, index=False)
            logger.info(f"✅ Saved {len(professors_to_update)} professor updates to update_professor.csv")
            
            # Update stats
            if not hasattr(self.stats, 'professors_updated'):
                self.stats['professors_updated'] = 0
            self.stats['professors_updated'] = len(professors_to_update)
        else:
            logger.info("ℹ️ No professors need boss_name updates")
            if not hasattr(self.stats, 'professors_updated'):
                self.stats['professors_updated'] = 0

    def process_remaining_tables(self):
        """Process classes and timings after professor lookup is updated"""
        logger.info("🏫 Processing remaining tables (classes, timings)...")
        
        try:
            # Clear any existing data from Phase 1 to avoid duplicates
            self.new_classes = []
            self.new_class_timings = []
            self.new_class_exam_timings = []
            self.class_id_mapping = {}
            self.stats['classes_created'] = 0
            self.stats['timings_created'] = 0
            self.stats['exams_created'] = 0
            
            # Process classes (depends on updated professor lookup)
            self.process_classes()
            
            # Process timings (depends on classes)
            self.process_timings()
            
            logger.info("✅ Remaining tables processed successfully")
            return True
            
        except Exception as e:
            logger.error(f"❌ Failed to process remaining tables: {e}")
            return False

    def _save_corrected_professor_lookup(self):
        """Save professor lookup with corrected structure: boss_name, afterclass_name, database_id"""
        lookup_data = []
        seen_combinations = set()  # To track (boss_name, afterclass_name) for deduplication
        
        # Collect all unique professor entries
        all_entries = {}
        
        # From professor_lookup
        for scraped_name, data in self.professor_lookup.items():
            boss_name = data.get('boss_name', scraped_name.upper())
            afterclass_name = data.get('afterclass_name', scraped_name)
            database_id = data['database_id']
            
            # Use boss_name as the key (since it's unique and in uppercase)
            key = (boss_name, afterclass_name)
            if key not in seen_combinations:
                all_entries[boss_name] = {
                    'boss_name': boss_name,
                    'afterclass_name': afterclass_name,
                    'database_id': database_id,
                    'method': 'created' if any(prof['id'] == database_id for prof in self.new_professors) else 'exists'
                }
                seen_combinations.add(key)
        
        # Convert to list and sort
        lookup_data = list(all_entries.values())
        lookup_data.sort(key=lambda x: x['boss_name'])
        
        # Save to output folder
        df = pd.DataFrame(lookup_data)
        df.to_csv(os.path.join(self.output_base, 'professor_lookup.csv'), index=False)
        logger.info(f"✅ Saved updated professor lookup with {len(lookup_data)} unique entries")

    def print_summary(self):
        """Print processing summary"""
        print("\n" + "="*70)
        print("📊 PROCESSING SUMMARY")
        print("="*70)
        print(f"✅ Professors created: {self.stats['professors_created']}")
        print(f"✅ Courses created: {self.stats['courses_created']}")
        print(f"✅ Courses updated: {self.stats['courses_updated']}")
        print(f"⚠️  Courses needing faculty: {self.stats['courses_needing_faculty']}")
        print(f"✅ Classes created: {self.stats['classes_created']}")
        print(f"✅ Class timings created: {self.stats['timings_created']}")
        print(f"✅ Exam timings created: {self.stats['exams_created']}")
        print("="*70)
        
        print("\n📁 OUTPUT FILES:")
        print(f"   Verify folder: {self.verify_dir}/")
        print(f"   - new_professors.csv ({self.stats['professors_created']} records)")
        print(f"   - new_courses.csv ({self.stats['courses_created']} records)")
        print(f"   Output folder: {self.output_base}/")
        print(f"   - update_courses.csv ({self.stats['courses_updated']} records)")
        print(f"   - new_acad_term.csv ({len(self.new_acad_terms)} records)")
        print(f"   - new_classes.csv ({self.stats['classes_created']} records)")
        print(f"   - new_class_timing.csv ({self.stats['timings_created']} records)")
        print(f"   - new_class_exam_timing.csv ({self.stats['exams_created']} records)")
        print(f"   - professor_lookup.csv (updated)")
        print(f"   - courses_needing_faculty.csv ({self.stats['courses_needing_faculty']} records)")
        print("="*70)

    def run_phase1_professors_and_courses(self):
        """Phase 1: Process professors and courses with automated faculty mapping"""
        try:
            logger.info("🚀 Starting Phase 1: Professors and Courses with Automated Faculty Mapping")
            logger.info("="*60)
            
            # Load data
            if not self.load_or_cache_data():
                logger.error("❌ Failed to load database data")
                return False
            
            if not self.load_raw_data():
                logger.error("❌ Failed to load raw data")
                return False
            
            # Process professors (CSV only, no lookup update)
            self.process_professors()
            
            # Process courses
            self.process_courses()
            
            # NEW: Automated faculty mapping using BOSS data
            logger.info("\n🎓 Running automated faculty mapping...")
            try:
                self.map_courses_to_faculties_from_boss()
            except Exception as e:
                logger.warning(f"⚠️ Automated faculty mapping failed: {e}")
                logger.info("   Continuing with manual faculty assignment...")
            
            # Process academic terms
            self.process_acad_terms()
            
            # Save phase 1 outputs
            self._save_phase1_outputs()
            
            # Print faculty mapping summary
            if hasattr(self, 'courses_needing_faculty') and self.courses_needing_faculty:
                logger.info(f"\n📋 Faculty Assignment Summary:")
                logger.info(f"   • Automated mappings applied to {self.stats['courses_created'] - len(self.courses_needing_faculty)} courses")
                logger.info(f"   • {len(self.courses_needing_faculty)} courses still need manual review")
                
                # Show which courses need manual review
                if len(self.courses_needing_faculty) <= 10:
                    logger.info(f"   Courses needing manual review:")
                    for course_info in self.courses_needing_faculty:
                        logger.info(f"     - {course_info['course_code']}: {course_info['course_name']}")
            
            logger.info("✅ Phase 1 completed - Review files in verify/ folder")
            return True
            
        except Exception as e:
            logger.error(f"❌ Phase 1 failed: {e}")
            return False

    def run_phase2_remaining_tables(self):
        """Phase 2: Process classes and timings after professor correction"""
        try:
            logger.info("🚀 Starting Phase 2: Classes and Timings")
            logger.info("="*60)
            
            # Set phase 2 mode to prevent overwriting corrected professors
            self._phase2_mode = True
            
            # Update professor lookup from corrected CSV
            if not self.update_professor_lookup_from_corrected_csv():
                logger.error("❌ Failed to update professor lookup")
                return False
            
            # Update professors with missing boss_names
            self.update_professors_with_boss_names()
            
            # Process remaining tables
            if not self.process_remaining_tables():
                logger.error("❌ Failed to process remaining tables")
                return False
            
            # Save all outputs
            self.save_outputs()
            
            # Print summary
            self.print_summary()
            
            logger.info("✅ Phase 2 completed successfully!")
            return True
            
        except Exception as e:
            logger.error(f"❌ Phase 2 failed: {e}")
            return False

    def _save_phase1_outputs(self):
        """Save Phase 1 outputs (professors, courses, acad_terms)"""
        # Save new professors (to verify folder for manual correction)
        # Always create the file, even if empty
        df = pd.DataFrame(self.new_professors) if self.new_professors else pd.DataFrame(columns=['id', 'name', 'boss_name', 'afterclass_name', 'original_scraped_name'])
        df.to_csv(os.path.join(self.verify_dir, 'new_professors.csv'), index=False)
        if self.new_professors:
            logger.info(f"✅ Saved {len(self.new_professors)} new professors for review")
        else:
            logger.info(f"✅ Created empty new_professors.csv (all professors already exist)")
        
        # Save new courses (to verify folder)
        if self.new_courses:
            df = pd.DataFrame(self.new_courses)
            df.to_csv(os.path.join(self.verify_dir, 'new_courses.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_courses)} new courses")
        
        # Save course updates
        if self.update_courses:
            df = pd.DataFrame(self.update_courses)
            df.to_csv(os.path.join(self.output_base, 'update_courses.csv'), index=False)
            logger.info(f"✅ Saved {len(self.update_courses)} course updates")
        
        # Save academic terms
        if self.new_acad_terms:
            df = pd.DataFrame(self.new_acad_terms)
            df.to_csv(os.path.join(self.output_base, 'new_acad_term.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_acad_terms)} academic terms")

    def run(self, skip_faculty_assignment=True):
        """Run the complete table building process
        
        Args:
            skip_faculty_assignment: If True, faculty assignment is deferred
        """
        try:
            logger.info("🚀 Starting TableBuilder process")
            logger.info("="*60)
            
            # Step 1: Load or cache database data
            if not self.load_or_cache_data():
                logger.error("❌ Failed to load database data")
                return False
            
            # Step 2: Load raw data
            if not self.load_raw_data():
                logger.error("❌ Failed to load raw data")
                return False
            
            # Step 3: Process tables in dependency order
            logger.info("\n📋 Processing tables in dependency order...")
            
            # 3.1: Process professors first (no dependencies)
            self.process_professors()
            
            # 3.2: Process courses (without faculty assignment)
            self.process_courses()
            
            # 3.3: Process academic terms (no dependencies)
            self.process_acad_terms()
            
            # 3.4: Process classes (depends on courses, professors, acad_terms)
            self.process_classes()
            
            # 3.5: Process timings (depends on classes)
            self.process_timings()
            
            # Step 4: Save all outputs
            self.save_outputs()
            
            # Step 5: Print summary
            self.print_summary()
            
            if self.stats['courses_needing_faculty'] > 0 and not skip_faculty_assignment:
                print("\n⚠️  FACULTY ASSIGNMENT REQUIRED")
                print(f"   {self.stats['courses_needing_faculty']} courses need faculty assignment")
                print("   Run builder.assign_course_faculties() to complete assignment")
            
            logger.info("\n✅ TableBuilder process completed successfully!")
            return True
            
        except Exception as e:
            logger.error(f"❌ Process failed: {e}")
            import traceback
            traceback.print_exc()
            return False

    def setup_boss_processing(self):
        """Initialize BOSS results processing with logging and caches"""
        # Setup logging for BOSS processing
        self.boss_log_file = os.path.join(self.output_base, 'boss_result_log.txt')
        
        # Create the log file and write header
        try:
            with open(self.boss_log_file, 'w') as f:
                f.write(f"BOSS Results Processing Log - {datetime.now().isoformat()}\n")
                f.write("="*70 + "\n\n")
            print(f"📝 Log file created: {self.boss_log_file}")
        except Exception as e:
            print(f"⚠️ Warning: Could not create log file {self.boss_log_file}: {e}")
            self.boss_log_file = None
        
        # Initialize existing classes cache
        self.existing_classes_cache = []
        
        # Data storage for BOSS results
        self.boss_data = []
        self.failed_mappings = []
        
        # Output collectors
        self.new_bid_windows = []
        self.new_class_availability = []
        self.new_bid_result = []
        
        # Caches for deduplication
        self.bid_window_cache = {}  # (acad_term_id, round, window) -> bid_window_id
        self.bid_window_id_counter = 1
        
        # Statistics
        self.boss_stats = {
            'files_processed': 0,
            'total_rows': 0,
            'bid_windows_created': 0,
            'class_availability_created': 0,
            'bid_results_created': 0,
            'failed_mappings': 0
        }
        
        print("🔄 BOSS results processing setup completed")

    def log_boss_activity(self, message, print_to_stdout=True):
        """Log activity to both file and optionally stdout"""
        timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        log_message = f"[{timestamp}] {message}\n"
        
        # Only write to file if boss_log_file exists (after setup_boss_processing is called)
        if hasattr(self, 'boss_log_file') and self.boss_log_file:
            try:
                with open(self.boss_log_file, 'a') as f:
                    f.write(log_message)
            except Exception as e:
                print(f"⚠️ Warning: Could not write to log file: {e}")
        
        if print_to_stdout:
            print(f"📝 {message}")

    def parse_term_to_acad_term_id(self, term_str):
        """Convert term string to acad_term_id format
        
        Examples:
        "2021-22 Term 1" -> "AY202122T1"
        "2021-22 Term 3A" -> "AY202122T3A"
        """
        if not term_str or pd.isna(term_str):
            return None
        
        # Clean the string
        term_str = str(term_str).strip()
        
        # Pattern: YYYY-YY Term X[A/B]
        pattern = r'(\d{4})-(\d{2})\s+Term\s+(\w+)'
        match = re.match(pattern, term_str)
        
        if match:
            year_start = match.group(1)
            year_end = match.group(2)
            term = match.group(3)
            return f"AY{year_start}{year_end}T{term}"
        
        return None

    def parse_bidding_window(self, bidding_window_str):
        """Complete parser for bidding window string to extract round and window
        
        Examples:
        "Round 1 Window 1" -> ("1", 1)
        "Round 1A Window 2" -> ("1A", 2)
        "Round 2A Window 3" -> ("2A", 3)
        "Incoming Exchange Rnd 1C Win 1" -> ("1C", 1)
        "Incoming Freshmen Rnd 1 Win 4" -> ("1F", 4)
        """
        if not bidding_window_str or pd.isna(bidding_window_str):
            return None, None
        
        # Clean the string
        bidding_window_str = str(bidding_window_str).strip()
        
        # Pattern 1: Standard format "Round X[A/B/C] Window Y"
        pattern1 = r'Round\s+(\w+)\s+Window\s+(\d+)'
        match1 = re.match(pattern1, bidding_window_str)
        if match1:
            round_str = match1.group(1)
            window_num = int(match1.group(2))
            return round_str, window_num
        
        # Pattern 2: Incoming Exchange format "Incoming Exchange Rnd X[A/B/C] Win Y"
        # Map to same round but keep distinction if needed
        pattern2 = r'Incoming\s+Exchange\s+Rnd\s+(\w+)\s+Win\s+(\d+)'
        match2 = re.match(pattern2, bidding_window_str)
        if match2:
            round_str = match2.group(1)  # Keep original round (1C)
            window_num = int(match2.group(2))
            return round_str, window_num
        
        # Pattern 3: Incoming Freshmen format "Incoming Freshmen Rnd X Win Y"
        # Map Round 1 -> Round 1F for distinction
        pattern3 = r'Incoming\s+Freshmen\s+Rnd\s+(\w+)\s+Win\s+(\d+)'
        match3 = re.match(pattern3, bidding_window_str)
        if match3:
            original_round = match3.group(1)
            window_num = int(match3.group(2))
            # Map Incoming Freshmen Round 1 to Round 1F
            if original_round == "1":
                round_str = "1F"
            else:
                round_str = f"{original_round}F"  # For other rounds if they exist
            return round_str, window_num
        
        return None, None

    def get_window_hierarchy(self, acad_term_id):
        """Get the expected window hierarchy for a given academic term
        Updated to include incoming student rounds"""
        if not acad_term_id:
            return []
        
        # Extract year and term from acad_term_id
        pattern = r'AY(\d{4})(\d{2})T(\w+)'
        match = re.match(pattern, acad_term_id)
        if not match:
            return []
        
        year_start = int(match.group(1))
        year_end = int(match.group(2))
        term = match.group(3)
        
        # Determine academic year
        full_year_end = 2000 + year_end if year_end > 50 else 2000 + year_end
        if year_start > full_year_end:
            full_year_end += 100
        
        # Term 3A and 3B have different hierarchy
        if term in ['3A', '3B']:
            return [
                ("1", 1), ("1", 2), ("1", 3), ("1", 4),
                ("2", 1), ("2", 2)
            ]
        
        # Regular terms (T1, T2) - includes incoming student rounds
        base_hierarchy = []
        
        if full_year_end < 2025:  # Before AY2024-25
            base_hierarchy = [
                ("1", 1), ("1", 2),
                ("1A", 1), ("1A", 2),
                ("1B", 1), ("1B", 2),
                ("1C", 1), ("1C", 2), ("1C", 3),
                ("2", 1), ("2", 2), ("2", 3),
                ("2A", 1), ("2A", 2), ("2A", 3)
            ]
        else:  # From AY2024-25 onwards
            base_hierarchy = [
                ("1", 1),
                ("1A", 1), ("1A", 2), ("1A", 3),
                ("1B", 1), ("1B", 2),
                ("1C", 1), ("1C", 2), ("1C", 3),
                ("2", 1), ("2", 2), ("2", 3),
                ("2A", 1), ("2A", 2), ("2A", 3)
            ]
        
        # Add incoming student rounds
        incoming_rounds = [
            ("1F", 1), ("1F", 2), ("1F", 3), ("1F", 4)  # Incoming Freshmen
        ]
        
        # Combine hierarchies: regular rounds first, then incoming rounds
        return base_hierarchy + incoming_rounds

    def load_boss_results(self):
        """Load all BOSS results XLSX files"""
        self.log_boss_activity("🔍 Loading BOSS results files...")
        
        input_pattern = os.path.join('script_input', 'overallBossResults', '*.xlsx')
        xlsx_files = glob.glob(input_pattern)
        
        if not xlsx_files:
            self.log_boss_activity(f"❌ No XLSX files found in pattern: {input_pattern}")
            return False
        
        self.log_boss_activity(f"📂 Found {len(xlsx_files)} XLSX files")
        
        all_data = []
        
        for file_path in xlsx_files:
            try:
                self.log_boss_activity(f"📖 Loading: {os.path.basename(file_path)}")
                df = pd.read_excel(file_path)
                
                # Add source file for tracking
                df['source_file'] = os.path.basename(file_path)
                all_data.append(df)
                
                self.boss_stats['files_processed'] += 1
                self.log_boss_activity(f"✅ Loaded {len(df)} rows from {os.path.basename(file_path)}")
                
            except Exception as e:
                self.log_boss_activity(f"❌ Error loading {file_path}: {e}")
                continue
        
        if all_data:
            self.boss_data = pd.concat(all_data, ignore_index=True)
            self.boss_stats['total_rows'] = len(self.boss_data)
            self.log_boss_activity(f"✅ Combined {self.boss_stats['total_rows']} total rows")
            return True
        else:
            self.log_boss_activity("❌ No data loaded successfully")
            return False

    def process_bid_windows(self):
        """Process and create bid_window entries with dynamic window detection"""
        self.log_boss_activity("🪟 Processing bid windows...")
        
        if self.boss_data is None or len(self.boss_data) == 0:
            self.log_boss_activity("❌ No BOSS data loaded")
            return False
        
        # Track all unique bid windows found in data
        found_windows = defaultdict(set)  # acad_term_id -> set of (round, window) tuples
        
        # First pass: discover all windows that actually exist in the data
        for _, row in self.boss_data.iterrows():
            term_str = row.get('Term')
            bidding_window_str = row.get('Bidding Window')
            
            if pd.isna(term_str) or pd.isna(bidding_window_str):
                continue
            
            acad_term_id = self.parse_term_to_acad_term_id(term_str)
            round_str, window_num = self.parse_bidding_window(bidding_window_str)
            
            if acad_term_id and round_str and window_num:
                found_windows[acad_term_id].add((round_str, window_num))
        
        # Define the expected order for rounds
        round_order = {
            '1': 1,
            '1A': 2,
            '1B': 3,
            '1C': 4,    # Incoming Exchange
            '1F': 5,    # Incoming Freshmen
            '2': 6,
            '2A': 7
        }
        
        # Create bid windows in proper order
        bid_window_id = 1
        
        # Process each academic term
        for acad_term_id in sorted(found_windows.keys()):
            windows_for_term = found_windows[acad_term_id]
            
            # Convert to list and sort by round order first, then window number
            sorted_windows = sorted(
                windows_for_term,
                key=lambda x: (round_order.get(x[0], 99), x[1])
            )
            
            self.log_boss_activity(f"📅 Processing {acad_term_id}: found {len(sorted_windows)} windows")
            
            # Create bid windows in order
            for round_str, window_num in sorted_windows:
                window_key = (acad_term_id, round_str, window_num)
                
                # Skip if already created
                if window_key in self.bid_window_cache:
                    continue
                
                new_bid_window = {
                    'id': bid_window_id,
                    'acad_term_id': acad_term_id,
                    'round': round_str,
                    'window': window_num
                }
                
                self.new_bid_windows.append(new_bid_window)
                self.bid_window_cache[window_key] = bid_window_id
                self.boss_stats['bid_windows_created'] += 1
                
                self.log_boss_activity(
                    f"✅ Created bid_window {bid_window_id}: {acad_term_id} Round {round_str} Window {window_num}"
                )
                
                bid_window_id += 1
        
        self.bid_window_id_counter = bid_window_id
        self.log_boss_activity(f"✅ Created {self.boss_stats['bid_windows_created']} bid windows")
        return True

    def sort_bid_windows_by_hierarchy(self):
        """Sort bid windows according to the proper hierarchy"""
        self.log_boss_activity("🔄 Sorting bid windows by hierarchy...")
        
        # Group by acad_term_id
        term_groups = defaultdict(list)
        for bw in self.new_bid_windows:
            term_groups[bw['acad_term_id']].append(bw)
        
        sorted_windows = []
        new_id_mapping = {}  # old_id -> new_id
        new_id_counter = 1
        
        for acad_term_id in sorted(term_groups.keys()):
            windows = term_groups[acad_term_id]
            hierarchy = self.get_window_hierarchy(acad_term_id)
            
            # Create a mapping of (round, window) to bid_window for this term
            term_window_map = {(bw['round'], bw['window']): bw for bw in windows}
            
            # Sort according to hierarchy
            for round_str, window_num in hierarchy:
                if (round_str, window_num) in term_window_map:
                    bw = term_window_map[(round_str, window_num)]
                    old_id = bw['id']
                    new_id = new_id_counter
                    new_id_counter += 1
                    
                    # Update the bid window with new ID
                    bw['id'] = new_id
                    sorted_windows.append(bw)
                    new_id_mapping[old_id] = new_id
                    
                    # Update cache
                    window_key = (acad_term_id, round_str, window_num)
                    self.bid_window_cache[window_key] = new_id
        
        self.new_bid_windows = sorted_windows
        self.bid_window_id_counter = new_id_counter
        
        self.log_boss_activity(f"✅ Sorted {len(sorted_windows)} bid windows by hierarchy")


    def find_class_id(self, course_code, section, acad_term_id):
        """Find class_id using course_code, section, and acad_term_id
        Robust version that checks multiple sources in order:
        1. Memory cache (new_classes)
        2. Database cache (existing_classes_cache)
        3. new_classes.csv file
        4. Direct database query
        """
        
        # First, get course_id from course_code
        course_id = self.get_course_id(course_code)
        if not course_id:
            return None
        
        # Convert section to string for consistent comparison
        section_str = str(section)
        
        # Source 1: Search in newly created classes (memory)
        if hasattr(self, 'new_classes') and self.new_classes:
            for class_obj in self.new_classes:
                if (class_obj['course_id'] == course_id and 
                    str(class_obj['section']) == section_str and 
                    class_obj['acad_term_id'] == acad_term_id):
                    return class_obj['id']
        
        # Source 2: Search in existing database cache
        if not hasattr(self, 'existing_classes_cache'):
            self.load_existing_classes_cache()
        
        if hasattr(self, 'existing_classes_cache') and self.existing_classes_cache:
            for class_obj in self.existing_classes_cache:
                if (class_obj['course_id'] == course_id and 
                    str(class_obj['section']) == section_str and 
                    class_obj['acad_term_id'] == acad_term_id):
                    return class_obj['id']
        
        # Source 3: Check new_classes.csv file (if cache is empty/stale)
        class_id = self.search_new_classes_csv(course_id, section_str, acad_term_id)
        if class_id:
            return class_id
        
        # Source 4: Direct database query (last resort)
        if self.connection:
            class_id = self.search_database_classes(course_id, section_str, acad_term_id)
            if class_id:
                return class_id
        
        return None

    def get_course_id(self, course_code):
        """Get course_id from course_code, checking multiple sources"""
        # Check courses cache (from database)
        if course_code in self.courses_cache:
            return self.courses_cache[course_code]['id']
        
        # Check in new_courses (newly created)
        for course in self.new_courses:
            if course['code'] == course_code:
                return course['id']
        
        # Check new_courses.csv file
        try:
            new_courses_path = os.path.join(self.output_base, 'new_courses.csv')
            verify_courses_path = os.path.join(self.verify_dir, 'new_courses.csv')
            
            for path in [verify_courses_path, new_courses_path]:
                if os.path.exists(path):
                    df = pd.read_csv(path)
                    matching_courses = df[df['code'] == course_code]
                    if not matching_courses.empty:
                        return matching_courses.iloc[0]['id']
        except Exception as e:
            self.log_boss_activity(f"⚠️ Error reading new_courses.csv: {e}", print_to_stdout=False)
        
        return None

    def search_new_classes_csv(self, course_id, section_str, acad_term_id):
        """Search for class in new_classes.csv file"""
        try:
            new_classes_path = os.path.join(self.output_base, 'new_classes.csv')
            if os.path.exists(new_classes_path):
                df = pd.read_csv(new_classes_path)
                matching_classes = df[
                    (df['course_id'] == course_id) & 
                    (df['section'].astype(str) == section_str) & 
                    (df['acad_term_id'] == acad_term_id)
                ]
                if not matching_classes.empty:
                    return matching_classes.iloc[0]['id']
        except Exception as e:
            self.log_boss_activity(f"⚠️ Error reading new_classes.csv: {e}", print_to_stdout=False)
        
        return None

    def search_database_classes(self, course_id, section_str, acad_term_id):
        """Search for class directly in database"""
        try:
            query = """
            SELECT id FROM classes 
            WHERE course_id = %s AND section = %s AND acad_term_id = %s
            LIMIT 1
            """
            cursor = self.connection.cursor()
            cursor.execute(query, (course_id, section_str, acad_term_id))
            result = cursor.fetchone()
            cursor.close()
            
            if result:
                return result[0]
        except Exception as e:
            self.log_boss_activity(f"⚠️ Error querying database: {e}", print_to_stdout=False)
        
        return None

    def load_existing_classes_cache(self):
        """Load existing classes from database cache with fallback options"""
        self.existing_classes_cache = []
        
        try:
            cache_file = os.path.join(self.cache_dir, 'classes_cache.pkl')
            
            # Try loading from cache file first
            if os.path.exists(cache_file):
                try:
                    classes_df = pd.read_pickle(cache_file)
                    if not classes_df.empty:
                        self.existing_classes_cache = classes_df.to_dict('records')
                        self.log_boss_activity(f"📚 Loaded {len(self.existing_classes_cache)} existing classes from cache")
                        return
                    else:
                        self.log_boss_activity("⚠️ Cache file exists but is empty")
                except Exception as e:
                    self.log_boss_activity(f"⚠️ Error reading cache file: {e}")
            
            # If cache doesn't exist or is empty, try database
            if self.connection:
                try:
                    query = "SELECT * FROM classes"
                    classes_df = pd.read_sql_query(query, self.connection)
                    if not classes_df.empty:
                        # Save to cache for future use
                        classes_df.to_pickle(cache_file)
                        self.existing_classes_cache = classes_df.to_dict('records')
                        self.log_boss_activity(f"📚 Downloaded and cached {len(self.existing_classes_cache)} existing classes")
                        return
                    else:
                        self.log_boss_activity("⚠️ Database classes table is empty")
                except Exception as e:
                    self.log_boss_activity(f"⚠️ Error downloading classes from database: {e}")
            
            # If all else fails, try reading from new_classes.csv
            try:
                new_classes_path = os.path.join(self.output_base, 'new_classes.csv')
                if os.path.exists(new_classes_path):
                    classes_df = pd.read_csv(new_classes_path)
                    if not classes_df.empty:
                        self.existing_classes_cache = classes_df.to_dict('records')
                        self.log_boss_activity(f"📚 Loaded {len(self.existing_classes_cache)} classes from new_classes.csv as fallback")
                        return
            except Exception as e:
                self.log_boss_activity(f"⚠️ Error reading new_classes.csv as fallback: {e}")
            
            # Final fallback
            self.log_boss_activity("⚠️ All class loading methods failed - using empty cache")
                    
        except Exception as e:
            self.existing_classes_cache = []
            self.log_boss_activity(f"⚠️ Critical error in load_existing_classes_cache: {e}")

    def process_class_availability(self):
        """Process class availability data with support for multi-professor classes"""
        self.log_boss_activity("📊 Processing class availability...")
        
        processed_count = 0
        
        for _, row in self.boss_data.iterrows():
            # Parse required fields
            course_code = row.get('Course Code')
            section = row.get('Section')
            term_str = row.get('Term')
            bidding_window_str = row.get('Bidding Window')
            
            # Extract availability data
            vacancy = row.get('Vacancy')
            enrolled_students = row.get('Enrolled Students')
            before_process_vacancy = row.get('Before Process Vacancy')
            
            # Validate required fields
            if pd.isna(course_code) or pd.isna(section) or pd.isna(term_str) or pd.isna(bidding_window_str):
                continue
            
            # Parse term and bidding window
            acad_term_id = self.parse_term_to_acad_term_id(term_str)
            round_str, window_num = self.parse_bidding_window(bidding_window_str)
            
            if not all([acad_term_id, round_str, window_num]):
                continue
            
            # Find ALL class_ids for this course/section/term (handles multi-professor)
            class_ids = self.find_all_class_ids(course_code, str(section), acad_term_id)
            
            if not class_ids:
                # Record failed mapping
                failed_row = {
                    'course_code': course_code,
                    'section': section,
                    'acad_term_id': acad_term_id,
                    'term_str': term_str,
                    'bidding_window_str': bidding_window_str,
                    'reason': 'class_not_found',
                    'source_file': row.get('source_file', 'unknown')
                }
                self.failed_mappings.append(failed_row)
                self.boss_stats['failed_mappings'] += 1
                continue
            
            # Get bid_window_id
            window_key = (acad_term_id, round_str, window_num)
            bid_window_id = self.bid_window_cache.get(window_key)
            if not bid_window_id:
                self.log_boss_activity(f"⚠️ No bid_window_id for {window_key}")
                continue
            
            # Calculate fields
            total = int(vacancy) if pd.notna(vacancy) else 0
            current_enrolled = int(enrolled_students) if pd.notna(enrolled_students) else 0
            available = int(before_process_vacancy) if pd.notna(before_process_vacancy) else 0
            reserved = max(0, total - current_enrolled - available)
            
            # Create class availability record for EACH class_id (multi-professor support)
            for class_id in class_ids:
                availability_record = {
                    'class_id': class_id,
                    'bid_window_id': bid_window_id,
                    'total': total,
                    'current_enrolled': current_enrolled,
                    'reserved': reserved,
                    'available': available
                }
                
                self.new_class_availability.append(availability_record)
                self.boss_stats['class_availability_created'] += 1
            
            processed_count += len(class_ids)
        
        self.log_boss_activity(f"✅ Processed {processed_count} class availability records")
        return True

    def process_bid_results(self):
        """Process bid result data with support for multi-professor classes"""
        self.log_boss_activity("📈 Processing bid results...")
        
        processed_count = 0
        
        for _, row in self.boss_data.iterrows():
            # Parse required fields
            course_code = row.get('Course Code')
            section = row.get('Section')
            term_str = row.get('Term')
            bidding_window_str = row.get('Bidding Window')
            
            # Extract bid result data
            vacancy = row.get('Vacancy')
            opening_vacancy = row.get('Opening Vacancy')
            before_process_vacancy = row.get('Before Process Vacancy')
            dice = row.get('D.I.C.E')
            after_process_vacancy = row.get('After Process Vacancy', 0)
            enrolled_students = row.get('Enrolled Students')
            median_bid = row.get('Median Bid')
            min_bid = row.get('Min Bid')
            
            # Validate required fields
            if pd.isna(course_code) or pd.isna(section) or pd.isna(term_str) or pd.isna(bidding_window_str):
                continue
            
            # Parse term and bidding window
            acad_term_id = self.parse_term_to_acad_term_id(term_str)
            round_str, window_num = self.parse_bidding_window(bidding_window_str)
            
            if not all([acad_term_id, round_str, window_num]):
                continue
            
            # Find ALL class_ids for this course/section/term (handles multi-professor)
            class_ids = self.find_all_class_ids(course_code, str(section), acad_term_id)
            
            if not class_ids:
                # Failed mapping already recorded in process_class_availability
                continue
            
            # Get bid_window_id
            window_key = (acad_term_id, round_str, window_num)
            bid_window_id = self.bid_window_cache.get(window_key)
            if not bid_window_id:
                continue
            
            # Convert numeric fields
            def safe_int(val):
                return int(val) if pd.notna(val) else 0
            
            def safe_float(val):
                return float(val) if pd.notna(val) else 0.0
            
            # Create bid result record for EACH class_id (multi-professor support)
            for class_id in class_ids:
                bid_result_record = {
                    'bid_window_id': bid_window_id,
                    'class_id': class_id,
                    'vacancy': safe_int(vacancy),
                    'opening_vacancy': safe_int(opening_vacancy),
                    'before_process_vacancy': safe_int(before_process_vacancy),
                    'dice': safe_int(dice),
                    'after_process_vacancy': safe_int(after_process_vacancy),
                    'enrolled_students': safe_int(enrolled_students),
                    'bid_actual_median': safe_float(median_bid),
                    'bid_actual_min': safe_float(min_bid),
                    'bid_predicted_median': 0.0,
                    'bid_predicted_min': 0.0
                }
                
                self.new_bid_result.append(bid_result_record)
                self.boss_stats['bid_results_created'] += 1
            
            processed_count += len(class_ids)
        
        self.log_boss_activity(f"✅ Processed {processed_count} bid result records")
        return True

    def find_all_class_ids(self, course_code, section, acad_term_id):
        """Find ALL class_ids for a course/section/term combination (handles multi-professor classes)"""
        
        # First, get course_id from course_code
        course_id = self.get_course_id(course_code)
        if not course_id:
            return []
        
        # Convert section to string for consistent comparison
        section_str = str(section)
        
        class_ids = []
        
        # Source 1: Search in newly created classes (memory)
        if hasattr(self, 'new_classes') and self.new_classes:
            for class_obj in self.new_classes:
                if (class_obj['course_id'] == course_id and 
                    str(class_obj['section']) == section_str and 
                    class_obj['acad_term_id'] == acad_term_id):
                    class_ids.append(class_obj['id'])
        
        # Source 2: Search in existing database cache
        if not hasattr(self, 'existing_classes_cache'):
            self.load_existing_classes_cache()
        
        if hasattr(self, 'existing_classes_cache') and self.existing_classes_cache:
            for class_obj in self.existing_classes_cache:
                if (class_obj['course_id'] == course_id and 
                    str(class_obj['section']) == section_str and 
                    class_obj['acad_term_id'] == acad_term_id):
                    if class_obj['id'] not in class_ids:  # Avoid duplicates
                        class_ids.append(class_obj['id'])
        
        # Source 3: Check new_classes.csv file (if cache is empty/stale)
        try:
            new_classes_path = os.path.join(self.output_base, 'new_classes.csv')
            if os.path.exists(new_classes_path) and not class_ids:
                df = pd.read_csv(new_classes_path)
                matching_classes = df[
                    (df['course_id'] == course_id) & 
                    (df['section'].astype(str) == section_str) & 
                    (df['acad_term_id'] == acad_term_id)
                ]
                for _, row in matching_classes.iterrows():
                    if row['id'] not in class_ids:
                        class_ids.append(row['id'])
        except Exception as e:
            self.log_boss_activity(f"⚠️ Error reading new_classes.csv: {e}", print_to_stdout=False)
        
        # Source 4: Direct database query (last resort)
        if self.connection and not class_ids:
            try:
                query = """
                SELECT id FROM classes 
                WHERE course_id = %s AND section = %s AND acad_term_id = %s
                """
                cursor = self.connection.cursor()
                cursor.execute(query, (course_id, section_str, acad_term_id))
                results = cursor.fetchall()
                cursor.close()
                
                for result in results:
                    if result[0] not in class_ids:
                        class_ids.append(result[0])
            except Exception as e:
                self.log_boss_activity(f"⚠️ Error querying database: {e}", print_to_stdout=False)
        
        return class_ids

    def save_boss_outputs(self):
        """Save all BOSS-related output files"""
        self.log_boss_activity("💾 Saving BOSS output files...")
        
        # Save bid windows
        if self.new_bid_windows:
            df = pd.DataFrame(self.new_bid_windows)
            output_path = os.path.join(self.output_base, 'new_bid_window.csv')
            df.to_csv(output_path, index=False)
            self.log_boss_activity(f"✅ Saved {len(self.new_bid_windows)} bid windows to new_bid_window.csv")
        
        # Save class availability
        if self.new_class_availability:
            df = pd.DataFrame(self.new_class_availability)
            output_path = os.path.join(self.output_base, 'new_class_availability.csv')
            df.to_csv(output_path, index=False)
            self.log_boss_activity(f"✅ Saved {len(self.new_class_availability)} availability records to new_class_availability.csv")
        
        # Save bid results
        if self.new_bid_result:
            df = pd.DataFrame(self.new_bid_result)
            output_path = os.path.join(self.output_base, 'new_bid_result.csv')
            df.to_csv(output_path, index=False)
            self.log_boss_activity(f"✅ Saved {len(self.new_bid_result)} bid results to new_bid_result.csv")
        
        # Save failed mappings
        if self.failed_mappings:
            df = pd.DataFrame(self.failed_mappings)
            output_path = os.path.join(self.output_base, 'failed_boss_results_mapping.csv')
            df.to_csv(output_path, index=False)
            self.log_boss_activity(f"⚠️ Saved {len(self.failed_mappings)} failed mappings to failed_boss_results_mapping.csv")
        
        self.log_boss_activity("✅ All BOSS output files saved successfully")

    def print_boss_summary(self):
        """Print BOSS processing summary"""
        print("\n" + "="*70)
        print("📊 BOSS RESULTS PROCESSING SUMMARY")
        print("="*70)
        print(f"📂 Files processed: {self.boss_stats['files_processed']}")
        print(f"📄 Total rows: {self.boss_stats['total_rows']}")
        print(f"🪟 Bid windows created: {self.boss_stats['bid_windows_created']}")
        print(f"📊 Class availability records: {self.boss_stats['class_availability_created']}")
        print(f"📈 Bid result records: {self.boss_stats['bid_results_created']}")
        print(f"❌ Failed mappings: {self.boss_stats['failed_mappings']}")
        print("="*70)
        
        print("\n📁 OUTPUT FILES:")
        print(f"   - new_bid_window.csv ({self.boss_stats['bid_windows_created']} records)")
        print(f"   - new_class_availability.csv ({self.boss_stats['class_availability_created']} records)")
        print(f"   - new_bid_result.csv ({self.boss_stats['bid_results_created']} records)")
        if self.boss_stats['failed_mappings'] > 0:
            print(f"   - failed_boss_results_mapping.csv ({self.boss_stats['failed_mappings']} records)")
        print(f"   - boss_result_log.txt (processing log)")
        print("="*70)

    def run_phase3_boss_processing(self):
        """Run the complete BOSS results processing pipeline"""
        try:
            self.log_boss_activity("🚀 Starting Phase 3: BOSS Results Processing")
            self.log_boss_activity("="*60)

            # Ensure database connection is available
            if not hasattr(self, 'connection') or not self.connection:
                self.log_boss_activity("🔌 Establishing database connection...")
                if not self.connect_database():
                    self.log_boss_activity("❌ Failed to establish database connection")
                    return False

            # Step 1: Setup
            self.setup_boss_processing()
            
            # Step 2: Load BOSS results
            if not self.load_boss_results():
                self.log_boss_activity("❌ Failed to load BOSS results")
                return False
            
            # Step 3: Process bid windows
            if not self.process_bid_windows():
                self.log_boss_activity("❌ Failed to process bid windows")
                return False
            
            # Step 4: Process class availability
            if not self.process_class_availability():
                self.log_boss_activity("❌ Failed to process class availability")
                return False
            
            # Step 5: Process bid results
            if not self.process_bid_results():
                self.log_boss_activity("❌ Failed to process bid results")
                return False
            
            # Step 6: Save outputs
            self.save_boss_outputs()
            
            # Step 7: Print summary
            self.print_boss_summary()
            
            self.log_boss_activity("✅ Phase 3: BOSS Results Processing completed successfully!")
            return True
            
        except Exception as e:
            self.log_boss_activity(f"❌ Phase 3 failed: {e}")
            import traceback
            traceback.print_exc()
            return False

    def load_faculties_cache(self):
        """Load faculties from database cache for mapping"""
        try:
            cache_file = os.path.join(self.cache_dir, 'faculties_cache.pkl')
            
            # Try loading from cache file first
            if os.path.exists(cache_file):
                try:
                    faculties_df = pd.read_pickle(cache_file)
                    if not faculties_df.empty:
                        self.faculties_cache = {}
                        self.faculty_acronym_to_id = {}
                        
                        for _, row in faculties_df.iterrows():
                            faculty_id = row['id']
                            acronym = row['acronym'].upper()
                            
                            self.faculties_cache[faculty_id] = row.to_dict()
                            self.faculty_acronym_to_id[acronym] = faculty_id
                        
                        logger.info(f"📚 Loaded {len(self.faculties_cache)} faculties from cache")
                        return True
                    else:
                        logger.warning("⚠️ Faculty cache file exists but is empty")
                except Exception as e:
                    logger.warning(f"⚠️ Error reading faculty cache file: {e}")
            
            # If cache doesn't exist or failed, try database
            if self.connection:
                try:
                    query = "SELECT * FROM faculties"
                    faculties_df = pd.read_sql_query(query, self.connection)
                    if not faculties_df.empty:
                        # Save to cache for future use
                        faculties_df.to_pickle(cache_file)
                        
                        # Load into memory
                        self.faculties_cache = {}
                        self.faculty_acronym_to_id = {}
                        
                        for _, row in faculties_df.iterrows():
                            faculty_id = row['id']
                            acronym = row['acronym'].upper()
                            
                            self.faculties_cache[faculty_id] = row.to_dict()
                            self.faculty_acronym_to_id[acronym] = faculty_id
                        
                        logger.info(f"📚 Downloaded and cached {len(self.faculties_cache)} faculties from database")
                        return True
                    else:
                        logger.warning("⚠️ Database faculties table is empty")
                except Exception as e:
                    logger.warning(f"⚠️ Error downloading faculties from database: {e}")
            
            # Fallback: create basic mapping from known data
            logger.warning("⚠️ Using fallback faculty mapping")
            self.faculties_cache = {}
            self.faculty_acronym_to_id = {
                'LKCSB': 1,   # Lee Kong Chian School of Business
                'YPHSL': 2,   # Yong Pung How School of Law
                'SOE': 3,     # School of Economics
                'SCIS': 4,    # School of Computing and Information Systems
                'SOSS': 5,    # School of Social Sciences
                'SOA': 6,     # School of Accountancy
                'CIS': 7,     # College of Integrative Studies
                'CEC': 8,      # Center for English Communication
                'C4SR': 9,      # Centre for Social Responsibility
                'OCS': 10,      # Dato’ Kho Hui Meng Career Centre
            }
            return True
            
        except Exception as e:
            logger.error(f"❌ Critical error in load_faculties_cache: {e}")
            return False

    def map_courses_to_faculties_from_boss(self):
        """Map courses to faculties using School/Department data from BOSS results"""
        logger.info("🎓 Starting automated faculty mapping from BOSS data...")
        
        # Load faculties cache first
        if not self.load_faculties_cache():
            logger.error("❌ Failed to load faculties cache")
            return False
        
        # Department code to faculty acronym mapping
        # Maps BOSS School/Department codes to our faculty acronyms
        dept_to_faculty_mapping = {
            'SOA': 'SOA',         # School of Accountancy
            'SOSS': 'SOSS',       # School of Social Sciences  
            'LKCSOB': 'LKCSB',    # Lee Kong Chian School of Business (alternative name)
            'LKCSB': 'LKCSB',     # Lee Kong Chian School of Business
            'SIS': 'SCIS',        # School of Computing and Information Systems (old name)
            'SCIS': 'SCIS',       # School of Computing and Information Systems
            'OCC': 'CIS',         # College of Integrative Studies (Office of Core Curriculum)
            'CIS': 'CIS',         # College of Integrative Studies
            'CEC': 'CEC',         # Center for English Communication
            'SOL': 'YPHSL',       # Yong Pung How School of Law (School of Law)
            'SOLGPO': 'YPHSL',    # Yong Pung How School of Law (alternative)
            'SOE': 'SOE',          # School of Economics
            'C4SR': 'C4SR',          # Centre for Social Responsibility
            'OCS': 'OCS',          # Dato’ Kho Hui Meng Career Centre
        }
        
        # Track new faculties that need to be created
        new_faculties_needed = set()
        course_faculty_mappings = {}
        
        # Load BOSS results to extract School/Department mapping
        boss_data_pattern = os.path.join('script_input', 'overallBossResults', '*.xlsx')
        boss_files = glob.glob(boss_data_pattern)
        
        if not boss_files:
            logger.warning("⚠️ No BOSS results files found for faculty mapping")
            return False
        
        logger.info(f"📂 Found {len(boss_files)} BOSS files for faculty mapping")
        
        # Collect all course-faculty mappings from BOSS data
        boss_faculty_data = []
        for file_path in boss_files:
            try:
                df = pd.read_excel(file_path)
                if 'Course Code' in df.columns and 'School/Department' in df.columns:
                    # Extract unique course-faculty pairs
                    course_dept_pairs = df[['Course Code', 'School/Department']].dropna().drop_duplicates()
                    boss_faculty_data.append(course_dept_pairs)
                    logger.info(f"✅ Extracted {len(course_dept_pairs)} course-department pairs from {os.path.basename(file_path)}")
            except Exception as e:
                logger.warning(f"⚠️ Could not read {file_path}: {e}")
        
        if not boss_faculty_data:
            logger.warning("⚠️ No valid BOSS faculty data found")
            return False
        
        # Combine all BOSS faculty data
        combined_boss_data = pd.concat(boss_faculty_data, ignore_index=True).drop_duplicates()
        logger.info(f"📋 Combined {len(combined_boss_data)} unique course-department pairs")
        
        # Log unique departments found
        unique_depts = combined_boss_data['School/Department'].str.strip().str.upper().unique()
        logger.info(f"🏛️ Unique departments found in BOSS data: {sorted(unique_depts)}")
        
        # Process each course-department pair
        mapped_count = 0
        unmapped_depts = set()
        
        for _, row in combined_boss_data.iterrows():
            course_code = row['Course Code']
            dept_code = str(row['School/Department']).strip().upper()
            
            if not course_code or not dept_code:
                continue
            
            # Check if course exists in our courses (new or existing)
            course_exists = False
            if course_code in self.courses_cache:
                course_exists = True
            elif any(course['code'] == course_code for course in self.new_courses):
                course_exists = True
            
            if not course_exists:
                continue  # Skip courses we don't have
            
            # Map department code to faculty acronym, then to faculty ID
            if dept_code in dept_to_faculty_mapping:
                faculty_acronym = dept_to_faculty_mapping[dept_code]
                
                # Get faculty ID from acronym
                if faculty_acronym in self.faculty_acronym_to_id:
                    faculty_id = self.faculty_acronym_to_id[faculty_acronym]
                    course_faculty_mappings[course_code] = faculty_id
                    mapped_count += 1
                    logger.debug(f"✅ Mapped {course_code}: {dept_code} → {faculty_acronym} → ID {faculty_id}")
                else:
                    logger.warning(f"⚠️ Faculty acronym {faculty_acronym} not found in database")
                    unmapped_depts.add(dept_code)
            else:
                # Track unmapped department for new faculty creation
                unmapped_depts.add(dept_code)
                logger.info(f"🆕 Unmapped department: {dept_code} (for course {course_code})")
        
        logger.info(f"✅ Mapped {mapped_count} courses to existing faculties")
        logger.info(f"🆕 Found {len(unmapped_depts)} unmapped departments: {sorted(unmapped_depts)}")
        
        # Create new faculties for unmapped departments
        new_faculty_mappings = {}
        if unmapped_depts:
            new_faculty_mappings = self._create_new_faculties(unmapped_depts)
            
            # Update our faculty mapping caches
            for dept_code, faculty_data in new_faculty_mappings.items():
                faculty_id = faculty_data['id']
                faculty_acronym = faculty_data['acronym']
                
                # Update caches
                self.faculties_cache[faculty_id] = faculty_data
                self.faculty_acronym_to_id[faculty_acronym] = faculty_id
                dept_to_faculty_mapping[dept_code] = faculty_acronym
            
            # Re-process courses with new faculty mappings
            for _, row in combined_boss_data.iterrows():
                course_code = row['Course Code']
                dept_code = str(row['School/Department']).strip().upper()
                
                if course_code and dept_code and dept_code in new_faculty_mappings:
                    # Check if course exists
                    course_exists = False
                    if course_code in self.courses_cache:
                        course_exists = True
                    elif any(course['code'] == course_code for course in self.new_courses):
                        course_exists = True
                    
                    if course_exists:
                        faculty_id = new_faculty_mappings[dept_code]['id']
                        course_faculty_mappings[course_code] = faculty_id
                        mapped_count += 1
        
        # Apply faculty mappings to courses
        self._apply_faculty_mappings_to_courses(course_faculty_mappings)
        
        logger.info(f"✅ Automated faculty mapping completed:")
        logger.info(f"   • {mapped_count} courses mapped to faculties")
        logger.info(f"   • {len(new_faculty_mappings)} new faculties created")
        logger.info(f"   • {len(self.courses_needing_faculty)} courses still need manual review")
        
        return True

    def _create_new_faculties(self, unmapped_dept_codes):
        """Create new faculties for unmapped department codes"""
        logger.info(f"🏗️ Creating {len(unmapped_dept_codes)} new faculties...")
        
        # Get next available faculty ID (start after 10 since you have 10 existing faculties)
        if hasattr(self, 'faculties_cache') and self.faculties_cache:
            next_faculty_id = max(self.faculties_cache.keys()) + 1
        else:
            next_faculty_id = 11  # Start after existing 10 faculties
        
        # Since all known faculties are already mapped, any new dept codes 
        # would be truly new faculties that don't exist yet
        new_faculties = []
        new_faculty_mappings = {}
        
        for dept_code in sorted(unmapped_dept_codes):
            # Generate a reasonable name from the code
            # Common patterns for new faculties
            if 'CENTRE' in dept_code.upper():
                faculty_name = f"Centre for {dept_code.replace('CENTRE', '').strip()}"
            elif 'CENTER' in dept_code.upper():
                faculty_name = f"Center for {dept_code.replace('CENTER', '').strip()}"
            elif 'INSTITUTE' in dept_code.upper():
                faculty_name = f"Institute of {dept_code.replace('INSTITUTE', '').strip()}"
            elif 'OFFICE' in dept_code.upper():
                faculty_name = f"Office of {dept_code.replace('OFFICE', '').strip()}"
            else:
                # Default pattern
                faculty_name = f"SMU {dept_code}"
            
            new_faculty = {
                'id': next_faculty_id,
                'name': faculty_name,
                'acronym': dept_code,  # Use the original dept_code as acronym
                'site_url': f'https://www.smu.edu.sg/',
                'belong_to_university': 1,  # SMU
                'created_at': datetime.now().isoformat(),
                'updated_at': datetime.now().isoformat()
            }
            
            new_faculties.append(new_faculty)
            new_faculty_mappings[dept_code] = new_faculty
            next_faculty_id += 1
            
            logger.info(f"✅ Created faculty: {faculty_name} ({dept_code}) with ID {new_faculty['id']}")
        
        # Save new faculties to verify folder
        if new_faculties:
            df = pd.DataFrame(new_faculties)
            output_path = os.path.join(self.verify_dir, 'new_faculties.csv')
            df.to_csv(output_path, index=False)
            logger.info(f"💾 Saved {len(new_faculties)} new faculties to {output_path}")
        
        return new_faculty_mappings

    def _apply_faculty_mappings_to_courses(self, course_faculty_mappings):
        """Apply faculty mappings to new courses and update courses needing faculty"""
        logger.info(f"🔄 Applying faculty mappings to {len(course_faculty_mappings)} courses...")
        
        mapped_count = 0
        
        # Update new_courses
        for course in self.new_courses:
            course_code = course['code']
            if course_code in course_faculty_mappings:
                course['belong_to_faculty'] = course_faculty_mappings[course_code]
                mapped_count += 1
        
        # Update courses_cache
        for course_code, faculty_id in course_faculty_mappings.items():
            if course_code in self.courses_cache:
                self.courses_cache[course_code]['belong_to_faculty'] = faculty_id
        
        # Remove mapped courses from courses_needing_faculty
        original_needing_count = len(self.courses_needing_faculty)
        self.courses_needing_faculty = [
            course_info for course_info in self.courses_needing_faculty
            if course_info['course_code'] not in course_faculty_mappings
        ]
        
        removed_count = original_needing_count - len(self.courses_needing_faculty)
        
        logger.info(f"✅ Applied faculty mappings:")
        logger.info(f"   • {mapped_count} courses updated with faculty")
        logger.info(f"   • {removed_count} courses removed from manual review queue")
        logger.info(f"   • {len(self.courses_needing_faculty)} courses still need manual review")

    def extract_acad_term_from_path(self, file_path: str) -> Optional[str]:
        r"""Extract acad_term_id from file path as fallback
        Examples:
        'script_input\classTimingsFull\2021-22_T1' -> 'AY202122T1'
        'script_input\classTimingsFull\2022-23_T3A' -> 'AY202223T3A'
        """
        # Extract the term folder name
        path_parts = file_path.replace('/', '\\').split('\\')
        
        for part in path_parts:
            # Look for pattern like "2021-22_T1"
            match = re.match(r'(\d{4})-(\d{2})_T(\w+)', part)
            if match:
                year_start = match.group(1)
                year_end = match.group(2)
                term = match.group(3)
                return f"AY{year_start}{year_end}T{term}"
        
        return None

    def open_course_html_files(self, driver, course_code):
        """Open relevant HTML files for a course from standalone data"""
        try:
            if not hasattr(self, 'standalone_data') or self.standalone_data is None:
                print(f"⚠️ No standalone data available")
                return
            
            # Filter standalone data for the current course
            course_files = self.standalone_data[
                self.standalone_data['course_code'].str.upper() == course_code.upper()
            ]
            
            if course_files.empty:
                print(f"📝 No HTML files found for {course_code}")
                return
            
            # Get unique filepaths for this course
            unique_filepaths = course_files['filepath'].dropna().unique()
            
            if len(unique_filepaths) == 0:
                print(f"📝 No valid filepaths found for {course_code}")
                return
            
            print(f"📂 Opening {len(unique_filepaths)} HTML files for {course_code}...")
            
            # Open each file in a new tab with proper file:// protocol
            for filepath in unique_filepaths:
                try:
                    # Ensure it's an HTML file
                    if not str(filepath).lower().endswith('.html'):
                        continue
                    
                    # Convert to absolute path and use proper file:// protocol
                    abs_path = os.path.abspath(str(filepath))
                    
                    # Use pathlib for cross-platform compatibility
                    from pathlib import Path
                    file_path = Path(abs_path)
                    
                    if file_path.exists():
                        # Use pathlib's as_uri() method for proper file:// URL
                        file_url = file_path.as_uri()
                        
                        # Open in new tab
                        driver.execute_script(f"window.open('{file_url}', '_blank');")
                        print(f"✅ Opened: {file_path.name}")
                    else:
                        print(f"⚠️ File not found: {abs_path}")
                        
                except Exception as e:
                    print(f"⚠️ Could not open {filepath}: {e}")
            
        except Exception as e:
            print(f"⚠️ Error opening HTML files for {course_code}: {e}")

    def get_last_filepath_by_course(self, course_code):
        """Direct filepath lookup for course code - bypasses record_key linking"""
        print(f"🔍 DEBUG: Looking for course {course_code} using direct method")
        
        # Check if we have standalone data with filepath column
        if hasattr(self, 'standalone_data') and self.standalone_data is not None:
            if 'filepath' in self.standalone_data.columns:
                print(f"✅ DEBUG: Found filepath column in standalone_data")
                
                course_records = self.standalone_data[
                    self.standalone_data['course_code'].str.upper() == course_code.upper()
                ].copy()
                
                print(f"📊 DEBUG: Found {len(course_records)} records for {course_code}")
                
                if not course_records.empty:
                    # Get the most recent record (last row)
                    last_record = course_records.iloc[-1]
                    filepath = last_record.get('filepath')
                    
                    print(f"📁 DEBUG: Last record filepath: {filepath}")
                    
                    if pd.notna(filepath):
                        print(f"✅ Found filepath for {course_code}: {filepath}")
                        return filepath
                    else:
                        print(f"❌ DEBUG: Filepath is NaN for {course_code}")
            else:
                print(f"❌ DEBUG: No 'filepath' column in standalone_data")
                print(f"Available columns: {list(self.standalone_data.columns)}")
        
        # Fallback: check multiple_data if standalone doesn't have filepath
        if hasattr(self, 'multiple_data') and self.multiple_data is not None:
            if 'filepath' in self.multiple_data.columns and 'course_code' in self.multiple_data.columns:
                print(f"✅ DEBUG: Checking multiple_data as fallback")
                
                course_records = self.multiple_data[
                    self.multiple_data['course_code'].str.upper() == course_code.upper()
                ].copy()
                
                if not course_records.empty:
                    last_record = course_records.iloc[-1]
                    filepath = last_record.get('filepath')
                    
                    if pd.notna(filepath):
                        print(f"✅ Found filepath in multiple_data for {course_code}: {filepath}")
                        return filepath
        
        print(f"❌ DEBUG: No filepath found for {course_code}")
        return None

    def close_connection(self):
        """Explicitly close database connection"""
        if self.connection:
            self.connection.close()
            self.connection = None
            logger.info("🔒 Database connection closed")

### **Cell 1: Phase 1 - Professor and Course Processing with Automated Faculty Mapping**

**What This Does:**
- Initializes the TableBuilder system and connects to PostgreSQL database for existing data validation
- Processes professors from raw data with advanced name normalization handling Asian, Western, and mixed naming patterns
- Resolves professor email addresses automatically using Microsoft Outlook integration
- Handles hardcoded multi-instructor combinations and prevents duplicate professor creation through multiple validation strategies
- Creates new courses from standalone data and automatically maps them to SMU faculties using BOSS department data
- Generates academic terms with proper ID formatting and date range extraction
- Outputs verification files for manual review: `new_professors.csv` for name corrections and `new_courses.csv` for faculty validation
- Provides detailed statistics on professors created, courses processed, and automated faculty mappings applied

In [None]:
# Initialize the TableBuilder
builder = TableBuilder()

In [None]:
# Run Phase 1 (professors, courses, acad_terms)
success = builder.run_phase1_professors_and_courses()

if success:
    print("\n🎉 Phase 1 completed successfully!")
    print("📝 Next steps:")
    print("   1. Review script_output/verify/new_professors.csv")
    print("   2. Manually correct any professor names if needed")
    print("   3. Run Phase 2 in the next cell")
else:
    print("\n❌ Phase 1 failed. Check logs for details.")

### **Cell 2: Professor Name Review and Correction Interface**

**What This Does:**
- Loads the generated `new_professors.csv` file from the verification directory
- Displays a comparison table showing four name formats: original scraped name, boss format (ALL CAPS), afterclass format (Title Case), and the final processed name
- Provides clear instructions for manual correction focusing only on the 'name' column (afterclass format)
- Guides users to preserve the boss_name format while correcting any parsing errors or name formatting issues
- Handles empty files gracefully when all professors already exist in the database
- Prepares corrected data for Phase 2 processing by maintaining proper name mapping relationships

In [None]:
# Display new professors for review
new_prof_path = os.path.join('script_output', 'verify', 'new_professors.csv')
if os.path.exists(new_prof_path):
    df = pd.read_csv(new_prof_path)
    if not df.empty:
        print(f"📋 {len(df)} new professors created:")
        print("\n🔍 Review these professor names:")
        display(df[['name', 'boss_name', 'afterclass_name', 'original_scraped_name']])
        print("\n📝 If any names need correction, edit the 'name' column in:")
        print(f"   {new_prof_path}")
        print("\n⚠️  Only edit the 'name' column (afterclass format)")
        print("   Keep 'boss_name' unchanged")
    else:
        print("✅ No new professors created - all professors already exist in database")
else:
    print("✅ No new professors file - all professors already exist in database")

### **Cell 3: Phase 2 - Class and Timing Processing with Corrected Professor Data**

**What This Does:**
- Reads manually corrected professor names from verification CSV files and updates internal lookup tables
- Processes classes from standalone data using corrected professor mappings and established course relationships
- Handles complex professor assignments including single professors, JSON arrays for multi-instructor classes, and missing professor scenarios
- Generates class timing records (weekly schedules) and exam timing records with proper foreign key relationships
- Links all timing data to valid class IDs while maintaining referential integrity
- Creates complete set of database-ready CSV files: `new_classes.csv`, `new_class_timing.csv`, `new_class_exam_timing.csv`
- Provides comprehensive error reporting for validation issues and successful record creation statistics

In [None]:
# Run Phase 2 (classes, timings) after manual correction
success = builder.run_phase2_remaining_tables()

if success:
    print("\n🎉 Phase 2 completed successfully!")
    print("📝 All tables generated with corrected professor names")
else:
    print("\n❌ Phase 2 failed. Check logs for details.")

### **Cell 4: Interactive Faculty Assignment for Unmapped Courses**

**What This Does:**
- Identifies courses that still require manual faculty assignment after automated BOSS-based mapping
- Opens scraped HTML course outline files in web browser for informed faculty assignment decisions  
- Presents interactive menu of SMU's schools and centers with options to create new faculties for unmapped departments
- Provides course code, name, and content preview to guide proper faculty placement decisions
- Updates course records with selected faculty assignments and maintains faculty cache consistency
- Allows skipping courses that need additional research while preserving assignment workflow
- Re-saves updated CSV files with complete faculty information for database insertion

In [None]:
# Run faculty assignment process if needed
if hasattr(builder, 'courses_needing_faculty') and builder.courses_needing_faculty:
    builder.assign_course_faculties()
    print("\n✅ Faculty assignment completed!")
else:
    print("✅ No courses need faculty assignment")

### **Cell 5: BOSS Bidding Results Processing and Integration**

**What This Does:**
- Scans `script_input/overallBossResults/` directory for Excel files containing SMU's bidding system historical data
- Parses academic terms from BOSS format ("2021-22 Term 1") to standardized database format ("AY202122T1")
- Creates hierarchical bid windows following SMU's bidding rules with proper round/window progression and incoming student handling
- Maps course codes and sections from BOSS data to existing class records using multiple fallback strategies
- Extracts comprehensive bidding metrics: vacancy counts, enrollment numbers, median/minimum bids, D.I.C.E scores, and availability data
- Generates three new database tables: `new_bid_window.csv`, `new_class_availability.csv`, `new_bid_result.csv`
- Creates detailed processing logs with timestamps, failed mapping analysis, and comprehensive statistics reporting
- Handles academic year rule differences (pre/post AY2024-25) and provides extensive error tracking for troubleshooting

In [None]:
# Run complete Phase 3 pipeline
print("🚀 Starting Phase 3: BOSS Results Processing")
success = builder.run_phase3_boss_processing()

if success:
    print("\n🎉 Phase 3 completed successfully!")
    builder.close_connection()
else:
    print("\n❌ Phase 3 failed. Check logs for details.")

# Check failed mappings (if any)
failed_path = os.path.join('script_output', 'failed_boss_results_mapping.csv')
if os.path.exists(failed_path):
    failed_df = pd.read_csv(failed_path)
    print(f"⚠️ {len(failed_df)} failed mappings found:")
    display(failed_df.head(10))
    print(f"\n📝 Review failed mappings in: {failed_path}")
else:
    print("✅ No failed mappings - all BOSS results mapped successfully!")

# Inspect generated data
print("📋 Generated Data Summary:")

# Check bid windows
bid_window_path = os.path.join('script_output', 'new_bid_window.csv')
if os.path.exists(bid_window_path):
    bw_df = pd.read_csv(bid_window_path)
    print(f"\n🪟 Bid Windows ({len(bw_df)} records):")

# Check class availability
availability_path = os.path.join('script_output', 'new_class_availability.csv')
if os.path.exists(availability_path):
    av_df = pd.read_csv(availability_path)
    print(f"\n📊 Class Availability ({len(av_df)} records):")

# Check bid results
result_path = os.path.join('script_output', 'new_bid_result.csv')
if os.path.exists(result_path):
    br_df = pd.read_csv(result_path)
    print(f"\n📈 Bid Results ({len(br_df)} records):")

### **Cell 6: Comprehensive Data Integrity Validation**

**What This Does:**
- Validates referential integrity across all generated CSV files by checking foreign key relationships between tables
- Loads valid IDs from multiple sources: database cache files, new CSV files, professor lookup tables, and verification files
- Performs comprehensive validation of course_id references in classes, professor_id fields (both single UUIDs and JSON arrays), and class_id references in timing tables
- Checks UUID format validity and ensures all referenced IDs exist in their respective source tables
- Generates detailed error reports with specific row numbers, invalid IDs, and raw professor names for debugging
- Creates validation summary statistics including total records checked, error counts by type, and data loading metrics
- Provides categorized error analysis for professor ID issues including format errors, missing references, and null assignments
- Saves validation results to CSV files: `validation_errors.csv`, `validation_warnings.csv`, and `validation_summary.csv` for comprehensive quality assurance

In [None]:
# Set up logging
import logging
import os
import re
import json
import pandas as pd
from datetime import datetime
import traceback
import sys

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class DataIntegrityValidator:
    """Validates data integrity across generated CSV files and database cache"""
    
    def __init__(self, output_base='script_output', cache_dir='db_cache'):
        self.output_base = output_base
        self.verify_dir = os.path.join(output_base, 'verify')
        self.cache_dir = cache_dir
        
        # Ensure directories exist
        os.makedirs(self.verify_dir, exist_ok=True)
        os.makedirs(self.cache_dir, exist_ok=True)
        
        # Data containers
        self.valid_course_ids = set()
        self.valid_professor_ids = set()
        self.valid_class_ids = set()
        
        # Professor lookup mapping
        self.professor_lookup = {}
        
        # Validation results
        self.validation_errors = []
        self.validation_warnings = []
        
        # Statistics
        self.stats = {
            'total_classes_checked': 0,
            'total_timings_checked': 0,
            'total_exam_timings_checked': 0,
            'course_id_errors': 0,
            'professor_id_errors': 0,
            'professor_id_format_errors': 0,
            'class_id_errors': 0,
            'warnings': 0,
            'professors_created': 0,
            'professors_updated': 0,
            'courses_created': 0,
            'courses_updated': 0,
            'courses_needing_faculty': 0,
            'classes_created': 0,
            'timings_created': 0,
            'exams_created': 0
        }
        
        # Initialize new_acad_terms for compatibility
        self.new_acad_terms = []
    
    def is_valid_uuid(self, uuid_string):
        """Check if a string is a valid UUID format"""
        if not uuid_string or pd.isna(uuid_string):
            return False
        
        try:
            uuid_pattern = re.compile(
                r'^[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$',
                re.IGNORECASE
            )
            return bool(uuid_pattern.match(str(uuid_string).strip()))
        except Exception as e:
            logger.warning(f"UUID validation error for {uuid_string}: {e}")
            return False
    
    def safe_read_csv(self, file_path, required_columns=None):
        """Safely read CSV file with error handling"""
        try:
            if not os.path.exists(file_path):
                logger.warning(f"File not found: {file_path}")
                return pd.DataFrame()
            
            if os.path.getsize(file_path) == 0:
                logger.warning(f"File is empty: {file_path}")
                return pd.DataFrame()
            
            df = pd.read_csv(file_path)
            
            if df.empty:
                logger.warning(f"CSV file is empty: {file_path}")
                return pd.DataFrame()
            
            if required_columns:
                missing_columns = [col for col in required_columns if col not in df.columns]
                if missing_columns:
                    logger.warning(f"Missing columns in {file_path}: {missing_columns}")
                    return pd.DataFrame()
            
            return df
        except Exception as e:
            logger.error(f"Error reading CSV {file_path}: {e}")
            return pd.DataFrame()
    
    def safe_read_pickle(self, file_path):
        """Safely read pickle file with error handling"""
        try:
            if not os.path.exists(file_path):
                logger.warning(f"Pickle file not found: {file_path}")
                return pd.DataFrame()
            
            df = pd.read_pickle(file_path)
            return df if not df.empty else pd.DataFrame()
        except Exception as e:
            logger.error(f"Error reading pickle {file_path}: {e}")
            return pd.DataFrame()
    
    def load_valid_course_ids(self):
        """Load valid course IDs from new_courses.csv and database cache"""
        logger.info("📚 Loading valid course IDs...")
        
        # Load from new_courses.csv (verify folder)
        new_courses_path = os.path.join(self.verify_dir, 'new_courses.csv')
        df = self.safe_read_csv(new_courses_path, ['id'])
        if not df.empty:
            new_course_ids = set(df['id'].astype(str))
            self.valid_course_ids.update(new_course_ids)
            logger.info(f"   ✅ Loaded {len(new_course_ids)} course IDs from new_courses.csv")
        
        # Load from database cache
        cache_file = os.path.join(self.cache_dir, 'courses_cache.pkl')
        courses_df = self.safe_read_pickle(cache_file)
        if not courses_df.empty and 'id' in courses_df.columns:
            cache_course_ids = set(courses_df['id'].astype(str))
            self.valid_course_ids.update(cache_course_ids)
            logger.info(f"   ✅ Loaded {len(cache_course_ids)} course IDs from database cache")
        
        logger.info(f"   📊 Total valid course IDs: {len(self.valid_course_ids)}")
    
    def load_valid_professor_ids(self):
        """Load valid professor IDs from multiple sources including professor_lookup.csv"""
        logger.info("👥 Loading valid professor IDs...")
        
        # PRIORITY 1: Load from professor_lookup.csv (most authoritative)
        lookup_file = 'script_input/professor_lookup.csv'
        if os.path.exists(lookup_file):
            lookup_df = self.safe_read_csv(lookup_file, ['database_id'])
            if not lookup_df.empty and 'database_id' in lookup_df.columns:
                lookup_professor_ids = set(lookup_df['database_id'].astype(str))
                self.valid_professor_ids.update(lookup_professor_ids)
                logger.info(f"   ✅ Loaded {len(lookup_professor_ids)} professor IDs from professor_lookup.csv")
                
                # Also build lookup mapping for analysis
                for _, row in lookup_df.iterrows():
                    boss_name = row.get('boss_name')
                    database_id = str(row.get('database_id'))
                    if pd.notna(boss_name) and pd.notna(database_id):
                        self.professor_lookup[boss_name] = database_id
        
        # PRIORITY 2: Load from database cache (professors table)
        cache_file = os.path.join(self.cache_dir, 'professors_cache.pkl')
        professors_df = self.safe_read_pickle(cache_file)
        if not professors_df.empty and 'id' in professors_df.columns:
            cache_professor_ids = set(professors_df['id'].astype(str))
            self.valid_professor_ids.update(cache_professor_ids)
            logger.info(f"   ✅ Loaded {len(cache_professor_ids)} professor IDs from database cache")
        
        # PRIORITY 3: Load from new_professors.csv (verify folder)
        new_professors_path = os.path.join(self.verify_dir, 'new_professors.csv')
        df = self.safe_read_csv(new_professors_path, ['id'])
        if not df.empty:
            new_professor_ids = set(df['id'].astype(str))
            self.valid_professor_ids.update(new_professor_ids)
            logger.info(f"   ✅ Loaded {len(new_professor_ids)} professor IDs from new_professors.csv")
        
        logger.info(f"   📊 Total valid professor IDs: {len(self.valid_professor_ids)}")
    
    def load_valid_class_ids(self):
        """Load valid class IDs from new_classes.csv"""
        logger.info("🏫 Loading valid class IDs...")
        
        classes_path = os.path.join(self.output_base, 'new_classes.csv')
        df = self.safe_read_csv(classes_path, ['id'])
        if not df.empty:
            self.valid_class_ids = set(df['id'].astype(str))
            logger.info(f"   ✅ Loaded {len(self.valid_class_ids)} class IDs from new_classes.csv")
        else:
            logger.error(f"   ❌ Could not load class IDs from {classes_path}")
        
        logger.info(f"   📊 Total valid class IDs: {len(self.valid_class_ids)}")
    
    def parse_professor_ids(self, professor_id_field):
        """Safely parse professor ID field which can be single ID or JSON array"""
        if pd.isna(professor_id_field) or str(professor_id_field).strip() == '':
            return []
        
        professor_id_str = str(professor_id_field).strip()
        
        # Check if it's a JSON array
        if professor_id_str.startswith('[') and professor_id_str.endswith(']'):
            try:
                # Handle both single and double quotes
                normalized_json = professor_id_str.replace("'", '"')
                parsed_ids = json.loads(normalized_json)
                
                if isinstance(parsed_ids, list):
                    return [str(pid).strip() for pid in parsed_ids if pd.notna(pid)]
                else:
                    return []
            except (json.JSONDecodeError, TypeError) as e:
                logger.warning(f"JSON parsing error for professor_id: {professor_id_str} - {e}")
                return []
        else:
            # Single professor ID
            return [professor_id_str] if professor_id_str else []
    
    def validate_classes(self):
        """Validate course_id and professor_id references in new_classes.csv"""
        logger.info("🔍 Validating new_classes.csv...")
        
        classes_path = os.path.join(self.output_base, 'new_classes.csv')
        df = self.safe_read_csv(classes_path, ['id', 'course_id'])
        
        if df.empty:
            logger.error(f"   ❌ Could not validate classes - file not found or empty")
            return
        
        try:
            self.stats['total_classes_checked'] = len(df)
            
            for idx, row in df.iterrows():
                try:
                    class_id = str(row['id'])
                    course_id = str(row['course_id'])
                    professor_id_field = row.get('professor_id')
                    raw_professor_name = row.get('raw_professor_name', '')
                    
                    # Validate course_id
                    if course_id not in self.valid_course_ids:
                        error = {
                            'type': 'course_id_missing',
                            'file': 'new_classes.csv',
                            'row': idx,
                            'class_id': class_id,
                            'invalid_id': course_id,
                            'field': 'course_id'
                        }
                        self.validation_errors.append(error)
                        self.stats['course_id_errors'] += 1
                    
                    # Validate professor_id
                    professor_ids_to_check = self.parse_professor_ids(professor_id_field)
                    
                    if professor_ids_to_check:
                        for prof_id in professor_ids_to_check:
                            prof_id_str = str(prof_id).strip()
                            
                            # Check UUID format
                            if not self.is_valid_uuid(prof_id_str):
                                error = {
                                    'type': 'professor_id_invalid_uuid',
                                    'file': 'new_classes.csv',
                                    'row': idx,
                                    'class_id': class_id,
                                    'invalid_id': prof_id_str,
                                    'field': 'professor_id',
                                    'raw_professor_name': raw_professor_name,
                                    'course_id': course_id
                                }
                                self.validation_errors.append(error)
                                self.stats['professor_id_format_errors'] += 1
                                continue
                            
                            # Check if professor exists
                            if prof_id_str not in self.valid_professor_ids:
                                error = {
                                    'type': 'professor_id_not_found',
                                    'file': 'new_classes.csv',
                                    'row': idx,
                                    'class_id': class_id,
                                    'invalid_id': prof_id_str,
                                    'field': 'professor_id',
                                    'raw_professor_name': raw_professor_name,
                                    'course_id': course_id
                                }
                                self.validation_errors.append(error)
                                self.stats['professor_id_errors'] += 1
                    else:
                        # Warning for missing professor
                        warning = {
                            'type': 'professor_id_null',
                            'file': 'new_classes.csv',
                            'row': idx,
                            'class_id': class_id,
                            'message': 'No professors found for class',
                            'raw_professor_name': raw_professor_name,
                            'course_id': course_id
                        }
                        self.validation_warnings.append(warning)
                        self.stats['warnings'] += 1
                
                except Exception as e:
                    logger.error(f"Error processing row {idx}: {e}")
                    continue
            
            logger.info(f"   ✅ Validated {len(df)} classes")
            
        except Exception as e:
            logger.error(f"   ❌ Error validating classes: {e}")
            traceback.print_exc()
    
    def validate_class_timings(self):
        """Validate class_id references in new_class_timing.csv"""
        logger.info("⏰ Validating new_class_timing.csv...")
        
        timings_path = os.path.join(self.output_base, 'new_class_timing.csv')
        df = self.safe_read_csv(timings_path, ['class_id'])
        
        if df.empty:
            logger.warning(f"   ⚠️ new_class_timing.csv not found or empty")
            return
        
        try:
            self.stats['total_timings_checked'] = len(df)
            
            for idx, row in df.iterrows():
                try:
                    class_id = str(row['class_id'])
                    
                    if class_id not in self.valid_class_ids:
                        error = {
                            'type': 'class_id_missing',
                            'file': 'new_class_timing.csv',
                            'row': idx,
                            'invalid_id': class_id,
                            'field': 'class_id'
                        }
                        self.validation_errors.append(error)
                        self.stats['class_id_errors'] += 1
                
                except Exception as e:
                    logger.error(f"Error processing timing row {idx}: {e}")
                    continue
            
            logger.info(f"   ✅ Validated {len(df)} class timings")
            
        except Exception as e:
            logger.error(f"   ❌ Error validating class timings: {e}")
    
    def validate_exam_timings(self):
        """Validate class_id references in new_class_exam_timing.csv"""
        logger.info("📝 Validating new_class_exam_timing.csv...")
        
        exam_timings_path = os.path.join(self.output_base, 'new_class_exam_timing.csv')
        df = self.safe_read_csv(exam_timings_path, ['class_id'])
        
        if df.empty:
            logger.warning(f"   ⚠️ new_class_exam_timing.csv not found or empty")
            return
        
        try:
            self.stats['total_exam_timings_checked'] = len(df)
            
            for idx, row in df.iterrows():
                try:
                    class_id = str(row['class_id'])
                    
                    if class_id not in self.valid_class_ids:
                        error = {
                            'type': 'class_id_missing',
                            'file': 'new_class_exam_timing.csv',
                            'row': idx,
                            'invalid_id': class_id,
                            'field': 'class_id'
                        }
                        self.validation_errors.append(error)
                        self.stats['class_id_errors'] += 1
                
                except Exception as e:
                    logger.error(f"Error processing exam timing row {idx}: {e}")
                    continue
            
            logger.info(f"   ✅ Validated {len(df)} exam timings")
            
        except Exception as e:
            logger.error(f"   ❌ Error validating exam timings: {e}")
    
    def analyze_professor_issues(self):
        """Analyze professor-related issues in detail"""
        logger.info("🔬 Analyzing professor issues...")
        
        # Group professor errors by type
        error_types = {}
        for error in self.validation_errors:
            if 'professor_id' in error['type']:
                error_type = error['type']
                if error_type not in error_types:
                    error_types[error_type] = []
                error_types[error_type].append(error)
        
        if error_types:
            print(f"\n📊 PROFESSOR ID ERROR ANALYSIS:")
            for error_type, errors in error_types.items():
                print(f"\n   Error Type: {error_type}")
                print(f"   Count: {len(errors)}")
                
                # Show unique invalid IDs for this error type
                unique_invalid_ids = set()
                for error in errors:
                    unique_invalid_ids.add(error['invalid_id'])
                
                print(f"   Unique Invalid IDs: {len(unique_invalid_ids)}")
                
                # Show sample errors
                print(f"   Sample errors:")
                for i, error in enumerate(errors[:3]):
                    print(f"     {i+1}. Class {error['class_id']} - Raw name: {error.get('raw_professor_name', 'N/A')}")
                    print(f"        Invalid ID: {error['invalid_id']}")
                
                if len(errors) > 3:
                    print(f"     ... and {len(errors) - 3} more")
    
    def save_validation_report(self):
        """Save validation errors and warnings to CSV files"""
        logger.info("💾 Saving validation report...")
        
        try:
            # Save validation errors
            if self.validation_errors:
                errors_df = pd.DataFrame(self.validation_errors)
                errors_path = os.path.join(self.output_base, 'validation_errors.csv')
                errors_df.to_csv(errors_path, index=False)
                logger.info(f"   ❌ Saved {len(self.validation_errors)} validation errors to validation_errors.csv")
            
            # Save validation warnings
            if self.validation_warnings:
                warnings_df = pd.DataFrame(self.validation_warnings)
                warnings_path = os.path.join(self.output_base, 'validation_warnings.csv')
                warnings_df.to_csv(warnings_path, index=False)
                logger.info(f"   ⚠️ Saved {len(self.validation_warnings)} validation warnings to validation_warnings.csv")
            
            # Save summary report
            summary = {
                'validation_timestamp': datetime.now().isoformat(),
                'total_classes_checked': self.stats['total_classes_checked'],
                'total_timings_checked': self.stats['total_timings_checked'],
                'total_exam_timings_checked': self.stats['total_exam_timings_checked'],
                'total_errors': len(self.validation_errors),
                'total_warnings': len(self.validation_warnings),
                'course_id_errors': self.stats['course_id_errors'],
                'professor_id_errors': self.stats['professor_id_errors'],
                'professor_id_format_errors': self.stats['professor_id_format_errors'],
                'class_id_errors': self.stats['class_id_errors'],
                'valid_course_ids_loaded': len(self.valid_course_ids),
                'valid_professor_ids_loaded': len(self.valid_professor_ids),
                'valid_class_ids_loaded': len(self.valid_class_ids)
            }
            
            summary_df = pd.DataFrame([summary])
            summary_path = os.path.join(self.output_base, 'validation_summary.csv')
            summary_df.to_csv(summary_path, index=False)
            logger.info(f"   📊 Saved validation summary to validation_summary.csv")
            
        except Exception as e:
            logger.error(f"Error saving validation report: {e}")
    
    def print_summary(self):
        """Print processing summary"""
        print("\n" + "="*70)
        print("📊 PROCESSING SUMMARY")
        print("="*70)
        print(f"✅ Professors created: {self.stats['professors_created']}")
        print(f"✅ Professors updated: {self.stats.get('professors_updated', 0)}")
        print(f"✅ Courses created: {self.stats['courses_created']}")
        print(f"✅ Courses updated: {self.stats['courses_updated']}")
        print(f"⚠️  Courses needing faculty: {self.stats['courses_needing_faculty']}")
        print(f"✅ Classes created: {self.stats['classes_created']}")
        print(f"✅ Class timings created: {self.stats['timings_created']}")
        print(f"✅ Exam timings created: {self.stats['exams_created']}")
        print("="*70)
        
        print("\n📁 OUTPUT FILES:")
        print(f"   Verify folder: {self.verify_dir}/")
        print(f"   - new_professors.csv ({self.stats['professors_created']} records)")
        print(f"   - new_courses.csv ({self.stats['courses_created']} records)")
        print(f"   Output folder: {self.output_base}/")
        print(f"   - update_courses.csv ({self.stats['courses_updated']} records)")
        print(f"   - update_professor.csv ({self.stats.get('professors_updated', 0)} records)")
        print(f"   - new_acad_term.csv ({len(self.new_acad_terms)} records)")
        print(f"   - new_classes.csv ({self.stats['classes_created']} records)")
        print(f"   - new_class_timing.csv ({self.stats['timings_created']} records)")
        print(f"   - new_class_exam_timing.csv ({self.stats['exams_created']} records)")
        print(f"   - professor_lookup.csv (updated)")
        print(f"   - courses_needing_faculty.csv ({self.stats['courses_needing_faculty']} records)")
        print("="*70)
    
    def run_validation(self):
        """Run complete data integrity validation"""
        try:
            logger.info("🚀 Starting Data Integrity Validation")
            logger.info("="*60)
            
            # Step 1: Load valid IDs from all sources
            self.load_valid_course_ids()
            self.load_valid_professor_ids()
            self.load_valid_class_ids()
            
            # Step 2: Validate references
            self.validate_classes()
            self.validate_class_timings()
            self.validate_exam_timings()
            
            # Step 3: Analyze issues
            self.analyze_professor_issues()
            
            # Step 4: Save and display results
            self.save_validation_report()
            self.print_summary()
            
            logger.info("\n✅ Data integrity validation completed!")
            
            # Return validation status
            return len(self.validation_errors) == 0
            
        except Exception as e:
            logger.error(f"❌ Validation failed: {e}")
            traceback.print_exc()
            return False

In [None]:
validator = DataIntegrityValidator()
success = validator.run_validation()

if success:
    print("\n🎉 All data integrity checks passed!")
    exit(0)
else:
    print("\n💥 Data integrity issues found - check error reports!")
    exit(1)

In [None]:
def check_class_coverage_standalone(output_dir='script_output'):
    """Standalone function to analyze class coverage and generate detailed report"""
    import pandas as pd
    import os
    from collections import defaultdict
    
    print("\n" + "="*70)
    print("🔍 CLASS COVERAGE ANALYSIS")
    print("="*70)
    
    # Load new_classes.csv
    classes_path = os.path.join(output_dir, 'new_classes.csv')
    if not os.path.exists(classes_path):
        print(f"❌ File not found: {classes_path}")
        return
    
    try:
        classes_df = pd.read_csv(classes_path)
        total_classes = len(classes_df)
        all_class_ids = set(classes_df['id'].unique())
        
        # Files to check
        files_to_check = {
            'new_class_availability.csv': 'class_availability',
            'new_class_exam_timing.csv': 'class_exam_timing', 
            'new_class_timing.csv': 'class_timing',
            'new_bid_result.csv': 'bid_result'
        }
        
        coverage_results = {}
        orphan_class_ids = defaultdict(list)
        
        # Load each file and analyze
        for filename, table_name in files_to_check.items():
            file_path = os.path.join(output_dir, filename)
            
            if not os.path.exists(file_path):
                coverage_results[table_name] = {
                    'found_ids': set(),
                    'orphan_ids': set()
                }
                continue
                
            df = pd.read_csv(file_path)
            if 'class_id' in df.columns:
                found_class_ids = set(df['class_id'].unique())
                
                # Check for orphan class_ids
                orphan_ids = found_class_ids - all_class_ids
                if orphan_ids:
                    orphan_class_ids[table_name] = orphan_ids
                
                # Store valid class_ids
                valid_class_ids = found_class_ids & all_class_ids
                
                coverage_results[table_name] = {
                    'found_ids': valid_class_ids,
                    'orphan_ids': orphan_ids
                }
        
        # Calculate statistics
        print("\n📊 STATISTICS:")
        print("-" * 50)
        
        # 1. Total class rows
        print(f"1. Total class rows created: {total_classes}")
        
        # 2. Unique course/section/term combinations
        unique_combinations = classes_df.groupby(['course_id', 'section', 'acad_term_id']).size()
        num_unique_combinations = len(unique_combinations)
        print(f"2. Unique course/section/term combinations: {num_unique_combinations}")
        
        # 3. Classes from multiple professors
        multi_professor_combinations = unique_combinations[unique_combinations > 1]
        total_multi_professor_classes = multi_professor_combinations.sum()
        print(f"3. Class records from multiple professors: {total_multi_professor_classes}")
        
        # 4. Unique classes duplicated due to multiple professors
        num_duplicated_unique_classes = len(multi_professor_combinations)
        print(f"4. Unique classes duplicated due to multiple professors: {num_duplicated_unique_classes}")
        
        # 5. Classes with no BOSS results
        no_boss_classes = []
        for class_id in all_class_ids:
            in_availability = class_id in coverage_results.get('class_availability', {}).get('found_ids', set())
            in_bid_result = class_id in coverage_results.get('bid_result', {}).get('found_ids', set())
            if not in_availability and not in_bid_result:
                no_boss_classes.append(class_id)
        print(f"5. Classes with no BOSS results (no availability/bid_result): {len(no_boss_classes)}")
        
        # 6. Classes with no exams but have class timings
        has_timing = coverage_results.get('class_timing', {}).get('found_ids', set())
        has_exam = coverage_results.get('class_exam_timing', {}).get('found_ids', set())
        no_exam_with_timing = has_timing - has_exam
        print(f"6. Classes with class timings but no exams: {len(no_exam_with_timing)}")
        
        # 7. Classes with exams but no class timings
        exam_no_timing = has_exam - has_timing
        print(f"7. Classes with exams but no class timings: {len(exam_no_timing)}")
        
        # 8. Classes with both exams and class timings
        both_exam_timing = has_exam & has_timing
        print(f"8. Classes with both exams and class timings: {len(both_exam_timing)}")
        
        # 9. Orphan class_ids
        total_orphans = sum(len(ids) for ids in orphan_class_ids.values())
        print(f"9. Orphan class_ids (in tables but not in new_classes): {total_orphans}")
        if total_orphans > 0:
            for table, ids in orphan_class_ids.items():
                print(f"   - {table}: {len(ids)} orphan IDs")
        
        # 10. BOSS records not mapped to scraped data
        print("\n📊 Checking BOSS records not mapped to scraped data...")
        
        # Load BOSS data to check unmapped records
        boss_unmapped_count = 0
        try:
            import glob
            boss_files = glob.glob(os.path.join('script_input', 'overallBossResults', '*.xlsx'))
            if boss_files:
                # Get unique course/section/term from scraped data
                scraped_combinations = set()
                for _, row in classes_df.iterrows():
                    # Need to get course code from course_id
                    course_id = row['course_id']
                    section = str(row['section'])
                    acad_term_id = row['acad_term_id']
                    scraped_combinations.add((course_id, section, acad_term_id))
                
                # Count unique BOSS combinations not in scraped
                boss_unique_combinations = set()
                for file_path in boss_files[:1]:  # Sample first file for performance
                    boss_df = pd.read_excel(file_path)
                    if all(col in boss_df.columns for col in ['Course Code', 'Section', 'Term']):
                        for _, row in boss_df.iterrows():
                            if pd.notna(row['Course Code']) and pd.notna(row['Section']) and pd.notna(row['Term']):
                                course_code = row['Course Code']
                                section = str(row['Section'])
                                term = row['Term']
                                # Convert term to acad_term_id format
                                if isinstance(term, str) and '-' in term:
                                    import re
                                    match = re.match(r'(\d{4})-(\d{2})\s+Term\s+(\w+)', term)
                                    if match:
                                        acad_term_id = f"AY{match.group(1)}{match.group(2)}T{match.group(3)}"
                                        boss_unique_combinations.add((course_code, section, acad_term_id))
                
                # Note: This is approximate as we're comparing course_code vs course_id
                print(f"10. Unique BOSS combinations sampled: {len(boss_unique_combinations)}")
                print("    (Note: Exact count requires course_code to course_id mapping)")
        except Exception as e:
            print(f"10. Could not analyze BOSS unmapped records: {e}")
        
        # Generate detailed report
        print("\n💾 Generating detailed report...")
        
        missing_report = []
        for _, class_row in classes_df.iterrows():
            class_id = class_row['id']
            
            row = {
                'class_id': class_id,
                'course_id': class_row.get('course_id'),
                'section': class_row.get('section'),
                'professor_id': class_row.get('professor_id'),
                'acad_term_id': class_row.get('acad_term_id'),
                'boss_id': class_row.get('boss_id'),
                'raw_professor_name': class_row.get('raw_professor_name', ''),
                'warn_inaccuracy': class_row.get('warn_inaccuracy', False)
            }
            
            # Check each table
            for table_name, result in coverage_results.items():
                row[f'in_{table_name}'] = 'Yes' if class_id in result['found_ids'] else 'No'
            
            # Add summary flags
            row['has_boss_data'] = 'Yes' if (
                row.get('in_class_availability') == 'Yes' or 
                row.get('in_bid_result') == 'Yes'
            ) else 'No'
            
            row['has_timing_data'] = 'Yes' if row.get('in_class_timing') == 'Yes' else 'No'
            row['has_exam_data'] = 'Yes' if row.get('in_class_exam_timing') == 'Yes' else 'No'
            
            missing_report.append(row)
        
        # Save report
        report_df = pd.DataFrame(missing_report)
        report_path = os.path.join(output_dir, 'class_coverage_detailed_report.csv')
        report_df.to_csv(report_path, index=False)
        print(f"✅ Detailed report saved to: {report_path}")
        
        # Summary
        print("\n" + "="*70)
        print("📊 SUMMARY")
        print("="*70)
        print(f"Total classes analyzed: {total_classes}")
        print(f"Report generated: class_coverage_detailed_report.csv")
        print("="*70)
        
    except Exception as e:
        print(f"❌ Error: {e}")
        import traceback
        traceback.print_exc()

# Usage:
check_class_coverage_standalone('script_output')

In [None]:
def check_duplicates_in_tables(output_dir='script_output'):
    """Check for duplicate records in class_availability, class_exam_timing, class_timing, and bid_result"""
    import pandas as pd
    import os
    
    print("\n" + "="*70)
    print("🔍 DUPLICATE RECORDS ANALYSIS")
    print("="*70)
    
    # Tables to check with their unique key combinations
    tables_to_check = {
        'new_class_availability.csv': {
            'name': 'class_availability',
            'key_columns': ['class_id', 'bid_window_id'],
            'description': 'class_id + bid_window_id'
        },
        'new_bid_result.csv': {
            'name': 'bid_result',
            'key_columns': ['bid_window_id', 'class_id'],
            'description': 'bid_window_id + class_id'
        },
        'new_class_timing.csv': {
            'name': 'class_timing',
            'key_columns': None,  # No composite key, check for exact duplicates
            'description': 'all columns (no defined unique key)'
        },
        'new_class_exam_timing.csv': {
            'name': 'class_exam_timing',
            'key_columns': None,  # No composite key, check for exact duplicates
            'description': 'all columns (no defined unique key)'
        }
    }
    
    all_duplicates = {}
    
    for filename, config in tables_to_check.items():
        file_path = os.path.join(output_dir, filename)
        table_name = config['name']
        
        print(f"\n📊 Checking {table_name}...")
        print("-" * 50)
        
        if not os.path.exists(file_path):
            print(f"⚠️  {filename} not found - skipping")
            continue
        
        try:
            df = pd.read_csv(file_path)
            total_rows = len(df)
            print(f"Total rows: {total_rows}")
            
            if total_rows == 0:
                print("❌ No data in file")
                continue
            
            duplicates_found = []
            
            if config['key_columns']:
                # Check for duplicates based on composite key
                key_cols = config['key_columns']
                
                # Verify columns exist
                missing_cols = [col for col in key_cols if col not in df.columns]
                if missing_cols:
                    print(f"❌ Missing required columns: {missing_cols}")
                    continue
                
                # Find duplicates
                duplicated_mask = df.duplicated(subset=key_cols, keep=False)
                duplicates = df[duplicated_mask].copy()
                
                if len(duplicates) > 0:
                    # Sort by key columns for better visualization
                    duplicates = duplicates.sort_values(by=key_cols)
                    
                    # Group duplicates
                    duplicate_groups = duplicates.groupby(key_cols).size().reset_index(name='count')
                    num_duplicate_groups = len(duplicate_groups)
                    
                    print(f"❌ Found {len(duplicates)} duplicate rows")
                    print(f"   Duplicate groups: {num_duplicate_groups}")
                    print(f"   Unique constraint violated: {config['description']}")
                    
                    # Show sample duplicates
                    print("\n   Sample duplicate groups:")
                    for idx, group in duplicate_groups.head(5).iterrows():
                        key_values = {col: group[col] for col in key_cols}
                        print(f"   • {key_values} appears {group['count']} times")
                    
                    if num_duplicate_groups > 5:
                        print(f"   ... and {num_duplicate_groups - 5} more duplicate groups")
                    
                    duplicates_found = duplicates
                else:
                    print(f"✅ No duplicates found on composite key: {config['description']}")
            
            else:
                # Check for exact row duplicates (all columns)
                duplicated_mask = df.duplicated(keep=False)
                duplicates = df[duplicated_mask].copy()
                
                if len(duplicates) > 0:
                    print(f"❌ Found {len(duplicates)} duplicate rows (exact matches)")
                    
                    # Show sample duplicates
                    print("\n   Sample duplicate rows:")
                    shown = 0
                    for idx, row in duplicates.head(10).iterrows():
                        if shown < 5:
                            print(f"   • Row {idx}: class_id={row.get('class_id', 'N/A')}")
                            if 'start_time' in row:
                                print(f"     Timing: {row.get('day_of_week', '')} {row.get('start_time', '')}-{row.get('end_time', '')}")
                            if 'date' in row:
                                print(f"     Exam: {row.get('date', '')} {row.get('start_time', '')}-{row.get('end_time', '')}")
                            shown += 1
                    
                    duplicates_found = duplicates
                else:
                    print(f"✅ No exact duplicate rows found")
            
            # Additional checks for timing tables
            if table_name == 'class_timing' and len(df) > 0:
                # Check for same class with overlapping timings
                print("\n🔍 Checking for overlapping class timings...")
                if all(col in df.columns for col in ['class_id', 'day_of_week', 'start_time', 'end_time']):
                    overlap_issues = []
                    for class_id in df['class_id'].unique():
                        class_timings = df[df['class_id'] == class_id]
                        if len(class_timings) > 1:
                            # Check each pair of timings
                            timings_list = class_timings.to_dict('records')
                            for i in range(len(timings_list)):
                                for j in range(i + 1, len(timings_list)):
                                    t1 = timings_list[i]
                                    t2 = timings_list[j]
                                    if (t1['day_of_week'] == t2['day_of_week'] and 
                                        pd.notna(t1['day_of_week']) and pd.notna(t2['day_of_week'])):
                                        # Same day - check for time overlap
                                        overlap_issues.append({
                                            'class_id': class_id,
                                            'day': t1['day_of_week'],
                                            'timing1': f"{t1['start_time']}-{t1['end_time']}",
                                            'timing2': f"{t2['start_time']}-{t2['end_time']}"
                                        })
                    
                    if overlap_issues:
                        print(f"⚠️  Found {len(overlap_issues)} potential timing conflicts")
                        for issue in overlap_issues[:3]:
                            print(f"   • Class {issue['class_id']} on {issue['day']}: {issue['timing1']} and {issue['timing2']}")
                    else:
                        print("✅ No overlapping timings found")
            
            # Store results
            if len(duplicates_found) > 0:
                all_duplicates[table_name] = {
                    'dataframe': duplicates_found,
                    'total_duplicates': len(duplicates_found),
                    'key_columns': config['key_columns'],
                    'description': config['description']
                }
            
        except Exception as e:
            print(f"❌ Error processing {filename}: {e}")
            import traceback
            traceback.print_exc()
    
    # Export duplicate reports
    print("\n" + "="*70)
    print("💾 EXPORTING DUPLICATE REPORTS")
    print("="*70)
    
    if all_duplicates:
        for table_name, dup_info in all_duplicates.items():
            output_filename = f"duplicates_{table_name}.csv"
            output_path = os.path.join(output_dir, output_filename)
            
            # Add duplicate group numbers
            df = dup_info['dataframe']
            if dup_info['key_columns']:
                # Add group number for easier identification
                df['duplicate_group'] = df.groupby(dup_info['key_columns']).ngroup() + 1
                df = df.sort_values(by=['duplicate_group'] + dup_info['key_columns'])
            
            df.to_csv(output_path, index=False)
            print(f"✅ Exported {dup_info['total_duplicates']} duplicate records to: {output_filename}")
    else:
        print("✅ No duplicates found in any table!")
    
    # Summary
    print("\n" + "="*70)
    print("📊 SUMMARY")
    print("="*70)
    
    total_duplicates = sum(info['total_duplicates'] for info in all_duplicates.values())
    print(f"Total duplicate records found: {total_duplicates}")
    
    if all_duplicates:
        print("\nDuplicates by table:")
        for table_name, info in all_duplicates.items():
            print(f"  • {table_name}: {info['total_duplicates']} duplicates on {info['description']}")
    
    print("="*70)

# Usage:
check_duplicates_in_tables('script_output')