# **SMU Course Scraping Using Selenium**

<div style="background-color:#FFD700; padding:15px; border-radius:5px; border: 2px solid #FF4500;">
    
  <h1 style="color:#8B0000;">⚠️🚨 SCRAPE THIS DATA AT YOUR OWN RISK 🚨⚠️</h1>
  
  <p><strong>📌 If you need the data, please contact me directly.</strong> Only available for **existing students**.</p>

  <h3>🔗 📩 How to Get the Data?</h3>
  <p>📨 <strong>Reach out to me for access</strong> instead of scraping manually.</p>

</div>

<br>

<div style="background-color:#FFF8DC; padding:12px; border-radius:5px; border: 1px solid #DAA520;">
    
  <h2 style="color:#8B8000;">✨ Looking for the Latest Model? Consider V4! ✨</h2>
  <p>👉 <a href="V4_example_prediction.ipynb"><strong>Check out V4 Here</strong></a></p>

</div>

### **Objective**
This script is designed to scrape SMU course details from the BOSS system using Selenium. The process involves:
1. Logging into the system manually to bypass authentication.
2. Iteratively scraping class details for specified academic years and terms.
3. Writing the scraped data to structured CSV files.

### **Script Structure**
1. **Setup**: Import libraries and initialize Selenium WebDriver.
2. **Login**: Wait for manual login and authentication.
3. **Scraping Logic**:
    - `scrape_class_details`: Scrapes course details for a specific class number, academic year, and term.
    - `main`: Manages the scraping process for multiple academic years and terms.
4. **Execution**: Log in and start scraping.


---

## **1. Setup**

In [6]:
import os
os.environ['PGGSSENCMODE'] = 'disable'

import re
import csv
import time
import pandas as pd
import random
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from webdriver_manager.chrome import ChromeDriverManager
from pathlib import Path
import uuid
import logging
import psycopg2
from typing import List, Optional, Tuple
from collections import Counter, defaultdict
from dotenv import load_dotenv
import webbrowser


## **2. Scrape all BOSS data**

### **BOSS Class Scraper Summary**

#### **What This Code Does**
The `BOSSClassScraper` class automates the extraction of class timing data from SMU's BOSS (Banner Online Self-Service) system with intelligent resume capabilities. It systematically scrapes class details across multiple academic terms and saves them as HTML files for further processing.

**Key Features:**
- **Automated Web Scraping**: Navigates through BOSS class detail pages using Selenium WebDriver
- **Resume Capability**: Automatically detects existing scraped files and continues from the last scraped class number, preventing duplicate work
- **Flexible Term Range**: Dynamically derives academic years from input parameters (e.g., '2025-26_T1' to '2028-29_T2') rather than hardcoded lists
- **Smart Pagination**: Scans class numbers from 1000-5000 with intelligent termination after 300 consecutive empty records
- **Progress Tracking**: Monitors existing files and resumes scraping from the highest class number found for each term
- **Data Organization**: Saves HTML files in structured directories by academic term (`script_input/classTimingsFull/`)
- **Incremental CSV Updates**: Appends only new valid files to the existing CSV index, avoiding duplicates

#### **What Is Required**

**Technical Dependencies:**
- Python packages: `selenium`, `webdriver-manager`, standard libraries (`os`, `time`, `csv`, `re`)
- Chrome browser and ChromeDriver (auto-managed)
- Network access to SMU's BOSS system

**User Requirements:**
- **Manual Authentication**: User must manually log in and complete Microsoft Authenticator process when prompted
- **SMU Credentials**: Valid access to BOSS system
- **Directory Structure**: Code creates `script_input/classTimingsFull/` for HTML files and `script_input/scraped_filepaths.csv` for the file index

**Resume Functionality:**
- **Interruption Handling**: If scraping stops halfway due to network issues or manual interruption, the next run automatically resumes from the exact point it left off
- **Duplicate Prevention**: Existing files are automatically detected and skipped, preventing re-downloading of already scraped data
- **Natural Termination**: Uses 300 consecutive empty records threshold to handle BOSS system inconsistencies without hardcoded limits

**Usage in Jupyter Notebook:**
```python
scraper = BOSSClassScraper()
# Will automatically resume from previous progress if files exist
success = scraper.run_full_scraping_process('2025-26_T1', '2025-26_T3B')
```

In [2]:
class BOSSClassScraper:
    """
    A class to scrape class details from BOSS (SMU's online class registration system)
    and save them as HTML files for further processing with resume capability.
    """
    
    def __init__(self):
        """
        Initialize the BOSS Class Scraper with configuration parameters.
        """
        self.term_code_map = {'T1': '10', 'T2': '20', 'T3A': '31', 'T3B': '32'}
        self.all_terms = ['T1', 'T2', 'T3A', 'T3B']
        self.driver = None
        self.min_class_number = 1000
        self.max_class_number = 5000
        self.consecutive_empty_threshold = 300
        
    def _derive_academic_years(self, start_ay_term, end_ay_term):
        """
        Derive academic years from start and end terms.
        
        Args:
            start_ay_term: Starting term (e.g., '2025-26_T1')
            end_ay_term: Ending term (e.g., '2028-29_T2')
            
        Returns:
            List of academic years in format ['2025-26', '2026-27', ...]
        """
        start_year = int(start_ay_term[:4])
        end_year = int(end_ay_term[:4])
        
        academic_years = []
        for year in range(start_year, end_year + 1):
            next_year = (year + 1) % 100
            ay = f"{year}-{next_year:02d}"
            academic_years.append(ay)
            
        return academic_years
    
    def _get_existing_files_progress(self, base_dir):
        """
        Check existing files and determine the last scraped position for each term.
        
        Args:
            base_dir: Base directory where HTML files are stored
            
        Returns:
            Dictionary with term as key and last scraped class number as value
        """
        progress = {}
        
        if not os.path.exists(base_dir):
            return progress
            
        for term_folder in os.listdir(base_dir):
            term_path = os.path.join(base_dir, term_folder)
            if os.path.isdir(term_path):
                max_class_num = 0
                
                for filename in os.listdir(term_path):
                    if filename.endswith('.html'):
                        # Extract class number from filename
                        # Format: SelectedAcadTerm=XXYY&SelectedClassNumber=ZZZZ.html
                        match = re.search(r'SelectedClassNumber=(\d+)\.html', filename)
                        if match:
                            class_num = int(match.group(1))
                            max_class_num = max(max_class_num, class_num)
                
                if max_class_num > 0:
                    progress[term_folder] = max_class_num
                    print(f"Found existing progress for {term_folder}: last class number {max_class_num}")
        
        return progress
    
    def wait_for_manual_login(self):
        """
        Wait for manual login and Microsoft Authenticator process completion.
        """
        print("Please log in manually and complete the Microsoft Authenticator process.")
        print("Waiting for BOSS dashboard to load...")
        
        wait = WebDriverWait(self.driver, 120)
        
        try:
            wait.until(EC.presence_of_element_located((By.ID, "Label_UserName")))
            wait.until(EC.presence_of_element_located((By.XPATH, "//a[contains(text(),'Sign out')]")))
            
            username = self.driver.find_element(By.ID, "Label_UserName").text
            print(f"Login successful! Logged in as {username}")
            
        except TimeoutException:
            print("Login failed or timed out. Could not detect login elements.")
            raise Exception("Login failed")
        
        time.sleep(1)
    
    def scrape_and_save_html(self, start_ay_term='2025-26_T1', end_ay_term='2025-26_T1', base_dir='script_input/classTimingsFull'):
        """
        Scrapes class details from BOSS and saves them as HTML files with resume capability.
        
        Args:
            start_ay_term: Starting academic year and term (e.g., '2025-26_T1')
            end_ay_term: Ending academic year and term (e.g., '2025-26_T1')
            base_dir: Base directory to save the HTML files
        """
        # Check existing progress
        existing_progress = self._get_existing_files_progress(base_dir)
        
        # Derive academic years from input terms
        all_academic_years = self._derive_academic_years(start_ay_term, end_ay_term)
        
        # Generate all possible AY_TERM combinations
        all_ay_terms = []
        for ay in all_academic_years:
            for term in self.all_terms:
                all_ay_terms.append(f"{ay}_{term}")
        
        # Find the indices of the start and end terms
        try:
            start_idx = all_ay_terms.index(start_ay_term)
            end_idx = all_ay_terms.index(end_ay_term)
        except ValueError:
            print("Invalid start or end term provided. Using full range.")
            start_idx = 0
            end_idx = len(all_ay_terms) - 1
        
        # Select the range to scrape
        ay_terms_to_scrape = all_ay_terms[start_idx:end_idx+1]
        
        # Create base directory if needed
        os.makedirs(base_dir, exist_ok=True)
        
        # Process each AY_TERM
        for ay_term in ay_terms_to_scrape:
            print(f"Processing {ay_term}...")
            
            # Parse AY_TERM for URL
            ay, term = ay_term.split('_')
            ay_short = ay[2:4]  # last two digits of first year
            term_code = self.term_code_map.get(term, '10')
            
            # Create folder for AY_TERM
            folder_path = os.path.join(base_dir, ay_term)
            os.makedirs(folder_path, exist_ok=True)
            
            # Determine starting class number based on existing progress
            start_class_num = self.min_class_number
            if ay_term in existing_progress:
                start_class_num = existing_progress[ay_term] + 1
                print(f"Resuming {ay_term} from class number {start_class_num}")
            
            consecutive_empty = 0
            
            # Scrape each class number in range
            for class_num in range(start_class_num, self.max_class_number + 1):
                # Check if file already exists
                filename = f"SelectedAcadTerm={ay_short}{term_code}&SelectedClassNumber={class_num:04}.html"
                filepath = os.path.join(folder_path, filename)
                
                if os.path.exists(filepath):
                    print(f"File already exists: {filepath}, skipping...")
                    consecutive_empty = 0  # Reset counter since we have data
                    continue
                
                url = f"https://boss.intranet.smu.edu.sg/ClassDetails.aspx?SelectedClassNumber={class_num:04}&SelectedAcadTerm={ay_short}{term_code}&SelectedAcadCareer=UGRD"
                
                try:
                    self.driver.get(url)
                    
                    wait = WebDriverWait(self.driver, 15)
                    try:
                        element = wait.until(EC.any_of(
                            EC.presence_of_element_located((By.ID, "lblClassInfoHeader")),
                            EC.presence_of_element_located((By.ID, "lblErrorDetails"))
                        ))
                        
                        error_elements = self.driver.find_elements(By.ID, "lblErrorDetails")
                        has_data = True
                        
                        for error in error_elements:
                            if "No record found" in error.text:
                                has_data = False
                                break
                                
                    except Exception as e:
                        print(f"Wait error: {e}")
                        has_data = False
                    
                    if not has_data:
                        consecutive_empty += 1
                        print(f"No record found for {ay_term}, class {class_num:04}. Consecutive empty: {consecutive_empty}")
                        
                        if consecutive_empty >= self.consecutive_empty_threshold:
                            print(f"{self.consecutive_empty_threshold} consecutive empty records reached for {ay_term}, moving on.")
                            break
                        
                        time.sleep(2)
                        continue
                    
                    # Reset consecutive empty counter if data found
                    consecutive_empty = 0
                    
                    # Save HTML file
                    with open(filepath, 'w', encoding='utf-8') as f:
                        f.write(self.driver.page_source)
                    
                    print(f"Saved {filepath}")
                    time.sleep(2)
                    
                except Exception as e:
                    print(f"Error processing {url}: {str(e)}")
                    time.sleep(5)
        
        print("Scraping completed.")
    
    def generate_scraped_filepaths_csv(self, base_dir='script_input/classTimingsFull', output_csv='script_input/scraped_filepaths.csv'):
        """
        Generates a CSV file with paths to all valid HTML files (those without "No record found").
        Updates existing CSV by appending new valid files.
        
        Args:
            base_dir: Base directory where HTML files are stored
            output_csv: Path to the output CSV file
            
        Returns:
            Path to the generated CSV file or None if error
        """
        # Read existing filepaths if CSV exists
        existing_filepaths = set()
        if os.path.exists(output_csv):
            try:
                with open(output_csv, 'r', encoding='utf-8') as csvfile:
                    reader = csv.reader(csvfile)
                    next(reader)  # Skip header
                    for row in reader:
                        if row:
                            existing_filepaths.add(row[0])
                print(f"Found {len(existing_filepaths)} existing filepaths in CSV")
            except Exception as e:
                print(f"Error reading existing CSV: {str(e)}")
        
        filepaths = []
        
        if not os.path.exists(base_dir):
            print(f"Directory '{base_dir}' does not exist.")
            return None
        
        # Ensure output directory exists
        os.makedirs(os.path.dirname(output_csv), exist_ok=True)
        
        # Walk through directory structure
        for root, dirs, files in os.walk(base_dir):
            for file in files:
                if file.endswith('.html'):
                    filepath = os.path.join(root, file)
                    
                    # Skip if already in existing filepaths
                    if filepath in existing_filepaths:
                        continue
                        
                    try:
                        with open(filepath, 'r', encoding='utf-8') as f:
                            content = f.read()
                            if 'No record found' not in content:
                                filepaths.append(filepath)
                    except Exception as e:
                        print(f"Error reading file {filepath}: {str(e)}")
        
        # Append new filepaths to CSV
        mode = 'a' if existing_filepaths else 'w'
        with open(output_csv, mode, newline='', encoding='utf-8') as csvfile:
            writer = csv.writer(csvfile)
            if not existing_filepaths:  # Write header only if new file
                writer.writerow(['Filepath'])
            for path in filepaths:
                writer.writerow([path])
        
        total_valid_files = len(existing_filepaths) + len(filepaths)
        print(f"CSV updated with {len(filepaths)} new valid file paths. Total: {total_valid_files} files at {output_csv}")
        return output_csv
    
    def run_full_scraping_process(self, start_ay_term='2025-26_T1', end_ay_term='2025-26_T1'):
        """
        Run the complete scraping process from login to CSV generation with resume capability.
        
        Args:
            start_ay_term: Starting academic year and term
            end_ay_term: Ending academic year and term
            
        Returns:
            True if successful, False otherwise
        """
        try:
            # Set up WebDriver
            options = webdriver.ChromeOptions()
            options.add_argument('--no-sandbox')
            options.add_argument('--disable-dev-shm-usage')
            
            service = Service(ChromeDriverManager().install())
            self.driver = webdriver.Chrome(service=service, options=options)
            
            # Navigate to login page and wait for manual login
            self.driver.get("https://boss.intranet.smu.edu.sg/")
            self.wait_for_manual_login()
            
            # Run the main scraping function
            self.scrape_and_save_html(start_ay_term, end_ay_term)
            
            # Generate CSV with valid file paths
            self.generate_scraped_filepaths_csv()
            
            return True
            
        except Exception as e:
            print(f"Error during scraping process: {str(e)}")
            return False
            
        finally:
            if self.driver:
                self.driver.quit()
                self.driver = None
            print("Process completed!")

In [31]:
# Run the scraper
scraper = BOSSClassScraper()
success = scraper.run_full_scraping_process('2024-25_T3A', '2025-26_T1')

2025-06-06 17:13:30,049 - INFO - Get LATEST chromedriver version for google-chrome
2025-06-06 17:13:30,290 - INFO - Get LATEST chromedriver version for google-chrome
2025-06-06 17:13:30,530 - INFO - Driver [C:\Users\tanzh\.wdm\drivers\chromedriver\win64\137.0.7151.68\chromedriver.exe] found in cache


Please log in manually and complete the Microsoft Authenticator process.
Waiting for BOSS dashboard to load...
Login successful! Logged in as Welcome, TAN ZHONG YAN
Found existing progress for 2021-22_T1: last class number 2889
Found existing progress for 2021-22_T2: last class number 2957
Found existing progress for 2021-22_T3A: last class number 1038
Found existing progress for 2021-22_T3B: last class number 1033
Found existing progress for 2022-23_T1: last class number 2954
Found existing progress for 2022-23_T2: last class number 2920
Found existing progress for 2022-23_T3A: last class number 1031
Found existing progress for 2022-23_T3B: last class number 1027
Found existing progress for 2023-24_T1: last class number 2982
Found existing progress for 2023-24_T2: last class number 2964
Found existing progress for 2023-24_T3A: last class number 1028
Found existing progress for 2023-24_T3B: last class number 1033
Found existing progress for 2024-25_T1: last class number 2945
Found exis

KeyboardInterrupt: 


---

## **3. Extract Data from HTML Files**

### **HTML Data Extractor Summary**

#### **What This Code Does**
The `HTMLDataExtractor` class processes previously scraped HTML files from SMU's BOSS system and extracts structured data into Excel format. It systematically parses course information, class timings, academic terms, and exam schedules from local HTML files without requiring network access or authentication.

**Key Features:**
- **Local File Processing**: Uses Selenium WebDriver to parse local HTML files without network connectivity requirements
- **Comprehensive Data Extraction**: Extracts course details, academic terms, class timings, exam schedules, grading information, and professor names
- **Test-First Approach**: Includes `run_test()` function to validate extraction logic on a small sample before processing all files
- **Structured Output**: Organizes extracted data into two Excel sheets - standalone records (one per HTML file) and multiple records (class/exam timings)
- **Error Tracking**: Captures and logs parsing errors in a separate sheet for debugging and quality assurance
- **Flexible Data Parsing**: Handles multiple academic term naming conventions and date formats used across different years
- **Record Linking**: Uses record keys to maintain relationships between standalone and multiple data records

#### **What Is Required**

**Technical Dependencies:**
- Python packages: `selenium`, `webdriver-manager`, `pandas`, `openpyxl`, standard libraries (`os`, `re`, `datetime`, `pathlib`)
- Chrome browser and ChromeDriver (auto-managed)
- No network access required (processes local files only)

**Input Requirements:**
- **Scraped HTML Files**: Previously downloaded HTML files from BOSS system stored locally
- **File Path CSV**: `script_input/scraped_filepaths.csv` containing paths to valid HTML files
- **Directory Structure**: HTML files organized in the expected folder structure (typically `script_input/classTimingsFull/`)

**Output Structure:**
- **Excel File**: `script_input/raw_data.xlsx` (or custom path) with multiple sheets:
  - `standalone`: One record per HTML file with course and class information
  - `multiple`: Multiple records for class timings and exam schedules
  - `errors`: Parsing errors and problematic files for debugging

**Data Extraction Capabilities:**
- **Course Information**: Course codes, names, descriptions, credit units, course areas, enrollment requirements
- **Academic Terms**: Term IDs, academic years, start/end dates, BOSS IDs
- **Class Details**: Sections, grading basis, course outline URLs, professor names
- **Timing Data**: Class schedules, exam dates, venues, day-of-week information
- **Cross-References**: Maintains linking keys between related records across sheets

**Usage in Jupyter Notebook:**
```python
# Initialize extractor
extractor = HTMLDataExtractor()

# Test with sample files first (recommended)
test_success = extractor.run_test(test_count=10)

if test_success:
    # Run full extraction
    extractor.run()
    
# Or run directly without testing
extractor.run(
    scraped_filepaths_csv='script_input/scraped_filepaths.csv',
    output_path='script_input/raw_data.xlsx'
)
```

The class provides a crucial intermediate step between raw HTML scraping and database insertion, creating clean, structured data that can be further processed for database integration or analysis.

In [3]:
class HTMLDataExtractor:
    """
    Extract raw data from scraped HTML files and save to Excel format using Selenium
    """
    
    def __init__(self):
        self.standalone_data = []
        self.multiple_data = []
        self.errors = []
        self.driver = None
        
    def setup_selenium_driver(self):
        """Set up Selenium WebDriver for local file access"""
        try:
            options = Options()
            options.add_argument('--no-sandbox')
            options.add_argument('--disable-dev-shm-usage')
            options.add_argument('--headless')  # Run in headless mode for efficiency
            options.add_argument('--disable-gpu')
            
            service = Service(ChromeDriverManager().install())
            self.driver = webdriver.Chrome(service=service, options=options)
            print("Selenium WebDriver initialized successfully")
        except Exception as e:
            print(f"Failed to initialize Selenium WebDriver: {e}")
            raise
    
    def safe_find_element_text(self, by, value):
        """Safely find element and return its text"""
        try:
            element = self.driver.find_element(by, value)
            return element.text.strip() if element else None
        except Exception:
            return None
    
    def safe_find_element_attribute(self, by, value, attribute):
        """Safely find element and return its attribute"""
        try:
            element = self.driver.find_element(by, value)
            return element.get_attribute(attribute) if element else None
        except Exception:
            return None
    
    def convert_date_to_timestamp(self, date_str):
        """Convert DD-Mmm-YYYY to database timestamp format"""
        try:
            date_obj = datetime.strptime(date_str, '%d-%b-%Y')
            return date_obj.strftime('%Y-%m-%d 00:00:00.000 +0800')
        except Exception as e:
            return None
    
    def parse_acad_term(self, term_text):
        """Parse academic term text and return structured data"""
        try:
            # Pattern like "2021-22 Term 2" or "2021-22 Session 1"
            pattern = r'(\d{4})-(\d{2})\s+(.*)'
            match = re.search(pattern, term_text)
            
            if not match:
                return None, None, None, None
            
            start_year = int(match.group(1))
            end_year_short = int(match.group(2))
            term_desc = match.group(3).lower()
            
            # Convert 2-digit year to 4-digit
            if end_year_short < 50:
                end_year = 2000 + end_year_short
            else:
                end_year = 1900 + end_year_short
            
            # Determine term code
            if 'term 1' in term_desc or 'session 1' in term_desc or 'august term' in term_desc:
                term_code = 'T1'
            elif 'term 2' in term_desc or 'session 2' in term_desc or 'january term' in term_desc:
                term_code = 'T2'
            elif 'term 3a' in term_desc:
                term_code = 'T3A'
            elif 'term 3b' in term_desc:
                term_code = 'T3B'
            else:
                return start_year, end_year, None, None
            
            acad_term_id = f"AY{start_year}{end_year_short:02d}{term_code}"
            
            return start_year, end_year, term_code, acad_term_id
        except Exception as e:
            return None, None, None, None
    
    def parse_course_and_section(self, header_text):
        """Parse course code and section from header text"""
        try:
            # Clean the text first
            clean_text = re.sub(r'<[^>]+>', '', header_text)
            clean_text = re.sub(r'\s+', ' ', clean_text.strip())
            
            # Try multiple regex patterns
            patterns = [
                r'([A-Z0-9_-]+)\s+-\s+(.+)',  # Standard format
                r'([A-Z]+)\s+(\d+[A-Z0-9_]*)\s+-\s+(.+)',  # Split format
                r'([A-Z0-9_\s-]+?)\s*[-–—]\s*(.+)'  # Fallback
            ]
            
            for i, pattern in enumerate(patterns):
                match = re.match(pattern, clean_text, re.IGNORECASE)
                if match:
                    if i == 1:  # Split format
                        course_code = match.group(1) + match.group(2)
                        section = match.group(3)
                    else:
                        course_code = match.group(1)
                        section = match.group(2)
                    
                    course_code = re.sub(r'\s+', '', course_code.upper())
                    section = section.strip()
                    
                    return course_code, section
            
            return None, None
            
        except Exception as e:
            return None, None
    
    def parse_date_range(self, date_text):
        """Parse date range text and return start and end timestamps"""
        try:
            # Example: "10-Jan-2022 to 01-May-2022"
            pattern = r'(\d{1,2}-\w{3}-\d{4})\s+to\s+(\d{1,2}-\w{3}-\d{4})'
            match = re.search(pattern, date_text)
            
            if not match:
                return None, None
            
            start_date = self.convert_date_to_timestamp(match.group(1))
            end_date = self.convert_date_to_timestamp(match.group(2))
            
            return start_date, end_date
        except Exception as e:
            return None, None
    
    def extract_course_areas_list(self):
        """Extract course areas as comma-separated string using Selenium"""
        try:
            course_areas_element = self.driver.find_element(By.ID, 'lblCourseAreas')
            course_areas_html = course_areas_element.get_attribute('innerHTML')
            
            # Extract list items
            areas_list = re.findall(r'<li>(.*?)</li>', course_areas_html)
            if areas_list:
                return ', '.join(areas_list)
            else:
                # Fallback to text content
                return course_areas_element.text.strip()
        except Exception:
            return None
    
    def extract_course_outline_url(self):
        """Extract course outline URL from HTML using Selenium"""
        try:
            onclick_attr = self.safe_find_element_attribute(By.ID, 'imgCourseOutline', 'onclick')
            if onclick_attr:
                url_match = re.search(r"window\.open\('([^']+)'", onclick_attr)
                if url_match:
                    return url_match.group(1)
        except Exception:
            pass
        return None
    
    def extract_boss_ids_from_filepath(self, filepath):
        """Extract BOSS IDs from filepath"""
        try:
            filename = os.path.basename(filepath)
            acad_term_match = re.search(r'SelectedAcadTerm=(\d+)', filename)
            class_match = re.search(r'SelectedClassNumber=(\d+)', filename)
            
            acad_term_boss_id = int(acad_term_match.group(1)) if acad_term_match else None
            class_boss_id = int(class_match.group(1)) if class_match else None
            
            return acad_term_boss_id, class_boss_id
        except Exception:
            return None, None
    
    def extract_meeting_information(self, record_key):
        """Extract class timing and exam timing information using Selenium"""
        try:
            meeting_table = self.driver.find_element(By.ID, 'RadGrid_MeetingInfo_ctl00')
            tbody = meeting_table.find_element(By.TAG_NAME, 'tbody')
            rows = tbody.find_elements(By.TAG_NAME, 'tr')
            
            for row in rows:
                cells = row.find_elements(By.TAG_NAME, 'td')
                if len(cells) < 7:
                    continue
                
                meeting_type = cells[0].text.strip()
                start_date_text = cells[1].text.strip()
                end_date_text = cells[2].text.strip()
                day_of_week = cells[3].text.strip()
                start_time = cells[4].text.strip()
                end_time = cells[5].text.strip()
                venue = cells[6].text.strip() if len(cells) > 6 else ""
                professor_name = cells[7].text.strip() if len(cells) > 7 else ""
                
                # Assume CLASS if meeting_type is empty
                if not meeting_type:
                    meeting_type = 'CLASS'
                
                if meeting_type == 'CLASS':
                    # Convert dates to timestamp format
                    start_date = self.convert_date_to_timestamp(start_date_text)
                    end_date = self.convert_date_to_timestamp(end_date_text)
                    
                    timing_record = {
                        'record_key': record_key,
                        'type': 'CLASS',
                        'start_date': start_date,
                        'end_date': end_date,
                        'day_of_week': day_of_week,
                        'start_time': start_time,
                        'end_time': end_time,
                        'venue': venue,
                        'professor_name': professor_name
                    }
                    self.multiple_data.append(timing_record)
                
                elif meeting_type == 'EXAM':
                    # For exams, use the second date (end_date_text) as the exam date
                    exam_date = self.convert_date_to_timestamp(end_date_text)
                    
                    exam_record = {
                        'record_key': record_key,
                        'type': 'EXAM',
                        'date': exam_date,
                        'day_of_week': day_of_week,
                        'start_time': start_time,
                        'end_time': end_time,
                        'venue': venue,
                        'professor_name': professor_name
                    }
                    self.multiple_data.append(exam_record)
        
        except Exception as e:
            self.errors.append({
                'record_key': record_key,
                'error': f'Error extracting meeting information: {str(e)}',
                'type': 'parse_error'
            })
    
    def process_html_file(self, filepath):
        """Process a single HTML file and extract all data using Selenium"""
        try:
            # Load HTML file
            html_file = Path(filepath).resolve()
            file_url = html_file.as_uri()
            self.driver.get(file_url)
            
            # Create unique record key
            record_key = f"{os.path.basename(filepath)}"
            
            # Extract basic information
            class_header_text = self.safe_find_element_text(By.ID, 'lblClassInfoHeader')
            if not class_header_text:
                self.errors.append({
                    'filepath': filepath,
                    'error': 'Missing class header',
                    'type': 'parse_error'
                })
                return False
            
            course_code, section = self.parse_course_and_section(class_header_text)
            
            # Extract academic term
            term_text = self.safe_find_element_text(By.ID, 'lblClassInfoSubHeader')
            acad_year_start, acad_year_end, term, acad_term_id = self.parse_acad_term(term_text) if term_text else (None, None, None, None)
            
            # Extract course information
            course_name = self.safe_find_element_text(By.ID, 'lblClassSection')
            course_description = self.safe_find_element_text(By.ID, 'lblCourseDescription')
            credit_units_text = self.safe_find_element_text(By.ID, 'lblUnits')
            course_areas = self.extract_course_areas_list()
            enrolment_requirements = self.safe_find_element_text(By.ID, 'lblEnrolmentRequirements')
            
            # Process credit units
            try:
                credit_units = float(credit_units_text) if credit_units_text else None
            except (ValueError, TypeError):
                credit_units = None
            
            # Extract grading basis
            grading_text = self.safe_find_element_text(By.ID, 'lblGradingBasis')
            grading_basis = None
            if grading_text:
                if grading_text.lower() == 'graded':
                    grading_basis = 'Graded'
                elif grading_text.lower() in ['pass/fail', 'pass fail']:
                    grading_basis = 'Pass/Fail'
                else:
                    grading_basis = 'NA'
            
            # Extract course outline URL
            course_outline_url = self.extract_course_outline_url()
            
            # Extract dates
            period_text = self.safe_find_element_text(By.ID, 'lblDates')
            start_dt, end_dt = self.parse_date_range(period_text) if period_text else (None, None)
            
            # Extract BOSS IDs
            acad_term_boss_id, class_boss_id = self.extract_boss_ids_from_filepath(filepath)
            
            # Create standalone record
            standalone_record = {
                'record_key': record_key,
                'filepath': filepath,
                'course_code': course_code,
                'section': section,
                'course_name': course_name,
                'course_description': course_description,
                'credit_units': credit_units,
                'course_area': course_areas,
                'enrolment_requirements': enrolment_requirements,
                'acad_term_id': acad_term_id,
                'acad_year_start': acad_year_start,
                'acad_year_end': acad_year_end,
                'term': term,
                'start_dt': start_dt,
                'end_dt': end_dt,
                'grading_basis': grading_basis,
                'course_outline_url': course_outline_url,
                'acad_term_boss_id': acad_term_boss_id,
                'class_boss_id': class_boss_id,
                'term_text': term_text,
                'period_text': period_text
            }
            
            self.standalone_data.append(standalone_record)
            
            # Extract meeting information
            self.extract_meeting_information(record_key)
            
            return True
            
        except Exception as e:
            self.errors.append({
                'filepath': filepath,
                'error': str(e),
                'type': 'processing_error'
            })
            return False
    
    def run_test(self, scraped_filepaths_csv='script_input/scraped_filepaths.csv', test_count=10):
        """Randomly test the extraction on a subset of files"""
        try:
            print(f"Starting test run with {test_count} randomly selected files...")

            # Reset data containers
            self.standalone_data = []
            self.multiple_data = []
            self.errors = []

            # Set up Selenium driver
            self.setup_selenium_driver()

            # Read the CSV file with file paths
            df = pd.read_csv(scraped_filepaths_csv)

            # Handle both 'Filepath' and 'filepath' column names
            filepath_column = 'Filepath' if 'Filepath' in df.columns else 'filepath'
            all_filepaths = df[filepath_column].dropna().tolist()

            if len(all_filepaths) == 0:
                raise ValueError("No valid filepaths found in CSV")

            # Randomly sample filepaths
            sample_size = min(test_count, len(all_filepaths))
            sampled_filepaths = random.sample(all_filepaths, sample_size)

            processed_files = 0
            successful_files = 0

            for i, filepath in enumerate(sampled_filepaths, start=1):
                if os.path.exists(filepath):
                    print(f"Processing test file {i}/{sample_size}: {os.path.basename(filepath)}")
                    if self.process_html_file(filepath):
                        successful_files += 1
                    processed_files += 1
                else:
                    self.errors.append({
                        'filepath': filepath,
                        'error': 'File not found',
                        'type': 'file_error'
                    })

            print(f"\nTest run complete: {successful_files}/{processed_files} files successful")
            print(f"Standalone records extracted: {len(self.standalone_data)}")
            print(f"Multiple records extracted: {len(self.multiple_data)}")
            if self.errors:
                print(f"Errors encountered: {len(self.errors)}")
                for error in self.errors[:3]:  # Show only the first 3 errors
                    print(f"  - {error['type']}: {error['error']}")

            # Save test results
            test_output_path = 'script_input/test_raw_data.xlsx'
            self.save_to_excel(test_output_path)

            return successful_files > 0

        except Exception as e:
            print(f"Error in test run: {e}")
            return False

        finally:
            if self.driver:
                self.driver.quit()
                print("Test selenium driver closed")
    
    def process_all_files(self, scraped_filepaths_csv='script_input/scraped_filepaths.csv'):
        """Process all files listed in the scraped filepaths CSV"""
        try:
            # Read the CSV file with file paths
            df = pd.read_csv(scraped_filepaths_csv)
            
            # Handle both 'Filepath' and 'filepath' column names
            filepath_column = 'Filepath' if 'Filepath' in df.columns else 'filepath'
            
            total_files = len(df)
            processed_files = 0
            successful_files = 0
            
            print(f"Starting to process {total_files} files")
            
            for index, row in df.iterrows():
                filepath = row[filepath_column]
                
                if os.path.exists(filepath):
                    if self.process_html_file(filepath):
                        successful_files += 1
                    processed_files += 1
                    
                    if processed_files % 100 == 0:
                        print(f"Processed {processed_files}/{total_files} files")
                else:
                    self.errors.append({
                        'filepath': filepath,
                        'error': 'File not found',
                        'type': 'file_error'
                    })
            
            print(f"Processing complete: {successful_files}/{processed_files} files successful")
            
        except Exception as e:
            print(f"Error in process_all_files: {e}")
            raise
    
    def save_to_excel(self, output_path='script_input/raw_data.xlsx'):
        """Save extracted data to Excel file with two sheets"""
        try:
            # Ensure output directory exists
            os.makedirs(os.path.dirname(output_path), exist_ok=True)
            
            # Create DataFrames
            standalone_df = pd.DataFrame(self.standalone_data)
            multiple_df = pd.DataFrame(self.multiple_data)
            
            # Save to Excel with multiple sheets
            with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
                standalone_df.to_excel(writer, sheet_name='standalone', index=False)
                multiple_df.to_excel(writer, sheet_name='multiple', index=False)
                
                # Also save errors if any
                if self.errors:
                    errors_df = pd.DataFrame(self.errors)
                    errors_df.to_excel(writer, sheet_name='errors', index=False)
            
            print(f"Data saved to {output_path}")
            print(f"Standalone records: {len(self.standalone_data)}")
            print(f"Multiple records: {len(self.multiple_data)}")
            if self.errors:
                print(f"Errors: {len(self.errors)}")
            
        except Exception as e:
            print(f"Error saving to Excel: {e}")
            raise
    
    def run(self, scraped_filepaths_csv='script_input/scraped_filepaths.csv', output_path='script_input/raw_data.xlsx'):
        """Run the complete extraction process"""
        print("Starting HTML data extraction...")
        
        # Reset data containers
        self.standalone_data = []
        self.multiple_data = []
        self.errors = []
        
        # Set up Selenium driver
        self.setup_selenium_driver()
        
        try:
            # Process all files
            self.process_all_files(scraped_filepaths_csv)
            
            # Save to Excel
            self.save_to_excel(output_path)
            
            print("HTML data extraction completed!")
            
        finally:
            if self.driver:
                self.driver.quit()
                print("Selenium driver closed")

In [41]:
# Example usage
extractor = HTMLDataExtractor()

# Run the extraction process
extractor.run(scraped_filepaths_csv='script_input/scraped_filepaths.csv', output_path='script_input/raw_data.xlsx')



Starting HTML data extraction...


2025-06-06 18:08:38,277 - INFO - Get LATEST chromedriver version for google-chrome
2025-06-06 18:08:38,298 - INFO - Get LATEST chromedriver version for google-chrome
2025-06-06 18:08:38,323 - INFO - Driver [C:\Users\tanzh\.wdm\drivers\chromedriver\win64\137.0.7151.68\chromedriver.exe] found in cache


Selenium WebDriver initialized successfully
Starting to process 12976 files
Processed 100/12976 files
Processed 200/12976 files
Processed 300/12976 files
Processed 400/12976 files
Processed 500/12976 files
Processed 600/12976 files
Processed 700/12976 files
Processed 800/12976 files
Processed 900/12976 files
Processed 1000/12976 files
Processed 1100/12976 files
Processed 1200/12976 files
Processed 1300/12976 files
Processed 1400/12976 files
Processed 1500/12976 files
Processed 1600/12976 files
Processed 1700/12976 files
Processed 1800/12976 files
Processed 1900/12976 files
Processed 2000/12976 files
Processed 2100/12976 files
Processed 2200/12976 files
Processed 2300/12976 files
Processed 2400/12976 files
Processed 2500/12976 files
Processed 2600/12976 files
Processed 2700/12976 files
Processed 2800/12976 files
Processed 2900/12976 files
Processed 3000/12976 files
Processed 3100/12976 files
Processed 3200/12976 files
Processed 3300/12976 files
Processed 3400/12976 files
Processed 3500/


---

## **4. Process Raw Data into Database Tables**

### **TableBuilder Summary**

#### **What This Code Does**
The `TableBuilder` class processes structured data from the HTML extractor and transforms it into database-ready CSV files for SMU's class management system. It handles complex data relationships, professor name normalization, duplicate detection, and creates all necessary tables for courses, classes, professors, and timing schedules while maintaining referential integrity.

**Key Features:**
- **Two-Phase Processing**: Separates professor/course creation from class/timing processing to allow manual review and correction
- **Intelligent Professor Matching**: Advanced name normalization and substring matching to prevent duplicate professor creation
- **Comprehensive Data Pipeline**: Processes professors, courses, academic terms, classes, class timings, and exam schedules
- **Database Cache Integration**: Loads existing data from PostgreSQL to avoid duplicates and maintain consistency
- **Manual Review Workflow**: Outputs verification files for human review before final processing
- **Asian Name Handling**: Specialized normalization for Asian, Western, and mixed naming conventions common in Singapore
- **Faculty Assignment Interface**: Interactive web-based system for assigning courses to appropriate faculties
- **Error Recovery**: Robust handling of malformed data and missing information

#### **What Is Required**

**Technical Dependencies:**
- Python packages: `pandas`, `psycopg2`, `openpyxl`, `uuid`, `webbrowser`, standard libraries
- PostgreSQL database connection (configured via `.env` file)
- Database cache files or live database access for existing data validation

**Input Requirements:**
- **Raw Data Excel**: `script_input/raw_data.xlsx` from HTML extractor (point 3) with `standalone` and `multiple` sheets
- **Database Configuration**: `.env` file with PostgreSQL connection parameters
- **Professor Lookup**: `script_input/professor_lookup.csv` for existing professor mappings

**Output Structure:**
- **Verification Files** (`script_output/verify/`):
  - `new_professors.csv`: New professors requiring manual name review
  - `new_courses.csv`: New courses for validation
- **Database Insert Files** (`script_output/`):
  - `new_classes.csv`, `new_class_timing.csv`, `new_class_exam_timing.csv`
  - `new_acad_term.csv`, `update_courses.csv`
  - `professor_lookup.csv`: Updated professor mapping table

**Data Processing Capabilities:**
- **Professor Normalization**: Converts names to boss format (ALL CAPS) and afterclass format (Title Case)
- **Duplicate Detection**: Substring matching across existing professors, new professors, and cached data
- **Course Management**: Creates new courses and updates existing ones with latest information
- **Academic Term Generation**: Automatically creates term IDs and manages semester data
- **Relationship Mapping**: Maintains foreign key relationships across all generated tables
- **Faculty Assignment**: Interactive workflow for assigning courses to SMU's 8 schools/centers

In [16]:
# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class TableBuilder:
    """Comprehensive table builder for university class management system"""
    
    def __init__(self, input_file: str = 'script_input/raw_data.xlsx'):
        """Initialize TableBuilder with database configuration and caching"""
        self.input_file = input_file
        self.output_base = 'script_output'
        self.verify_dir = os.path.join(self.output_base, 'verify')
        self.cache_dir = 'db_cache'
        
        # Create output directories
        os.makedirs(self.output_base, exist_ok=True)
        os.makedirs(self.verify_dir, exist_ok=True)
        os.makedirs(self.cache_dir, exist_ok=True)
        
        # Load environment variables
        load_dotenv()
        self.db_config = {
            'host': os.getenv('DB_HOST'),
            'database': os.getenv('DB_NAME'),
            'user': os.getenv('DB_USER'),
            'password': os.getenv('DB_PASSWORD'),
            'port': int(os.getenv('DB_PORT', 5432)),
            'gssencmode': 'disable'
        }
        
        # Database connection
        self.connection = None
        
        # Data storage
        self.standalone_data = None
        self.multiple_data = None
        
        # Caches
        self.professors_cache = {}  # name -> professor data
        self.courses_cache = {}     # code -> course data
        self.acad_term_cache = {}   # id -> acad_term data
        self.professor_lookup = {}  # scraped_name -> database mapping
        
        # Output data collectors
        self.new_professors = []
        self.new_courses = []
        self.update_courses = []
        self.new_acad_terms = []
        self.new_classes = []
        self.new_class_timings = []
        self.new_class_exam_timings = []
        
        # Class ID mapping for timing tables
        self.class_id_mapping = {}  # record_key -> class_id
        
        # Courses requiring faculty assignment
        self.courses_needing_faculty = []
        
        # Statistics
        self.stats = {
            'professors_created': 0,
            'courses_created': 0,
            'courses_updated': 0,
            'classes_created': 0,
            'timings_created': 0,
            'exams_created': 0,
            'courses_needing_faculty': 0
        }
        
        # Asian surnames database for name normalization
        self.asian_surnames = {
            'chinese': ['WANG', 'LI', 'ZHANG', 'LIU', 'CHEN', 'YANG', 'HUANG', 'ZHAO', 'WU', 'ZHOU',
                       'XU', 'SUN', 'MA', 'ZHU', 'HU', 'GUO', 'HE', 'LIN', 'GAO', 'LUO'],
            'singaporean': ['TAN', 'LIM', 'LEE', 'NG', 'ONG', 'WONG', 'GOH', 'CHUA', 'CHAN', 'KOH',
                           'TEO', 'AW', 'CHYE', 'YEO', 'SIM', 'CHIA', 'CHONG', 'LAM', 'CHEW', 'TOH'],
            'korean': ['KIM', 'LEE', 'PARK', 'CHOI', 'JUNG', 'KANG', 'CHO', 'YUN', 'JANG', 'LIM'],
            'vietnamese': ['NGUYEN', 'TRAN', 'LE', 'PHAM', 'HOANG', 'PHAN', 'VU', 'DANG', 'BUI'],
            'indian': ['SHARMA', 'SINGH', 'KUMAR', 'GUPTA', 'KOHLI', 'PATEL', 'MAKHIJA']
        }
        self.all_asian_surnames = set()
        for surnames in self.asian_surnames.values():
            self.all_asian_surnames.update(surnames)
        
        # Western given names
        self.western_given_names = {
            'AARON', 'ADAM', 'ADRIAN', 'ALEXANDER', 'AMANDA', 'ANDREW', 'ANTHONY',
            'BENJAMIN', 'CHRISTOPHER', 'DANIEL', 'DAVID', 'EMILY', 'JAMES', 'JENNIFER',
            'JOHN', 'MICHAEL', 'PETER', 'ROBERT', 'SARAH', 'THOMAS', 'WILLIAM'
        }

    def connect_database(self):
        """Connect to PostgreSQL database"""
        try:
            self.connection = psycopg2.connect(**self.db_config)
            logger.info("✅ Database connection established")
            return True
        except Exception as e:
            logger.error(f"❌ Database connection failed: {e}")
            return False

    def load_or_cache_data(self):
        """Load data from cache or database"""
        # Try loading from cache first
        if self._load_from_cache():
            logger.info("✅ Loaded data from cache")
            return True
        
        # Connect to database and download
        if not self.connect_database():
            return False
        
        try:
            self._download_and_cache_data()
            logger.info("✅ Downloaded and cached data from database")
            return True
        except Exception as e:
            logger.error(f"❌ Failed to download data: {e}")
            return False

    def _load_from_cache(self) -> bool:
        """Load cached data from files"""
        try:
            cache_files = {
                'professors': os.path.join(self.cache_dir, 'professors_cache.pkl'),
                'courses': os.path.join(self.cache_dir, 'courses_cache.pkl'),
                'acad_terms': os.path.join(self.cache_dir, 'acad_terms_cache.pkl')
            }
            
            if all(os.path.exists(f) for f in cache_files.values()):
                # Load professors
                professors_df = pd.read_pickle(cache_files['professors'])
                for _, row in professors_df.iterrows():
                    self.professors_cache[row['name'].upper()] = row.to_dict()
                
                # Load courses
                courses_df = pd.read_pickle(cache_files['courses'])
                for _, row in courses_df.iterrows():
                    self.courses_cache[row['code']] = row.to_dict()
                
                # Load acad_terms
                acad_terms_df = pd.read_pickle(cache_files['acad_terms'])
                for _, row in acad_terms_df.iterrows():
                    self.acad_term_cache[row['id']] = row.to_dict()
                
                # Load professor lookup if exists
                lookup_file = 'script_input/professor_lookup.csv'
                if os.path.exists(lookup_file):
                    lookup_df = pd.read_csv(lookup_file)
                    for _, row in lookup_df.iterrows():
                        self.professor_lookup[row['scraped_name']] = {
                            'database_id': row['database_id'],
                            'boss_name': row.get('boss_name', row['scraped_name'].upper()),
                            'afterclass_name': row.get('afterclass_name', row['scraped_name'])
                        }
                
                return True
            return False
        except Exception as e:
            logger.error(f"Cache loading error: {e}")
            return False

    def _download_and_cache_data(self):
        """Download data from database and cache locally"""
        # Download professors
        query = "SELECT * FROM professors"
        professors_df = pd.read_sql_query(query, self.connection)
        professors_df.to_pickle(os.path.join(self.cache_dir, 'professors_cache.pkl'))
        
        # Download courses
        query = "SELECT * FROM courses"
        courses_df = pd.read_sql_query(query, self.connection)
        courses_df.to_pickle(os.path.join(self.cache_dir, 'courses_cache.pkl'))
        
        # Download acad_terms
        query = "SELECT * FROM acad_term"
        acad_terms_df = pd.read_sql_query(query, self.connection)
        acad_terms_df.to_pickle(os.path.join(self.cache_dir, 'acad_terms_cache.pkl'))
        
        # Load into memory
        self._load_from_cache()

    def load_raw_data(self):
        """Load raw data from Excel file"""
        try:
            logger.info(f"📂 Loading raw data from {self.input_file}")
            
            # Load both sheets
            self.standalone_data = pd.read_excel(self.input_file, sheet_name='standalone')
            self.multiple_data = pd.read_excel(self.input_file, sheet_name='multiple')
            
            logger.info(f"✅ Loaded {len(self.standalone_data)} standalone records")
            logger.info(f"✅ Loaded {len(self.multiple_data)} multiple records")
            
            from collections import defaultdict
            
            self.multiple_lookup = defaultdict(list)
            for _, row in self.multiple_data.iterrows():
                key = row.get('record_key')
                if pd.notna(key):
                    self.multiple_lookup[key].append(row)
            
            logger.info(f"✅ Created optimized lookup for {len(self.multiple_lookup)} record keys")

            return True
        except Exception as e:
            logger.error(f"❌ Failed to load raw data: {e}")
            return False

    def normalize_professor_name(self, name: str) -> Tuple[str, str]:
        """Normalize professor name and return (boss_format, afterclass_format)"""
        if not name or pd.isna(name):
            return "", ""
        
        # Clean and prepare name
        name = str(name).strip()
        
        # Handle comma-separated names properly
        if ',' in name:
            comma_count = name.count(',')
            if comma_count == 1:
                # Single comma - convert "SURNAME, Given Names" to "SURNAME Given Names"
                parts = name.split(',')
                if len(parts) == 2:
                    surname = parts[0].strip().upper()
                    given_names = parts[1].strip()
                    # Convert given names to title case
                    given_names_title = ' '.join(word.capitalize() for word in given_names.split())
                    name = f"{surname} {given_names_title}"
                # If not exactly 2 parts, keep original
            elif comma_count > 1:
                # Multiple commas - likely multiple professors, take first
                name = name.split(',')[0].strip()

        
        # Detect naming pattern
        words = name.split()
        if not words:
            return name.upper(), name
        
        # Detect pattern
        pattern = self._detect_name_pattern(words)
        
        # Format based on pattern
        if pattern == 'WESTERN':
            # Western: Given SURNAME
            boss_name = name.upper()
            afterclass_parts = []
            for i, word in enumerate(words):
                if i == len(words) - 1:  # Last word is surname
                    afterclass_parts.append(word.upper())
                else:
                    afterclass_parts.append(word.capitalize())
            afterclass_name = ' '.join(afterclass_parts)
        
        elif pattern == 'ASIAN':
            # Asian: SURNAME Given Given
            boss_name = name.upper()
            afterclass_parts = []
            for i, word in enumerate(words):
                if i == 0:  # First word is surname
                    afterclass_parts.append(word.upper())
                else:
                    afterclass_parts.append(word.capitalize())
            afterclass_name = ' '.join(afterclass_parts)
        
        elif pattern == 'SINGAPOREAN':
            # Singaporean: Given SURNAME Given
            boss_name = name.upper()
            surname_idx = self._find_surname_index(words)
            afterclass_parts = []
            for i, word in enumerate(words):
                if i == surname_idx:
                    afterclass_parts.append(word.upper())
                else:
                    afterclass_parts.append(word.capitalize())
            afterclass_name = ' '.join(afterclass_parts)
        
        else:
            # Default fallback
            boss_name = name.upper()
            afterclass_name = ' '.join(word.capitalize() for word in words)
        
        return boss_name, afterclass_name

    def _detect_name_pattern(self, words: List[str]) -> str:
        """Detect naming pattern: WESTERN, ASIAN, or SINGAPOREAN"""
        if not words:
            return 'UNKNOWN'
        
        # Check for Western pattern
        first_upper = words[0].upper()
        if first_upper in self.western_given_names:
            return 'WESTERN'
        
        # Check for pure Asian pattern
        if first_upper in self.all_asian_surnames:
            # Check if no Western names present
            has_western = any(w.upper() in self.western_given_names for w in words)
            if not has_western:
                return 'ASIAN'
        
        # Check for Singaporean mixed pattern
        if len(words) >= 3:
            if (words[0].upper() in self.western_given_names and 
                any(w.upper() in self.all_asian_surnames for w in words[1:])):
                return 'SINGAPOREAN'
        
        # Default to Western if unclear
        return 'WESTERN'

    def _find_surname_index(self, words: List[str]) -> int:
        """Find the index of surname in a list of words"""
        for i, word in enumerate(words):
            if word.upper() in self.all_asian_surnames:
                return i
        # Default to last word if no Asian surname found
        return len(words) - 1

    def process_professors(self):
        """Process professors from multiple sheet"""
        logger.info("👥 Processing professors...")
        
        unique_professors = set()
        
        # Extract unique professor names from multiple sheet
        for _, row in self.multiple_data.iterrows():
            if pd.notna(row.get('professor_name')):
                prof_name = str(row['professor_name']).strip()
                if prof_name and prof_name.upper() not in ['TBA', 'TO BE ANNOUNCED']:
                    # Handle comma-separated names properly during extraction
                    comma_count = prof_name.count(',')
                    if comma_count == 1:
                        parts = prof_name.split(',', 1)
                        before_comma = parts[0].strip()
                        if len(before_comma.split()) == 1:  # Single-word surname
                            prof_name = f"{before_comma} {parts[1].strip()}"  # "SURNAME Given"
                        else:  # Multi-word before comma = multi-instructor
                            prof_name = before_comma  # Take first instructor only
                    elif comma_count > 1:
                        prof_name = prof_name.split(',', 1)[0].strip()  # First instructor
                    else:
                        prof_name = prof_name  # No commas
                    unique_professors.add(prof_name)
        
        # Process each unique professor
        for prof_name in unique_professors:
            boss_name, afterclass_name = self.normalize_professor_name(prof_name)
            
            # Check if professor exists in lookup or cache
            if prof_name in self.professor_lookup:
                continue
            
            # Check cache by normalized name
            if boss_name in self.professors_cache or afterclass_name.upper() in self.professors_cache:
                continue
            
            # NEW: Add substring matching logic here
            duplicate_found = False
            
            # Check against existing professor_lookup
            for existing_scraped_name, prof_data in self.professor_lookup.items():
                existing_boss = prof_data.get('boss_name', '')
                existing_afterclass = prof_data.get('afterclass_name', '')
                
                # Check substring matches
                if (prof_name.upper() in existing_scraped_name.upper() or 
                    existing_scraped_name.upper() in prof_name.upper() or
                    boss_name.upper() in existing_boss.upper() or
                    existing_boss.upper() in boss_name.upper() or
                    afterclass_name.upper() in existing_afterclass.upper() or
                    existing_afterclass.upper() in afterclass_name.upper()):
                    
                    # Update lookup to include this variation
                    self.professor_lookup[prof_name] = prof_data.copy()
                    duplicate_found = True
                    break
            
            if duplicate_found:
                continue
            
            # Check against professors_cache
            for cached_name, cached_prof in self.professors_cache.items():
                cached_boss = cached_prof.get('name', '').upper()
                
                # Check substring matches with cache
                if (prof_name.upper() in cached_name.upper() or 
                    cached_name.upper() in prof_name.upper() or
                    boss_name.upper() in cached_boss or
                    cached_boss in boss_name.upper()):
                    
                    # Update lookup to point to existing professor
                    self.professor_lookup[prof_name] = {
                        'database_id': cached_prof['id'],
                        'boss_name': cached_boss,
                        'afterclass_name': cached_prof.get('name', afterclass_name)
                    }
                    duplicate_found = True
                    break
            
            if duplicate_found:
                continue
            
            # Check against new_professors being created in this run
            for new_prof in self.new_professors:
                new_original = new_prof.get('original_scraped_name', '')
                new_boss = new_prof.get('boss_name', '')
                new_afterclass = new_prof.get('afterclass_name', '')
                
                # Check substring matches with new professors
                if (prof_name.upper() in new_original.upper() or 
                    new_original.upper() in prof_name.upper() or
                    boss_name.upper() in new_boss.upper() or
                    new_boss.upper() in boss_name.upper() or
                    afterclass_name.upper() in new_afterclass.upper() or
                    new_afterclass.upper() in afterclass_name.upper()):
                    
                    # Update lookup to point to the new professor
                    self.professor_lookup[prof_name] = {
                        'database_id': new_prof['id'],
                        'boss_name': new_boss,
                        'afterclass_name': new_afterclass
                    }
                    duplicate_found = True
                    break
            
            if duplicate_found:
                continue
            
            # Create new professor (existing code continues here)
            professor_id = str(uuid.uuid4())
            slug = re.sub(r'[^a-zA-Z0-9]+', '-', afterclass_name.lower()).strip('-')
            
            new_prof = {
                'id': professor_id,
                'name': afterclass_name,
                'email': 'enquiry@smu.edu.sg',  # Default email
                'slug': slug,
                'photo_url': 'https://smu.edu.sg',
                'profile_url': 'https://smu.edu.sg',
                'belong_to_university': 1,  # SMU
                'created_at': datetime.now().isoformat(),
                'updated_at': datetime.now().isoformat(),
                'boss_name': boss_name,
                'afterclass_name': afterclass_name,
                'original_scraped_name': prof_name
            }
            
            self.new_professors.append(new_prof)
            self.stats['professors_created'] += 1
            
            # Update lookup
            self.professor_lookup[prof_name] = {
                'database_id': professor_id,
                'boss_name': boss_name,
                'afterclass_name': afterclass_name
            }
        
        logger.info(f"✅ Created {self.stats['professors_created']} new professors")

    def process_courses(self):
        """Process courses from standalone sheet WITHOUT prompting for faculty"""
        logger.info("📚 Processing courses...")
        
        # Group by course code to handle duplicates
        course_groups = defaultdict(list)
        for _, row in self.standalone_data.iterrows():
            if pd.notna(row.get('course_code')):
                course_groups[row['course_code']].append(row)
        
        for course_code, rows in course_groups.items():
            # Helper function to get sortable key for academic term ordering
            def get_sort_key(row):
                year_start = row.get('acad_year_start', 0)
                year_end = row.get('acad_year_end', 0)
                term = str(row.get('term', ''))
                
                # Convert term to sortable format
                term_order = {
                    'T1': 1,
                    'T2': 2,
                    'T3A': 3.1,
                    'T3B': 3.2
                }
                term_value = term_order.get(term.upper(), 0)
                return (year_start, year_end, term_value)
            
            # Sort rows to get the latest one (highest year and term)
            sorted_rows = sorted(rows, key=get_sort_key, reverse=True)
            latest_row = sorted_rows[0]
            
            # Check if course exists in cache
            if course_code in self.courses_cache:
                # Course exists - check for updates
                existing = self.courses_cache[course_code]
                update_needed = False
                update_record = {'id': existing['id'], 'code': course_code}
                
                # Fields that need comparison for changes
                comparison_fields = ['name', 'description', 'credit_units']
                
                # Fields that always need updating (even if null/empty in existing)
                always_update_fields = ['course_area', 'enrolment_requirements']
                
                # Field mapping from raw data to database columns
                field_mapping = {
                    'name': 'course_name',
                    'description': 'course_description',
                    'credit_units': 'credit_units'
                }
                
                # Check comparison fields for changes
                for field in comparison_fields:
                    raw_field = field_mapping.get(field, field)
                    new_value = latest_row.get(raw_field)
                    old_value = existing.get(field)
                    
                    # Convert credit_units to float for proper comparison
                    if field == 'credit_units':
                        new_value = float(new_value) if pd.notna(new_value) else None
                        old_value = float(old_value) if pd.notna(old_value) else None
                    
                    # Only update if new value exists and differs from old
                    if pd.notna(new_value) and new_value != old_value:
                        update_record[field] = new_value
                        update_needed = True
                
                # Always update course_area and enrolment_requirements if they have values
                for field in always_update_fields:
                    new_value = latest_row.get(field)
                    if pd.notna(new_value):
                        # Always add these fields to update, even if unchanged
                        update_record[field] = new_value
                        update_needed = True
                    elif existing.get(field) is None:
                        # If existing has no value and new has no value, no update needed
                        pass
                    else:
                        # If existing has value but new doesn't, keep existing (don't overwrite with null)
                        pass
                
                if update_needed:
                    self.update_courses.append(update_record)
                    self.stats['courses_updated'] += 1
                    
                    # Update cache with new values
                    for field, value in update_record.items():
                        if field != 'id' and field != 'code':
                            self.courses_cache[course_code][field] = value
            else:
                # Create new course WITHOUT faculty assignment
                course_id = str(uuid.uuid4())
                
                new_course = {
                    'id': course_id,
                    'code': course_code,
                    'name': latest_row.get('course_name', 'Unknown Course'),
                    'description': latest_row.get('course_description', 'No description available'),
                    'credit_units': float(latest_row.get('credit_units', 1.0)) if pd.notna(latest_row.get('credit_units')) else 1.0,
                    'belong_to_university': 1,  # SMU
                    'belong_to_faculty': None,  # Will be assigned later
                    'course_area': latest_row.get('course_area'),
                    'enrolment_requirements': latest_row.get('enrolment_requirements')
                }
                
                self.new_courses.append(new_course)
                self.stats['courses_created'] += 1
                
                # Store course info for later faculty assignment
                self.courses_needing_faculty.append({
                    'course_id': course_id,
                    'course_code': course_code,
                    'course_name': latest_row.get('course_name', 'Unknown Course'),
                    'course_outline_url': latest_row.get('course_outline_url')
                })
                self.stats['courses_needing_faculty'] += 1
                
                # Update cache
                self.courses_cache[course_code] = new_course
        
        logger.info(f"✅ Created {self.stats['courses_created']} new courses")
        logger.info(f"✅ Updated {self.stats['courses_updated']} existing courses")
        logger.info(f"⚠️  {self.stats['courses_needing_faculty']} courses need faculty assignment")

    def assign_course_faculties(self):
        """Separate method to handle faculty assignments for courses"""
        if not self.courses_needing_faculty:
            logger.info("✅ No courses need faculty assignment")
            return
        
        logger.info(f"🎓 Starting faculty assignment for {len(self.courses_needing_faculty)} courses")
        
        faculty_assignments = []
        
        for course_info in self.courses_needing_faculty:
            print(f"\n{'='*60}")
            print(f"🎓 FACULTY ASSIGNMENT NEEDED")
            print(f"{'='*60}")
            print(f"Course Code: {course_info['course_code']}")
            print(f"Course Name: {course_info['course_name']}")
            
            # Open course outline if available
            if pd.notna(course_info.get('course_outline_url')):
                url = course_info['course_outline_url']
                print(f"Opening course outline: {url}")
                webbrowser.open(url)
            
            print("\nFaculty Options:")
            print("1. Lee Kong Chian School of Business")
            print("2. Yong Pung How School of Law")
            print("3. School of Economics")
            print("4. School of Computing and Information Systems")
            print("5. School of Social Sciences")
            print("6. School of Accountancy")
            print("7. College of Integrative Studies")
            print("8. Center for English Communication")
            print("0. Skip (will need manual review)")
            
            while True:
                choice = input("\nEnter faculty number (0-8): ").strip()
                if choice == '0':
                    faculty_id = None
                    break
                elif choice in ['1', '2', '3', '4', '5', '6', '7', '8']:
                    faculty_id = int(choice)
                    break
                else:
                    print("Invalid choice. Please enter 0-8.")
            
            # Store assignment
            faculty_assignments.append({
                'course_id': course_info['course_id'],
                'course_code': course_info['course_code'],
                'faculty_id': faculty_id
            })
        
        # Update the new_courses list with faculty assignments
        for assignment in faculty_assignments:
            if assignment['faculty_id'] is not None:
                # Find and update the course in new_courses
                for course in self.new_courses:
                    if course['id'] == assignment['course_id']:
                        course['belong_to_faculty'] = assignment['faculty_id']
                        break
                
                # Update cache
                if assignment['course_code'] in self.courses_cache:
                    self.courses_cache[assignment['course_code']]['belong_to_faculty'] = assignment['faculty_id']
        
        # Re-save the new_courses.csv with faculty assignments
        if self.new_courses:
            df = pd.DataFrame(self.new_courses)
            df.to_csv(os.path.join(self.verify_dir, 'new_courses.csv'), index=False)
            logger.info(f"✅ Updated new_courses.csv with faculty assignments")
        
        logger.info("✅ Faculty assignment completed")

    def process_acad_terms(self):
        """Process academic terms from standalone sheet"""
        logger.info("📅 Processing academic terms...")
        
        # Group by (acad_year_start, acad_year_end, term)
        term_groups = defaultdict(list)
        for _, row in self.standalone_data.iterrows():
            key = (
                row.get('acad_year_start'),
                row.get('acad_year_end'),
                row.get('term')
            )
            if all(pd.notna(v) for v in key):
                term_groups[key].append(row)
        
        for (year_start, year_end, term), rows in term_groups.items():
            # Generate acad_term_id
            acad_term_id = f"AY{int(year_start)}{int(year_end) % 100:02d}{term}"
            
            # Check if already exists
            if acad_term_id in self.acad_term_cache:
                continue
            
            # Find most common period_text and dates
            period_counter = Counter()
            date_info = {}
            
            for row in rows:
                period_text = row.get('period_text', '')
                if pd.notna(period_text):
                    period_counter[period_text] += 1
                    if period_text not in date_info:
                        date_info[period_text] = {
                            'start_dt': row.get('start_dt'),
                            'end_dt': row.get('end_dt')
                        }
            
            # Get most common period
            if period_counter:
                most_common_period = period_counter.most_common(1)[0][0]
                dates = date_info[most_common_period]
            else:
                dates = {'start_dt': None, 'end_dt': None}
            
            # Get boss_id from first row
            boss_id = rows[0].get('acad_term_boss_id')
            
            new_term = {
                'id': acad_term_id,
                'acad_year_start': int(year_start),
                'acad_year_end': int(year_end),
                'term': str(term),
                'boss_id': int(boss_id) if pd.notna(boss_id) else None,
                'start_dt': dates['start_dt'],
                'end_dt': dates['end_dt']
            }
            
            self.new_acad_terms.append(new_term)
            self.acad_term_cache[acad_term_id] = new_term
        
        logger.info(f"✅ Created {len(self.new_acad_terms)} new academic terms")

    def process_classes(self):
        """Process classes from standalone sheet"""
        logger.info("🏫 Processing classes...")
        
        try:
            for _, row in self.standalone_data.iterrows():
                record_key = row.get('record_key')
                if pd.notna(record_key):
                    # Use optimized professor lookup
                    professor_id = self._find_professor_for_class(record_key)
                    # Generate class ID
                    class_id = str(uuid.uuid4())
                    
                    # Extract boss_id from record_key
                    record_key = row.get('record_key', '')
                    boss_id_match = re.search(r'SelectedClassNumber=(\d+)', record_key)
                    boss_id = int(boss_id_match.group(1)) if boss_id_match else None
                    
                    # Get course_id
                    course_code = row.get('course_code')
                    course_id = None
                    if course_code and course_code in self.courses_cache:
                        course_id = self.courses_cache[course_code]['id']
                    
                    # Get professor_id from multiple sheet
                    professor_id = self._find_professor_for_class(record_key)
                    
                    new_class = {
                        'id': class_id,
                        'section': row.get('section', ''),
                        'course_id': course_id,
                        'professor_id': professor_id,
                        'acad_term_id': row.get('acad_term_id'),
                        'grading_basis': row.get('grading_basis'),
                        'course_outline_url': row.get('course_outline_url'),
                        'boss_id': boss_id
                    }
                    
                    self.new_classes.append(new_class)
                    self.stats['classes_created'] += 1
                    
                    # Store mapping for timing tables
                    self.class_id_mapping[record_key] = class_id
        except Exception as e:
            logger.error(f"Error processing classes: {e}")
            raise
        logger.info(f"✅ Created {self.stats['classes_created']} new classes")

    def _find_professor_for_class(self, record_key: str) -> Optional[str]:
        """Optimised: Find professor ID for a class using pre-indexed multiple_lookup"""
        rows = self.multiple_lookup.get(record_key, [])
        for row in rows:
            if pd.notna(row.get('professor_name')):
                original_prof_name = str(row['professor_name']).strip()

                # Step 1: Try full string match first
                if original_prof_name in self.professor_lookup:
                    return self.professor_lookup[original_prof_name]['database_id']

                # Step 2: Parse name by commas
                comma_count = original_prof_name.count(',')
                if comma_count == 1:
                    # One comma - check substring before comma
                    parts = original_prof_name.split(',')
                    before_comma = parts[0].strip()
                    words_before_comma = before_comma.split()
                    
                    if len(words_before_comma) == 1:
                        # Exactly one word before comma - single professor in "SURNAME, FirstName" format
                        cleaned_name = original_prof_name  # Use full original string
                    else:
                        # More than one word before comma - multiple professors
                        cleaned_name = before_comma  # Use only part before comma
                elif comma_count >= 2:
                    # Two or more commas - definitely multiple professors
                    cleaned_name = original_prof_name.split(',')[0].strip()
                else:
                    # No commas - single professor
                    cleaned_name = original_prof_name

                # Step 3: Try cleaned name match
                if cleaned_name in self.professor_lookup:
                    return self.professor_lookup[cleaned_name]['database_id']

                # NEW: Step 3.5: Try substring matching
                # Check if any existing professor names are substrings of the original name
                # or if the original name is a substring of existing names
                for existing_name, prof_data in self.professor_lookup.items():
                    # Check if existing name is in the original name
                    if existing_name.upper() in original_prof_name.upper():
                        return prof_data['database_id']
                    # Check if original name is in existing name  
                    if original_prof_name.upper() in existing_name.upper():
                        return prof_data['database_id']
                
                # Also check against new professors being created
                for new_prof in self.new_professors:
                    new_prof_name = new_prof.get('original_scraped_name', '')
                    boss_name = new_prof.get('boss_name', '')
                    afterclass_name = new_prof.get('afterclass_name', '')
                    
                    # Check substring matches against various name formats
                    names_to_check = [new_prof_name, boss_name, afterclass_name]
                    for name in names_to_check:
                        if name and (name.upper() in original_prof_name.upper() or 
                                    original_prof_name.upper() in name.upper()):
                            return new_prof['id']

                # Step 4: Use normalisation fallback
                boss_name, afterclass_name = self.normalize_professor_name(cleaned_name)
                if boss_name in self.professors_cache:
                    return self.professors_cache[boss_name]['id']
                if afterclass_name.upper() in self.professors_cache:
                    return self.professors_cache[afterclass_name.upper()]['id']
        return None

    def process_timings(self):
        """Process class timings and exam timings from multiple sheet"""
        logger.info("⏰ Processing class timings and exam timings...")
        
        for _, row in self.multiple_data.iterrows():
            record_key = row.get('record_key')
            if record_key not in self.class_id_mapping:
                continue
            
            class_id = self.class_id_mapping[record_key]
            timing_type = row.get('type', 'CLASS')
            
            if timing_type == 'CLASS':
                timing_record = {
                    'class_id': class_id,
                    'start_date': row.get('start_date'),
                    'end_date': row.get('end_date'),
                    'day_of_week': row.get('day_of_week'),
                    'start_time': row.get('start_time'),
                    'end_time': row.get('end_time'),
                    'venue': row.get('venue', '')
                }
                self.new_class_timings.append(timing_record)
                self.stats['timings_created'] += 1
            
            elif timing_type == 'EXAM':
                exam_record = {
                    'class_id': class_id,
                    'date': row.get('date'),
                    'day_of_week': row.get('day_of_week'),
                    'start_time': row.get('start_time'),
                    'end_time': row.get('end_time'),
                    'venue': row.get('venue')
                }
                self.new_class_exam_timings.append(exam_record)
                self.stats['exams_created'] += 1
        
        logger.info(f"✅ Created {self.stats['timings_created']} class timings")
        logger.info(f"✅ Created {self.stats['exams_created']} exam timings")

    def save_outputs(self):
        """Save all generated CSV files"""
        logger.info("💾 Saving output files...")
        
        # Save new professors (to verify folder)
        if self.new_professors:
            df = pd.DataFrame(self.new_professors)
            df.to_csv(os.path.join(self.verify_dir, 'new_professors.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_professors)} new professors")
        
        # Save new courses (to verify folder)
        if self.new_courses:
            df = pd.DataFrame(self.new_courses)
            df.to_csv(os.path.join(self.verify_dir, 'new_courses.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_courses)} new courses")
        
        # Save course updates
        if self.update_courses:
            df = pd.DataFrame(self.update_courses)
            df.to_csv(os.path.join(self.output_base, 'update_courses.csv'), index=False)
            logger.info(f"✅ Saved {len(self.update_courses)} course updates")
        
        # Save academic terms
        if self.new_acad_terms:
            df = pd.DataFrame(self.new_acad_terms)
            df.to_csv(os.path.join(self.output_base, 'new_acad_term.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_acad_terms)} academic terms")
        
        # Save classes
        if self.new_classes:
            df = pd.DataFrame(self.new_classes)
            df.to_csv(os.path.join(self.output_base, 'new_classes.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_classes)} classes")
        
        # Save class timings
        if self.new_class_timings:
            df = pd.DataFrame(self.new_class_timings)
            df.to_csv(os.path.join(self.output_base, 'new_class_timing.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_class_timings)} class timings")
        
        # Save exam timings
        if self.new_class_exam_timings:
            df = pd.DataFrame(self.new_class_exam_timings)
            df.to_csv(os.path.join(self.output_base, 'new_class_exam_timing.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_class_exam_timings)} exam timings")
        
        # Save updated professor lookup
        # Commented out due to poor surname handling in original code
        # if self.new_professors:
        #     self._save_professor_lookup()
        pass
        
        # Save courses needing faculty assignment
        if self.courses_needing_faculty:
            df = pd.DataFrame(self.courses_needing_faculty)
            df.to_csv(os.path.join(self.output_base, 'courses_needing_faculty.csv'), index=False)
            logger.info(f"✅ Saved {len(self.courses_needing_faculty)} courses needing faculty assignment")
        
        # Create placeholder files
        placeholders = ['new_bid_window.csv', 'new_class_availability.csv', 'new_bid_result.csv']
        for filename in placeholders:
            df = pd.DataFrame()
            df.to_csv(os.path.join(self.output_base, filename), index=False)
            logger.info(f"✅ Created placeholder: {filename}")

    def _save_professor_lookup(self):
        """Save updated professor lookup table"""
        lookup_data = []
        
        # Add all professors from lookup
        for scraped_name, data in self.professor_lookup.items():
            lookup_data.append({
                'boss_name': data.get('boss_name', scraped_name.upper()),
                'afterclass_name': data.get('afterclass_name', scraped_name),
                'database_id': data['database_id'],
                'method': 'exists' if scraped_name not in [p['original_scraped_name'] for p in self.new_professors] else 'created'
            })
        
        # Sort by scraped_name
        lookup_data.sort(key=lambda x: x['scraped_name'])
        
        # Save to output folder
        df = pd.DataFrame(lookup_data)
        df.to_csv(os.path.join(self.output_base, 'professor_lookup.csv'), index=False)
        logger.info(f"✅ Saved updated professor lookup with {len(lookup_data)} entries")

    def update_professor_lookup_from_corrected_csv(self):
        """Update professor lookup from manually corrected new_professors.csv"""
        logger.info("🔄 Updating professor lookup from corrected CSV...")
        
        # Read corrected new_professors.csv
        corrected_csv_path = os.path.join(self.verify_dir, 'new_professors.csv')
        if not os.path.exists(corrected_csv_path):
            logger.error(f"❌ Corrected CSV not found: {corrected_csv_path}")
            return False
        
        try:
            corrected_df = pd.read_csv(corrected_csv_path)
            logger.info(f"📖 Reading {len(corrected_df)} corrected professor records")
            
            # Update internal professor_lookup for new professors
            updated_count = 0
            for _, row in corrected_df.iterrows():
                original_name = row.get('original_scraped_name', '')
                corrected_afterclass_name = row.get('name', '')  # This is the corrected name
                boss_name = row.get('boss_name', '')  # Keep boss name same
                professor_id = row.get('id', '')
                
                if original_name and professor_id:
                    # Update lookup with corrected afterclass name but same boss name
                    self.professor_lookup[original_name] = {
                        'database_id': professor_id,
                        'boss_name': boss_name,  # Keep original boss name
                        'afterclass_name': corrected_afterclass_name  # Use corrected name
                    }
                    updated_count += 1
            
            # Save updated professor lookup to CSV
            self._save_corrected_professor_lookup()
            
            logger.info(f"✅ Updated {updated_count} professor lookup entries")
            return True
            
        except Exception as e:
            logger.error(f"❌ Failed to update professor lookup: {e}")
            return False

    def process_remaining_tables(self):
        """Process classes and timings after professor lookup is updated"""
        logger.info("🏫 Processing remaining tables (classes, timings)...")
        
        try:
            # Process classes (depends on updated professor lookup)
            self.process_classes()
            
            # Process timings (depends on classes)
            self.process_timings()
            
            logger.info("✅ Remaining tables processed successfully")
            return True
            
        except Exception as e:
            logger.error(f"❌ Failed to process remaining tables: {e}")
            return False

    def _save_corrected_professor_lookup(self):
        """Save professor lookup with corrected names"""
        lookup_data = []
        
        # Load existing professor lookup if it exists
        existing_lookup_path = os.path.join(self.output_base, 'professor_lookup.csv')
        if os.path.exists(existing_lookup_path):
            existing_df = pd.read_csv(existing_lookup_path)
            for _, row in existing_df.iterrows():
                lookup_data.append({
                    'scraped_name': row.get('scraped_name', ''),
                    'boss_name': row.get('boss_name', ''),
                    'afterclass_name': row.get('afterclass_name', ''),
                    'database_id': row.get('database_id', ''),
                    'method': row.get('method', 'exists')
                })
        
        # Add/update with new professor lookup entries
        existing_scraped_names = {item['scraped_name'] for item in lookup_data}
        
        for scraped_name, data in self.professor_lookup.items():
            if scraped_name not in existing_scraped_names:
                lookup_data.append({
                    'scraped_name': scraped_name,
                    'boss_name': data.get('boss_name', scraped_name.upper()),
                    'afterclass_name': data.get('afterclass_name', scraped_name),
                    'database_id': data['database_id'],
                    'method': 'created'
                })
            else:
                # Update existing entry with corrected afterclass name
                for item in lookup_data:
                    if item['scraped_name'] == scraped_name:
                        item['afterclass_name'] = data.get('afterclass_name', scraped_name)
                        break
        
        # Sort by scraped_name
        lookup_data.sort(key=lambda x: x['scraped_name'])
        
        # Save to output folder
        df = pd.DataFrame(lookup_data)
        df.to_csv(os.path.join(self.output_base, 'professor_lookup.csv'), index=False)
        logger.info(f"✅ Saved updated professor lookup with {len(lookup_data)} entries")

    def print_summary(self):
        """Print processing summary"""
        print("\n" + "="*70)
        print("📊 PROCESSING SUMMARY")
        print("="*70)
        print(f"✅ Professors created: {self.stats['professors_created']}")
        print(f"✅ Courses created: {self.stats['courses_created']}")
        print(f"✅ Courses updated: {self.stats['courses_updated']}")
        print(f"⚠️  Courses needing faculty: {self.stats['courses_needing_faculty']}")
        print(f"✅ Classes created: {self.stats['classes_created']}")
        print(f"✅ Class timings created: {self.stats['timings_created']}")
        print(f"✅ Exam timings created: {self.stats['exams_created']}")
        print("="*70)
        
        print("\n📁 OUTPUT FILES:")
        print(f"   Verify folder: {self.verify_dir}/")
        print(f"   - new_professors.csv ({self.stats['professors_created']} records)")
        print(f"   - new_courses.csv ({self.stats['courses_created']} records)")
        print(f"   Output folder: {self.output_base}/")
        print(f"   - update_courses.csv ({self.stats['courses_updated']} records)")
        print(f"   - new_acad_term.csv ({len(self.new_acad_terms)} records)")
        print(f"   - new_classes.csv ({self.stats['classes_created']} records)")
        print(f"   - new_class_timing.csv ({self.stats['timings_created']} records)")
        print(f"   - new_class_exam_timing.csv ({self.stats['exams_created']} records)")
        print(f"   - professor_lookup.csv (updated)")
        print(f"   - courses_needing_faculty.csv ({self.stats['courses_needing_faculty']} records)")
        print("="*70)

    def run_phase1_professors_and_courses(self):
        """Phase 1: Process professors and courses only"""
        try:
            logger.info("🚀 Starting Phase 1: Professors and Courses")
            logger.info("="*60)
            
            # Load data
            if not self.load_or_cache_data():
                logger.error("❌ Failed to load database data")
                return False
            
            if not self.load_raw_data():
                logger.error("❌ Failed to load raw data")
                return False
            
            # Process professors (CSV only, no lookup update)
            self.process_professors()
            
            # Process courses
            self.process_courses()
            
            # Process academic terms
            self.process_acad_terms()
            
            # Save phase 1 outputs
            self._save_phase1_outputs()
            
            logger.info("✅ Phase 1 completed - Review new_professors.csv for manual correction")
            return True
            
        except Exception as e:
            logger.error(f"❌ Phase 1 failed: {e}")
            return False

    def run_phase2_remaining_tables(self):
        """Phase 2: Process classes and timings after professor correction"""
        try:
            logger.info("🚀 Starting Phase 2: Classes and Timings")
            logger.info("="*60)
            
            # Update professor lookup from corrected CSV
            if not self.update_professor_lookup_from_corrected_csv():
                logger.error("❌ Failed to update professor lookup")
                return False
            
            # Process remaining tables
            if not self.process_remaining_tables():
                logger.error("❌ Failed to process remaining tables")
                return False
            
            # Save all outputs
            self.save_outputs()
            
            # Print summary
            self.print_summary()
            
            logger.info("✅ Phase 2 completed successfully!")
            return True
            
        except Exception as e:
            logger.error(f"❌ Phase 2 failed: {e}")
            return False

    def _save_phase1_outputs(self):
        """Save Phase 1 outputs (professors, courses, acad_terms)"""
        # Save new professors (to verify folder for manual correction)
        if self.new_professors:
            df = pd.DataFrame(self.new_professors)
            df.to_csv(os.path.join(self.verify_dir, 'new_professors.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_professors)} new professors for review")
        
        # Save new courses (to verify folder)
        if self.new_courses:
            df = pd.DataFrame(self.new_courses)
            df.to_csv(os.path.join(self.verify_dir, 'new_courses.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_courses)} new courses")
        
        # Save course updates
        if self.update_courses:
            df = pd.DataFrame(self.update_courses)
            df.to_csv(os.path.join(self.output_base, 'update_courses.csv'), index=False)
            logger.info(f"✅ Saved {len(self.update_courses)} course updates")
        
        # Save academic terms
        if self.new_acad_terms:
            df = pd.DataFrame(self.new_acad_terms)
            df.to_csv(os.path.join(self.output_base, 'new_acad_term.csv'), index=False)
            logger.info(f"✅ Saved {len(self.new_acad_terms)} academic terms")

    def run(self, skip_faculty_assignment=True):
        """Run the complete table building process
        
        Args:
            skip_faculty_assignment: If True, faculty assignment is deferred
        """
        try:
            logger.info("🚀 Starting TableBuilder process")
            logger.info("="*60)
            
            # Step 1: Load or cache database data
            if not self.load_or_cache_data():
                logger.error("❌ Failed to load database data")
                return False
            
            # Step 2: Load raw data
            if not self.load_raw_data():
                logger.error("❌ Failed to load raw data")
                return False
            
            # Step 3: Process tables in dependency order
            logger.info("\n📋 Processing tables in dependency order...")
            
            # 3.1: Process professors first (no dependencies)
            self.process_professors()
            
            # 3.2: Process courses (without faculty assignment)
            self.process_courses()
            
            # 3.3: Process academic terms (no dependencies)
            self.process_acad_terms()
            
            # 3.4: Process classes (depends on courses, professors, acad_terms)
            self.process_classes()
            
            # 3.5: Process timings (depends on classes)
            self.process_timings()
            
            # Step 4: Save all outputs
            self.save_outputs()
            
            # Step 5: Print summary
            self.print_summary()
            
            if self.stats['courses_needing_faculty'] > 0 and not skip_faculty_assignment:
                print("\n⚠️  FACULTY ASSIGNMENT REQUIRED")
                print(f"   {self.stats['courses_needing_faculty']} courses need faculty assignment")
                print("   Run builder.assign_course_faculties() to complete assignment")
            
            logger.info("\n✅ TableBuilder process completed successfully!")
            return True
            
        except Exception as e:
            logger.error(f"❌ Process failed: {e}")
            import traceback
            traceback.print_exc()
            return False
        finally:
            # Clean up database connection
            if self.connection:
                self.connection.close()
                logger.info("🔒 Database connection closed")

### **Cell 1: Phase 1 Initialization**
```python
# Initialize the TableBuilder
builder = TableBuilder()

# Run Phase 1 (professors, courses, acad_terms)
success = builder.run_phase1_professors_and_courses()
```

**When to Use:**
- **First-time setup**: Initial processing of new raw data from HTML extractor
- **Semester data ingestion**: Beginning of each new academic term data import
- **After HTML extraction**: Following successful completion of point 3 data extraction

**What It Does:**
- Loads existing database cache and raw Excel data
- Processes professors with advanced name normalization and duplicate detection
- Creates new courses without faculty assignments (deferred for manual review)
- Generates academic terms from date ranges and term codes
- Outputs verification files for manual review before proceeding

**Success Indicators:**
- Creates `script_output/verify/new_professors.csv` with properly normalized names
- Generates course files ready for faculty assignment
- Displays statistics on professors, courses, and terms processed

In [17]:
# Initialize the TableBuilder
builder = TableBuilder()

# Run Phase 1 (professors, courses, acad_terms)
success = builder.run_phase1_professors_and_courses()

if success:
    print("\n🎉 Phase 1 completed successfully!")
    print("📝 Next steps:")
    print("   1. Review script_output/verify/new_professors.csv")
    print("   2. Manually correct any professor names if needed")
    print("   3. Run Phase 2 in the next cell")
else:
    print("\n❌ Phase 1 failed. Check logs for details.")

2025-06-09 17:14:23,763 - INFO - 🚀 Starting Phase 1: Professors and Courses
2025-06-09 17:14:24,023 - INFO - ✅ Loaded data from cache
2025-06-09 17:14:24,023 - INFO - 📂 Loading raw data from script_input/raw_data.xlsx
2025-06-09 17:14:32,059 - INFO - ✅ Loaded 12976 standalone records
2025-06-09 17:14:32,059 - INFO - ✅ Loaded 19988 multiple records
2025-06-09 17:14:33,078 - INFO - ✅ Created optimized lookup for 11926 record keys
2025-06-09 17:14:33,078 - INFO - 👥 Processing professors...
2025-06-09 17:14:33,937 - INFO - ✅ Created 11 new professors
2025-06-09 17:14:33,938 - INFO - 📚 Processing courses...
2025-06-09 17:14:34,838 - INFO - ✅ Created 141 new courses
2025-06-09 17:14:34,839 - INFO - ✅ Updated 1158 existing courses
2025-06-09 17:14:34,839 - INFO - ⚠️  141 courses need faculty assignment
2025-06-09 17:14:34,858 - INFO - 📅 Processing academic terms...
2025-06-09 17:14:35,760 - INFO - ✅ Created 17 new academic terms
2025-06-09 17:14:35,772 - INFO - ✅ Saved 11 new professors for r


🎉 Phase 1 completed successfully!
📝 Next steps:
   1. Review script_output/verify/new_professors.csv
   2. Manually correct any professor names if needed
   3. Run Phase 2 in the next cell


### **Cell 2: Professor Review Interface**
```python
# Display new professors for review
new_prof_path = os.path.join('script_output', 'verify', 'new_professors.csv')
df = pd.read_csv(new_prof_path)
display(df[['name', 'boss_name', 'afterclass_name', 'original_scraped_name']])
```

**When to Use:**
- **After Phase 1 completion**: Review professor names before final processing
- **Quality assurance**: Verify name normalization accuracy for Asian and Western names
- **Before database insertion**: Ensure all professor names are correctly formatted

**What It Does:**
- Displays newly created professors with different name formats
- Shows original scraped names vs. normalized versions
- Provides clear guidance on which column to edit (name = afterclass format)
- Preserves boss_name format for database consistency

**Manual Review Process:**
- Check `name` column for proper Title Case formatting
- Verify Asian surnames are correctly identified and positioned
- Correct any obvious parsing errors (e.g., "TSE, JUSTIN K, AIDAN WONG" cases)
- Save changes directly to the CSV file for Phase 2 processing

In [18]:
# Display new professors for review
new_prof_path = os.path.join('script_output', 'verify', 'new_professors.csv')
if os.path.exists(new_prof_path):
    df = pd.read_csv(new_prof_path)
    print(f"📋 {len(df)} new professors created:")
    print("\n🔍 Review these professor names:")
    display(df[['name', 'boss_name', 'afterclass_name', 'original_scraped_name']])
    print("\n📝 If any names need correction, edit the 'name' column in:")
    print(f"   {new_prof_path}")
    print("\n⚠️  Only edit the 'name' column (afterclass format)")
    print("   Keep 'boss_name' unchanged")
else:
    print("❌ new_professors.csv not found")

📋 10 new professors created:

🔍 Review these professor names:


Unnamed: 0,name,boss_name,afterclass_name,original_scraped_name
0,LEE Yun,LEE YUN,LEE Yun,LEE YUN
1,Yu QI,YU QI,Yu QI,YU QI
2,Tang TONY,TANG TONY,Tang TONY,TANG TONY
3,Hara KOTARO,HARA KOTARO,Hara KOTARO,HARA KOTARO
4,HU Naiyuan,HU NAIYUAN,HU Naiyuan,HU NAIYUAN
5,Koh ANDREW,KOH ANDREW,Koh ANDREW,KOH ANDREW
6,Ricks JACOB,RICKS JACOB,Ricks JACOB,RICKS JACOB
7,ZHANG Ce,ZHANG CE,ZHANG Ce,ZHANG CE
8,Zeng QINGLI,ZENG QINGLI,Zeng QINGLI,ZENG QINGLI
9,Pepito NONA,PEPITO NONA,Pepito NONA,PEPITO NONA



📝 If any names need correction, edit the 'name' column in:
   script_output\verify\new_professors.csv

⚠️  Only edit the 'name' column (afterclass format)
   Keep 'boss_name' unchanged


### **Cell 3: Phase 2 Completion**
```python
# Run Phase 2 (classes, timings) after manual correction
success = builder.run_phase2_remaining_tables()
```

**When to Use:**
- **After manual professor review**: Following corrections to new_professors.csv
- **Final data processing**: Complete the database table generation pipeline
- **Before database insertion**: Generate all remaining tables with correct relationships

**What It Does:**
- Updates internal professor lookup from manually corrected CSV files
- Processes classes using corrected professor mappings and course relationships
- Generates class timing and exam timing records linked to classes
- Creates complete set of database-ready CSV files
- Maintains referential integrity across all generated tables

**Output Generation:**
- Links professors to classes using updated lookup mappings
- Ensures all timing records reference valid class IDs
- Produces final statistics and file summaries for database insertion

In [19]:
# Run Phase 2 (classes, timings) after manual correction
success = builder.run_phase2_remaining_tables()

if success:
    print("\n🎉 Phase 2 completed successfully!")
    print("📝 All tables generated with corrected professor names")
else:
    print("\n❌ Phase 2 failed. Check logs for details.")

2025-06-09 17:15:17,815 - INFO - 🚀 Starting Phase 2: Classes and Timings
2025-06-09 17:15:17,816 - INFO - 🔄 Updating professor lookup from corrected CSV...
2025-06-09 17:15:17,818 - INFO - 📖 Reading 10 corrected professor records
2025-06-09 17:15:17,825 - INFO - ✅ Saved updated professor lookup with 1116 entries
2025-06-09 17:15:17,825 - INFO - ✅ Updated 10 professor lookup entries
2025-06-09 17:15:17,826 - INFO - 🏫 Processing remaining tables (classes, timings)...
2025-06-09 17:15:17,827 - INFO - 🏫 Processing classes...
2025-06-09 17:15:19,021 - INFO - ✅ Created 12976 new classes
2025-06-09 17:15:19,023 - INFO - ⏰ Processing class timings and exam timings...
2025-06-09 17:15:20,118 - INFO - ✅ Created 13084 class timings
2025-06-09 17:15:20,119 - INFO - ✅ Created 6904 exam timings
2025-06-09 17:15:20,119 - INFO - ✅ Remaining tables processed successfully
2025-06-09 17:15:20,120 - INFO - 💾 Saving output files...
2025-06-09 17:15:20,122 - INFO - ✅ Saved 11 new professors
2025-06-09 17:15


📊 PROCESSING SUMMARY
✅ Professors created: 11
✅ Courses created: 141
✅ Courses updated: 1158
⚠️  Courses needing faculty: 141
✅ Classes created: 12976
✅ Class timings created: 13084
✅ Exam timings created: 6904

📁 OUTPUT FILES:
   Verify folder: script_output\verify/
   - new_professors.csv (11 records)
   - new_courses.csv (141 records)
   Output folder: script_output/
   - update_courses.csv (1158 records)
   - new_acad_term.csv (17 records)
   - new_classes.csv (12976 records)
   - new_class_timing.csv (13084 records)
   - new_class_exam_timing.csv (6904 records)
   - professor_lookup.csv (updated)
   - courses_needing_faculty.csv (141 records)

🎉 Phase 2 completed successfully!
📝 All tables generated with corrected professor names


### **Cell 4: Faculty Assignment (Optional)**
```python
# Run faculty assignment process if needed
if hasattr(builder, 'courses_needing_faculty') and builder.courses_needing_faculty:
    builder.assign_course_faculties()
```

**When to Use:**
- **New course processing**: When courses lack faculty assignments in manual mapping
- **Interactive assignment**: For courses requiring human judgment on faculty placement
- **Policy compliance**: Ensuring all courses are properly assigned to SMU schools

**What It Does:**
- Opens course outline URLs in web browser for informed decision-making
- Presents interactive menu of SMU's 8 schools and centers
- Updates course records with selected faculty assignments
- Re-saves CSV files with complete faculty information

**Faculty Options Available:**
1. Lee Kong Chian School of Business
2. Yong Pung How School of Law  
3. School of Economics
4. School of Computing and Information Systems
5. School of Social Sciences
6. School of Accountancy
7. College of Integrative Studies
8. Center for English Communication

**Best Practices:**
- Review course outlines before making faculty assignments
- Use existing course patterns as reference for similar courses
- Skip courses requiring additional research (can be assigned later)
- Ensure consistency with SMU's academic structure and course offerings

In [None]:
# Run faculty assignment process if needed
if hasattr(builder, 'courses_needing_faculty') and builder.courses_needing_faculty:
    builder.assign_course_faculties()
    print("\n✅ Faculty assignment completed!")
else:
    print("✅ No courses need faculty assignment")

2025-06-09 16:08:40,418 - INFO - 🎓 Starting faculty assignment for 141 courses



🎓 FACULTY ASSIGNMENT NEEDED
Course Code: ISFS603
Course Name: Corporate Banking and Blockchain
Opening course outline: https://courses.smu.edu.sg/sites/courses.smu.edu.sg/files/PGP2022/SCISGPO/2210/ISFS603_ PAUL GRIFFIN.pdf

Faculty Options:
1. Lee Kong Chian School of Business
2. Yong Pung How School of Law
3. School of Economics
4. School of Computing and Information Systems
5. School of Social Sciences
6. School of Accountancy
7. College of Integrative Studies
8. Center for English Communication
0. Skip (will need manual review)
Invalid choice. Please enter 0-8.
Invalid choice. Please enter 0-8.
Invalid choice. Please enter 0-8.


In [20]:
def sync_courses_with_manual_mapping():
    """
    Sync script-generated courses with manually mapped courses and filter related data
    """
    
    # File paths
    manual_courses_path = r'extracted_data\3. new_courses.csv'
    script_courses_path = r'script_output\verify\new_courses.csv'
    
    print("🔄 Starting course synchronization...")
    
    # Load manually mapped courses
    try:
        manual_courses = pd.read_csv(manual_courses_path)
        print(f"✅ Loaded {len(manual_courses)} manually mapped courses")
    except FileNotFoundError:
        print(f"❌ Manual courses file not found: {manual_courses_path}")
        return False
    
    # Load script-generated courses
    try:
        script_courses = pd.read_csv(script_courses_path)
        print(f"✅ Loaded {len(script_courses)} script-generated courses")
    except FileNotFoundError:
        print(f"❌ Script courses file not found: {script_courses_path}")
        return False
    
    # Create mappings from course code to manual UUID and faculty
    code_to_manual_uuid = dict(zip(manual_courses['code'], manual_courses['id']))
    code_to_faculty = dict(zip(manual_courses['code'], manual_courses['belong_to_faculty']))
    print(f"📋 Created mapping for {len(code_to_manual_uuid)} course codes")
    
    # CRITICAL: Identify script-generated UUIDs that should be deleted
    # These are the UUIDs from script courses that are NOT mapped to manual courses
    script_courses_to_keep = script_courses[script_courses['code'].isin(manual_courses['code'])].copy()
    script_courses_to_delete = script_courses[~script_courses['code'].isin(manual_courses['code'])].copy()
    
    # Get the UUIDs that should be deleted (unmapped script UUIDs)
    uuids_to_delete = set(script_courses_to_delete['id'])
    print(f"🗑️  Identified {len(uuids_to_delete)} script-generated UUIDs to delete (unmapped)")
    
    # Create old UUID to new UUID mapping for courses that are kept
    old_to_new_uuid = {}
    for _, row in script_courses_to_keep.iterrows():
        old_uuid = row['id']
        new_uuid = code_to_manual_uuid[row['code']]
        old_to_new_uuid[old_uuid] = new_uuid
    
    print(f"🔗 Created UUID mapping for {len(old_to_new_uuid)} courses (old → new)")
    
    # Filter script courses to only those in manual courses
    original_count = len(script_courses)
    filtered_script_courses = script_courses_to_keep.copy()
    filtered_count = len(filtered_script_courses)
    
    print(f"🔍 Filtered courses: {original_count} → {filtered_count} ({original_count - filtered_count} dropped)")
    
    # Update filtered script courses IDs and belong_to_faculty to match manual data
    filtered_script_courses['id'] = filtered_script_courses['code'].map(code_to_manual_uuid)
    filtered_script_courses['belong_to_faculty'] = filtered_script_courses['code'].map(code_to_faculty)
    
    # Save updated script courses
    filtered_script_courses.to_csv(script_courses_path, index=False)
    print(f"💾 Updated {script_courses_path} with UUIDs and faculty mapping")
    
    # Filter and update classes - ONLY delete unmapped script UUIDs
    classes_path = r'script_output\new_classes.csv'
    if os.path.exists(classes_path):
        classes_df = pd.read_csv(classes_path)
        original_classes = len(classes_df)
        
        # First, remove records with UUIDs that should be deleted
        classes_df_cleaned = classes_df[~classes_df['course_id'].isin(uuids_to_delete)].copy()
        deleted_classes = original_classes - len(classes_df_cleaned)
        
        # Then, update course_id from old UUID to new UUID (for mapped courses)
        classes_df_cleaned['course_id'] = classes_df_cleaned['course_id'].map(old_to_new_uuid).fillna(classes_df_cleaned['course_id'])
        
        filtered_classes_count = len(classes_df_cleaned)
        
        # Save updated classes
        classes_df_cleaned.to_csv(classes_path, index=False)
        print(f"💾 Updated classes: {original_classes} → {filtered_classes_count} ({deleted_classes} deleted from unmapped courses)")
        
        # Get valid class IDs for timing tables
        valid_class_ids = set(classes_df_cleaned['id'])
    else:
        print(f"⚠️  Classes file not found: {classes_path}")
        valid_class_ids = set()
    
    # Filter and update class timings - ONLY delete records linked to deleted classes
    timing_files = [
        (r'script_output\new_class_timing.csv', 'class timings'),
        (r'script_output\new_class_exam_timing.csv', 'exam timings')
    ]
    
    for file_path, description in timing_files:
        if os.path.exists(file_path):
            df = pd.read_csv(file_path)
            original_count = len(df)
            
            # Filter by valid class_ids (keep existing + mapped, remove only deleted class records)
            filtered_df = df[df['class_id'].isin(valid_class_ids)].copy()
            filtered_count = len(filtered_df)
            deleted_count = original_count - filtered_count
            
            # Save filtered data
            filtered_df.to_csv(file_path, index=False)
            print(f"💾 Updated {description}: {original_count} → {filtered_count} ({deleted_count} deleted)")
        else:
            print(f"⚠️  File not found: {file_path}")
    
    print("\n✅ Course synchronization completed successfully!")
    print(f"📊 Summary:")
    print(f"   • Courses: {filtered_count} kept (matched manual mapping)")
    print(f"   • {len(uuids_to_delete)} unmapped script courses deleted")
    print(f"   • UUIDs and faculty assignments updated from manual mapping")
    print(f"   • Classes: {filtered_classes_count if 'filtered_classes_count' in locals() else 'N/A'} kept")
    print(f"   • Existing non-script data preserved")
    
    return True

# Run the synchronization
sync_courses_with_manual_mapping()

🔄 Starting course synchronization...
✅ Loaded 99 manually mapped courses
✅ Loaded 141 script-generated courses
📋 Created mapping for 99 course codes
🗑️  Identified 42 script-generated UUIDs to delete (unmapped)
🔗 Created UUID mapping for 99 courses (old → new)
🔍 Filtered courses: 141 → 99 (42 dropped)
💾 Updated script_output\verify\new_courses.csv with UUIDs and faculty mapping
💾 Updated classes: 12976 → 12695 (281 deleted from unmapped courses)
💾 Updated class timings: 13084 → 12910 (174 deleted)
💾 Updated exam timings: 6904 → 6893 (11 deleted)

✅ Course synchronization completed successfully!
📊 Summary:
   • Courses: 6893 kept (matched manual mapping)
   • 42 unmapped script courses deleted
   • UUIDs and faculty assignments updated from manual mapping
   • Classes: 12695 kept
   • Existing non-script data preserved


True