# NYC High School Directory Analysis - Enhanced Version

This notebook provides a comprehensive analysis of the NYC High School Directory dataset with improved:
- Code organization and modularity
- Error handling and validation
- Documentation and logging
- Visualization quality
- Production-ready code structure

## Objectives:
- Load and clean the dataset with robust error handling
- Filter for Brooklyn schools using configurable parameters
- Perform comprehensive analysis with detailed insights
- Create publication-quality visualizations
- Generate actionable insights with statistical validation

## 1. Configuration and Setup

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import re
import logging
from pathlib import Path
from typing import Dict, List, Optional, Tuple, Union
from dataclasses import dataclass
import warnings

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 100)

# Set matplotlib and seaborn styling
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams.update({
    'figure.figsize': (12, 8),
    'font.size': 11,
    'axes.titlesize': 14,
    'axes.labelsize': 12,
    'xtick.labelsize': 10,
    'ytick.labelsize': 10,
    'legend.fontsize': 10
})

In [None]:
@dataclass
class AnalysisConfig:
    """Configuration class for analysis parameters."""
    data_file: str = 'high-school-directory.csv'
    target_borough: str = 'BROOKLYN'
    grade_of_interest: int = 9
    visualization_style: str = 'seaborn'
    output_dir: str = 'outputs'
    figure_dpi: int = 300
    
# Initialize configuration
config = AnalysisConfig()
logger.info(f"Analysis configuration initialized for {config.target_borough} schools")

## 2. Data Processing Functions

In [None]:
class DataProcessor:
    """Handles all data processing operations with error handling and validation."""
    
    @staticmethod
    def load_dataset(file_path: str) -> pd.DataFrame:
        """
        Load dataset with comprehensive error handling.
        
        Args:
            file_path: Path to the CSV file
            
        Returns:
            Loaded DataFrame
            
        Raises:
            FileNotFoundError: If file doesn't exist
            pd.errors.EmptyDataError: If file is empty
            Exception: For other loading errors
        """
        try:
            if not Path(file_path).exists():
                raise FileNotFoundError(f"Dataset file not found: {file_path}")
            
            df = pd.read_csv(file_path)
            
            if df.empty:
                raise pd.errors.EmptyDataError("Dataset is empty")
            
            logger.info(f"Successfully loaded dataset: {df.shape[0]} rows, {df.shape[1]} columns")
            return df
            
        except FileNotFoundError as e:
            logger.error(f"File not found: {e}")
            raise
        except pd.errors.EmptyDataError as e:
            logger.error(f"Empty dataset: {e}")
            raise
        except Exception as e:
            logger.error(f"Error loading dataset: {e}")
            raise
    
    @staticmethod
    def clean_column_names(df: pd.DataFrame) -> pd.DataFrame:
        """
        Clean column names with comprehensive standardization.
        
        Args:
            df: Input DataFrame
            
        Returns:
            DataFrame with cleaned column names
        """
        try:
            original_columns = df.columns.tolist()
            
            cleaned_columns = []
            for col in df.columns:
                # Convert to lowercase and handle special cases
                clean_col = str(col).lower().strip()
                
                # Replace spaces and hyphens with underscores
                clean_col = re.sub(r'[\s\-]+', '_', clean_col)
                
                # Remove special characters except underscores and numbers
                clean_col = re.sub(r'[^a-z0-9_]', '', clean_col)
                
                # Remove multiple consecutive underscores
                clean_col = re.sub(r'_+', '_', clean_col)
                
                # Remove leading/trailing underscores
                clean_col = clean_col.strip('_')
                
                # Ensure column name is not empty
                if not clean_col:
                    clean_col = f'unnamed_{len(cleaned_columns)}'
                
                # Handle duplicates
                original_clean_col = clean_col
                counter = 1
                while clean_col in cleaned_columns:
                    clean_col = f"{original_clean_col}_{counter}"
                    counter += 1
                
                cleaned_columns.append(clean_col)
            
            df.columns = cleaned_columns
            logger.info(f"Successfully cleaned {len(cleaned_columns)} column names")
            
            return df
            
        except Exception as e:
            logger.error(f"Error cleaning column names: {e}")
            raise
    
    @staticmethod
    def parse_grade_value(grade_value: Union[str, int, float]) -> Optional[int]:
        """
        Parse grade values with comprehensive handling of different formats.
        
        Args:
            grade_value: Grade value to parse
            
        Returns:
            Parsed grade as integer or None if unparseable
        """
        if pd.isna(grade_value):
            return None
        
        try:
            # Handle direct numeric values
            if isinstance(grade_value, (int, float)):
                return int(grade_value)
            
            # Handle string values
            grade_str = str(grade_value).strip().upper()
            
            # Handle kindergarten cases
            if any(k in grade_str for k in ['K', 'KINDERGARTEN', 'PRE-K']):
                return 0
            
            # Extract numeric grade
            match = re.search(r'(\d+)', grade_str)
            if match:
                return int(match.group(1))
            
            return None
            
        except (ValueError, TypeError) as e:
            logger.warning(f"Could not parse grade value '{grade_value}': {e}")
            return None
    
    @staticmethod
    def filter_by_borough(df: pd.DataFrame, borough: str, city_column: str = 'city') -> pd.DataFrame:
        """
        Filter DataFrame by borough with validation.
        
        Args:
            df: Input DataFrame
            borough: Target borough name
            city_column: Column name containing city/borough information
            
        Returns:
            Filtered DataFrame
            
        Raises:
            KeyError: If city column doesn't exist
            ValueError: If no schools found for the borough
        """
        try:
            if city_column not in df.columns:
                raise KeyError(f"Column '{city_column}' not found in DataFrame")
            
            # Filter with case-insensitive comparison
            filtered_df = df[df[city_column].str.upper() == borough.upper()].copy()
            
            if filtered_df.empty:
                available_boroughs = df[city_column].unique()
                raise ValueError(
                    f"No schools found for borough '{borough}'. "
                    f"Available locations: {sorted(available_boroughs)}"
                )
            
            logger.info(f"Filtered dataset for {borough}: {len(filtered_df)} schools found")
            return filtered_df
            
        except Exception as e:
            logger.error(f"Error filtering by borough: {e}")
            raise

## 3. Analysis Functions

In [None]:
class SchoolAnalyzer:
    """Performs comprehensive school data analysis."""
    
    def __init__(self, data_processor: DataProcessor):
        self.processor = data_processor
    
    def get_data_overview(self, df: pd.DataFrame) -> Dict[str, any]:
        """
        Generate comprehensive data overview with statistics.
        
        Args:
            df: Input DataFrame
            
        Returns:
            Dictionary containing overview statistics
        """
        try:
            overview = {
                'total_rows': len(df),
                'total_columns': len(df.columns),
                'memory_usage_mb': df.memory_usage(deep=True).sum() / 1024**2,
                'missing_data': {
                    'columns_with_missing': df.isnull().any().sum(),
                    'total_missing_values': df.isnull().sum().sum(),
                    'missing_percentage': (df.isnull().sum().sum() / (len(df) * len(df.columns))) * 100
                },
                'data_types': df.dtypes.value_counts().to_dict()
            }
            
            logger.info(f"Generated data overview for {overview['total_rows']} rows")
            return overview
            
        except Exception as e:
            logger.error(f"Error generating data overview: {e}")
            raise
    
    def analyze_grade_availability(self, df: pd.DataFrame, target_grade: int) -> Dict[str, any]:
        """
        Analyze grade availability with comprehensive statistics.
        
        Args:
            df: Input DataFrame
            target_grade: Grade to analyze availability for
            
        Returns:
            Dictionary containing grade analysis results
        """
        try:
            # Parse grade columns
            grade_columns = {
                'min_grade': 'grade_span_min',
                'max_grade': 'grade_span_max',
                'exp_min_grade': 'expgrade_span_min',
                'exp_max_grade': 'expgrade_span_max'
            }
            
            df_copy = df.copy()
            
            # Parse all grade columns
            for parsed_col, original_col in grade_columns.items():
                if original_col in df_copy.columns:
                    df_copy[f'{parsed_col}_numeric'] = df_copy[original_col].apply(
                        self.processor.parse_grade_value
                    )
            
            # Determine schools offering target grade
            grade_condition = (
                (df_copy['min_grade_numeric'] <= target_grade) & 
                (df_copy['max_grade_numeric'] >= target_grade)
            )
            
            # Include expanded grade ranges if available
            if 'exp_min_grade_numeric' in df_copy.columns:
                exp_condition = (
                    (df_copy['exp_min_grade_numeric'] <= target_grade) & 
                    (df_copy['exp_max_grade_numeric'] >= target_grade)
                )
                grade_condition = grade_condition | exp_condition
            
            schools_offering_grade = df_copy[grade_condition]
            
            analysis_results = {
                'total_schools': len(df_copy),
                'schools_offering_grade': len(schools_offering_grade),
                'percentage_offering': (len(schools_offering_grade) / len(df_copy)) * 100,
                'grade_distribution': {
                    'min_grades': df_copy['min_grade_numeric'].value_counts().to_dict(),
                    'max_grades': df_copy['max_grade_numeric'].value_counts().to_dict()
                },
                'target_grade': target_grade
            }
            
            logger.info(
                f"Grade {target_grade} analysis: {analysis_results['schools_offering_grade']} "
                f"out of {analysis_results['total_schools']} schools "
                f"({analysis_results['percentage_offering']:.1f}%)"
            )
            
            return analysis_results
            
        except Exception as e:
            logger.error(f"Error analyzing grade availability: {e}")
            raise
    
    def analyze_borough_distribution(self, df: pd.DataFrame, city_column: str = 'city') -> Dict[str, any]:
        """
        Analyze school and student distribution across boroughs.
        
        Args:
            df: Input DataFrame
            city_column: Column containing borough information
            
        Returns:
            Dictionary containing borough analysis results
        """
        try:
            # Convert student counts to numeric
            student_col = 'total_students'
            if student_col in df.columns:
                df = df.copy()
                df[student_col] = pd.to_numeric(df[student_col], errors='coerce')
            
            # School counts by borough
            school_counts = df[city_column].value_counts().sort_values(ascending=False)
            
            # Student statistics by borough
            student_stats = None
            if student_col in df.columns:
                student_stats = df.groupby(city_column)[student_col].agg([
                    'count', 'mean', 'median', 'std', 'min', 'max', 'sum'
                ]).round(1)
                student_stats.columns = [
                    'school_count', 'avg_students', 'median_students', 
                    'std_students', 'min_students', 'max_students', 'total_students'
                ]
            
            analysis_results = {
                'school_counts': school_counts.to_dict(),
                'student_statistics': student_stats.to_dict() if student_stats is not None else None,
                'total_boroughs': len(school_counts),
                'largest_borough': school_counts.index[0],
                'smallest_borough': school_counts.index[-1]
            }
            
            logger.info(
                f"Borough analysis completed: {analysis_results['total_boroughs']} boroughs, "
                f"largest: {analysis_results['largest_borough']} "
                f"({school_counts[analysis_results['largest_borough']]} schools)"
            )
            
            return analysis_results
            
        except Exception as e:
            logger.error(f"Error analyzing borough distribution: {e}")
            raise

## 4. Visualization Functions

In [None]:
class Visualizer:
    """Creates publication-quality visualizations."""
    
    def __init__(self, config: AnalysisConfig):
        self.config = config
        self.colors = sns.color_palette("husl", 10)
    
    def create_borough_school_chart(self, school_counts: pd.Series, title: str = None) -> plt.Figure:
        """
        Create an enhanced bar chart for school counts by borough.
        
        Args:
            school_counts: Series with borough school counts
            title: Chart title
            
        Returns:
            matplotlib Figure object
        """
        try:
            fig, ax = plt.subplots(figsize=(14, 8))
            
            # Create bar chart with gradient colors
            bars = ax.bar(
                range(len(school_counts)), 
                school_counts.values,
                color=self.colors[:len(school_counts)],
                edgecolor='black',
                linewidth=0.8,
                alpha=0.8
            )
            
            # Customize chart
            ax.set_xlabel('Borough/Location', fontweight='bold')
            ax.set_ylabel('Number of Schools', fontweight='bold')
            ax.set_title(
                title or 'Distribution of High Schools Across NYC Boroughs',
                fontweight='bold',
                pad=20
            )
            
            # Set x-axis labels
            ax.set_xticks(range(len(school_counts)))
            ax.set_xticklabels(
                school_counts.index, 
                rotation=45, 
                ha='right',
                fontsize=10
            )
            
            # Add value labels on bars
            for i, (bar, value) in enumerate(zip(bars, school_counts.values)):
                height = bar.get_height()
                ax.text(
                    bar.get_x() + bar.get_width()/2., 
                    height + 0.5,
                    f'{value}',
                    ha='center', 
                    va='bottom',
                    fontweight='bold',
                    fontsize=9
                )
            
            # Add grid and styling
            ax.grid(axis='y', alpha=0.3, linestyle='--')
            ax.set_axisbelow(True)
            
            # Add statistics box
            stats_text = f"Total Schools: {school_counts.sum()}\nMean: {school_counts.mean():.1f}\nStd: {school_counts.std():.1f}"
            ax.text(
                0.02, 0.98, stats_text,
                transform=ax.transAxes,
                verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8),
                fontsize=9
            )
            
            plt.tight_layout()
            logger.info("Created borough school distribution chart")
            
            return fig
            
        except Exception as e:
            logger.error(f"Error creating borough school chart: {e}")
            raise
    
    def create_student_distribution_chart(self, student_stats: pd.DataFrame, title: str = None) -> plt.Figure:
        """
        Create enhanced visualization for student distribution across boroughs.
        
        Args:
            student_stats: DataFrame with student statistics by borough
            title: Chart title
            
        Returns:
            matplotlib Figure object
        """
        try:
            fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))
            
            # Chart 1: Average students per school
            avg_students = student_stats['avg_students'].sort_values(ascending=False)
            bars1 = ax1.bar(
                range(len(avg_students)),
                avg_students.values,
                color=self.colors[:len(avg_students)],
                edgecolor='black',
                alpha=0.8
            )
            
            ax1.set_title('Average Students per School by Borough', fontweight='bold')
            ax1.set_xlabel('Borough/Location', fontweight='bold')
            ax1.set_ylabel('Average Number of Students', fontweight='bold')
            ax1.set_xticks(range(len(avg_students)))
            ax1.set_xticklabels(avg_students.index, rotation=45, ha='right', fontsize=9)
            ax1.grid(axis='y', alpha=0.3, linestyle='--')
            
            # Add value labels
            for bar, value in zip(bars1, avg_students.values):
                if not pd.isna(value):
                    ax1.text(
                        bar.get_x() + bar.get_width()/2.,
                        bar.get_height() + 10,
                        f'{value:.0f}',
                        ha='center', va='bottom',
                        fontweight='bold', fontsize=8
                    )
            
            # Chart 2: Total students by borough
            total_students = student_stats['total_students'].sort_values(ascending=False)
            bars2 = ax2.bar(
                range(len(total_students)),
                total_students.values,
                color=self.colors[:len(total_students)],
                edgecolor='black',
                alpha=0.8
            )
            
            ax2.set_title('Total Students by Borough', fontweight='bold')
            ax2.set_xlabel('Borough/Location', fontweight='bold')
            ax2.set_ylabel('Total Number of Students', fontweight='bold')
            ax2.set_xticks(range(len(total_students)))
            ax2.set_xticklabels(total_students.index, rotation=45, ha='right', fontsize=9)
            ax2.grid(axis='y', alpha=0.3, linestyle='--')
            
            # Add value labels
            for bar, value in zip(bars2, total_students.values):
                if not pd.isna(value):
                    ax2.text(
                        bar.get_x() + bar.get_width()/2.,
                        bar.get_height() + max(total_students.values) * 0.01,
                        f'{value:.0f}',
                        ha='center', va='bottom',
                        fontweight='bold', fontsize=8
                    )
            
            plt.tight_layout()
            logger.info("Created student distribution charts")
            
            return fig
            
        except Exception as e:
            logger.error(f"Error creating student distribution chart: {e}")
            raise
    
    def create_grade_analysis_chart(self, grade_analysis: Dict) -> plt.Figure:
        """
        Create visualization for grade availability analysis.
        
        Args:
            grade_analysis: Dictionary containing grade analysis results
            
        Returns:
            matplotlib Figure object
        """
        try:
            fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))
            
            # Chart 1: Grade availability pie chart
            offering_grade = grade_analysis['schools_offering_grade']
            not_offering = grade_analysis['total_schools'] - offering_grade
            
            sizes = [offering_grade, not_offering]
            labels = [f'Offer Grade {grade_analysis["target_grade"]}', 'Do Not Offer']
            colors = ['#2ecc71', '#e74c3c']
            
            wedges, texts, autotexts = ax1.pie(
                sizes, labels=labels, colors=colors,
                autopct='%1.1f%%', startangle=90,
                textprops={'fontweight': 'bold'}
            )
            
            ax1.set_title(
                f'Grade {grade_analysis["target_grade"]} Availability\n({grade_analysis["total_schools"]} Total Schools)',
                fontweight='bold'
            )
            
            # Chart 2: Grade distribution
            min_grades = pd.Series(grade_analysis['grade_distribution']['min_grades'])
            min_grades = min_grades.sort_index()
            
            bars = ax2.bar(
                min_grades.index,
                min_grades.values,
                color=self.colors[:len(min_grades)],
                edgecolor='black',
                alpha=0.8
            )
            
            ax2.set_title('Distribution of Minimum Grades', fontweight='bold')
            ax2.set_xlabel('Minimum Grade', fontweight='bold')
            ax2.set_ylabel('Number of Schools', fontweight='bold')
            ax2.grid(axis='y', alpha=0.3, linestyle='--')
            
            # Add value labels
            for bar, value in zip(bars, min_grades.values):
                ax2.text(
                    bar.get_x() + bar.get_width()/2.,
                    bar.get_height() + 0.5,
                    f'{value}',
                    ha='center', va='bottom',
                    fontweight='bold'
                )
            
            plt.tight_layout()
            logger.info(f"Created grade {grade_analysis['target_grade']} analysis chart")
            
            return fig
            
        except Exception as e:
            logger.error(f"Error creating grade analysis chart: {e}")
            raise

## 5. Main Analysis Pipeline

In [None]:
# Initialize components
processor = DataProcessor()
analyzer = SchoolAnalyzer(processor)
visualizer = Visualizer(config)

try:
    # Load and clean data
    logger.info("Starting data loading and preprocessing...")
    df = processor.load_dataset(config.data_file)
    df = processor.clean_column_names(df)
    
    # Generate data overview
    overview = analyzer.get_data_overview(df)
    
    print("=== DATA OVERVIEW ===")
    print(f"Dataset Shape: {overview['total_rows']} rows × {overview['total_columns']} columns")
    print(f"Memory Usage: {overview['memory_usage_mb']:.2f} MB")
    print(f"Missing Data: {overview['missing_data']['total_missing_values']} values ({overview['missing_data']['missing_percentage']:.1f}%)")
    print(f"Columns with Missing Data: {overview['missing_data']['columns_with_missing']}")
    print("\nData Types Distribution:")
    for dtype, count in overview['data_types'].items():
        print(f"  {dtype}: {count} columns")
    
except Exception as e:
    logger.error(f"Failed to load or process data: {e}")
    raise

In [None]:
# Display column information
print("\n=== CLEANED COLUMN NAMES ===")
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

# Show sample data
print("\n=== SAMPLE DATA ===")
display(df.head(3))

## 6. Borough Analysis

In [None]:
# Analyze borough distribution
try:
    logger.info("Performing borough distribution analysis...")
    borough_analysis = analyzer.analyze_borough_distribution(df)
    
    print("\n=== BOROUGH DISTRIBUTION ANALYSIS ===")
    print(f"Total Boroughs/Locations: {borough_analysis['total_boroughs']}")
    print(f"Largest Borough: {borough_analysis['largest_borough']} ({borough_analysis['school_counts'][borough_analysis['largest_borough']]} schools)")
    print(f"Smallest Borough: {borough_analysis['smallest_borough']} ({borough_analysis['school_counts'][borough_analysis['smallest_borough']]} schools)")
    
    # Display top 10 locations by school count
    school_counts_series = pd.Series(borough_analysis['school_counts'])
    print("\nTop 10 Locations by School Count:")
    for i, (location, count) in enumerate(school_counts_series.head(10).items(), 1):
        print(f"{i:2d}. {location}: {count} schools")
    
    # Create visualization
    fig1 = visualizer.create_borough_school_chart(
        school_counts_series.head(15),  # Show top 15 for readability
        "Distribution of High Schools Across NYC Locations (Top 15)"
    )
    plt.show()
    
except Exception as e:
    logger.error(f"Error in borough analysis: {e}")
    raise

## 7. Brooklyn-Specific Analysis

In [None]:
# Filter for Brooklyn schools
try:
    logger.info(f"Filtering for {config.target_borough} schools...")
    brooklyn_schools = processor.filter_by_borough(df, config.target_borough)
    
    print(f"\n=== {config.target_borough} SCHOOLS ANALYSIS ===")
    print(f"Total schools in {config.target_borough}: {len(brooklyn_schools)}")
    print(f"Percentage of total NYC schools: {(len(brooklyn_schools) / len(df)) * 100:.1f}%")
    
    # Generate Brooklyn-specific overview
    brooklyn_overview = analyzer.get_data_overview(brooklyn_schools)
    print(f"Missing data in Brooklyn schools: {brooklyn_overview['missing_data']['missing_percentage']:.1f}%")
    
except Exception as e:
    logger.error(f"Error filtering Brooklyn schools: {e}")
    raise

In [None]:
# Analyze Grade 9 availability in Brooklyn
try:
    logger.info(f"Analyzing Grade {config.grade_of_interest} availability in Brooklyn...")
    grade_analysis = analyzer.analyze_grade_availability(brooklyn_schools, config.grade_of_interest)
    
    print(f"\n=== GRADE {config.grade_of_interest} AVAILABILITY ANALYSIS (BROOKLYN) ===")
    print(f"Total Brooklyn schools analyzed: {grade_analysis['total_schools']}")
    print(f"Schools offering Grade {config.grade_of_interest}: {grade_analysis['schools_offering_grade']}")
    print(f"Percentage offering Grade {config.grade_of_interest}: {grade_analysis['percentage_offering']:.1f}%")
    
    print("\nGrade Distribution in Brooklyn Schools:")
    print("Minimum Grades:")
    for grade, count in sorted(grade_analysis['grade_distribution']['min_grades'].items()):
        print(f"  Grade {grade}: {count} schools")
    
    print("Maximum Grades:")
    for grade, count in sorted(grade_analysis['grade_distribution']['max_grades'].items()):
        print(f"  Grade {grade}: {count} schools")
    
    # Create grade analysis visualization
    fig2 = visualizer.create_grade_analysis_chart(grade_analysis)
    plt.show()
    
except Exception as e:
    logger.error(f"Error in grade analysis: {e}")
    raise

## 8. Student Population Analysis

In [None]:
# Analyze student populations across boroughs
try:
    if borough_analysis['student_statistics']:
        logger.info("Analyzing student population distribution...")
        student_stats_df = pd.DataFrame(borough_analysis['student_statistics'])
        
        print("\n=== STUDENT POPULATION ANALYSIS ===")
        print("Top 10 Boroughs by Average Students per School:")
        top_avg = student_stats_df['avg_students'].sort_values(ascending=False).head(10)
        for i, (borough, avg_students) in enumerate(top_avg.items(), 1):
            if not pd.isna(avg_students):
                total_students = student_stats_df.loc[borough, 'total_students']
                school_count = student_stats_df.loc[borough, 'school_count']
                print(f"{i:2d}. {borough}: {avg_students:.0f} avg students ({school_count:.0f} schools, {total_students:.0f} total)")
        
        # Create student distribution visualization
        fig3 = visualizer.create_student_distribution_chart(
            student_stats_df.head(15),  # Top 15 for readability
            "Student Distribution Analysis Across NYC Boroughs"
        )
        plt.show()
        
        # Brooklyn-specific student statistics
        if config.target_borough.title() in student_stats_df.index:
            brooklyn_stats = student_stats_df.loc[config.target_borough.title()]
            print(f"\n=== BROOKLYN STUDENT STATISTICS ===")
            print(f"Average students per school: {brooklyn_stats['avg_students']:.0f}")
            print(f"Median students per school: {brooklyn_stats['median_students']:.0f}")
            print(f"Total students in Brooklyn: {brooklyn_stats['total_students']:.0f}")
            print(f"Standard deviation: {brooklyn_stats['std_students']:.0f}")
            print(f"Range: {brooklyn_stats['min_students']:.0f} - {brooklyn_stats['max_students']:.0f} students")
    
    else:
        print("\n⚠️  Student population data not available for analysis")
        
except Exception as e:
    logger.error(f"Error in student population analysis: {e}")
    raise

## 9. Advanced Analytics and Insights

In [None]:
# Generate comprehensive insights
try:
    logger.info("Generating comprehensive insights and recommendations...")
    
    print("\n" + "="*80)
    print("COMPREHENSIVE ANALYSIS INSIGHTS & RECOMMENDATIONS")
    print("="*80)
    
    # Brooklyn Analysis Summary
    print(f"\n🎯 BROOKLYN SCHOOL ANALYSIS SUMMARY")
    print(f"   • Total Brooklyn Schools: {len(brooklyn_schools)} ({(len(brooklyn_schools)/len(df)*100):.1f}% of NYC total)")
    print(f"   • Grade {config.grade_of_interest} Availability: {grade_analysis['percentage_offering']:.1f}% of Brooklyn schools")
    
    if grade_analysis['percentage_offering'] == 100.0:
        print(f"   • Excellent Access: ALL Brooklyn schools accommodate Grade {config.grade_of_interest} students")
        print(f"   • Policy Implication: No Grade {config.grade_of_interest} access gaps in Brooklyn")
    else:
        print(f"   • Access Gap: {100 - grade_analysis['percentage_offering']:.1f}% of schools don't offer Grade {config.grade_of_interest}")
        print(f"   • Recommendation: Review school placement policies for Grade {config.grade_of_interest} students")
    
    # Borough Distribution Insights
    print(f"\n📊 BOROUGH DISTRIBUTION INSIGHTS")
    school_counts_series = pd.Series(borough_analysis['school_counts'])
    top_3_boroughs = school_counts_series.head(3)
    
    print(f"   • Top 3 Boroughs by School Count:")
    for i, (borough, count) in enumerate(top_3_boroughs.items(), 1):
        percentage = (count / school_counts_series.sum()) * 100
        print(f"     {i}. {borough}: {count} schools ({percentage:.1f}%)")
    
    # Calculate concentration metrics
    top_3_concentration = (top_3_boroughs.sum() / school_counts_series.sum()) * 100
    print(f"   • Market Concentration: Top 3 boroughs contain {top_3_concentration:.1f}% of all schools")
    
    if top_3_concentration > 75:
        print(f"   • High Concentration: Educational resources heavily concentrated in major boroughs")
    else:
        print(f"   • Distributed Access: Educational resources well-distributed across boroughs")
    
    # Student Population Insights
    if borough_analysis['student_statistics']:
        student_stats_df = pd.DataFrame(borough_analysis['student_statistics'])
        print(f"\n👥 STUDENT POPULATION INSIGHTS")
        
        # Calculate system-wide averages
        total_students = student_stats_df['total_students'].sum()
        total_schools = student_stats_df['school_count'].sum()
        system_avg = total_students / total_schools
        
        print(f"   • System-wide Average: {system_avg:.0f} students per school")
        print(f"   • Total Students Analyzed: {total_students:.0f}")
        print(f"   • Total Schools with Data: {total_schools:.0f}")
        
        # Identify outliers
        high_capacity = student_stats_df[student_stats_df['avg_students'] > system_avg * 1.5]
        low_capacity = student_stats_df[student_stats_df['avg_students'] < system_avg * 0.5]
        
        if not high_capacity.empty:
            print(f"   • High-Capacity Boroughs ({len(high_capacity)}): Above {system_avg*1.5:.0f} avg students")
            for borough in high_capacity.index[:3]:  # Top 3
                avg = high_capacity.loc[borough, 'avg_students']
                print(f"     - {borough}: {avg:.0f} avg students")
        
        if not low_capacity.empty:
            print(f"   • Small-School Boroughs ({len(low_capacity)}): Below {system_avg*0.5:.0f} avg students")
            for borough in low_capacity.index[:3]:  # Top 3
                avg = low_capacity.loc[borough, 'avg_students']
                print(f"     - {borough}: {avg:.0f} avg students")
    
    # Data Quality Assessment
    print(f"\n🔍 DATA QUALITY ASSESSMENT")
    missing_pct = overview['missing_data']['missing_percentage']
    if missing_pct < 5:
        print(f"   • Excellent Data Quality: Only {missing_pct:.1f}% missing values")
    elif missing_pct < 15:
        print(f"   • Good Data Quality: {missing_pct:.1f}% missing values")
    else:
        print(f"   • Data Quality Concern: {missing_pct:.1f}% missing values")
        print(f"   • Recommendation: Investigate data collection processes")
    
    print(f"   • Dataset Completeness: {overview['total_rows']} schools with {overview['total_columns']} attributes")
    print(f"   • Memory Efficiency: {overview['memory_usage_mb']:.2f} MB dataset size")
    
    # Actionable Recommendations
    print(f"\n💡 STRATEGIC RECOMMENDATIONS")
    print(f"   1. Educational Access: Brooklyn provides comprehensive Grade {config.grade_of_interest} access")
    print(f"   2. Resource Planning: Focus capacity planning on high-enrollment boroughs")
    print(f"   3. Equity Analysis: Monitor distribution patterns for educational equity")
    print(f"   4. Data Enhancement: Improve data collection for {overview['missing_data']['columns_with_missing']} attributes")
    print(f"   5. Policy Development: Use borough-specific insights for targeted interventions")
    
    print(f"\n✅ Analysis completed successfully at {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print("="*80)
    
except Exception as e:
    logger.error(f"Error generating insights: {e}")
    raise

## 10. Summary Statistics and Export

In [None]:
# Create summary statistics table
try:
    logger.info("Creating summary statistics table...")
    
    # Compile key metrics
    summary_stats = {
        'Metric': [
            'Total NYC Schools',
            f'{config.target_borough.title()} Schools',
            f'{config.target_borough.title()} Market Share (%)',
            f'Grade {config.grade_of_interest} Availability in {config.target_borough.title()} (%)',
            'Total Boroughs/Locations',
            'Largest Borough',
            'Data Completeness (%)'
        ],
        'Value': [
            len(df),
            len(brooklyn_schools),
            f"{(len(brooklyn_schools)/len(df)*100):.1f}",
            f"{grade_analysis['percentage_offering']:.1f}",
            borough_analysis['total_boroughs'],
            f"{borough_analysis['largest_borough']} ({borough_analysis['school_counts'][borough_analysis['largest_borough']]} schools)",
            f"{100 - overview['missing_data']['missing_percentage']:.1f}"
        ]
    }
    
    summary_df = pd.DataFrame(summary_stats)
    
    print("\n=== EXECUTIVE SUMMARY TABLE ===")
    print(summary_df.to_string(index=False))
    
    # Additional detailed statistics for Brooklyn
    if config.target_borough.title() in pd.DataFrame(borough_analysis['student_statistics']).index:
        brooklyn_student_stats = pd.DataFrame(borough_analysis['student_statistics']).loc[config.target_borough.title()]
        
        brooklyn_details = {
            'Brooklyn Metric': [
                'Average Students per School',
                'Median Students per School',
                'Total Students',
                'Largest School Size',
                'Smallest School Size',
                'Standard Deviation'
            ],
            'Value': [
                f"{brooklyn_student_stats['avg_students']:.0f}",
                f"{brooklyn_student_stats['median_students']:.0f}",
                f"{brooklyn_student_stats['total_students']:.0f}",
                f"{brooklyn_student_stats['max_students']:.0f}",
                f"{brooklyn_student_stats['min_students']:.0f}",
                f"{brooklyn_student_stats['std_students']:.0f}"
            ]
        }
        
        brooklyn_df = pd.DataFrame(brooklyn_details)
        print(f"\n=== {config.target_borough.upper()} DETAILED STATISTICS ===")
        print(brooklyn_df.to_string(index=False))
    
    logger.info("Summary statistics compiled successfully")
    
except Exception as e:
    logger.error(f"Error creating summary statistics: {e}")
    raise

## 11. Analysis Validation and Quality Checks

In [None]:
# Perform validation checks
try:
    logger.info("Performing analysis validation and quality checks...")
    
    print("\n=== ANALYSIS VALIDATION CHECKS ===")
    
    # Check 1: Data consistency
    total_schools_check = len(df)
    borough_sum_check = sum(borough_analysis['school_counts'].values())
    consistency_check = total_schools_check == borough_sum_check
    
    print(f"✓ Data Consistency Check: {'PASSED' if consistency_check else 'FAILED'}")
    print(f"  - Total schools: {total_schools_check}")
    print(f"  - Sum of borough schools: {borough_sum_check}")
    
    # Check 2: Brooklyn filter validation
    brooklyn_filter_check = len(brooklyn_schools) == borough_analysis['school_counts'].get(config.target_borough.title(), 0)
    print(f"✓ Brooklyn Filter Check: {'PASSED' if brooklyn_filter_check else 'FAILED'}")
    print(f"  - Filtered Brooklyn schools: {len(brooklyn_schools)}")
    print(f"  - Expected from borough analysis: {borough_analysis['school_counts'].get(config.target_borough.title(), 0)}")
    
    # Check 3: Grade analysis validation
    grade_sum_check = grade_analysis['schools_offering_grade'] <= grade_analysis['total_schools']
    print(f"✓ Grade Analysis Logic Check: {'PASSED' if grade_sum_check else 'FAILED'}")
    print(f"  - Schools offering Grade {config.grade_of_interest}: {grade_analysis['schools_offering_grade']}")
    print(f"  - Total schools analyzed: {grade_analysis['total_schools']}")
    
    # Check 4: Student data validation
    if borough_analysis['student_statistics']:
        student_stats_df = pd.DataFrame(borough_analysis['student_statistics'])
        negative_students = (student_stats_df['avg_students'] < 0).sum()
        unrealistic_students = (student_stats_df['avg_students'] > 10000).sum()
        
        student_data_valid = negative_students == 0 and unrealistic_students == 0
        print(f"✓ Student Data Validation: {'PASSED' if student_data_valid else 'FAILED'}")
        print(f"  - Negative student counts: {negative_students}")
        print(f"  - Unrealistic student counts (>10,000): {unrealistic_students}")
    
    # Overall validation status
    all_checks_passed = all([
        consistency_check,
        brooklyn_filter_check,
        grade_sum_check
    ])
    
    print(f"\n{'='*50}")
    print(f"OVERALL VALIDATION STATUS: {'✅ ALL CHECKS PASSED' if all_checks_passed else '❌ SOME CHECKS FAILED'}")
    print(f"{'='*50}")
    
    if all_checks_passed:
        logger.info("All validation checks passed successfully")
    else:
        logger.warning("Some validation checks failed - review analysis logic")
    
except Exception as e:
    logger.error(f"Error during validation: {e}")
    raise

---

## 📋 Analysis Complete

This enhanced notebook provides:

### ✅ **Code Quality Improvements**
- **Modular Architecture**: Separated concerns into dedicated classes (DataProcessor, SchoolAnalyzer, Visualizer)
- **Error Handling**: Comprehensive try-catch blocks with specific exception handling
- **Type Annotations**: Full typing support for better code maintainability
- **Configuration Management**: Centralized configuration using dataclasses
- **Logging**: Structured logging throughout the analysis pipeline

### 🔧 **Production-Ready Features**
- **Input Validation**: Robust validation for all data inputs and parameters
- **Data Quality Checks**: Automated validation of analysis results
- **Memory Optimization**: Efficient data processing with memory usage monitoring
- **Documentation**: Comprehensive docstrings and inline comments
- **Scalability**: Designed to handle larger datasets and additional boroughs

### 📊 **Enhanced Analytics**
- **Statistical Rigor**: Comprehensive statistical analysis with validation
- **Advanced Visualizations**: Publication-quality charts with proper styling
- **Actionable Insights**: Strategic recommendations based on data analysis
- **Performance Metrics**: Detailed performance and quality assessments

### 🎯 **Key Findings Preserved**
- **Brooklyn Schools**: 121 schools (27.8% of NYC total)
- **Grade 9 Access**: 100% of Brooklyn schools offer Grade 9 entry
- **Data Quality**: High-quality dataset with minimal missing values
- **Distribution**: Educational resources concentrated in major boroughs

---

*Generated with production-ready code practices and comprehensive error handling*