# 🧮 Data Integration & Schema Design: NYC SAT Results
## Comprehensive Data Analysis and Database Integration

**Objective**: Evaluate, clean, and integrate NYC SAT Results into existing PostgreSQL schema

**Dataset**: `/Users/svitlanakovalivska/onboarding_weebet/_onboarding_data/daily_tasks/day_4/day_4_datasets/sat-results.csv`

**Existing Tables**: high_school_directory, school_demographics, school_safety_report

---

### Analysis Overview
This notebook performs comprehensive data cleaning, statistical analysis, and database integration for NYC SAT results data. The analysis follows data science best practices including exploratory data analysis, quality assessment, and systematic data transformation.

In [19]:
# Import required libraries
import pandas as pd
import numpy as np
import psycopg2
from sqlalchemy import create_engine
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Any
import re

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Libraries imported successfully
Pandas version: 2.3.1
NumPy version: 2.3.1


## 1. Data Loading and Initial Exploration

First, let's load the raw SAT results data and examine its structure, identify data quality issues, and understand the dataset characteristics.

In [20]:
# Load the raw SAT results dataset
data_path = '/Users/svitlanakovalivska/onboarding_weebet/_onboarding_data/daily_tasks/day_4/day_4_datasets/sat-results.csv'
df_raw = pd.read_csv(data_path)

print("=== RAW DATASET OVERVIEW ===")
print(f"Dataset shape: {df_raw.shape}")
print(f"\nColumn names and types:")
print(df_raw.dtypes)
print(f"\nFirst 5 rows:")
df_raw.head()

=== RAW DATASET OVERVIEW ===
Dataset shape: (493, 11)

Column names and types:
DBN                                 object
SCHOOL NAME                         object
Num of SAT Test Takers              object
SAT Critical Reading Avg. Score     object
SAT Math Avg. Score                 object
SAT Writing Avg. Score              object
SAT Critical Readng Avg. Score      object
internal_school_id                   int64
contact_extension                   object
pct_students_tested                 object
academic_tier_rating               float64
dtype: object

First 5 rows:


Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score,SAT Critical Readng Avg. Score,internal_school_id,contact_extension,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,355,218160,x345,78%,2.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,383,268547,x234,,3.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370,377,236446,x123,,3.0
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359,414,427826,x123,92%,4.0
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384,390,672714,x123,92%,2.0


In [21]:
# Data quality assessment
def assess_data_quality(df: pd.DataFrame) -> Dict[str, Any]:
    """
    Comprehensive data quality assessment function
    Returns detailed statistics about data quality issues
    """
    quality_report = {
        'shape': df.shape,
        'columns': list(df.columns),
        'missing_values': df.isnull().sum().to_dict(),
        'duplicate_rows': df.duplicated().sum(),
        'column_stats': {}
    }
    
    # Analyze each column
    for col in df.columns:
        stats = {
            'dtype': str(df[col].dtype),
            'unique_values': df[col].nunique(),
            'sample_values': df[col].unique()[:10].tolist()
        }
        
        # Check for suspicious values
        if col in ['Num of SAT Test Takers', 'SAT Critical Reading Avg. Score', 
                  'SAT Math Avg. Score', 'SAT Writing Avg. Score', 'SAT Critical Readng Avg. Score']:
            # Look for non-numeric values in score columns
            non_numeric = df[col].astype(str).str.contains(r'[^0-9\.]', na=False).sum()
            stats['non_numeric_count'] = non_numeric
            stats['suspicious_values'] = df[col][df[col].astype(str).str.contains(r'[^0-9\.]', na=False)].unique().tolist()
            
        quality_report['column_stats'][col] = stats
    
    return quality_report

# Perform quality assessment
quality_report = assess_data_quality(df_raw)

print("=== DATA QUALITY ASSESSMENT ===")
print(f"Total rows: {quality_report['shape'][0]}")
print(f"Total columns: {quality_report['shape'][1]}")
print(f"Duplicate rows: {quality_report['duplicate_rows']}")

print(f"\n=== MISSING VALUES BY COLUMN ===")
for col, missing_count in quality_report['missing_values'].items():
    if missing_count > 0:
        print(f"{col}: {missing_count} ({missing_count/df_raw.shape[0]*100:.1f}%)")

print(f"\n=== SUSPICIOUS VALUES IN SCORE COLUMNS ===")
score_columns = ['Num of SAT Test Takers', 'SAT Critical Reading Avg. Score', 
                'SAT Math Avg. Score', 'SAT Writing Avg. Score', 'SAT Critical Readng Avg. Score']

for col in score_columns:
    if col in quality_report['column_stats']:
        suspicious = quality_report['column_stats'][col]['suspicious_values']
        if suspicious:
            print(f"{col}: {suspicious}")

=== DATA QUALITY ASSESSMENT ===
Total rows: 493
Total columns: 11
Duplicate rows: 15

=== MISSING VALUES BY COLUMN ===
contact_extension: 105 (21.3%)
pct_students_tested: 117 (23.7%)
academic_tier_rating: 91 (18.5%)

=== SUSPICIOUS VALUES IN SCORE COLUMNS ===
Num of SAT Test Takers: ['s']
SAT Critical Reading Avg. Score: ['s']
SAT Math Avg. Score: ['s', '-10']
SAT Writing Avg. Score: ['s']
SAT Critical Readng Avg. Score: ['s']


## 2. Data Cleaning and Preprocessing

Based on the quality assessment, we need to address several data quality issues:

1. **Remove duplicates**
2. **Handle suppressed data (marked as 's')**
3. **Fix invalid SAT scores (outliers like -10, 999, 1100)**
4. **Standardize percentage formatting**
5. **Remove the duplicate column with typo**
6. **Handle missing values appropriately**

In [22]:
def improved_data_cleaning(df: pd.DataFrame) -> pd.DataFrame:
    """
    Improved data cleaning with logical filtering based on primary goal: SAT score analysis
    """
    print("=== IMPROVED DATA CLEANING PROCESS ===")
    df_clean = df.copy()
    
    # Step 1: Remove exact duplicates
    initial_rows = len(df_clean)
    df_clean = df_clean.drop_duplicates()
    print(f"Step 1: Removed {initial_rows - len(df_clean)} duplicate rows")
    
    # Step 2: Remove the duplicate column with typo
    if 'SAT Critical Readng Avg. Score' in df_clean.columns:
        df_clean = df_clean.drop('SAT Critical Readng Avg. Score', axis=1)
        print("Step 2: Removed duplicate column with typo")
    
    # Step 3: Clean and validate SAT score columns (primary indicators)
    sat_score_columns = ['SAT Critical Reading Avg. Score', 'SAT Math Avg. Score', 'SAT Writing Avg. Score']
    
    for col in sat_score_columns:
        if col in df_clean.columns:
            # Replace 's' (suppressed) with NaN
            df_clean[col] = df_clean[col].replace('s', np.nan)
            # Convert to numeric
            df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce')
            # Validate SAT score range (200-800)
            valid_mask = (df_clean[col] >= 200) & (df_clean[col] <= 800) | df_clean[col].isna()
            invalid_count = (~valid_mask).sum()
            df_clean.loc[~valid_mask, col] = np.nan
            print(f"Step 3: Cleaned {col} - {invalid_count} invalid scores")
    
    # Step 4: CRITICAL FILTER - Remove rows without complete SAT data
    # This is our primary goal - SAT analysis requires all three scores
    before_filter = len(df_clean)
    sat_columns = ['SAT Critical Reading Avg. Score', 'SAT Math Avg. Score', 'SAT Writing Avg. Score']
    df_clean = df_clean.dropna(subset=sat_columns)
    removed_incomplete = before_filter - len(df_clean)
    print(f"Step 4: REMOVED {removed_incomplete} rows without complete SAT scores (PRIMARY FILTER)")
    
    # Step 5: Clean number of test takers (important for statistical validity)
    if 'Num of SAT Test Takers' in df_clean.columns:
        df_clean['Num of SAT Test Takers'] = df_clean['Num of SAT Test Takers'].replace('s', np.nan)
        df_clean['Num of SAT Test Takers'] = pd.to_numeric(df_clean['Num of SAT Test Takers'], errors='coerce')
        # Filter reasonable values
        valid_mask = (df_clean['Num of SAT Test Takers'] > 0) & (df_clean['Num of SAT Test Takers'] <= 2000)
        df_clean = df_clean[valid_mask | df_clean['Num of SAT Test Takers'].isna()]
        print(f"Step 5: Validated test taker counts")
    
    # Step 6: Keep only essential columns for SAT analysis
    essential_columns = [
        'DBN',
        'SCHOOL NAME', 
        'SAT Critical Reading Avg. Score',
        'SAT Math Avg. Score', 
        'SAT Writing Avg. Score',
        'Num of SAT Test Takers'
    ]
    
    # Filter to essential columns only
    existing_essential = [col for col in essential_columns if col in df_clean.columns]
    df_clean = df_clean[existing_essential]
    print(f"Step 6: Kept only {len(existing_essential)} essential columns")
    
    # Step 7: Standardize column names
    column_mapping = {
        'DBN': 'dbn',
        'SCHOOL NAME': 'school_name',
        'SAT Critical Reading Avg. Score': 'sat_reading_avg',
        'SAT Math Avg. Score': 'sat_math_avg', 
        'SAT Writing Avg. Score': 'sat_writing_avg',
        'Num of SAT Test Takers': 'num_test_takers'
    }
    
    df_clean = df_clean.rename(columns=column_mapping)
    print("Step 7: Standardized column names")
    
    # Step 8: Add calculated total score (computed, not stored separately)
    df_clean['sat_total_avg'] = (df_clean['sat_reading_avg'] + 
                                df_clean['sat_math_avg'] + 
                                df_clean['sat_writing_avg'])
    print("Step 8: Added calculated total SAT score")
    
    print(f"\n=== CLEANING COMPLETE ===")
    print(f"Final dataset shape: {df_clean.shape}")
    print(f"Rows removed: {len(df) - len(df_clean)}")
    print(f"Columns kept: {list(df_clean.columns)}")
    print(f"Data completeness: 100% for all SAT scores (by design)")
    
    return df_clean

# Apply improved cleaning function
df_improved = improved_data_cleaning(df_raw)

# Display improved data info
print(f"\n=== IMPROVED DATASET SUMMARY ===")
print(df_improved.info())
print(f"\n=== SAMPLE OF CLEAN DATA ===")
df_improved.head()

=== IMPROVED DATA CLEANING PROCESS ===
Step 1: Removed 15 duplicate rows
Step 2: Removed duplicate column with typo
Step 3: Cleaned SAT Critical Reading Avg. Score - 0 invalid scores
Step 3: Cleaned SAT Math Avg. Score - 5 invalid scores
Step 3: Cleaned SAT Writing Avg. Score - 0 invalid scores
Step 4: REMOVED 62 rows without complete SAT scores (PRIMARY FILTER)
Step 5: Validated test taker counts
Step 6: Kept only 6 essential columns
Step 7: Standardized column names
Step 8: Added calculated total SAT score

=== CLEANING COMPLETE ===
Final dataset shape: (416, 7)
Rows removed: 77
Columns kept: ['dbn', 'school_name', 'sat_reading_avg', 'sat_math_avg', 'sat_writing_avg', 'num_test_takers', 'sat_total_avg']
Data completeness: 100% for all SAT scores (by design)

=== IMPROVED DATASET SUMMARY ===
<class 'pandas.core.frame.DataFrame'>
Index: 416 entries, 0 to 477
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   

Unnamed: 0,dbn,school_name,sat_reading_avg,sat_math_avg,sat_writing_avg,num_test_takers,sat_total_avg
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,355.0,404.0,363.0,29,1122.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,383.0,423.0,366.0,91,1172.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,377.0,402.0,370.0,70,1149.0
3,01M458,FORSYTH SATELLITE ACADEMY,414.0,401.0,359.0,7,1174.0
4,01M509,MARTA VALLE HIGH SCHOOL,390.0,433.0,384.0,44,1207.0


## 3. Statistical Analysis and Data Exploration

Now let's perform comprehensive statistical analysis to understand the patterns and distributions in our cleaned data.

In [23]:
# Statistical analysis of cleaned data
def perform_statistical_analysis(df: pd.DataFrame) -> None:
    """
    Comprehensive statistical analysis of SAT score data
    """
    print("=== STATISTICAL ANALYSIS OF CLEANED DATA ===")
    
    # Basic statistics for numeric columns
    numeric_cols = ['num_of_sat_test_takers', 'sat_critical_reading_avg._score', 
                   'sat_math_avg._score', 'sat_writing_avg._score', 
                   'pct_students_tested', 'academic_tier_rating']
    
    # Filter existing columns
    existing_numeric_cols = [col for col in numeric_cols if col in df.columns]
    
    if existing_numeric_cols:
        print("\n=== DESCRIPTIVE STATISTICS ===")
        print(df[existing_numeric_cols].describe())
        
        # Missing value analysis
        print(f"\n=== MISSING VALUE ANALYSIS ===")
        missing_analysis = df[existing_numeric_cols].isnull().sum()
        for col, missing_count in missing_analysis.items():
            missing_pct = (missing_count / len(df)) * 100
            print(f"{col}: {missing_count} ({missing_pct:.1f}%)")
        
        # SAT Score Analysis
        sat_cols = [col for col in existing_numeric_cols if 'sat_' in col and 'score' in col]
        if sat_cols:
            print(f"\n=== SAT SCORE ANALYSIS ===")
            
            # Calculate total SAT scores where all sections are available
            complete_scores = df[sat_cols].dropna()
            if not complete_scores.empty:
                complete_scores['total_sat'] = complete_scores.sum(axis=1)
                
                print(f"Schools with complete SAT data: {len(complete_scores)}")
                print(f"Average Total SAT Score: {complete_scores['total_sat'].mean():.1f}")
                print(f"Standard Deviation: {complete_scores['total_sat'].std():.1f}")
                print(f"Min Total SAT: {complete_scores['total_sat'].min()}")
                print(f"Max Total SAT: {complete_scores['total_sat'].max()}")
                
                # Score distribution by percentiles
                percentiles = [25, 50, 75, 90, 95]
                print(f"\nTotal SAT Score Percentiles:")
                for p in percentiles:
                    score = np.percentile(complete_scores['total_sat'], p)
                    print(f"  {p}th percentile: {score:.0f}")
        
        # Academic tier analysis
        if 'academic_tier_rating' in df.columns:
            print(f"\n=== ACADEMIC TIER RATING DISTRIBUTION ===")
            tier_counts = df['academic_tier_rating'].value_counts().sort_index()
            for tier, count in tier_counts.items():
                pct = (count / len(df.dropna(subset=['academic_tier_rating']))) * 100
                print(f"  Tier {int(tier)}: {count} schools ({pct:.1f}%)")
    
    return df

# Perform statistical analysis
df_analyzed = perform_statistical_analysis(df_cleaned)

# Create summary statistics table for key metrics
numeric_columns = ['num_of_sat_test_takers', 'sat_critical_reading_avg._score', 
                  'sat_math_avg._score', 'sat_writing_avg._score', 'pct_students_tested']
existing_cols = [col for col in numeric_columns if col in df_cleaned.columns]

if existing_cols:
    summary_stats = df_cleaned[existing_cols].describe().round(2)
    print(f"\n=== SUMMARY STATISTICS TABLE ===")
    print(summary_stats)

=== STATISTICAL ANALYSIS OF CLEANED DATA ===

=== DESCRIPTIVE STATISTICS ===
       num_of_sat_test_takers  sat_critical_reading_avg._score  \
count              421.000000                       421.000000   
mean               110.320665                       400.850356   
std                155.534254                        56.802783   
min                  6.000000                       279.000000   
25%                 41.000000                       368.000000   
50%                 62.000000                       391.000000   
75%                 95.000000                       416.000000   
max               1277.000000                       679.000000   

       sat_math_avg._score  sat_writing_avg._score  pct_students_tested  \
count           416.000000              421.000000           363.000000   
mean            413.733173              393.985748            84.595041   
std              64.945638               58.635109             5.673305   
min             312.000000  

## 4. Database Schema Design and Integration

Now let's design the optimal database schema for integrating the SAT results with existing tables and establish the database connection.

In [24]:
# Database connection setup
DATABASE_URL = (
    "postgresql+psycopg2://neondb_owner:npg_CeS9fJg2azZD"
    "@ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech:5432/neondb"
    "?sslmode=require"
)

# Create SQLAlchemy engine
engine = create_engine(DATABASE_URL)

# Test database connection
try:
    # Test connection
    with engine.connect() as connection:
        # Import text for SQL queries - FIXED
        from sqlalchemy import text
        
        result = connection.execute(text("SELECT version()"))
        version = result.fetchone()[0]
        print(f"✅ Database connection successful!")
        print(f"PostgreSQL version: {version}")
        
        # Check existing tables in nyc_schools schema
        result = connection.execute(text("""
            SELECT table_name 
            FROM information_schema.tables 
            WHERE table_schema = 'nyc_schools'
            ORDER BY table_name;
        """))
        tables = result.fetchall()
        print(f"\n=== EXISTING TABLES IN nyc_schools SCHEMA ===")
        for table in tables:
            print(f"  - {table[0]}")
            
except Exception as e:
    print(f"❌ Database connection failed: {e}")

✅ Database connection successful!
PostgreSQL version: PostgreSQL 17.5 on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14+deb12u1) 12.2.0, 64-bit

=== EXISTING TABLES IN nyc_schools SCHEMA ===
  - Levon_cleaned_sat_scores
  - anastasia_sat_results
  - clara_sat_results
  - deepshikha_sat_results
  - giovani_sat_results
  - high_school_directory
  - jyoti_sat_results
  - najimohammed_sat_results
  - sat_scores
  - sat_scores_mariia
  - school_demographics
  - school_safety_report
  - sebastian_sat_results
  - sultan_sat_results
  - sumi_sat_results
  - svitlana_experement_sat_results
  - svitlana_sat_results
  - svitlana_test_connection


In [25]:
# Prepare improved clean data for database insertion
def prepare_clean_data_for_database(df: pd.DataFrame) -> pd.DataFrame:
    """
    Prepare the logically cleaned dataset for database insertion
    Focus on essential SAT analysis data only
    """
    print("=== PREPARING CLEAN DATA FOR DATABASE ===")
    
    # The data is already clean - just add minimal metadata for tracking
    df_db = df.copy()
    
    # Add only essential metadata
    df_db['data_processed_at'] = pd.Timestamp.now()
    
    print(f"Clean database schema columns: {list(df_db.columns)}")
    print(f"Records prepared for insertion: {len(df_db)}")
    print(f"Data completeness: 100% for all SAT scores")
    
    return df_db

# Prepare final clean dataset
df_final_clean = prepare_clean_data_for_database(df_improved)

# Display final clean schema
print(f"\n=== FINAL CLEAN DATABASE SCHEMA ===")
print(df_final_clean.info())
print(f"\n=== SAMPLE CLEAN RECORDS ===")
df_final_clean.head()

=== PREPARING CLEAN DATA FOR DATABASE ===
Clean database schema columns: ['dbn', 'school_name', 'sat_reading_avg', 'sat_math_avg', 'sat_writing_avg', 'num_test_takers', 'sat_total_avg', 'data_processed_at']
Records prepared for insertion: 416
Data completeness: 100% for all SAT scores

=== FINAL CLEAN DATABASE SCHEMA ===
<class 'pandas.core.frame.DataFrame'>
Index: 416 entries, 0 to 477
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   dbn                416 non-null    object        
 1   school_name        416 non-null    object        
 2   sat_reading_avg    416 non-null    float64       
 3   sat_math_avg       416 non-null    float64       
 4   sat_writing_avg    416 non-null    float64       
 5   num_test_takers    416 non-null    int64         
 6   sat_total_avg      416 non-null    float64       
 7   data_processed_at  416 non-null    datetime64[us]
dtypes: datetime64[us](1), f

Unnamed: 0,dbn,school_name,sat_reading_avg,sat_math_avg,sat_writing_avg,num_test_takers,sat_total_avg,data_processed_at
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,355.0,404.0,363.0,29,1122.0,2025-08-07 22:40:56.847458
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,383.0,423.0,366.0,91,1172.0,2025-08-07 22:40:56.847458
2,01M450,EAST SIDE COMMUNITY SCHOOL,377.0,402.0,370.0,70,1149.0,2025-08-07 22:40:56.847458
3,01M458,FORSYTH SATELLITE ACADEMY,414.0,401.0,359.0,7,1174.0,2025-08-07 22:40:56.847458
4,01M509,MARTA VALLE HIGH SCHOOL,390.0,433.0,384.0,44,1207.0,2025-08-07 22:40:56.847458


## 5. Database Insertion with Error Handling

Now let's insert the cleaned data into the PostgreSQL database with proper error handling and validation.

In [26]:
# Database insertion with improved clean dataset
def insert_clean_data_to_database(df: pd.DataFrame, table_name: str = 'svitlana_sat_scores_clean') -> bool:
    """
    Insert the logically cleaned SAT data to PostgreSQL database
    Only includes schools with complete SAT score data
    """
    try:
        print(f"=== INSERTING CLEAN SAT DATA TO DATABASE ===")
        print(f"Table: nyc_schools.{table_name}")
        print(f"Clean records to insert: {len(df)}")
        print(f"Data completeness: 100% SAT scores (by design)")
        
        # Insert clean data using pandas to_sql
        result = df.to_sql(
            name=table_name,
            con=engine,
            schema='nyc_schools',
            if_exists='replace',  # Replace existing table
            index=False,
            method='multi'  # Use efficient batch insertion
        )
        
        print(f"✅ Successfully inserted {len(df)} clean records into nyc_schools.{table_name}")
        
        # Verify insertion
        with engine.connect() as connection:
            from sqlalchemy import text
            
            verification_query = text(f"SELECT COUNT(*) FROM nyc_schools.{table_name}")
            result = connection.execute(verification_query)
            count = result.fetchone()[0]
            print(f"✅ Verification: {count} records found in database table")
            
            # Get sample of inserted data with SAT statistics
            sample_query = text(f"""
                SELECT dbn, school_name, sat_reading_avg, sat_math_avg, sat_writing_avg, sat_total_avg
                FROM nyc_schools.{table_name} 
                ORDER BY sat_total_avg DESC 
                LIMIT 3
            """)
            sample_result = connection.execute(sample_query)
            sample_data = sample_result.fetchall()
            
            print(f"\n=== TOP 3 SCHOOLS BY SAT TOTAL ===")
            for i, row in enumerate(sample_data, 1):
                print(f"{i}. {row[0]} - {row[1][:40]}...")
                print(f"   Reading: {row[2]}, Math: {row[3]}, Writing: {row[4]}, Total: {row[5]}")
        
        return True
        
    except Exception as e:
        print(f"❌ Database insertion failed: {e}")
        print(f"Error type: {type(e).__name__}")
        return False

# Insert improved clean dataset
success_clean = insert_clean_data_to_database(df_final_clean)

if success_clean:
    print(f"\n🎯 IMPROVED DATA INTEGRATION COMPLETED!")
    print(f"   ✅ Logical filtering: Only schools with complete SAT data")
    print(f"   ✅ Essential columns only: Focused on SAT analysis purpose")  
    print(f"   ✅ Clean dataset: {len(df_final_clean)} high-quality records")
    print(f"   ✅ 100% data completeness for primary indicators (SAT scores)")
    print(f"   ✅ Removed unnecessary metadata and irrelevant columns")
else:
    print(f"\n⚠️  Clean database insertion failed")

=== INSERTING CLEAN SAT DATA TO DATABASE ===
Table: nyc_schools.svitlana_sat_scores_clean
Clean records to insert: 416
Data completeness: 100% SAT scores (by design)
✅ Successfully inserted 416 clean records into nyc_schools.svitlana_sat_scores_clean
✅ Verification: 416 records found in database table

=== TOP 3 SCHOOLS BY SAT TOTAL ===
1. 02M475 - STUYVESANT HIGH SCHOOL...
   Reading: 679.0, Math: 735.0, Writing: 682.0, Total: 2096.0
2. 10X445 - BRONX HIGH SCHOOL OF SCIENCE...
   Reading: 632.0, Math: 688.0, Writing: 649.0, Total: 1969.0
3. 31R605 - STATEN ISLAND TECHNICAL HIGH SCHOOL...
   Reading: 635.0, Math: 682.0, Writing: 636.0, Total: 1953.0

🎯 IMPROVED DATA INTEGRATION COMPLETED!
   ✅ Logical filtering: Only schools with complete SAT data
   ✅ Essential columns only: Focused on SAT analysis purpose
   ✅ Clean dataset: 416 high-quality records
   ✅ 100% data completeness for primary indicators (SAT scores)
   ✅ Removed unnecessary metadata and irrelevant columns


## 6. Export Cleaned Data

Finally, let's export the cleaned dataset as a CSV file for backup and further use.

In [27]:
# Export cleaned data to CSV
output_path = '/Users/svitlanakovalivska/onboarding_weebet/_onboarding_data/svitlana_experement_sat_results.csv'

try:
    # Export final cleaned dataset
    df_final.to_csv(output_path, index=False)
    print(f"✅ Successfully exported cleaned data to: {output_path}")
    print(f"   - Records exported: {len(df_final)}")
    print(f"   - Columns exported: {len(df_final.columns)}")
    print(f"   - File size: {df_final.memory_usage(deep=True).sum() / 1024:.1f} KB")
    
    # Create summary report
    print(f"\n=== DATA CLEANING SUMMARY REPORT ===")
    print(f"Original dataset:")
    print(f"  - Rows: {len(df_raw)}")
    print(f"  - Columns: {len(df_raw.columns)}")
    print(f"  - Duplicates: {df_raw.duplicated().sum()}")
    
    print(f"\nCleaned dataset:")
    print(f"  - Rows: {len(df_final)}")
    print(f"  - Columns: {len(df_final.columns)}")
    print(f"  - Data quality improvements:")
    print(f"    * Removed duplicate rows: {len(df_raw) - len(df_cleaned)}")
    print(f"    * Standardized column names")
    print(f"    * Converted percentages to proper format (0-100)")
    print(f"    * Validated SAT score ranges (200-800)")
    print(f"    * Handled suppressed data appropriately")
    print(f"    * Added calculated total SAT scores")
    print(f"    * Added metadata columns")
    
    # Data quality metrics
    complete_records = df_final.dropna(subset=['sat_critical_reading_avg_score', 'sat_math_avg_score', 'sat_writing_avg_score'])
    print(f"\nData completeness:")
    print(f"  - Schools with complete SAT data: {len(complete_records)} ({len(complete_records)/len(df_final)*100:.1f}%)")
    
    if not complete_records.empty:
        avg_total = complete_records['sat_total_avg_score'].mean()
        print(f"  - Average total SAT score: {avg_total:.1f}")
    
except Exception as e:
    print(f"❌ Failed to export cleaned data: {e}")

print(f"\n🎯 TASK COMPLETION STATUS:")
print(f"✅ Data exploration and quality assessment completed") 
print(f"✅ Comprehensive data cleaning applied")
print(f"✅ Statistical analysis performed")
print(f"✅ Database schema designed and data inserted")
print(f"✅ Cleaned dataset exported to CSV")
print(f"\n🚀 Ready for production use and further analysis!")

✅ Successfully exported cleaned data to: /Users/svitlanakovalivska/onboarding_weebet/_onboarding_data/svitlana_experement_sat_results.csv
   - Records exported: 478
   - Columns exported: 13
   - File size: 172.1 KB

=== DATA CLEANING SUMMARY REPORT ===
Original dataset:
  - Rows: 493
  - Columns: 11
  - Duplicates: 15

Cleaned dataset:
  - Rows: 478
  - Columns: 13
  - Data quality improvements:
    * Removed duplicate rows: 15
    * Standardized column names
    * Converted percentages to proper format (0-100)
    * Validated SAT score ranges (200-800)
    * Handled suppressed data appropriately
    * Added calculated total SAT scores
    * Added metadata columns

Data completeness:
  - Schools with complete SAT data: 416 (87.0%)
  - Average total SAT score: 1209.0

🎯 TASK COMPLETION STATUS:
✅ Data exploration and quality assessment completed
✅ Comprehensive data cleaning applied
✅ Statistical analysis performed
✅ Database schema designed and data inserted
✅ Cleaned dataset exported 