# Day 4 – SAT Results Integration

In this notebook, I will clean and integrate the SAT results dataset into our PostgreSQL database.  
The goal is to simulate a small ETL (Extract → Transform → Load) workflow:

1. **Explore the dataset** – check structure and columns.  
2. **Clean the data** – normalize headers, fix dirty numeric values, remove invalid or duplicate rows.  
3. **Prepare the schema** – decide which columns to keep for the database.  
4. **Load the cleaned data** – save a clean CSV and append the results into PostgreSQL.  
5. **Validate** – run a few simple checks on the database to confirm data quality.  

At the end, I will have a reproducible and simple data pipeline that transforms raw CSV into a relational database table.

## Import libraries and loading the data

In [1]:
# Import all needed libraries
import pandas as pd
import re
from pathlib import Path
import psycopg2
from psycopg2.extras import execute_values
from sqlalchemy import create_engine, text

### 1. Data Inspection

In [2]:
# Define file paths
raw_path = Path("/Users/s.bangemann/Documents/Arbeit/Internship webeet.io/Work Area/_onboarding_data/daily_tasks/day_4/day_4_datasets/sat-results.csv")
clean_path = Path("/Users/s.bangemann/Documents/Arbeit/Internship webeet.io/Work Area/_onboarding_data/daily_tasks/day_4/day_4_task/cleaned_sat_results.csv")

# Load raw dataset
df_raw = pd.read_csv(raw_path)

# Show first rows of the raw data
df_raw.head()

Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score,SAT Critical Readng Avg. Score,internal_school_id,contact_extension,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,355,218160,x345,78%,2.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,383,268547,x234,,3.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370,377,236446,x123,,3.0
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359,414,427826,x123,92%,4.0
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384,390,672714,x123,92%,2.0


In [3]:
# Show shape and column names
print("Shape:", df_raw.shape)
print("Columns:", df_raw.columns.tolist())

# Show column info
df_raw.info()

Shape: (493, 11)
Columns: ['DBN', 'SCHOOL NAME', 'Num of SAT Test Takers', 'SAT Critical Reading Avg. Score', 'SAT Math Avg. Score', 'SAT Writing Avg. Score', 'SAT Critical Readng Avg. Score', 'internal_school_id', 'contact_extension', 'pct_students_tested', 'academic_tier_rating']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493 entries, 0 to 492
Data columns (total 11 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   DBN                              493 non-null    object 
 1   SCHOOL NAME                      493 non-null    object 
 2   Num of SAT Test Takers           493 non-null    object 
 3   SAT Critical Reading Avg. Score  493 non-null    object 
 4   SAT Math Avg. Score              493 non-null    object 
 5   SAT Writing Avg. Score           493 non-null    object 
 6   SAT Critical Readng Avg. Score   493 non-null    object 
 7   internal_school_id               493 non-null    

There are only a few missing values.
What can be seen immediately is the appearance of duplicated columns ("SAT Critical Reading Avg. Score" & "SAT Critical Readng Avg. Score"). That should be corrected.
The data cleaning will also include normalizing of headers.
I will take a look at it in the following part.

### 2. Data Cleaning

In [5]:
# Normalize headers to snake_case
def to_snake(s: str) -> str:
    s = re.sub(r"[^\w\s]", " ", s.strip())  # remove punctuation
    s = re.sub(r"\s+", "_", s.lower())      # collapse spaces to underscores
    return s

df_raw.columns = [to_snake(c) for c in df_raw.columns]
df_raw.head(3)

Unnamed: 0,dbn,school_name,num_of_sat_test_takers,sat_critical_reading_avg_score,sat_math_avg_score,sat_writing_avg_score,sat_critical_readng_avg_score,internal_school_id,contact_extension,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,355,218160,x345,78%,2.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,383,268547,x234,,3.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370,377,236446,x123,,3.0


In [7]:
# Map original (normalized) columns to clean names
# I keep only the relevant fields for the DB
rename_map = {
    "dbn": "dbn",
    "school_name": "school_name",
    "num_of_sat_test_takers": "num_takers",
    "sat_critical_reading_avg_score": "sat_reading_avg",
    "sat_math_avg_score": "sat_math_avg",
    "sat_writing_avg_score": "sat_writing_avg",
    "pct_students_tested": "pct_students_tested",
}

# Columns to drop (synthetic or not needed for SAT integration)
drop_candidates = [
    "sat_critical_readng_avg_score",  # typo duplicate of reading
    "internal_school_id",
    "contact_extension",
    "academic_tier_rating",
]

# Keep only the keys I need (ignore if some are missing)
keep = [c for c in rename_map.keys() if c in df_raw.columns]
df = df_raw[keep].rename(columns=rename_map)

# Optional: drop noisy columns if they exist (from df_raw, not needed anymore)
# (No action needed since I already sub-selected)
df.head(5)

Unnamed: 0,dbn,school_name,num_takers,sat_reading_avg,sat_math_avg,sat_writing_avg,pct_students_tested
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,78%
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370,
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359,92%
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384,92%


In [9]:
# Helper to clean messy numbers (%, commas, stray chars)
def clean_num(x):
    if pd.isna(x):
        return pd.NA
    x = str(x).replace("%", "").replace(",", "").strip()
    return pd.to_numeric(x, errors="coerce")

# Convert numeric columns
for col in ["num_takers", "sat_reading_avg", "sat_math_avg", "sat_writing_avg", "pct_students_tested"]:
    df[col] = df[col].map(clean_num)

# Scale pct_students_tested into range [0,1] instead of 0-100
df["pct_students_tested"] = df["pct_students_tested"] / 100

df.head(5)

Unnamed: 0,dbn,school_name,num_takers,sat_reading_avg,sat_math_avg,sat_writing_avg,pct_students_tested
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29.0,355.0,404.0,363.0,0.78
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91.0,383.0,423.0,366.0,
2,01M450,EAST SIDE COMMUNITY SCHOOL,70.0,377.0,402.0,370.0,
3,01M458,FORSYTH SATELLITE ACADEMY,7.0,414.0,401.0,359.0,0.92
4,01M509,MARTA VALLE HIGH SCHOOL,44.0,390.0,433.0,384.0,0.92


In [10]:
# SAT subscores must be in [200, 800]
for col in ["sat_reading_avg", "sat_math_avg", "sat_writing_avg"]:
    df[col] = df[col].where(df[col].between(200, 800), pd.NA)

# num_takers must be non-negative
df["num_takers"] = df["num_takers"].where((df["num_takers"].isna()) | (df["num_takers"] >= 0), pd.NA)

df.head(5)

Unnamed: 0,dbn,school_name,num_takers,sat_reading_avg,sat_math_avg,sat_writing_avg,pct_students_tested
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29.0,355.0,404.0,363.0,0.78
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91.0,383.0,423.0,366.0,
2,01M450,EAST SIDE COMMUNITY SCHOOL,70.0,377.0,402.0,370.0,
3,01M458,FORSYTH SATELLITE ACADEMY,7.0,414.0,401.0,359.0,0.92
4,01M509,MARTA VALLE HIGH SCHOOL,44.0,390.0,433.0,384.0,0.92


In [11]:
# Drop duplicate DBNs (keep first – simple rule)
# If you prefer a smarter rule: keep the row with most scores present, then highest num_takers
df = df.drop_duplicates(subset=["dbn"]).copy()

# Compute total SAT score (only if all three parts are present)
df["sat_total_avg"] = df[["sat_reading_avg", "sat_math_avg", "sat_writing_avg"]].sum(axis=1, min_count=3)

df.head(5)

Unnamed: 0,dbn,school_name,num_takers,sat_reading_avg,sat_math_avg,sat_writing_avg,pct_students_tested,sat_total_avg
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29.0,355.0,404.0,363.0,0.78,1122.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91.0,383.0,423.0,366.0,,1172.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70.0,377.0,402.0,370.0,,1149.0
3,01M458,FORSYTH SATELLITE ACADEMY,7.0,414.0,401.0,359.0,0.92,1174.0
4,01M509,MARTA VALLE HIGH SCHOOL,44.0,390.0,433.0,384.0,0.92,1207.0


In [12]:
# Cast to integers where possible (nullable Int64 keeps NaN)
for col in ["num_takers", "sat_reading_avg", "sat_math_avg", "sat_writing_avg", "sat_total_avg"]:
    df[col] = df[col].round().astype("Int64")

# Save clean CSV
clean_path.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(clean_path, index=False)
print("✅ Clean CSV saved:", clean_path, "| rows:", len(df))

✅ Clean CSV saved: /Users/s.bangemann/Documents/Arbeit/Internship webeet.io/Work Area/_onboarding_data/daily_tasks/day_4/day_4_task/cleaned_sat_results.csv | rows: 478


### 3.Database Table Creation and Validation

In [13]:
conn = psycopg2.connect(
    dbname="neondb",
    user="neondb_owner",
    password="npg_CeS9fJg2azZD",
    host="ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech",
    port="5432",
    sslmode="require"
)
cur = conn.cursor()
print("✅ Connection established")

✅ Connection established


In [14]:
# 1. Create table if not exists
create_table_sql = """
CREATE TABLE IF NOT EXISTS "sebastian-bangemann_sat_scores" (
    dbn TEXT PRIMARY KEY,
    school_name TEXT,
    num_takers INTEGER,
    sat_reading_avg FLOAT,
    sat_math_avg FLOAT,
    sat_writing_avg FLOAT,
    sat_total_avg FLOAT,
    pct_students_tested FLOAT,
    loaded_at TIMESTAMPTZ DEFAULT NOW()
);
"""
cur.execute(create_table_sql)
conn.commit()
print("✅ Table ready")

✅ Table ready


In [15]:
# Convert pd.NA to None so psycopg2 can handle it
rows = df[[
    "dbn","school_name","num_takers",
    "sat_reading_avg","sat_math_avg","sat_writing_avg","sat_total_avg",
    "pct_students_tested"
]].where(pd.notna(df), None).values.tolist()

# Double-check: first row
print(rows[:3])

[['01M292', 'HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES', 29, 355, 404, 363, 1122, 0.78], ['01M448', 'UNIVERSITY NEIGHBORHOOD HIGH SCHOOL', 91, 383, 423, 366, 1172, None], ['01M450', 'EAST SIDE COMMUNITY SCHOOL', 70, 377, 402, 370, 1149, None]]


In [19]:
# Fix schema: add pct_students_tested + cast avg columns to FLOAT
alter_sql = """
ALTER TABLE "sebastian-bangemann_sat_scores"
  ADD COLUMN IF NOT EXISTS pct_students_tested FLOAT;

ALTER TABLE "sebastian-bangemann_sat_scores"
  ALTER COLUMN sat_reading_avg  TYPE FLOAT USING sat_reading_avg::float,
  ALTER COLUMN sat_math_avg     TYPE FLOAT USING sat_math_avg::float,
  ALTER COLUMN sat_writing_avg  TYPE FLOAT USING sat_writing_avg::float,
  ALTER COLUMN sat_total_avg    TYPE FLOAT USING sat_total_avg::float;
"""
cur.execute(alter_sql)
conn.commit()
print("✅ Table altered: schema updated")

✅ Table altered: schema updated


In [20]:
# 2. Insert cleaned DataFrame rows into the table (with NA handling)

# Columns I want to insert
cols = [
    "dbn","school_name","num_takers",
    "sat_reading_avg","sat_math_avg","sat_writing_avg","sat_total_avg",
    "pct_students_tested"
]

# Make a copy and replace pd.NA with None (Python's null)
df_ins = df[cols].copy().astype(object).where(pd.notna(df[cols]), None)

# Convert rows to list of tuples
rows = df_ins.values.tolist()

# SQL for bulk insert with UPSERT (update on conflict)
insert_sql = """
INSERT INTO "sebastian-bangemann_sat_scores"
    (dbn, school_name, num_takers, sat_reading_avg, sat_math_avg, sat_writing_avg, sat_total_avg, pct_students_tested)
VALUES %s
ON CONFLICT (dbn) DO UPDATE SET
    school_name = EXCLUDED.school_name,
    num_takers = EXCLUDED.num_takers,
    sat_reading_avg = EXCLUDED.sat_reading_avg,
    sat_math_avg = EXCLUDED.sat_math_avg,
    sat_writing_avg = EXCLUDED.sat_writing_avg,
    sat_total_avg = EXCLUDED.sat_total_avg,
    pct_students_tested = EXCLUDED.pct_students_tested;
"""

# Run bulk insert
execute_values(cur, insert_sql, rows)
conn.commit()
print("✅ Data inserted (bulk)")

✅ Data inserted (bulk)


In [21]:
# 3. Validation: Row count
cur.execute('SELECT COUNT(*) FROM "sebastian-bangemann_sat_scores";')
print("Row count:", cur.fetchone()[0])

# 4. Validation: Out-of-range SAT scores (should be 0)
cur.execute("""
    SELECT COUNT(*) 
    FROM "sebastian-bangemann_sat_scores"
    WHERE (sat_reading_avg  < 200 OR sat_reading_avg  > 800)
       OR (sat_math_avg     < 200 OR sat_math_avg     > 800)
       OR (sat_writing_avg  < 200 OR sat_writing_avg  > 800);
""")
print("Out-of-range rows:", cur.fetchone()[0])

# 5. Validation: Top 5 schools by total SAT score
cur.execute("""
    SELECT dbn, school_name, sat_total_avg
    FROM "sebastian-bangemann_sat_scores"
    ORDER BY sat_total_avg DESC NULLS LAST
    LIMIT 5;
""")
rows = cur.fetchall()

print("\nTop 5 schools by total SAT score:")
for r in rows:
    print(r)

Row count: 478
Out-of-range rows: 0

Top 5 schools by total SAT score:
('02M475', 'STUYVESANT HIGH SCHOOL', 2096.0)
('10X445', 'BRONX HIGH SCHOOL OF SCIENCE', 1969.0)
('31R605', 'STATEN ISLAND TECHNICAL HIGH SCHOOL', 1953.0)
('10X696', 'HIGH SCHOOL OF AMERICAN STUDIES AT LEHMAN COLLEGE', 1920.0)
('25Q525', 'TOWNSEND HARRIS HIGH SCHOOL', 1910.0)
