Day 4 — SAT Results ETL (Explore → Clean → Load)


Inspect the dataset (structure, columns, issues).

Clean & preprocess (handle duplicates, missing values, formatting).

Prepare cleaned .csv.

Provide Python script for cleaning + inserting into PostgreSQL.

Provide Markdown .md file with documentation.

Analyzed

In [12]:
import pandas as pd
import numpy as np
import pandas as pd
import psycopg2
from sqlalchemy import create_engine



In [23]:
# Load the dataset
file_path = "sat-results.csv"
df = pd.read_csv(file_path)


# Basic info
info = df.info()

# First few rows
head = df.head()

# Summary stats
desc = df.describe(include="all")

(info, head, desc)
df = pd.read_csv("sat-results.csv")
print("Shape:", df.shape)
df.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493 entries, 0 to 492
Data columns (total 11 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   DBN                              493 non-null    object 
 1   SCHOOL NAME                      493 non-null    object 
 2   Num of SAT Test Takers           493 non-null    object 
 3   SAT Critical Reading Avg. Score  493 non-null    object 
 4   SAT Math Avg. Score              493 non-null    object 
 5   SAT Writing Avg. Score           493 non-null    object 
 6   SAT Critical Readng Avg. Score   493 non-null    object 
 7   internal_school_id               493 non-null    int64  
 8   contact_extension                388 non-null    object 
 9   pct_students_tested              376 non-null    object 
 10  academic_tier_rating             402 non-null    float64
dtypes: float64(1), int64(1), object(9)
memory usage: 42.5+ KB
Shape: (493, 11)


Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score,SAT Critical Readng Avg. Score,internal_school_id,contact_extension,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,355,218160,x345,78%,2.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,383,268547,x234,,3.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370,377,236446,x123,,3.0
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359,414,427826,x123,92%,4.0
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384,390,672714,x123,92%,2.0
5,01M515,LOWER EAST SIDE PREPARATORY HIGH SCHOOL,112,332,557,316,332,414951,x345,,3.0
6,01M539,"NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND ...",159,522,574,525,522,697107,,78%,2.0
7,01M650,CASCADES HIGH SCHOOL,18,417,418,411,417,297600,,92%,4.0
8,01M696,BARD HIGH SCHOOL EARLY COLLEGE,130,624,604,628,624,881396,x234,,
9,02M047,47 THE AMERICAN SIGN LANGUAGE AND ENGLISH SECO...,16,395,400,387,395,751293,,78%,4.0


Cleaning Plan

Drop redundant column: SAT Critical Readng Avg. Score.

Convert numeric fields:

Num of SAT Test Takers,

SAT Critical Reading Avg. Score,

SAT Math Avg. Score,

SAT Writing Avg. Score.
→ Replace "s" and invalid values with NaN then cast to int/float.

Fix percentage formatting: convert pct_students_tested from "78%" → 0.78.

Remove duplicates: based on DBN.

Keep structure consistent for PostgreSQL insertion.

In [20]:
# Drop redundant column
df_clean = df.drop(columns=["SAT Critical Readng Avg. Score"])

# Convert numeric SAT-related columns (replace 's' or non-numeric with NaN)
numeric_cols = [
    "Num of SAT Test Takers",
    "SAT Critical Reading Avg. Score",
    "SAT Math Avg. Score",
    "SAT Writing Avg. Score"
]

for col in numeric_cols:
    df_clean[col] = pd.to_numeric(df_clean[col], errors="coerce")

# Clean percentage column
df_clean["pct_students_tested"] = (
    df_clean["pct_students_tested"]
    .str.replace("%", "", regex=False)
    .astype(float) / 100
)

# Remove duplicates by DBN, keeping the first occurrence
df_clean = df_clean.drop_duplicates(subset=["DBN"], keep="first")

# Save cleaned dataset
output_path = "sat-results-cleaned.csv"
df_clean.to_csv(output_path, index=False)
print("After cleaning:", df.shape)
df.head(10)


After cleaning: (493, 11)


Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score,SAT Critical Readng Avg. Score,internal_school_id,contact_extension,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,355,218160,x345,78%,2.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,383,268547,x234,,3.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370,377,236446,x123,,3.0
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359,414,427826,x123,92%,4.0
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384,390,672714,x123,92%,2.0
5,01M515,LOWER EAST SIDE PREPARATORY HIGH SCHOOL,112,332,557,316,332,414951,x345,,3.0
6,01M539,"NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND ...",159,522,574,525,522,697107,,78%,2.0
7,01M650,CASCADES HIGH SCHOOL,18,417,418,411,417,297600,,92%,4.0
8,01M696,BARD HIGH SCHOOL EARLY COLLEGE,130,624,604,628,624,881396,x234,,
9,02M047,47 THE AMERICAN SIGN LANGUAGE AND ENGLISH SECO...,16,395,400,387,395,751293,,78%,4.0


Python Script

Connects to PostgreSQL.

Appends the cleaned data to a target table (sat_results).

In [66]:
# Save the Python script into a .py file
script_content = """import pandas as pd
import psycopg2
from sqlalchemy import create_engine

# === Step 1: Load raw dataset ===
raw_path = "sat-results.csv"
df = pd.read_csv(raw_path)

# === Step 2: Clean & preprocess ===
# Drop redundant column
df = df.drop(columns=["SAT Critical Readng Avg. Score"], errors="ignore")

# Convert numeric SAT-related columns
numeric_cols = [
    "Num of SAT Test Takers",
    "SAT Critical Reading Avg. Score",
    "SAT Math Avg. Score",
    "SAT Writing Avg. Score"
]
for col in numeric_cols:
    df[col] = pd.to_numeric(df[col], errors="coerce")

# Clean percentage column
if "pct_students_tested" in df.columns:
    df["pct_students_tested"] = (
        df["pct_students_tested"]
        .astype(str)
        .str.replace("%", "", regex=False)
        .replace("nan", None)
    )
    df["pct_students_tested"] = pd.to_numeric(df["pct_students_tested"], errors="coerce") / 100

# Remove duplicates
if "DBN" in df.columns:
    df = df.drop_duplicates(subset=["DBN"], keep="first")

# Save cleaned dataset
cleaned_path = "sat-results-cleaned.csv"
df.to_csv(cleaned_path, index=False)
print(f"Cleaned dataset saved to {cleaned_path}")

# === Step 3: Insert into PostgreSQL ===
# Update these credentials for your environment
DB_NAME = "neondb"
DB_USER = "neondb_owner"
DB_PASS = "npg_CeS9fJg2azZD"
DB_HOST = "ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech"
DB_PORT = "5432"

engine = create_engine(f"postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}")

# Append to table 'sat_results'
df.to_sql("sat_results", engine, if_exists="append", index=False)
print("Data appended to PostgreSQL table 'sat_results'")
"""

script_path = "/mnt/data/process_sat_results.py"


Quality checks (missing values, basic stats)

In [68]:
print("Row count:", len(df))
print("\nMissing values per column:")
print(df.isna().sum())

print("\nDescriptive statistics (numeric):")
display(df.describe())

print("\nSample rows:")
display(df.sample(5, random_state=42))

Row count: 478

Missing values per column:
DBN                                  0
SCHOOL NAME                          0
Num of SAT Test Takers              57
SAT Critical Reading Avg. Score     57
SAT Math Avg. Score                 57
SAT Writing Avg. Score              57
internal_school_id                   0
contact_extension                  100
pct_students_tested                115
academic_tier_rating                86
dtype: int64

Descriptive statistics (numeric):


Unnamed: 0,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score,internal_school_id,pct_students_tested,academic_tier_rating
count,421.0,421.0,421.0,421.0,478.0,363.0,392.0
mean,110.320665,400.850356,418.173397,393.985748,560082.717573,0.84595,2.579082
std,155.534254,56.802783,88.210494,58.635109,259637.064755,0.056733,1.128053
min,6.0,279.0,-10.0,286.0,101855.0,0.78,1.0
25%,41.0,368.0,372.0,360.0,337012.5,0.78,2.0
50%,62.0,391.0,395.0,381.0,581301.5,0.85,3.0
75%,95.0,416.0,438.0,411.0,778312.75,0.92,4.0
max,1277.0,679.0,1100.0,682.0,999398.0,0.92,4.0



Sample rows:


Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score,internal_school_id,contact_extension,pct_students_tested,academic_tier_rating
469,75M035,P.S. 035,,,,,861847,x123,,4.0
33,02M416,ELEANOR ROOSEVELT HIGH SCHOOL,127.0,572.0,594.0,592.0,799903,x345,0.92,2.0
131,07X548,URBAN ASSEMBLY SCHOOL FOR CAREERS IN SPORTS,44.0,387.0,411.0,383.0,875037,x345,,2.0
72,02M630,ART AND DESIGN HIGH SCHOOL,270.0,444.0,441.0,430.0,850641,x234,0.85,
78,03M403,THE GLOBAL LEARNING COLLABORATIVE,,,,,453954,x345,0.78,1.0
