# Day 4: SAT Results Cleaning and Database Integration

## ðŸ§  Task Summary

The objective is to evaluate, clean, and integrate the `sat-results.csv` dataset into an existing PostgreSQL database.

**Our goals:**
1.  **Explore** the raw dataset to identify its structure, data types, and any issues.
2.  **Clean** the data by handling duplicates, invalid values (like 's'), outliers, and formatting.
3.  **Normalize** headers and drop fields that are not useful for analysis.
4.  **Save** the cleaned data to `cleaned_sat_results.csv`.
5.  **Append** this clean data to our PostgreSQL table.

In [40]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import re

## 1. Exploration: Initial Load and Inspection

First, we'll load the raw `sat-results.csv` and use `.info()` and `.head()` to get our first look at the data. This will help us spot any immediate problems.

In [41]:
# Load the dataset

df = pd.read_csv('sat-results.csv')

print("--- DataFrame .info() ---")
df.info()

print("\n--- DataFrame .head() ---")
print(df.head())

--- DataFrame .info() ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493 entries, 0 to 492
Data columns (total 11 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   DBN                              493 non-null    object 
 1   SCHOOL NAME                      493 non-null    object 
 2   Num of SAT Test Takers           493 non-null    object 
 3   SAT Critical Reading Avg. Score  493 non-null    object 
 4   SAT Math Avg. Score              493 non-null    object 
 5   SAT Writing Avg. Score           493 non-null    object 
 6   SAT Critical Readng Avg. Score   493 non-null    object 
 7   internal_school_id               493 non-null    int64  
 8   contact_extension                388 non-null    object 
 9   pct_students_tested              376 non-null    object 
 10  academic_tier_rating             402 non-null    float64
dtypes: float64(1), int64(1), object(9)
memory usage: 42.5+ KB


## 2. Findings: Initial Assessment

The initial inspection reveals several key issues:

* **Bad Headers:** Column names have spaces and inconsistent capitalization (e.g., `SCHOOL NAME`).

* **Wrong Data Types:** All SAT score columns (`Num of SAT Test Takers`, `SAT Critical Reading Avg. Score`, etc.) are `object` (text) type. They must be numeric.

* **Redundancy:** We have both `SAT Critical Reading Avg. Score` and `SAT Critical Readng Avg. Score`. We need to check if they are identical.

* **Inconsistent Formatting:** `pct_students_tested` is an `object` with '%' signs.

* **Missing Data:** Several columns show a non-null count less than the total, indicating missing values.

## 3. Cleaning (Step 1): Normalize Headers

Let's fix the column names first. We'll convert them to lowercase `snake_case` for consistency and to make them database-friendly.

In [42]:
print("--- Original Columns ---")
print(df.columns)

# Standardize column names
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(" ", "_")
df.columns = df.columns.str.replace(r'[^a-z0-9_]', '', regex=True)

print("\n--- Cleaned Columns ---")
print(df.columns)

--- Original Columns ---
Index(['DBN', 'SCHOOL NAME', 'Num of SAT Test Takers',
       'SAT Critical Reading Avg. Score', 'SAT Math Avg. Score',
       'SAT Writing Avg. Score', 'SAT Critical Readng Avg. Score',
       'internal_school_id', 'contact_extension', 'pct_students_tested',
       'academic_tier_rating'],
      dtype='object')

--- Cleaned Columns ---
Index(['dbn', 'school_name', 'num_of_sat_test_takers',
       'sat_critical_reading_avg_score', 'sat_math_avg_score',
       'sat_writing_avg_score', 'sat_critical_readng_avg_score',
       'internal_school_id', 'contact_extension', 'pct_students_tested',
       'academic_tier_rating'],
      dtype='object')


## 4. Cleaning (Step 2): Handle Duplicates

We need to ensure each school is represented only once. We'll check for complete row duplicates and then for duplicates on the `dbn` (District Borough Number) column, which should be our primary key.

In [43]:
# Check for full-row duplicates
df.duplicated().sum()
print(f"Found {full_duplicates} complete row duplicates.")

# Drop all duplicates, keeping the first instance
df.drop_duplicates(inplace=True)

print(f"\nShape after dropping all duplicates: {df.shape}")

Found 15 complete row duplicates.

Shape after dropping all duplicates: (478, 11)


## 5. Cleaning (Step 3): Investigate and Handle Invalid Data

The SAT score columns are text. This is usually because of a non-numeric placeholder. Let's find it. We'll also check the redundant "readng" column before dropping it.

In [44]:
#Check if the typo column is identical to the correct one
are_identical = (df['sat_critical_reading_avg_score'] == df['sat_critical_readng_avg_score']).all()
print(f"Are the 'Reading' and 'Readng' columns identical? {are_identical}")

# Drop the typo column
df.drop(columns='sat_critical_readng_avg_score', inplace=True)

# Now, let's find the non-numeric value in the score columns
print("\n--- Value counts for 'num_of_sat_test_takers' ---")
print(df['num_of_sat_test_takers'].value_counts().head(10))

Are the 'Reading' and 'Readng' columns identical? True

--- Value counts for 'num_of_sat_test_takers' ---
num_of_sat_test_takers
s     57
54    10
9      8
72     8
48     8
29     7
61     7
52     7
49     7
69     6
Name: count, dtype: int64


### Finding: The 's' Value

The value `'s'` is present in many of the score columns, likely for "Suppressed" or "Skipped". These rows are unusable for score analysis. We will **remove all rows** where `num_of_sat_test_takers` is `'s'`.

We'll also create a new `df_cleaned` DataFrame to protect our original `df` from here on.

In [45]:
#Create a new DataFrame and remove the 's' rows
df_cleaned = df[df['num_of_sat_test_takers'] != 's'].copy()

print(f"Original shape after deduplication: {df.shape}")
print(f"New shape after dropping 's' rows: {df_cleaned.shape}")

Original shape after deduplication: (478, 10)
New shape after dropping 's' rows: (421, 10)


## 6. Cleaning (Step 4): Convert Data Types

Now that the `'s'` values are gone, we can safely convert all the score columns to numeric. We'll use `errors='coerce'` to turn any other unexpected text into `NaN` (Not a Number).

In [48]:
numeric_columns = [
    'num_of_sat_test_takers',
    'sat_critical_reading_avg_score',
    'sat_math_avg_score',
    'sat_writing_avg_score'
]

# Apply pd.to_numeric to all columns in the list
df_cleaned[numeric_columns] = df_cleaned[numeric_columns].apply(pd.to_numeric, errors='coerce')

print("--- Info After Numeric Conversion ---")
df_cleaned.info()

--- Info After Numeric Conversion ---
<class 'pandas.core.frame.DataFrame'>
Index: 421 entries, 0 to 477
Data columns (total 10 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   dbn                             421 non-null    object 
 1   school_name                     421 non-null    object 
 2   num_of_sat_test_takers          421 non-null    int64  
 3   sat_critical_reading_avg_score  421 non-null    int64  
 4   sat_math_avg_score              421 non-null    int64  
 5   sat_writing_avg_score           421 non-null    int64  
 6   internal_school_id              421 non-null    int64  
 7   contact_extension               334 non-null    object 
 8   pct_students_tested             317 non-null    object 
 9   academic_tier_rating            352 non-null    float64
dtypes: float64(1), int64(5), object(4)
memory usage: 36.2+ KB


# ==============================================================================
# %% [markdown]
# ## 7. Cleaning (Step 5): Handle Outliers (200-800)
#
# The SAT scores are now numeric, but they must be within the valid 200-800 range. We'll find any rows with scores outside this range and remove them, as they are data entry errors.

# ==============================================================================
# %% [code]
# This query finds all rows that have an *invalid* score in *any* of the 3 sections

In [50]:
outliers = df_cleaned.query(
    'not (200 <= sat_critical_reading_avg_score <= 800) or '
    'not (200 <= sat_math_avg_score <= 800) or '
    'not (200 <= sat_writing_avg_score <= 800)'
)

print(f"Found {len(outliers)} rows with out-of-range SAT scores.")
if not outliers.empty:
    print(outliers[numeric_columns])

# We'll use an inverse query to *keep* only the rows where all scores are valid
df_cleaned = df_cleaned.query(
    '(200 <= sat_critical_reading_avg_score <= 800) and '
    '(200 <= sat_math_avg_score <= 800) and '
    '(200 <= sat_writing_avg_score <= 800)'
)

print(f"\nShape after dropping outliers: {df_cleaned.shape}")

Found 0 rows with out-of-range SAT scores.

Shape after dropping outliers: (416, 10)


## 8. Cleaning (Step 6): Format `pct_students_tested`

This column is text (e.g., "85%"). We'll convert it to a numeric float (0.85).

As per our strategy, we will **not** drop rows where this value is missing. We will convert existing values and leave the missing ones as `NaN`.

In [51]:
print("--- 'pct_students_tested' unique values (before) ---")
print(df_cleaned['pct_students_tested'].value_counts(dropna=False).head())

# 1. Replace '%' with nothing
# 2. Convert to numeric, coercing errors (like 'N/A') to NaN
# 3. Divide by 100 to get a float (e.g., 85 -> 0.85)
df_cleaned['pct_students_tested'] = pd.to_numeric(
    df_cleaned['pct_students_tested'].str.replace('%', '', regex=False),
    errors='coerce'
) / 100.0

print("\n--- 'pct_students_tested' unique values (after) ---")
print(df_cleaned['pct_students_tested'].value_counts(dropna=False).head())

--- 'pct_students_tested' unique values (before) ---
pct_students_tested
78%    111
85%    105
NaN    103
92%     97
Name: count, dtype: int64

--- 'pct_students_tested' unique values (after) ---
pct_students_tested
0.78    111
0.85    105
NaN     103
0.92     97
Name: count, dtype: int64


In [52]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 416 entries, 0 to 477
Data columns (total 10 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   dbn                             416 non-null    object 
 1   school_name                     416 non-null    object 
 2   num_of_sat_test_takers          416 non-null    int64  
 3   sat_critical_reading_avg_score  416 non-null    int64  
 4   sat_math_avg_score              416 non-null    int64  
 5   sat_writing_avg_score           416 non-null    int64  
 6   internal_school_id              416 non-null    int64  
 7   contact_extension               331 non-null    object 
 8   pct_students_tested             313 non-null    float64
 9   academic_tier_rating            349 non-null    float64
dtypes: float64(2), int64(5), object(3)
memory usage: 35.8+ KB


#### 9. Cleaning (Step 7): Final Column Selection

 Finally, we'll drop columns that are purely synthetic or internal (`internal_school_id`, `contact_extension`) and are not needed for this analysis.

In [53]:
df_final = df_cleaned.drop(columns=['internal_school_id', 'contact_extension'])

print("--- Final DataFrame Info ---")
df_final.info()

print("\n--- Final DataFrame Head ---")
print(df_final.head())


--- Final DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
Index: 416 entries, 0 to 477
Data columns (total 8 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   dbn                             416 non-null    object 
 1   school_name                     416 non-null    object 
 2   num_of_sat_test_takers          416 non-null    int64  
 3   sat_critical_reading_avg_score  416 non-null    int64  
 4   sat_math_avg_score              416 non-null    int64  
 5   sat_writing_avg_score           416 non-null    int64  
 6   pct_students_tested             313 non-null    float64
 7   academic_tier_rating            349 non-null    float64
dtypes: float64(2), int64(4), object(2)
memory usage: 29.2+ KB

--- Final DataFrame Head ---
      dbn                                    school_name  \
0  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES   
1  01M448            UNIVERSITY NEIGHBORHOOD HIGH

## 10. Save Cleaned File

As required by the task, we'll save this cleaned DataFrame to a new CSV file.

In [54]:
df_final.to_csv('cleaned_sat_results.csv', index=False)
print("Successfully saved to 'cleaned_sat_results.csv'")

Successfully saved to 'cleaned_sat_results.csv'


## 10. Create table on Postgres

In [55]:
host = "ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech"
port = "5432"
database = "neondb"
user = "neondb_owner"
password = "a9Am7Yy5r9_T7h4OF2GN"

# Create a connection engine
engine = create_engine(f"postgresql://{user}:{password}@{host}:{port}/{database}")

In [56]:
df_final.to_sql(
    name='michael_sat_results',       
    con=engine,     
    schema='nyc_schools',
    if_exists='replace',    
    index=False            
)

416