## SOLUTIONS FOR DAY 4  
*by Giovani Goltara*

# Cleaning Logic

- I used Data Wrangler to perform an initial assessment and identify cleaning needs.
- Normalized the column names for consistency.
- Dropped rows containing `'s'` values representing missing data.
- Removed duplicate columns.
- Corrected the data types for relevant columns.
- Since SAT scores range from 200 to 800, I dropped rows with values outside this range.
- Kept only the school code as the primary key.
- Dropped columns like school name, contact extension, and percent students tested, as they were either unnecessary for this analysis or had many missing values.


# SQL Schema / Integration Strategy

- The primary key for the table is **school_code** to uniquely identify each school.
- Removed columns that had low data quality or were redundant to simplify the schema.
- Ensured all numeric columns (like SAT scores) have appropriate data types (e.g., `INT` or `FLOAT`).
- Rows with invalid or missing critical values were dropped to maintain data integrity.
- This cleaned table can now be joined with other datasets on `school_code` for further analysis or reporting.
- Any downstream integrations should account for the absence of dropped columns (e.g., `school_name`, `contact_extension`).

In [84]:
import pandas as pd
import psycopg2
from sqlalchemy import create_engine
import warnings
warnings.filterwarnings("ignore")

In [85]:
DATABASE_URL = (
    "postgresql+psycopg2://neondb_owner:npg_CeS9fJg2azZD"
    "@ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech:5432/neondb"
    "?sslmode=require"
)

# Create engine and establish connection
engine = create_engine(DATABASE_URL)

In [86]:
df = pd.read_csv('/Users/giovanigoltara/Documents/webeet/onboarding-webeet/_onboarding_data-1/daily_tasks/day_4/day_4_datasets/sat-results.csv')
df.head()

Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score,SAT Critical Readng Avg. Score,internal_school_id,contact_extension,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,355,218160,x345,78%,2.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,383,268547,x234,,3.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370,377,236446,x123,,3.0
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359,414,427826,x123,92%,4.0
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384,390,672714,x123,92%,2.0


In [87]:
#cleaning column names
df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('-', '_').str.replace('.', '')
df.head()

Unnamed: 0,dbn,school_name,num_of_sat_test_takers,sat_critical_reading_avg_score,sat_math_avg_score,sat_writing_avg_score,sat_critical_readng_avg_score,internal_school_id,contact_extension,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,355,218160,x345,78%,2.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,383,268547,x234,,3.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370,377,236446,x123,,3.0
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359,414,427826,x123,92%,4.0
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384,390,672714,x123,92%,2.0


In [None]:
# drop duplicates  
new_df = df.drop_duplicates()

# drop duplicated column
new_df = new_df.drop(columns=['sat_critical_readng_avg_score'], errors='ignore')


# drop rows with 's' values
columns_to_check = [
    'num_of_sat_test_takers',
    'sat_critical_reading_avg_score',
    'sat_math_avg_score',
    'sat_writing_avg_score',
    'academic_tier_rating'
]

# Keep rows where at least one of the specified columns does not have 's'
new_df = new_df[~(new_df[columns_to_check] == 's').all(axis=1)]

# Change columns to numeric, coercing errors to NaN and converting to Int64
new_df[columns_to_check] = new_df[columns_to_check].apply(pd.to_numeric, errors='coerce').astype('Int64')

# drop rows with values out of sat score range for math (200-800)
new_df = new_df[(new_df['sat_math_avg_score'] >= 200) & (new_df['sat_math_avg_score'] <= 800)]

# drop rows with values out of sat score range for critical reading (200-800)
new_df = new_df[(new_df['sat_critical_reading_avg_score'] >= 200) & (new_df['sat_critical_reading_avg_score'] <= 800)]

# drop unnecessary columns
new_df = new_df.drop(columns=['school_name', 'contact_extension', 'pct_students_tested'], errors='ignore')

In [89]:
# apend new_df to the database 
new_df.to_sql(
    name='giovani_sat_results',       
    con=engine,     
    schema='nyc_schools',
    if_exists='replace',    
    index=False            
)


416

In [96]:
# Save cleaned DataFrame to CSV in the day_4 folder
new_df.to_csv('/Users/giovanigoltara/Documents/webeet/onboarding-webeet/_onboarding_data-1/daily_tasks/day_4/sat-results-cleaned.csv', index=False)