In [33]:

import pandas as pd


In [34]:
# Step 1: Explore the Dataset
import pandas as pd


df_sat = pd.read_csv("sat-results.csv")


df_sat.head()


df_sat.info()


df_sat.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493 entries, 0 to 492
Data columns (total 11 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   DBN                              493 non-null    object 
 1   SCHOOL NAME                      493 non-null    object 
 2   Num of SAT Test Takers           493 non-null    object 
 3   SAT Critical Reading Avg. Score  493 non-null    object 
 4   SAT Math Avg. Score              493 non-null    object 
 5   SAT Writing Avg. Score           493 non-null    object 
 6   SAT Critical Readng Avg. Score   493 non-null    object 
 7   internal_school_id               493 non-null    int64  
 8   contact_extension                388 non-null    object 
 9   pct_students_tested              376 non-null    object 
 10  academic_tier_rating             402 non-null    float64
dtypes: float64(1), int64(1), object(9)
memory usage: 42.5+ KB


DBN                                  0
SCHOOL NAME                          0
Num of SAT Test Takers               0
SAT Critical Reading Avg. Score      0
SAT Math Avg. Score                  0
SAT Writing Avg. Score               0
SAT Critical Readng Avg. Score       0
internal_school_id                   0
contact_extension                  105
pct_students_tested                117
academic_tier_rating                91
dtype: int64

### Step 1: Explore the Dataset
In this step, I loaded the SAT Results dataset and explored its structure.
The goal is to understand the columns, data types, and missing values before cleaning.


In [35]:
# Step 2: Clean the Data

df_sat = df_sat.drop(columns=['internal_school_id', 'contact_extension'], errors='ignore')


df_sat.columns = (
    df_sat.columns
    .str.strip()
    .str.lower()
    .str.replace(' ', '_')
    .str.replace('.', '', regex=False)
)

if 'pct_students_tested' in df_sat.columns:
    df_sat['pct_students_tested'] = (
        df_sat['pct_students_tested']
        .astype(str)
        .str.replace('%', '')
        .str.replace('N/A', '')
    )
    df_sat['pct_students_tested'] = pd.to_numeric(df_sat['pct_students_tested'], errors='coerce')


for col in ['sat_critical_reading_avg_score', 'sat_math_avg_score', 'sat_writing_avg_score']:
    df_sat[col] = pd.to_numeric(df_sat[col], errors='coerce')


for col in ['sat_critical_reading_avg_score', 'sat_math_avg_score', 'sat_writing_avg_score']:
    df_sat = df_sat[(df_sat[col] >= 200) & (df_sat[col] <= 800)]


df_sat = df_sat.drop_duplicates()


df_sat.head()


Unnamed: 0,dbn,school_name,num_of_sat_test_takers,sat_critical_reading_avg_score,sat_math_avg_score,sat_writing_avg_score,sat_critical_readng_avg_score,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355.0,404.0,363.0,355,78.0,2.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383.0,423.0,366.0,383,,3.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377.0,402.0,370.0,377,,3.0
3,01M458,FORSYTH SATELLITE ACADEMY,7,414.0,401.0,359.0,414,92.0,4.0
4,01M509,MARTA VALLE HIGH SCHOOL,44,390.0,433.0,384.0,390,92.0,2.0


### Step 2: Clean the Data
In this step, I cleaned the dataset by:
- Removing irrelevant columns (`internal_school_id`, `contact_extension`)
- Standardizing column names
- Converting percentage and SAT score fields to numeric
- Filtering valid SAT scores (200‚Äì800)
- Dropping duplicates and missing rows


In [36]:
# Step 3: Design the Schema


df_sat_final = df_sat[[
    'dbn',
    'school_name',
    'sat_math_avg_score',
    'sat_writing_avg_score',
    'sat_critical_reading_avg_score'
]]


df_sat_final.head()


Unnamed: 0,dbn,school_name,sat_math_avg_score,sat_writing_avg_score,sat_critical_reading_avg_score
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,404.0,363.0,355.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,423.0,366.0,383.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,402.0,370.0,377.0
3,01M458,FORSYTH SATELLITE ACADEMY,401.0,359.0,414.0
4,01M509,MARTA VALLE HIGH SCHOOL,433.0,384.0,390.0


### Step 3: Design the Schema
In this step, I selected only the relevant columns that will be uploaded to the database:
- `dbn`
- `school_name`
- `sat_math_avg_score`
- `sat_writing_avg_score`
- `sat_critical_reading_avg_score`

This schema ensures relational consistency with the existing school directory table.


In [37]:
# Step 4: Append the Data to the Database

from sqlalchemy import create_engine
import psycopg2


DATABASE_URL = DATABASE_URL = "postgresql+psycopg2://neondb_owner:a9Am7Yy5r9_T7h4OF2GN@ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech:5432/neondb"

engine = create_engine(DATABASE_URL)

try:
    with engine.begin() as connection:
        df_sat_final.to_sql(
            name='nuran_nalci_sat_scores',
            con=connection,
            if_exists='replace',
            index=False
        )
    print("‚úÖ Data successfully appended to PostgreSQL database.")

except Exception as e:
    print("‚ùå Database connection failed or append unsuccessful.")
    print(e)


‚úÖ Data successfully appended to PostgreSQL database.


In this step, I established a connection to the company‚Äôs PostgreSQL database using SQLAlchemy with the psycopg2 driver.
The goal was to append the cleaned SAT dataset (df_sat_final) to a new table named nuran_nalci_sat_scores inside the nyc_schools database.

The process includes:

Creating a database connection with the provided credentials (student user).

Using the to_sql() method to append the cleaned data into PostgreSQL.

Applying if_exists='replace' to ensure the table is refreshed each time.

Managing the transaction safely with engine.begin(), which automatically commits changes if successful.

Handling any connection or data insertion errors using a try/except block.

In [38]:
# Save the cleaned dataset
df_sat_final.to_csv("cleaned_sat_results.csv", index=False)
print("üíæ Cleaned dataset saved as 'cleaned_sat_results.csv'")


üíæ Cleaned dataset saved as 'cleaned_sat_results.csv'
