# Day 4. Data Integration & Schema Design: NYC SAT Results

Objective
Learn how to evaluate, clean, and integrate a real-world dataset into an existing PostgreSQL schema. You'll inspect the dataset, identify relational keys, clean inconsistencies, and write a Python-based script to append the data into the database.

Goals:

- Inspect and understand the structure of the dataset.
- Select meaningful and relational columns that link to existing tables.
- Identify issues in the data such as duplicates, outliers, or formatting inconsistencies.
- Clean and preprocess the data using Python.
- Prepare the data for database insertion.
- Write a Python script that connects to the database and appends the cleaned data.

**Instructions**

1. Explore the Dataset

Open the CSV and review its structure
Refer to: daily_tasks/day_4/day_4_datasets/readme.md
Identify which columns are useful and which are synthetic or dirty

2. Clean the Data Using Python

Handle duplicates, invalid SAT scores, and inconsistent formatting (e.g., "85%"), weird outliers and any inconsistencies
Normalize headers and drop unrelated fields

3. Design the Schema

Choose columns to upload to the database

4. Write a Python Script to Append Data

Use psycopg2 or sqlalchemy to connect
Append cleaned data to your sat_scores table
Use parameterized queries and commit logic

5. Save Your Work

In your branch (e.g., [your-name]/day-4), go to:
📁 daily_tasks/day_4/day_4_task/

Add:

cleaned_sat_results.csv - output as clean csv file
sat_modeling.ipynb – your dataset cleaning and database insertion script

# 1. Explore the Dataset

In [20]:
#Import necessary libraries
import pandas as pd
import psycopg2
from sqlalchemy import create_engine
import warnings
warnings.filterwarnings("ignore")

In [21]:
# SQLAlchemy connection string format:
# postgresql+psycopg2://user:password@host:port/dbname

DATABASE_URL = (
    "postgresql+psycopg2://neondb_owner:npg_CeS9fJg2azZD"
    "@ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech:5432/neondb"
    "?sslmode=require"
)

# Create engine and establish connection
engine = create_engine(DATABASE_URL)

In [22]:
#Open the CSV and review its structure
import pandas as pd
df=pd.read_csv('/Users/svitlanakovalivska/onboarding_weebet/_onboarding_data-1/daily_tasks/day_4/day_4_datasets/sat-results.csv')
df

Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score,SAT Critical Readng Avg. Score,internal_school_id,contact_extension,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,355,218160,x345,78%,2.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,383,268547,x234,,3.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370,377,236446,x123,,3.0
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359,414,427826,x123,92%,4.0
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384,390,672714,x123,92%,2.0
...,...,...,...,...,...,...,...,...,...,...,...
488,27Q480,JOHN ADAMS HIGH SCHOOL,403,391,409,392,391,863765,,92%,1.0
489,13K605,GEORGE WESTINGHOUSE CAREER AND TECHNICAL EDUCA...,85,406,391,392,406,937579,x234,,
490,05M304,MOTT HALL HIGH SCHOOL,54,413,399,398,413,296405,x123,78%,2.0
491,02M520,MURRY BERGTRAUM HIGH SCHOOL FOR BUSINESS CAREERS,264,407,440,393,407,892839,,92%,2.0


In [23]:
#Identify which columns are useful and which are synthetic or dirty

# Check for null values and data types
df.info()

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

# Check for unique values in each column
unique_values = {col: df[col].nunique() for col in df.columns}
print("Unique values in each column:")
for col, count in unique_values.items():
    print(f"{col}: {count}")        

# Check for missing values
missing_values = df.isnull().sum()
print("Missing values in each column:")
print(missing_values)

# Check for outliers in numerical columns
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns
for col in numerical_cols:
    print(f"Descriptive statistics for {col}:")
    print(df[col].describe())
    print("\n")




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493 entries, 0 to 492
Data columns (total 11 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   DBN                              493 non-null    object 
 1   SCHOOL NAME                      493 non-null    object 
 2   Num of SAT Test Takers           493 non-null    object 
 3   SAT Critical Reading Avg. Score  493 non-null    object 
 4   SAT Math Avg. Score              493 non-null    object 
 5   SAT Writing Avg. Score           493 non-null    object 
 6   SAT Critical Readng Avg. Score   493 non-null    object 
 7   internal_school_id               493 non-null    int64  
 8   contact_extension                388 non-null    object 
 9   pct_students_tested              376 non-null    object 
 10  academic_tier_rating             402 non-null    float64
dtypes: float64(1), int64(1), object(9)
memory usage: 42.5+ KB
Number of duplicate rows: 

In [24]:
#Percentage of nul-values in the dataset
missing_percentage = df.isnull().mean() * 100
print("Percentage of missing values in each column:")
print(missing_percentage)

Percentage of missing values in each column:
DBN                                 0.000000
SCHOOL NAME                         0.000000
Num of SAT Test Takers              0.000000
SAT Critical Reading Avg. Score     0.000000
SAT Math Avg. Score                 0.000000
SAT Writing Avg. Score              0.000000
SAT Critical Readng Avg. Score      0.000000
internal_school_id                  0.000000
contact_extension                  21.298174
pct_students_tested                23.732252
academic_tier_rating               18.458418
dtype: float64


In [25]:
#Drop the duplicate rows
df = df.drop_duplicates()

In [26]:
#Column names review
print(f"Column names in the DataFrame: {df.columns} ")


#Check the duplicated column names  
duplicated_columns = df.columns[df.columns.duplicated()].tolist()
if duplicated_columns:
    print("Duplicated column names found:")
    print(duplicated_columns)
else:
    print("No duplicated column names found.")  

#Check if 'SAT Critical Reading Avg. Score' and 'SAT Critical Readng Avg. Score' columns are identical
if 'SAT Critical Reading Avg. Score' in df.columns and 'SAT Critical Readng Avg. Score' in df.columns:
    if df['SAT Critical Reading Avg. Score'].equals(df['SAT Critical Readng Avg. Score']):
        print("The columns 'SAT Critical Reading Avg. Score' and 'SAT Critical Readng Avg. Score' are identical.")
    else:
        print("The columns 'SAT Critical Reading Avg. Score' and 'SAT Critical Readng Avg. Score' are different.")
else:
    print("One or both of the columns 'SAT Critical Reading Avg. Score' and 'SAT Critical Readng Avg. Score' do not exist in the DataFrame.")   




Column names in the DataFrame: Index(['DBN', 'SCHOOL NAME', 'Num of SAT Test Takers',
       'SAT Critical Reading Avg. Score', 'SAT Math Avg. Score',
       'SAT Writing Avg. Score', 'SAT Critical Readng Avg. Score',
       'internal_school_id', 'contact_extension', 'pct_students_tested',
       'academic_tier_rating'],
      dtype='object') 
No duplicated column names found.
The columns 'SAT Critical Reading Avg. Score' and 'SAT Critical Readng Avg. Score' are identical.


Below is a description of the key columns in the NYC SAT results dataset. This reference will help you understand what each field represents as you clean, explore, and analyze the data.

Column Name	Description
DBN	District Borough Number, a unique code identifying each school (e.g., 01M292)
School Name	The full official name of the high school
Num of SAT Test Takers	Number of students from the school who took the SAT exam
SAT Critical Reading Avg. Score	Average score achieved in the Critical Reading section (valid: 200–800)
SAT Math Avg. Score	Average score achieved in the Math section (valid: 200–800)
SAT Writing Avg. Score	Average score achieved in the Writing section (valid: 200–800)
SAT Critical Readng Avg. Score	Duplicate of Critical Reading score with a typo in the column name
internal_school_id	potentially school ID,generated by system (?)
contact_extension	phone extension (e.g., "x234") — uncheked
pct_students_tested	Percentage of students tested (as string, e.g., "85%", "N/A")
academic_tier_rating	performance tier (scale 1–4), may contain nulls


In [27]:
# Check why data columns have the object type - which objects are inside (check only the columns where number of unique values is less the the number of rows)
for col in df.columns:
    if df[col].dtype == 'object' and df[col].nunique() < len(df):
        print(f"Column '{col}' has object type with unique values:")
        print(df[col].unique())
        print("\n")
# Check for percentage columns
percentage_columns = [col for col in df.columns if df[col].dtype == 'object' and df[col].str.contains('%').any()]
if percentage_columns:
    print("Percentage columns found:")
    for col in percentage_columns:
        print(f"{col}: {df[col].unique()}")
else:
    print("No percentage columns found.")




Column 'Num of SAT Test Takers' has object type with unique values:
['29' '91' '70' '7' '44' '112' '159' '18' '130' '16' '62' '53' '58' '85'
 '48' '76' '50' '40' '69' '42' '60' '92' 's' '79' '263' '54' '94' '104'
 '114' '66' '103' '127' '144' '336' '84' '95' '59' '72' '49' '151' '832'
 '167' '25' '81' '264' '131' '73' '14' '78' '26' '77' '56' '30' '33' '121'
 '9' '335' '36' '83' '154' '191' '270' '61' '27' '41' '12' '32' '261'
 '531' '75' '35' '111' '43' '375' '51' '31' '20' '214' '101' '55' '63'
 '24' '228' '65' '34' '64' '28' '47' '52' '67' '39' '415' '6' '68' '80'
 '74' '38' '113' '86' '57' '443' '731' '109' '99' '10' '46' '97' '189'
 '37' '1277' '90' '105' '8' '13' '89' '185' '102' '134' '142' '141' '71'
 '165' '259' '17' '182' '456' '238' '694' '385' '475' '727' '448' '119'
 '824' '518' '236' '11' '155' '320' '241' '138' '396' '45' '558' '347'
 '278' '888' '934' '334' '708' '175' '87' '93' '404' '403' '194' '762'
 '462' '422' '98' '395' '392' '174' '148' '143' '135' '137' '107' '3

In [28]:
#Count the nomber of rows with the s-values in the each column, and percentage od these number of rows for each column  
s_count = 0
for col in df.columns:
    if df[col].dtype == 'object':
        s_mask = df[col].str.contains('s', na=False)
        if s_mask.any():
            count = s_mask.sum()
            percentage = (count / len(df)) * 100
            print(f"Column '{col}' has {count} 's' values ({percentage:.2f}%)")
            s_count += count    
# Print total count of 's' values
print(f"Total 's' values across all columns: {s_count}")

Column 'SCHOOL NAME' has 7 's' values (1.46%)
Column 'Num of SAT Test Takers' has 57 's' values (11.92%)
Column 'SAT Critical Reading Avg. Score' has 57 's' values (11.92%)
Column 'SAT Math Avg. Score' has 57 's' values (11.92%)
Column 'SAT Writing Avg. Score' has 57 's' values (11.92%)
Column 'SAT Critical Readng Avg. Score' has 57 's' values (11.92%)
Total 's' values across all columns: 292


! The percentage of rows with s-values is greater than 5, so we can't just delete them, we need to think about how to deal with them.

Results of the Step 1:

Columns such as SAT Critical Readng Avg. Score (Duplicate of Critical Reading score with a typo), internal_school_id (no needed for SAT), contact_extension (contact info is no needed, too much missing values) are probably not useful for the SAT dataset and should be removed.  

Other column names should be cleaned up and renamed so that they have the same format and the correct data type (including %-removal and dealing with the missing data and s-values (change on NaN first - in order to keep in the dataset all rows).


# 2. Clean the Data Using Python

In [29]:
#  Data cleaning strategy

# Fix the column naming issue
df_cleaned_fixed = df.copy()

# 1. Remove the problematic duplicate column 
if 'SAT Critical Readng Avg. Score' in df_cleaned_fixed.columns:
    df_cleaned_fixed = df_cleaned_fixed.drop(columns=['SAT Critical Readng Avg. Score'])



# 2.Properly clean column names
df_cleaned_fixed.columns = (df_cleaned_fixed.columns
                           .str.strip()                    # Remove spaces
                           .str.replace(' ', '_')          # Replace spaces with underscores
                           .str.replace('.', '')           # Remove periods
                           .str.lower())                   # Convert to lowercase

print(f"\n Fixed column names:")
for i, col in enumerate(df_cleaned_fixed.columns, 1):
    print(f"{i}. '{col}'")

# 3. Convert data types properly into integer values (not float)
# Convert numerical columns to numeric types, handling errors and replacing 's' with None

df_cleaned_fixed['num_of_sat_test_takers'] = pd.to_numeric(df_cleaned_fixed['num_of_sat_test_takers'].replace('s', None), errors='coerce')
df_cleaned_fixed['sat_critical_reading_avg_score'] = pd.to_numeric(df_cleaned_fixed['sat_critical_reading_avg_score'].replace('s', None), errors='coerce')
df_cleaned_fixed['sat_math_avg_score'] = pd.to_numeric(df_cleaned_fixed['sat_math_avg_score'].replace('s', None), errors='coerce')
df_cleaned_fixed['sat_writing_avg_score'] = pd.to_numeric(df_cleaned_fixed['sat_writing_avg_score'].replace('s', None), errors='coerce')

for col in df_cleaned_fixed.select_dtypes(include=['float64']).columns:
    df_cleaned_fixed[col] = df_cleaned_fixed[col].fillna(0).astype(int) 

# 4.Fix percentage column
if 'pct_students_tested' in df_cleaned_fixed.columns:
    df_cleaned_fixed['pct_students_tested'] = df_cleaned_fixed['pct_students_tested'].str.replace('%', '').astype(float)

#convert float into integer
for col in df_cleaned_fixed.select_dtypes(include=['float64']).columns:
    df_cleaned_fixed[col] = df_cleaned_fixed[col].fillna(0).astype(int)
    
print(f"\n CORRECTED DATASET:")
print(f"Shape: {df_cleaned_fixed.shape}")
print(f"Columns: {list(df_cleaned_fixed.columns)}")
print(f"\nFirst 3 rows:")
df_cleaned_fixed.head(3)


 Fixed column names:
1. 'dbn'
2. 'school_name'
3. 'num_of_sat_test_takers'
4. 'sat_critical_reading_avg_score'
5. 'sat_math_avg_score'
6. 'sat_writing_avg_score'
7. 'internal_school_id'
8. 'contact_extension'
9. 'pct_students_tested'
10. 'academic_tier_rating'

 CORRECTED DATASET:
Shape: (478, 10)
Columns: ['dbn', 'school_name', 'num_of_sat_test_takers', 'sat_critical_reading_avg_score', 'sat_math_avg_score', 'sat_writing_avg_score', 'internal_school_id', 'contact_extension', 'pct_students_tested', 'academic_tier_rating']

First 3 rows:


Unnamed: 0,dbn,school_name,num_of_sat_test_takers,sat_critical_reading_avg_score,sat_math_avg_score,sat_writing_avg_score,internal_school_id,contact_extension,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,218160,x345,78,2
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,268547,x234,0,3
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370,236446,x123,0,3


In [30]:
# Check the cleaned data

# Start with a copy of the original data
df_cleaned = df_cleaned_fixed.copy()

# 1. Check duplicate rows ones more
initial_rows = len(df_cleaned)
df_cleaned = df_cleaned.drop_duplicates()
duplicates_removed = initial_rows - len(df_cleaned)
print(f"   Removed {duplicates_removed} duplicate rows")
print(f"   Rows: {initial_rows} → {len(df_cleaned)}")


print("\n CLEANING RESULTS")
print(f" Final dataset shape: {df_cleaned.shape}")
print(f"Data types after cleaning:")
print(df_cleaned.dtypes)

print(f"\n Missing values after cleaning:")
missing_after = df_cleaned.isnull().sum()
print(missing_after[missing_after > 0])

   Removed 0 duplicate rows
   Rows: 478 → 478

 CLEANING RESULTS
 Final dataset shape: (478, 10)
Data types after cleaning:
dbn                               object
school_name                       object
num_of_sat_test_takers             int64
sat_critical_reading_avg_score     int64
sat_math_avg_score                 int64
sat_writing_avg_score              int64
internal_school_id                 int64
contact_extension                 object
pct_students_tested                int64
academic_tier_rating               int64
dtype: object

 Missing values after cleaning:
contact_extension    100
dtype: int64


In [31]:
# Anomalies validation


# 1. Check SAT score ranges are reasonable
print("\n SAT SCORE RANGE VALIDATION:")
sat_score_cols = ['sat_critical_reading_avg_score', 'sat_math_avg_score', 'sat_writing_avg_score']
for col in sat_score_cols:
    valid_scores = df_cleaned[col].dropna()
    min_score = valid_scores.min()
    max_score = valid_scores.max()
    mean_score = valid_scores.mean()
    
    # SAT scores should be between 200-800
    if min_score >= 200 and max_score <= 800:
        status = "Valid range"
    else:
        status = "Invalid range"
    
    print(f"{col}:")
    print(f"Range: {min_score:.0f} - {max_score:.0f} {status}")
    print(f"Mean: {mean_score:.1f}")

# 2. Check percentage values
print("\n PERCENTAGE VALIDATION:")
if 'pct_students_tested' in df_cleaned.columns:
    pct_valid = df_cleaned['pct_students_tested'].dropna()
    pct_min = pct_valid.min()
    pct_max = pct_valid.max()
    
    if pct_min >= 0 and pct_max <= 100:
        status = "Valid percentage range"
    else:
        status = "Invalid percentage range"
    
    print(f"pct_students_tested: {pct_min:.0f}% - {pct_max:.0f}% {status}")

# 3. Check for remaining data quality issues
print("\n REMAINING DATA QUALITY CHECKS:")
print(f"Duplicate rows: {df_cleaned.duplicated().sum()}")
print(f"Unique schools (DBN): {df_cleaned['dbn'].nunique()}")

# 4. Summary statistics for cleaned data

print("\n FINAL DATASET READY FOR DATABASE!")
print(f"Dataset shape: {df_cleaned.shape}")
print(f"All numerical columns properly typed")
print(f"No duplicate rows")
print(f"Valid SAT score ranges (200-800)")
print(f"Valid percentage ranges (0-100%)")
print(f"Missing values properly handled as NaN")
print("\n CLEANED DATA SUMMARY:")
df_cleaned.describe()


 SAT SCORE RANGE VALIDATION:
sat_critical_reading_avg_score:
Range: 0 - 679 Invalid range
Mean: 353.1
sat_math_avg_score:
Range: -10 - 1100 Invalid range
Mean: 368.3
sat_writing_avg_score:
Range: 0 - 682 Invalid range
Mean: 347.0

 PERCENTAGE VALIDATION:
pct_students_tested: 0% - 92% Valid percentage range

 REMAINING DATA QUALITY CHECKS:
Duplicate rows: 0
Unique schools (DBN): 478

 FINAL DATASET READY FOR DATABASE!
Dataset shape: (478, 10)
All numerical columns properly typed
No duplicate rows
Valid SAT score ranges (200-800)
Valid percentage ranges (0-100%)
Missing values properly handled as NaN

 CLEANED DATA SUMMARY:


Unnamed: 0,num_of_sat_test_takers,sat_critical_reading_avg_score,sat_math_avg_score,sat_writing_avg_score,internal_school_id,pct_students_tested,academic_tier_rating
count,478.0,478.0,478.0,478.0,478.0,478.0,478.0
mean,97.165272,353.050209,368.307531,347.004184,560082.717573,64.242678,2.115063
std,150.270071,140.542605,158.920585,139.155359,259637.064755,36.532971,1.423575
min,0.0,0.0,-10.0,0.0,101855.0,0.0,0.0
25%,30.0,358.25,362.25,351.0,337012.5,78.0,1.0
50%,56.5,384.0,387.5,376.0,581301.5,78.0,2.0
75%,89.0,411.75,432.0,403.0,778312.75,85.0,3.0
max,1277.0,679.0,1100.0,682.0,999398.0,92.0,4.0


In [32]:
#Count the anomalies (out of range, with the invalid range)in numbers and percentages for each sat_score_cols
for col in sat_score_cols:
    out_of_range = df_cleaned[(df_cleaned[col] < 200) | (df_cleaned[col] > 800)]
    count_out_of_range = len(out_of_range)
    percentage_out_of_range = (count_out_of_range / len(df_cleaned)) * 100
    
    print(f"   {col} out of range: {count_out_of_range} ({percentage_out_of_range:.2f}%)")

# Drop the rows with out of range SAT scores
for col in sat_score_cols:
    df_cleaned = df_cleaned[(df_cleaned[col] >= 200) & (df_cleaned[col] <= 800)]    

# Check for remaining anomalies after cleaning
print("\nREMAINING ANOMALIES CHECKS")
print(f"Remaining duplicate rows: {df_cleaned.duplicated().sum()}")
print(f"Unique schools (DBN): {df_cleaned['dbn'].nunique()}")
# Check for percentage values again
if 'pct_students_tested' in df_cleaned.columns:
    pct_valid = df_cleaned['pct_students_tested'].dropna()
    pct_min = pct_valid.min()
    pct_max = pct_valid.max()
    
    if pct_min >= 0 and pct_max <= 100:
        status = " Valid percentage range"
    else:
        status = " Invalid percentage range"
    
    print(f"   pct_students_tested: {pct_min:.0f}% - {pct_max:.0f}% {status}")


# Final validation checks

print("\n FINAL VALIDATION CHECKS AFTER CLEANING:")
print(f"Dataset shape: {df_cleaned.shape}")
print(f"Data types after cleaning:")
print(df_cleaned.dtypes)
print(f"\n Missing values after cleaning:")
missing_after = df_cleaned.isnull().sum()
print(missing_after[missing_after > 0]) 


print(f"All numerical columns properly typed")
print(f"No duplicate rows")
print(f"Valid SAT score ranges (200-800)")
print(f"Valid percentage ranges (0-100%)")


df_cleaned.describe()


   sat_critical_reading_avg_score out of range: 57 (11.92%)
   sat_math_avg_score out of range: 62 (12.97%)
   sat_writing_avg_score out of range: 57 (11.92%)

REMAINING ANOMALIES CHECKS
Remaining duplicate rows: 0
Unique schools (DBN): 416
   pct_students_tested: 0% - 92%  Valid percentage range

 FINAL VALIDATION CHECKS AFTER CLEANING:
Dataset shape: (416, 10)
Data types after cleaning:
dbn                               object
school_name                       object
num_of_sat_test_takers             int64
sat_critical_reading_avg_score     int64
sat_math_avg_score                 int64
sat_writing_avg_score              int64
internal_school_id                 int64
contact_extension                 object
pct_students_tested                int64
academic_tier_rating               int64
dtype: object

 Missing values after cleaning:
contact_extension    85
dtype: int64
All numerical columns properly typed
No duplicate rows
Valid SAT score ranges (200-800)
Valid percentage ranges (0

Unnamed: 0,num_of_sat_test_takers,sat_critical_reading_avg_score,sat_math_avg_score,sat_writing_avg_score,internal_school_id,pct_students_tested,academic_tier_rating
count,416.0,416.0,416.0,416.0,416.0,416.0,416.0
mean,110.769231,401.067308,413.733173,394.175481,572765.245192,63.71875,2.163462
std,156.354878,57.017818,64.945638,58.91534,257828.614058,36.929242,1.397834
min,6.0,279.0,312.0,286.0,102816.0,0.0,0.0
25%,41.0,368.0,372.0,360.0,353089.5,78.0,1.0
50%,62.0,391.0,395.0,381.5,602509.5,78.0,2.0
75%,95.5,416.25,437.25,411.0,786460.0,85.0,3.0
max,1277.0,679.0,735.0,682.0,999398.0,92.0,4.0


In [33]:
# Display the cleaned dataset
print("CLEANED DATASET PREVIEW:")
df_cleaned.head()


CLEANED DATASET PREVIEW:


Unnamed: 0,dbn,school_name,num_of_sat_test_takers,sat_critical_reading_avg_score,sat_math_avg_score,sat_writing_avg_score,internal_school_id,contact_extension,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,218160,x345,78,2
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,268547,x234,0,3
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370,236446,x123,0,3
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359,427826,x123,92,4
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384,672714,x123,92,2


# 3. Design the Schema

Now we'll choose which columns to upload to the database and design our table structure.

In [34]:
# Remove unwanted columns
columns_to_drop = ['internal_school_id', 'contact_extension']
df_cleaned = df_cleaned.drop(columns=columns_to_drop, errors='ignore')

In [35]:
#  SCHEMA DESIGN FOR DATABASE

print("DESIGNING DATABASE SCHEMA\n")
# Define the schema for the cleaned SAT results dataset
schema = {
    'dbn': 'VARCHAR(10) PRIMARY KEY',
    'school_name': 'VARCHAR(255)',
    'num_of_sat_test_takers': 'INTEGER',
    'sat_critical_reading_avg_score': 'INTEGER',
    'sat_math_avg_score': 'INTEGER',
    'sat_writing_avg_score': 'INTEGER',
    'pct_students_tested': 'INTEGER',
    'academic_tier_rating': 'INTEGER'
}
# Create the SQL CREATE TABLE statement
create_table_sql = "CREATE TABLE IF NOT EXISTS sat_results (\n"
for column, data_type in schema.items():
    create_table_sql += f"    {column} {data_type},\n"
create_table_sql = create_table_sql.rstrip(',\n') + "\n);"      
print(create_table_sql)


DESIGNING DATABASE SCHEMA

CREATE TABLE IF NOT EXISTS sat_results (
    dbn VARCHAR(10) PRIMARY KEY,
    school_name VARCHAR(255),
    num_of_sat_test_takers INTEGER,
    sat_critical_reading_avg_score INTEGER,
    sat_math_avg_score INTEGER,
    sat_writing_avg_score INTEGER,
    pct_students_tested INTEGER,
    academic_tier_rating INTEGER
);


In [36]:
#Drow the schema diagram 

# Note: This is a placeholder for the schema diagram. In practice, you would use a tool like pgAdmin or an online ERD tool to visualize the schema.
print("\n SCHEMA DIAGRAM:")
print("┌─────────────────────────────┐")
print("│         sat_results          │")
print("├─────────────────────────────┤")
for column, data_type in schema.items():
    print(f"│ {column.ljust(25)} {data_type.ljust(10)} │")
print("└─────────────────────────────┘")            



 SCHEMA DIAGRAM:
┌─────────────────────────────┐
│         sat_results          │
├─────────────────────────────┤
│ dbn                       VARCHAR(10) PRIMARY KEY │
│ school_name               VARCHAR(255) │
│ num_of_sat_test_takers    INTEGER    │
│ sat_critical_reading_avg_score INTEGER    │
│ sat_math_avg_score        INTEGER    │
│ sat_writing_avg_score     INTEGER    │
│ pct_students_tested       INTEGER    │
│ academic_tier_rating      INTEGER    │
└─────────────────────────────┘


# 4. Write a Python Script to Append Data

Upload the cleaned data to the PostgreSQL database using SQLAlchemy.

In [37]:
# # DATABASE UPLOAD IMPLEMENTATION

try:
    # Upload cleaned data to database
    table_name = 'svitlana_sat_results'  
    schema_name = 'nyc_schools'
    
    print(f"Uploading data to table: {schema_name}.{table_name}")
    print(f"Uploading {len(df_cleaned)} rows with {len(df_cleaned.columns)} columns")
    
    # Upload to database
    df_cleaned.to_sql(
        name=table_name,       
        con=engine,     
        schema=schema_name,
        if_exists='replace',    
        index=False,           
        method='multi'       
    )
    
    print("SUCCESS: Data uploaded to database!")
    print(f"Table created: {schema_name}.{table_name}")
    
    # Verify upload by counting rows - ИСПРАВЛЕННЫЙ КОД
    from sqlalchemy import text
    verification_query = text(f"SELECT COUNT(*) FROM {schema_name}.{table_name}")
    
    with engine.connect() as connection:
        result = connection.execute(verification_query)
        row_count = result.fetchone()[0]
    
    print(f"VERIFICATION: {row_count} rows found in database table")
    
    if row_count == len(df_cleaned):
        print("All rows successfully uploaded!")
    else:
        print(f" Warning: Expected {len(df_cleaned)} rows, but found {row_count}")
        
except Exception as e:
    print(f" ERROR uploading to database: {str(e)}")
    print("Please check your database connection and permissions.")

Uploading data to table: nyc_schools.svitlana_sat_results
Uploading 416 rows with 8 columns
SUCCESS: Data uploaded to database!
Table created: nyc_schools.svitlana_sat_results
VERIFICATION: 416 rows found in database table
All rows successfully uploaded!


# 5. Save the Work

Export the cleaned dataset as CSV file as required by the task.

In [38]:
# EXPORT CLEANED DATA AS CSV

import os

# Create output directory if it doesn't exist
output_dir = '/Users/svitlanakovalivska/onboarding_weebet/_onboarding_data-1/daily_tasks/day_4'
csv_filename = 'cleaned_sat_results.csv'
csv_path = os.path.join(output_dir, csv_filename)


try:
    # Export cleaned data to CSV
    df_cleaned.to_csv(csv_path, index=False)
    
    print(f"SUCCESS: Cleaned data exported to CSV!")
    print(f"File location: {csv_path}")
    print(f"Exported {len(df_cleaned)} rows and {len(df_cleaned.columns)} columns")
    
    # Verify file was created
    if os.path.exists(csv_path):
        file_size = os.path.getsize(csv_path)
        print(f"File size: {file_size:,} bytes")
    else:
        print("Warning: CSV file not found after export")
        
except Exception as e:
    print(f" ERROR exporting CSV: {str(e)}")



SUCCESS: Cleaned data exported to CSV!
File location: /Users/svitlanakovalivska/onboarding_weebet/_onboarding_data-1/daily_tasks/day_4/cleaned_sat_results.csv
Exported 416 rows and 8 columns
File size: 27,091 bytes
