# 🧮 Day 4 – Data Integration & Schema Design

* **Import the necessary libraries**

In [57]:
import pandas as pd
import psycopg2
from sqlalchemy import create_engine
import warnings
warnings.filterwarnings("ignore")

* **Connect to the PostgreSQL database**

In [58]:

# SQLAlchemy connection string format:
# postgresql+psycopg2://user:password@host:port/dbname

DATABASE_URL = (
    "postgresql+psycopg2://neondb_owner:a9Am7Yy5r9_T7h4OF2GN"
    "@ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech:5432/neondb"
    "?sslmode=require"
)

# Create engine and establish connection
engine = create_engine(DATABASE_URL)



* **Load the dataset to a pandas dataframe**

In [19]:
df=pd.read_csv('/Users/biancaniemann/Documents/Webeet/Python/onboarding_data/_onboarding_data/daily_tasks/day_4/day_4_datasets/sat-results.csv')

* **

### **Inspect the Data**

- Show first 5 rows

In [59]:
df.head()

Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score,SAT Critical Readng Avg. Score,internal_school_id,contact_extension,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,355,218160,x345,78%,2.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,383,268547,x234,,3.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370,377,236446,x123,,3.0
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359,414,427826,x123,92%,4.0
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384,390,672714,x123,92%,2.0


- Check column names, Non-Null Counts and Data Types (Can see Column names very inconsistent so to adjust as well as incorrect Data Types need to be fixed)

In [60]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493 entries, 0 to 492
Data columns (total 11 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   DBN                              493 non-null    object 
 1   SCHOOL NAME                      493 non-null    object 
 2   Num of SAT Test Takers           493 non-null    object 
 3   SAT Critical Reading Avg. Score  493 non-null    object 
 4   SAT Math Avg. Score              493 non-null    object 
 5   SAT Writing Avg. Score           493 non-null    object 
 6   SAT Critical Readng Avg. Score   493 non-null    object 
 7   internal_school_id               493 non-null    int64  
 8   contact_extension                388 non-null    object 
 9   pct_students_tested              376 non-null    object 
 10  academic_tier_rating             402 non-null    float64
dtypes: float64(1), int64(1), object(9)
memory usage: 42.5+ KB


- Get summary statistics for numeric columns (not showing correctly as the Score columns and Perc Tested columns have incorrect data types)

In [61]:
df.describe()

Unnamed: 0,internal_school_id,academic_tier_rating
count,493.0,402.0
mean,562172.943205,2.564677
std,262138.627055,1.126443
min,101855.0,1.0
25%,332013.0,2.0
50%,587220.0,3.0
75%,782993.0,4.0
max,999398.0,4.0


* Check the percentage of missing values in 'academic tier rating' and the 'pct students tested' columns to see if columns should stay or go 
    - less than 25% missing so have chosen to keep both columns

In [21]:
df['academic_tier_rating'].isna().mean() * 100

np.float64(18.45841784989858)

In [22]:
df['pct_students_tested'].isna().mean() * 100

np.float64(23.732251521298174)

* **

### **Clean the Data**

* **Remove columns & create a new dataframe to work with**
    - Remove 'SAT Critical Readng Avg. Score' as this is a duplicate with a spelling error in the column name
    - Remove 'internal_school_id' and 'contact_extension' as they serve no analytical purpose

In [None]:
# Check for duplicate columns
df['SAT Critical Reading Avg. Score'].equals(df['SAT Critical Readng Avg. Score'])

True

In [25]:
# Remove columns & create a new dataframe to work with
new_df = df.drop(columns=['SAT Critical Readng Avg. Score', 'internal_school_id', 'contact_extension'])

* **Normalised Headers**
    - Remove extra whitespace
    - Replace " " with _
    - All lowercase
    - Remove any special Characters


In [28]:
new_df.columns = new_df.columns.str.strip().str.lower().str.replace(' ', '_', regex=True).str.replace(r'[^\w]', '', regex=True)

- **Remove duplicate rows**

In [None]:
# Look at the duplicate rows
duplicates = new_df[new_df.duplicated()]
print(duplicates)

        dbn                                        school_name  \
478  14K685            EL PUENTE ACADEMY FOR PEACE AND JUSTICE   
479  13K605  GEORGE WESTINGHOUSE CAREER AND TECHNICAL EDUCA...   
480  27Q480                             JOHN ADAMS HIGH SCHOOL   
481  07X221    SOUTH BRONX PREPARATORY: A COLLEGE BOARD SCHOOL   
482  19K420                       FRANKLIN K. LANE HIGH SCHOOL   
483  09X525               BRONX LEADERSHIP ACADEMY HIGH SCHOOL   
484  02M520   MURRY BERGTRAUM HIGH SCHOOL FOR BUSINESS CAREERS   
485  17K543  SCIENCE, TECHNOLOGY AND RESEARCH EARLY COLLEGE...   
486  02M419                               LANDMARK HIGH SCHOOL   
487  05M304                              MOTT HALL HIGH SCHOOL   
488  27Q480                             JOHN ADAMS HIGH SCHOOL   
489  13K605  GEORGE WESTINGHOUSE CAREER AND TECHNICAL EDUCA...   
490  05M304                              MOTT HALL HIGH SCHOOL   
491  02M520   MURRY BERGTRAUM HIGH SCHOOL FOR BUSINESS CAREERS   
492  07X22

In [31]:
# Remove duplicate rows
new_df.drop_duplicates(inplace=True)

* **Remove % symbol from the perc_students_tested values**
    - Convert to numeric values (0 - 100)
    - Change data type to a float

In [36]:
new_df['pct_students_tested'] = new_df['pct_students_tested'].str.rstrip('%').astype(float)

* **Replace 's' with NaN in the scores columns**
    - Dont want to remove whole row as still has pct_students_tested and academic_tier_rating values
    - Used a for loop to change each column to numeric and then replace the 's' with Nan using error='coerce'

In [38]:
cols_to_convert = ['num_of_sat_test_takers', 'sat_critical_reading_avg_score', 'sat_math_avg_score', 'sat_writing_avg_score']

for col in cols_to_convert:
    new_df[col] = pd.to_numeric(new_df[col], errors='coerce')

In [43]:
new_df.head()

Unnamed: 0,dbn,school_name,num_of_sat_test_takers,sat_critical_reading_avg_score,sat_math_avg_score,sat_writing_avg_score,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29.0,355.0,404.0,363.0,78.0,2.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91.0,383.0,423.0,366.0,,3.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70.0,377.0,402.0,370.0,,3.0
3,01M458,FORSYTH SATELLITE ACADEMY,7.0,414.0,401.0,359.0,92.0,4.0
4,01M509,MARTA VALLE HIGH SCHOOL,44.0,390.0,433.0,384.0,92.0,2.0


In [47]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 478 entries, 0 to 477
Data columns (total 8 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   dbn                             478 non-null    object 
 1   school_name                     478 non-null    object 
 2   num_of_sat_test_takers          421 non-null    float64
 3   sat_critical_reading_avg_score  421 non-null    float64
 4   sat_math_avg_score              421 non-null    float64
 5   sat_writing_avg_score           421 non-null    float64
 6   pct_students_tested             363 non-null    float64
 7   academic_tier_rating            392 non-null    float64
dtypes: float64(6), object(2)
memory usage: 33.6+ KB


In [48]:
new_df.describe()

Unnamed: 0,num_of_sat_test_takers,sat_critical_reading_avg_score,sat_math_avg_score,sat_writing_avg_score,pct_students_tested,academic_tier_rating
count,421.0,421.0,421.0,421.0,363.0,392.0
mean,110.320665,400.850356,418.173397,393.985748,84.595041,2.579082
std,155.534254,56.802783,88.210494,58.635109,5.673305,1.128053
min,6.0,279.0,-10.0,286.0,78.0,1.0
25%,41.0,368.0,372.0,360.0,78.0,2.0
50%,62.0,391.0,395.0,381.0,85.0,3.0
75%,95.0,416.0,438.0,411.0,92.0,4.0
max,1277.0,679.0,1100.0,682.0,92.0,4.0


* **Looking at describe above can see in col 'sat_math_avg_score' the min and max are out of range**
    - (should be between 200 and 800)
    - Ran a check below to see which rows have incorrect values

In [51]:
new_df[(new_df['sat_math_avg_score'] < 200) | (new_df['sat_math_avg_score'] > 800)]

Unnamed: 0,dbn,school_name,num_of_sat_test_takers,sat_critical_reading_avg_score,sat_math_avg_score,sat_writing_avg_score,pct_students_tested,academic_tier_rating
80,03M415,WADLEIGH SECONDARY SCHOOL FOR THE PERFORMING &...,32.0,371.0,850.0,370.0,78.0,4.0
181,10X225,THEATRE ARTS PRODUCTION COMPANY SCHOOL,59.0,405.0,-10.0,394.0,78.0,
288,15K656,BROOKLYN HIGH SCHOOL OF THE ARTS,141.0,426.0,999.0,411.0,,
422,28Q470,JAMAICA HIGH SCHOOL,90.0,342.0,999.0,353.0,92.0,3.0
434,29Q283,PREPARATORY ACADEMY FOR WRITERS: A COLLEGE BOA...,43.0,370.0,1100.0,363.0,85.0,3.0


* **Replace incorrect values with Nan so row is not deleted just the incorrect value is replaced**
    - Used loc to find the cells with the incorrect values
    - Used pd.NA to replace


In [52]:
new_df.loc[(new_df['sat_math_avg_score'] < 200) | (new_df['sat_math_avg_score'] > 800), 'sat_math_avg_score'] = pd.NA

* **Check the column again**

In [53]:
new_df[(new_df['sat_math_avg_score'] < 200) | (new_df['sat_math_avg_score'] > 800)]

Unnamed: 0,dbn,school_name,num_of_sat_test_takers,sat_critical_reading_avg_score,sat_math_avg_score,sat_writing_avg_score,pct_students_tested,academic_tier_rating


In [55]:
new_df.describe()

Unnamed: 0,num_of_sat_test_takers,sat_critical_reading_avg_score,sat_math_avg_score,sat_writing_avg_score,pct_students_tested,academic_tier_rating
count,421.0,421.0,416.0,421.0,363.0,392.0
mean,110.320665,400.850356,413.733173,393.985748,84.595041,2.579082
std,155.534254,56.802783,64.945638,58.635109,5.673305,1.128053
min,6.0,279.0,312.0,286.0,78.0,1.0
25%,41.0,368.0,372.0,360.0,78.0,2.0
50%,62.0,391.0,395.0,381.0,85.0,3.0
75%,95.0,416.0,437.25,411.0,92.0,4.0
max,1277.0,679.0,735.0,682.0,92.0,4.0


* **

### **Append the new dataframe to the nyc_school schema**

In [62]:
new_df.to_sql(
    name='bianca_sat_results',       
    con=engine,     
    schema='nyc_schools',
    if_exists='replace',    
    index=False            
)

478

### **Create cleaned csv file**

In [56]:
new_df.to_csv('/Users/biancaniemann/Documents/Webeet/Python/onboarding_data/_onboarding_data/daily_tasks/day_4/cleaned_sat_results.csv', index=False)

* **

### **Test the Table in the schema works and run a few queries**

In [63]:
query = """
    SELECT *
    FROM nyc_schools.bianca_sat_results
    LIMIT 5;
"""
df_result = pd.read_sql(query, engine)
df_result

Unnamed: 0,dbn,school_name,num_of_sat_test_takers,sat_critical_reading_avg_score,sat_math_avg_score,sat_writing_avg_score,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29.0,355.0,404.0,363.0,78.0,2.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91.0,383.0,423.0,366.0,,3.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70.0,377.0,402.0,370.0,,3.0
3,01M458,FORSYTH SATELLITE ACADEMY,7.0,414.0,401.0,359.0,92.0,4.0
4,01M509,MARTA VALLE HIGH SCHOOL,44.0,390.0,433.0,384.0,92.0,2.0


* **Both distinct dbn and total count of rows shows total of 478 rows - no schools are repeated in the dataset**

In [66]:
query = """
    SELECT count(DISTINCT dbn) AS total_schools
    FROM nyc_schools.bianca_sat_results;
"""
df_result = pd.read_sql(query, engine)
df_result

Unnamed: 0,total_schools
0,478


In [65]:
query = """
    SELECT count(*) AS total_rows
    FROM nyc_schools.bianca_sat_results;
"""
df_result = pd.read_sql(query, engine)
df_result

Unnamed: 0,total_rows
0,478


* **Top 10 schools with best AVG scores overall**

In [76]:
query = """
WITH total_avg AS (
    SELECT dbn,
        school_name,
        (sat_critical_reading_avg_score + sat_math_avg_score + sat_writing_avg_score) / 3.0 AS overall_avg_score
    FROM nyc_schools.bianca_sat_results
)
SELECT *,
    RANK() OVER (ORDER BY overall_avg_score DESC) AS rank_overall_avg_score
FROM total_avg
WHERE overall_avg_score IS NOT NULL
ORDER BY overall_avg_score DESC
LIMIT 10
;
"""
df_result = pd.read_sql(query, engine)
df_result

Unnamed: 0,dbn,school_name,overall_avg_score,rank_overall_avg_score
0,02M475,STUYVESANT HIGH SCHOOL,698.666667,1
1,10X445,BRONX HIGH SCHOOL OF SCIENCE,656.333333,2
2,31R605,STATEN ISLAND TECHNICAL HIGH SCHOOL,651.0,3
3,10X696,HIGH SCHOOL OF AMERICAN STUDIES AT LEHMAN COLLEGE,640.0,4
4,25Q525,TOWNSEND HARRIS HIGH SCHOOL,636.666667,5
5,28Q687,QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO...,622.666667,6
6,01M696,BARD HIGH SCHOOL EARLY COLLEGE,618.666667,7
7,05M692,"HIGH SCHOOL FOR MATHEMATICS, SCIENCE AND ENGIN...",615.666667,8
8,13K430,BROOKLYN TECHNICAL HIGH SCHOOL,611.0,9
9,02M416,ELEANOR ROOSEVELT HIGH SCHOOL,586.0,10


In [77]:
query = """
    SELECT *
    FROM nyc_schools.bianca_sat_results
    where school_name = 'STUYVESANT HIGH SCHOOL';
"""
df_result = pd.read_sql(query, engine)
df_result

Unnamed: 0,dbn,school_name,num_of_sat_test_takers,sat_critical_reading_avg_score,sat_math_avg_score,sat_writing_avg_score,pct_students_tested,academic_tier_rating
0,02M475,STUYVESANT HIGH SCHOOL,832.0,679.0,735.0,682.0,,4.0
