# Onboarding Day 3 - NYC Schools SQL Analysis

In [1]:
from sqlalchemy import create_engine
import psycopg2
import pandas as pd

## Database Connection

In [2]:
# Database URL format: dialect+driver://username:password@host:port/dbname
DATABASE_URL = (
    "postgresql+psycopg2://neondb_owner:a9Am7Yy5r9_T7h4OF2GN@"                  # database postgresql with psycopg2 driver + username and password + @ (separate host and port)
    "ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech:5432/neondb"    # host + port + /database name
    "?sslmode=require"                                                          # SSL mode
)

# Create SQLAlchemy engine
engine = create_engine(DATABASE_URL)

## Test Queries

The first five rows from `high_school_directory` and `school_demographics` tables.

In [3]:
query = "SELECT * FROM nyc_schools.high_school_directory LIMIT 5;"
df = pd.read_sql(query, engine)
df

Unnamed: 0,dbn,school_name,borough,building_code,phone_number,fax_number,grade_span_min,grade_span_max,expgrade_span_min,expgrade_span_max,...,number_programs,Location 1,Community Board,Council District,Census Tract,Zip Codes,Community Districts,Borough Boundaries,City Council Districts,Police Precincts
0,27Q260,Frederick Douglass Academy VI High School,Queens,Q465,718-471-2154,718-471-2890,9.0,12,,,...,1,"{'latitude': '40.601989336', 'longitude': '-73...",14,31,100802,20529,51,3,47,59
1,21K559,Life Academy High School for Film and Music,Brooklyn,K400,718-333-7750,718-333-7775,9.0,12,,,...,1,"{'latitude': '40.593593811', 'longitude': '-73...",13,47,306,17616,21,2,45,35
2,16K393,Frederick Douglass Academy IV Secondary School,Brooklyn,K026,718-574-2820,718-574-2821,9.0,12,,,...,1,"{'latitude': '40.692133704', 'longitude': '-73...",3,36,291,18181,69,2,49,52
3,08X305,Pablo Neruda Academy,Bronx,X450,718-824-1682,718-824-1663,9.0,12,,,...,1,"{'latitude': '40.822303765', 'longitude': '-73...",9,18,16,11611,58,5,31,26
4,03M485,Fiorello H. LaGuardia High School of Music & A...,Manhattan,M485,212-496-0700,212-724-5748,9.0,12,,,...,6,"{'latitude': '40.773670507', 'longitude': '-73...",7,6,151,12420,20,4,19,12


In [4]:
query = "SELECT * FROM nyc_schools.school_demographics LIMIT 5;"
df = pd.read_sql(query, engine)
df

Unnamed: 0,dbn,Name,schoolyear,fl_percent,frl_percent,total_enrollment,prek,k,grade1,grade2,...,black_num,black_per,hispanic_num,hispanic_per,white_num,white_per,male_num,male_per,female_num,female_per
0,01M015,P.S. 015 ROBERTO CLEMENTE,20052006,89.4,,281,15,36,40,33,...,74,26.3,189,67.3,5,1.8,158,56.2,123,43.8
1,01M015,P.S. 015 ROBERTO CLEMENTE,20062007,89.4,,243,15,29,39,38,...,68,28.0,153,63.0,4,1.6,140,57.6,103,42.4
2,01M015,P.S. 015 ROBERTO CLEMENTE,20072008,89.4,,261,18,43,39,36,...,77,29.5,157,60.2,7,2.7,143,54.8,118,45.2
3,01M015,P.S. 015 ROBERTO CLEMENTE,20082009,89.4,,252,17,37,44,32,...,75,29.8,149,59.1,7,2.8,149,59.1,103,40.9
4,01M015,P.S. 015 ROBERTO CLEMENTE,20092010,,96.5,208,16,40,28,32,...,67,32.2,118,56.7,6,2.9,124,59.6,84,40.4


## Task Queries

### School Distribution

**Q**: How many schools are there in each borough?

**A**: From the `high_school_directory` table I selected the boroughs and the distinct count of dbn and grouped the results by borough.

In [5]:
query = "SELECT borough, COUNT(DISTINCT dbn) FROM nyc_schools.high_school_directory GROUP BY borough;"
df = pd.read_sql(query, engine)
df

Unnamed: 0,borough,count
0,Bronx,118
1,Brooklyn,121
2,Manhattan,106
3,Queens,80
4,Staten Island,10


💡 **Insight**: Brooklyn and the Bronx host the largest number of high schools (121 and 118 respectively), while Staten Island has the fewest (10).

### Language Learners

**Q**: What is the average % of English Language Learners (ELL) per borough?

**A**: I calculated this in two steps using a CTE:

1.	In the CTE `school_level`, I joined `school_demographics` to `high_school_directory` on **dbn**. For each school, I selected its dbn, its borough, and the average ell_percent. I excluded NULL values and grouped by dbn and borough, so each school appears only once with its own average ELL%.

2.	From that CTE, I then took the average of those per-school ELL percentages by borough.

💡 **Insight**: After the join, only Manhattan schools had matching records in both tables with non-null ell_percent, so Manhattan is the only borough returned in the result. This suggests that either the other boroughs are missing demographic data, or their dbn values didn’t match across both tables.

In [6]:
query = """
WITH school_level AS (
    SELECT 
        hsd.dbn,
        hsd.borough,
        AVG(sd.ell_percent) AS avg_ell_per_school
    FROM nyc_schools.high_school_directory AS hsd
    INNER JOIN nyc_schools.school_demographics AS sd
        ON hsd.dbn = sd.dbn
    WHERE hsd.borough IS NOT NULL
      AND sd.ell_percent IS NOT NULL
    GROUP BY hsd.dbn, hsd.borough
)
SELECT 
    borough,
    AVG(avg_ell_per_school) AS avg_perc_ell
FROM school_level
GROUP BY borough
ORDER BY borough;
"""

df = pd.read_sql(query, engine)
df

Unnamed: 0,borough,avg_perc_ell
0,Manhattan,6.417347


💡 **Insight**: The average percentage of English Language Learners per school in Manhattan is 6.4%. Other boroughs had no matching demographic data.

#### Data Coverage Check

I verified how many rows and distinct school IDs (dbn) exist in each table, and how many records remain after joining. This helps confirm data coverage and explain why only one borough appears in the results.

* Distinct DBNs in `high_school_directory`: **435**

* Distinct DBNs in `school_demographics`: **32**

* Rows after join: **40**

💡 The `school_demographics` table only contains data for 32 unique schools, while the `high_school_directory` table contains 435 unique schools. After joining the tables only 40 records match. This explains why the final query only returns results for Manhattan.


In [7]:
query= """
SELECT COUNT(*) AS rows_after_join
FROM nyc_schools.high_school_directory AS hsd
INNER JOIN nyc_schools.school_demographics AS sd
    ON hsd.dbn = sd.dbn;
"""
df = pd.read_sql(query, engine)    
df

Unnamed: 0,rows_after_join
0,40


In [8]:
query = "SELECT COUNT(DISTINCT dbn) AS demographics_dbns FROM nyc_schools.school_demographics;"
df = pd.read_sql(query, engine)
df

Unnamed: 0,demographics_dbns
0,32


In [9]:
query = "SELECT COUNT(DISTINCT dbn) AS directory_dbns FROM nyc_schools.high_school_directory;"
df = pd.read_sql(query, engine)
df

Unnamed: 0,directory_dbns
0,435


### School supporting special needs

**Q**: Using the data from the school demographics and high school directory, write a query to find the top 3 schools in each borough with the highest percentage of special education students (sped_percent)

**A**: I answered this question in three parts with two CTEs:

1. **school_level CTE**: I joined `school_demographics` to `high_school_directory` on **dbn**. For each school, I selected: dbn, school_name, borough, and the average sped_percent. I grouped by school so each school has one average value, instead of counting a school multiple times if it appears in multiple rows/years.

2. **ranked** CTE: I added a **ROW_NUMBER()** window function that partitions by borough and orders schools in that borough by avg_sped_percent (highest first)
This **assigns a rank rn** of 1, 2, 3, … within each borough.

3. **Final selection**: I filtered the ranked CTE to only keep rows where **rn <= 3**, returning the top 3 schools per borough by special education percentage.

💡 Since only a small subset of schools match across both tables, the final result only returns Manhattan schools. So the output is not “all NYC boroughs,” it’s “of the schools with valid sped_percent in the joined data, here are the top 3 in each borough" — which is only Manhattan.

In [10]:
query = """
WITH school_level AS (
    SELECT 
        hsd.dbn,
        hsd.school_name,
        hsd.borough,
        AVG(sd.sped_percent) AS avg_sped_percent
    FROM nyc_schools.high_school_directory AS hsd
    INNER JOIN nyc_schools.school_demographics AS sd
        ON hsd.dbn = sd.dbn
    WHERE sd.sped_percent IS NOT NULL
    GROUP BY hsd.dbn, hsd.school_name, hsd.borough
),

ranked AS (
    SELECT
        borough,
        school_name,
        dbn,
        avg_sped_percent,
        ROW_NUMBER() OVER (
            PARTITION BY borough
            ORDER BY avg_sped_percent DESC
        ) AS rn
    FROM school_level
)

SELECT
    borough,
    school_name,
    dbn,
    avg_sped_percent AS avg_sped_percent
FROM ranked
WHERE rn <= 3
ORDER BY borough, avg_sped_percent DESC;
"""
df = pd.read_sql(query, engine)
df

Unnamed: 0,borough,school_name,dbn,avg_sped_percent
0,Manhattan,East Side Community School,01M450,26.285714
1,Manhattan,Henry Street School for International Studies,01M292,23.014286
2,Manhattan,Marta Valle High School,01M509,22.214285


💡 **Insight**: The top Manhattan schools for special education enrollment — East Side Community School, Henry Street School for International Studies, and Marta Valle High School — have average SPED percentages ranging between 22–26%.

## Insights


* The `high_school_directory` table contains **435** unique schools, while the `school_demographics` table only has **32** distinct dbn values.

* Joining the two tables on dbn produced **40 matching rows**, showing that only a small subset of schools appears in both datasets.

* Most of the matched records belong to **Manhattan**, meaning analyses are limited to this borough.

