# 🧠 Day 3 – SQL via Python: NYC School Data Exploration
In this notebook, you'll connect to a PostgreSQL database and execute SQL queries to explore NYC school data.

## 🔌 Step 1: Import Libraries

In [1]:
import pandas as pd
import psycopg2

## 🔐 Step 2: Connect to the Database

In [2]:
# DB connection setup using hardcoded credentials (for onboarding only)
conn = psycopg2.connect(
    dbname="neondb",
    user="neondb_owner",
    password="npg_CeS9fJg2azZD",
    host="ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech",
    port="5432",
    sslmode="require"
)
cur = conn.cursor()

## 🔍 Step 3: Run a Test Query

In [3]:
# just a note that sneaky conn was new to me 

In [4]:
query = "SELECT * FROM nyc_schools.high_school_directory LIMIT 5;"
df = pd.read_sql(query, conn)
df.head()

  df = pd.read_sql(query, conn)


Unnamed: 0,dbn,school_name,borough,building_code,phone_number,fax_number,grade_span_min,grade_span_max,expgrade_span_min,expgrade_span_max,...,number_programs,Location 1,Community Board,Council District,Census Tract,Zip Codes,Community Districts,Borough Boundaries,City Council Districts,Police Precincts
0,27Q260,Frederick Douglass Academy VI High School,Queens,Q465,718-471-2154,718-471-2890,9.0,12,,,...,1,"{'latitude': '40.601989336', 'longitude': '-73...",14,31,100802,20529,51,3,47,59
1,21K559,Life Academy High School for Film and Music,Brooklyn,K400,718-333-7750,718-333-7775,9.0,12,,,...,1,"{'latitude': '40.593593811', 'longitude': '-73...",13,47,306,17616,21,2,45,35
2,16K393,Frederick Douglass Academy IV Secondary School,Brooklyn,K026,718-574-2820,718-574-2821,9.0,12,,,...,1,"{'latitude': '40.692133704', 'longitude': '-73...",3,36,291,18181,69,2,49,52
3,08X305,Pablo Neruda Academy,Bronx,X450,718-824-1682,718-824-1663,9.0,12,,,...,1,"{'latitude': '40.822303765', 'longitude': '-73...",9,18,16,11611,58,5,31,26
4,03M485,Fiorello H. LaGuardia High School of Music & A...,Manhattan,M485,212-496-0700,212-724-5748,9.0,12,,,...,6,"{'latitude': '40.773670507', 'longitude': '-73...",7,6,151,12420,20,4,19,12


## ✅ Task Queries Below

In [5]:
# Question 1: How many schools are there in each borough?
# I will count how many unique schools (dbn) are present per borough.
# GROUP BY helps me get the count per borough.

query = """
SELECT 
    borough,
    COUNT(DISTINCT dbn) AS school_count
FROM nyc_schools.high_school_directory
GROUP BY borough ;
"""
# those """" for multiline query are really helpful in py that was also new 
# proceed to run the query and load the result into a dataframe
df = pd.read_sql(query, conn)

# show the result
df



  df = pd.read_sql(query, conn)


Unnamed: 0,borough,school_count
0,Bronx,118
1,Brooklyn,121
2,Manhattan,106
3,Queens,80
4,Staten Island,10


In [6]:
# Question 2: average % of ELL students per borough
# I just calculate avg(ell_percent) and group by borough

query = """
SELECT 
    h.borough, 
    AVG(d.ell_percent) AS avg_ell_percent
FROM nyc_schools.school_demographics d
Inner JOIN nyc_schools.high_school_directory h 
    ON d.dbn = h.dbn
GROUP BY h.borough
ORDER BY avg_ell_percent DESC;
"""

df = pd.read_sql(query, conn)
print(df)

  df = pd.read_sql(query, conn)


     borough  avg_ell_percent
0  Manhattan           7.5725


In [7]:
df = pd.read_sql("SELECT * FROM nyc_schools.school_demographics", conn)
df.to_csv("school_demographics.csv", index=False)
#thanks ai

  df = pd.read_sql("SELECT * FROM nyc_schools.school_demographics", conn)


In [8]:
df = pd.read_sql("SELECT * FROM nyc_schools.high_school_directory", conn)
df.to_csv("high_school_directory.csv", index=False)
#looking at it i understood thath through the dbn only manhattan data was provided her the dbn inise with m are manhattan, x bronx, k brooklyn...
#it follows a clear logic only staten doesn't those are R

  df = pd.read_sql("SELECT * FROM nyc_schools.high_school_directory", conn)


In [9]:
# question 3: Top 3 schools per borough with highest % of special education students (sped_percent)
# I join school_demographics with high_school_directory to get borough and school_name
# I use ROW_NUMBER() to rank schools in each borough by sped_percent

query = """
WITH school_ranks AS (
    SELECT 
        h.borough,
        h.school_name,
        d.sped_percent,
        ROW_NUMBER() OVER (
            PARTITION BY h.borough
            ORDER BY d.sped_percent DESC
        ) AS rank_in_borough
    FROM nyc_schools.school_demographics d
    INNER JOIN nyc_schools.high_school_directory h 
        ON d.dbn = h.dbn
)
SELECT borough, school_name, sped_percent
FROM school_ranks
WHERE rank_in_borough <= 3
ORDER BY borough, rank_in_borough;
"""

df = pd.read_sql(query, conn)
df


  df = pd.read_sql(query, conn)


Unnamed: 0,borough,school_name,sped_percent
0,Manhattan,East Side Community School,28.8
1,Manhattan,East Side Community School,27.7
2,Manhattan,East Side Community School,26.7


In [10]:
query = """
SELECT school_name, sped_percent
FROM nyc_schools.school_demographics d
JOIN nyc_schools.high_school_directory h 
    ON d.dbn = h.dbn
WHERE h.borough = 'Bronx'
ORDER BY sped_percent DESC
LIMIT 3;
"""
df = pd.read_sql(query, conn)
df

#just to confirm but
#since i get again only manhattan i already know sped_percent is inside the table 
#which only has manhattan data so well an eloborated query won't give satifistying results because our dataset is incomplete

  df = pd.read_sql(query, conn)


Unnamed: 0,school_name,sped_percent


## 🧠 Insights

well most frustrating was that the connection dropped so often and i had to restart⏩ all cells multiple times. 
The heat today didn't make things easier.🫠🧠
quick wrapup 
Q1 -
	borough 	school_count
0 	Bronx 	        118
1 	Brooklyn     	121
2 	Manhattan 	    106
3 	Queens 	         80
4 	Staten Island 	 10

Q2 -      borough  avg_ell_percent
0  Manhattan           7.57 %

Q3 -
	borough 	school_name 	sped_percent
0 	Manhattan 	East Side Community School 	28.8
1 	Manhattan 	East Side Community School 	27.7
2 	Manhattan 	East Side Community School 	26.7

Normally i'd put the Markdown at the top but i will leave it down here since the field 🧠Insights is here.
Main insight is probably that one table / sheet is only populated with Manhattan data , which resulted to these dirty q2&3 results so Zarko was right there was a trick that which led us astray.🦊


