# 🧠 Day 3 – SQL via Python: NYC School Data Exploration

###        In this notebook I connect to a PostgreSQL database and execute SQL queries to explore NYC school data.


In [1]:
import pandas as pd
import psycopg2

## 🔐 Step 2: Connect to the Database

In [2]:
# DB connection setup using hardcoded credentials (for onboarding only)
conn = psycopg2.connect(
    dbname="neondb",
    user="neondb_owner",
    password="npg_CeS9fJg2azZD",
    host="ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech",
    port="5432",
    sslmode="require"
)
cur = conn.cursor()

In [3]:
run_sql("""
SELECT table_schema, table_name
FROM information_schema.tables
WHERE table_name IN ('high_school_directory','school_demographics','school_safety_report')
ORDER BY table_schema, table_name;
""")

NameError: name 'run_sql' is not defined

## 🔍 Step 3: Run a Test Query

In [None]:
query = "SELECT * FROM nyc_schools.high_school_directory LIMIT 5;"
df = pd.read_sql(query, conn)
df.head()

  df = pd.read_sql(query, conn)


Unnamed: 0,dbn,school_name,borough,building_code,phone_number,fax_number,grade_span_min,grade_span_max,expgrade_span_min,expgrade_span_max,...,number_programs,Location 1,Community Board,Council District,Census Tract,Zip Codes,Community Districts,Borough Boundaries,City Council Districts,Police Precincts
0,27Q260,Frederick Douglass Academy VI High School,Queens,Q465,718-471-2154,718-471-2890,9.0,12,,,...,1,"{'latitude': '40.601989336', 'longitude': '-73...",14,31,100802,20529,51,3,47,59
1,21K559,Life Academy High School for Film and Music,Brooklyn,K400,718-333-7750,718-333-7775,9.0,12,,,...,1,"{'latitude': '40.593593811', 'longitude': '-73...",13,47,306,17616,21,2,45,35
2,16K393,Frederick Douglass Academy IV Secondary School,Brooklyn,K026,718-574-2820,718-574-2821,9.0,12,,,...,1,"{'latitude': '40.692133704', 'longitude': '-73...",3,36,291,18181,69,2,49,52
3,08X305,Pablo Neruda Academy,Bronx,X450,718-824-1682,718-824-1663,9.0,12,,,...,1,"{'latitude': '40.822303765', 'longitude': '-73...",9,18,16,11611,58,5,31,26
4,03M485,Fiorello H. LaGuardia High School of Music & A...,Manhattan,M485,212-496-0700,212-724-5748,9.0,12,,,...,6,"{'latitude': '40.773670507', 'longitude': '-73...",7,6,151,12420,20,4,19,12


In [None]:
from sqlalchemy import create_engine, text
import pandas as pd

DB_NAME = "neondb"
DB_USER = "neondb_owner"
DB_PASS = "npg_CeS9fJg2azZD"
DB_HOST = "ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech"
DB_PORT = 5432

url = f"postgresql+psycopg2://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}?sslmode=require"
engine = create_engine(url, pool_pre_ping=True)

def run_sql(sql: str):
    with engine.connect() as conn:
        return pd.read_sql(text(sql), conn)
    
def exec_sql(sql: str, params: dict | None = None) -> None:
    with engine.begin() as conn:
        conn.execute(text(sql), params or {})

exec_sql("SET search_path TO nyc_schools, public")
run_sql("SHOW search_path") 

# sanity check
run_sql("SELECT 1 AS ok")


Unnamed: 0,ok
0,1


In [None]:
run_sql("""
SELECT column_name
FROM information_schema.columns
WHERE table_schema='nyc_schools' AND table_name='high_school_directory'
ORDER BY ordinal_position
""")


Unnamed: 0,column_name
0,dbn
1,school_name
2,borough
3,building_code
4,phone_number
...,...
100,Zip Codes
101,Community Districts
102,Borough Boundaries
103,City Council Districts


# 🧮 1. Schools by borough

Schools in each borough
Query counts unique schools (dbn) in each borough to show how schools are distributed across boroughs
Result: 
- Brooklyn has the largest number of schools (121), followed by the Bronx (118) and Manhattan (106).  
- Queens has 80 schools.  
- Staten Island has the fewest (10). 
 
Insight: School resources and student opportunities are more concentrated in Brooklyn, the Bronx, and Manhattan, while Staten Island has very limited options.


In [None]:
# Q1
run_sql("""
SELECT "borough" AS borough, COUNT(DISTINCT dbn) AS school_count
FROM high_school_directory
GROUP BY "borough"
ORDER BY school_count DESC, "borough";
""")


Unnamed: 0,borough,school_count
0,Brooklyn,121
1,Bronx,118
2,Manhattan,106
3,Queens,80
4,Staten Island,10


# 🎓 2. Average % of English Language Learners (ELL) per borough 

Query joins the high school directory with demographics and calculates the average percentage of English Language Learners (`ell_percent`) for each borough.

**Result:**
Answer — Average % of ELL per borough (simple mean across schools with available ELL%):
Manhattan: 7.57% (n=40) • Bronx: n/a • Brooklyn: n/a • Queens: n/a • Staten Island: n/a.
We avoid COALESCE(...,0) because it would treat missing values as 0 and bias the average downward. AVG() in Postgres ignores NULL, which is correct here.

In [None]:
pd.read_sql("""
SELECT
    h.borough,
    ROUND(AVG(d.ell_percent)::numeric, 2) AS avg_ell_percent,
    COUNT(d.ell_percent) AS rows_with_ell
FROM nyc_schools.high_school_directory h
LEFT JOIN nyc_schools.school_demographics d USING (dbn)
GROUP BY h.borough
ORDER BY h.borough;
""", engine)

Unnamed: 0,borough,avg_ell_percent,rows_with_ell
0,Bronx,,0
1,Brooklyn,,0
2,Manhattan,7.57,40
3,Queens,,0
4,Staten Island,,0


# 🔗 3. Top-3 schools per borough by special-education share

In [None]:
pd.read_sql("""
SELECT borough, school_name, dbn, ROUND(sped_percent::numeric, 2) AS sped_percent
FROM (
  SELECT
    h.borough, h.school_name, h.dbn, d.sped_percent,
    DENSE_RANK() OVER (
      PARTITION BY h.borough
      ORDER BY d.sped_percent DESC NULLS LAST
    ) AS rk
  FROM nyc_schools.high_school_directory h
  JOIN nyc_schools.school_demographics d USING (dbn)
) s
WHERE rk <= 3
ORDER BY borough, sped_percent DESC;
""", engine)


Unnamed: 0,borough,school_name,dbn,sped_percent
0,Manhattan,East Side Community School,01M450,28.8
1,Manhattan,East Side Community School,01M450,27.7
2,Manhattan,East Side Community School,01M450,26.7


# Summary


**Setup**
- Restored SQLAlchemy engine; `search_path = nyc_schools`.
- Join key: `high_school_directory."dbn"` ↔ `school_demographics."dbn"`.

**Q1 — Schools per borough**
- Brooklyn **121**, Bronx **118**, Manhattan **106**, Queens **80**, Staten Island **10**.

**Q2 — Avg ELL% per borough**
- `AVG(d.ell_percent)` (no `COALESCE`), `LEFT JOIN` on `dbn`; also report `COUNT(ell_percent)` as coverage.

**Q3 — Top-3 sped% per borough**
- Use windowing: `ROW_NUMBER() OVER (PARTITION BY borough ORDER BY sped_percent DESC NULLS LAST)`.
- No `MAX/AVG`; use `DENSE_RANK()` only if you want to include ties.
