# 🧠 Day 3 – SQL via Python: NYC School Data Exploration
In this notebook, you'll connect to a PostgreSQL database and execute SQL queries to explore NYC school data.

## 🔌 Step 1: Import Libraries

In [5]:
import pandas as pd



pd.set_option('display.max_columns', None)


In [6]:
%pip install sqlalchemy psycopg2-binary


Collecting sqlalchemy
  Using cached sqlalchemy-2.0.42-cp313-cp313-win_amd64.whl.metadata (9.8 kB)
Collecting psycopg2-binary
  Using cached psycopg2_binary-2.9.10-cp313-cp313-win_amd64.whl.metadata (4.8 kB)
Collecting greenlet>=1 (from sqlalchemy)
  Using cached greenlet-3.2.3-cp313-cp313-win_amd64.whl.metadata (4.2 kB)
Collecting typing-extensions>=4.6.0 (from sqlalchemy)
  Using cached typing_extensions-4.14.1-py3-none-any.whl.metadata (3.0 kB)
Using cached sqlalchemy-2.0.42-cp313-cp313-win_amd64.whl (2.1 MB)
Using cached psycopg2_binary-2.9.10-cp313-cp313-win_amd64.whl (2.6 MB)
Using cached greenlet-3.2.3-cp313-cp313-win_amd64.whl (297 kB)
Using cached typing_extensions-4.14.1-py3-none-any.whl (43 kB)
Installing collected packages: typing-extensions, psycopg2-binary, greenlet, sqlalchemy
Successfully installed greenlet-3.2.3 psycopg2-binary-2.9.10 sqlalchemy-2.0.42 typing-extensions-4.14.1
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## 🔐 Step 2: Connect to the Database

In [7]:
""" DB connection setup using hardcoded credentials (for onboarding only)
conn = psycopg2.connect(
    dbname="neondb",
    user="neondb_owner",
    password="npg_CeS9fJg2azZD",
    host="ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech",
    port="5432",
    sslmode="require"
)
cur = conn.cursor()"""

' DB connection setup using hardcoded credentials (for onboarding only)\nconn = psycopg2.connect(\n    dbname="neondb",\n    user="neondb_owner",\n    password="npg_CeS9fJg2azZD",\n    host="ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech",\n    port="5432",\n    sslmode="require"\n)\ncur = conn.cursor()'

In [8]:
from sqlalchemy import create_engine

# Format: 'postgresql+psycopg2://user:password@host:port/dbname'
db_url = 'postgresql+psycopg2://neondb_owner:npg_CeS9fJg2azZD@ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech:5432/neondb'

engine = create_engine(db_url, connect_args={"sslmode": "require"})


In [9]:
tables_df = pd.read_sql("""
    SELECT table_schema, table_name
    FROM information_schema.tables
    WHERE table_schema NOT IN ('information_schema', 'pg_catalog')
    ORDER BY table_schema, table_name;
""", engine)

tables_df

Unnamed: 0,table_schema,table_name
0,dependency_example,departments
1,dependency_example,districts
2,dependency_example,employees
3,dependency_example,neighborhoods
4,nyc_schools,high_school_directory
5,nyc_schools,sat_scores
6,nyc_schools,school_demographics
7,nyc_schools,school_safety_report
8,public,departments
9,public,dependency_example.departments


## 🔍 Step 3: Run a Test Query

In [10]:
query = "SELECT * FROM nyc_schools.high_school_directory LIMIT 5"
df = pd.read_sql(query, engine)
print(df.columns.tolist())

['dbn', 'school_name', 'borough', 'building_code', 'phone_number', 'fax_number', 'grade_span_min', 'grade_span_max', 'expgrade_span_min', 'expgrade_span_max', 'start_time', 'end_time', 'priority01', 'priority02', 'priority03', 'priority04', 'priority05', 'priority06', 'priority07', 'priority08', 'priority09', 'priority10', 'location', 'phone_number2', 'school_email', 'website', 'subway', 'bus', 'grades2018', 'finalgrades', 'total_students', 'extracurricular_activities', 'school_sports', 'attendance_rate', 'pct_stu_enough_variety', 'pct_stu_safe', 'school_accessibility_description', 'directions1', 'requirement1', 'requirement2', 'requirement3', 'requirement4', 'requirement5', 'program1', 'code1', 'interest1', 'method1', 'seats9ge1', 'grade9gefilledflag1', 'grade9geapplicants1', 'seats9swd1', 'grade9swdfilledflag1', 'grade9swdapplicants1', 'campus_name', 'building_borough', 'building_location', 'latitude', 'longitude', 'community_board', 'council_district', 'census_tract', 'bin', 'bbl', 

In [11]:
df_demo = pd.read_sql("SELECT * FROM nyc_schools.school_demographics LIMIT 1", engine)
print(df_demo.columns.tolist())

['dbn', 'Name', 'schoolyear', 'fl_percent', 'frl_percent', 'total_enrollment', 'prek', 'k', 'grade1', 'grade2', 'grade3', 'grade4', 'grade5', 'grade6', 'grade7', 'grade8', 'grade9', 'grade10', 'grade11', 'grade12', 'ell_num', 'ell_percent', 'sped_num', 'sped_percent', 'ctt_num', 'selfcontained_num', 'asian_num', 'asian_per', 'black_num', 'black_per', 'hispanic_num', 'hispanic_per', 'white_num', 'white_per', 'male_num', 'male_per', 'female_num', 'female_per']


In [12]:
df_safety = pd.read_sql("SELECT * FROM nyc_schools.school_safety_report LIMIT 1", engine)
print(df_safety.columns.tolist())

['school_year', 'building_code', 'dbn', 'location_name', 'location_code', 'address', 'borough', 'geographical_district_code', 'register', 'building_name', 'num_schools', 'schools_in_building', 'major_n', 'oth_n', 'nocrim_n', 'prop_n', 'vio_n', 'engroupa', 'rangea', 'avgofmajor_n', 'avgofoth_n', 'avgofnocrim_n', 'avgofprop_n', 'avgofvio_n', 'borough_name', 'postcode', 'latitude', 'longitude', 'community_board', 'council_district', 'census_tract', 'bin', 'bbl', 'nta', '_schools']


## ✅ Task Queries Below

In [13]:
query = """
SELECT borough, COUNT(*) AS num_schools
FROM nyc_schools.high_school_directory
GROUP BY borough
ORDER BY num_schools DESC;
"""

df_borough_counts = pd.read_sql(query, engine)
df_borough_counts


Unnamed: 0,borough,num_schools
0,Brooklyn,121
1,Bronx,118
2,Manhattan,106
3,Queens,80
4,Staten Island,10


In [14]:
query = """
SELECT
    h.borough,
    ROUND(AVG(d.ell_percent)::numeric, 2) AS avg_ell_percent
FROM nyc_schools.school_demographics d
JOIN nyc_schools.high_school_directory h
    ON d.dbn = h.dbn
GROUP BY h.borough
ORDER BY avg_ell_percent DESC;
"""

df_ell = pd.read_sql(query, engine)
df_ell



Unnamed: 0,borough,avg_ell_percent
0,Manhattan,7.57


In [15]:
query = """SELECT DISTINCT h.borough
FROM nyc_schools.school_demographics d
JOIN nyc_schools.high_school_directory h
  ON d.dbn = h.dbn;
  """
print(pd.read_sql(query, engine))

     borough
0  Manhattan


In [16]:
query = """SELECT DISTINCT s.borough_name
FROM nyc_schools.school_demographics d
JOIN nyc_schools.school_safety_report s
  ON d.dbn = s.dbn;"""
print(pd.read_sql(query, engine))

  borough_name
0    MANHATTAN


Conclusion:

The analysis of English Language Learners (ELL) percentages across boroughs was limited by the available data. The school_demographics table contains detailed enrollment and demographic information for only a subset of schools, and in this dataset, demographic records exist exclusively for schools located in Manhattan.

As a result, the average ELL percentages could only be reliably calculated for Manhattan schools.

In [17]:
query_sped_top3 = """
WITH joined_data AS (
    SELECT
        h.borough,
        h.school_name,
        d.sped_percent
    FROM nyc_schools.school_demographics d
    JOIN nyc_schools.high_school_directory h
        ON d.dbn = h.dbn
    WHERE d.sped_percent IS NOT NULL
)
SELECT borough, school_name, sped_percent
FROM (
    SELECT
        borough,
        school_name,
        sped_percent,
        ROW_NUMBER() OVER (PARTITION BY borough ORDER BY sped_percent DESC) AS rn
    FROM joined_data
) sub
WHERE rn <= 3
ORDER BY borough, sped_percent DESC;
"""

df_sped_top3 = pd.read_sql(query_sped_top3, engine)
df_sped_top3


Unnamed: 0,borough,school_name,sped_percent
0,Manhattan,East Side Community School,28.8
1,Manhattan,East Side Community School,27.7
2,Manhattan,East Side Community School,26.7


## 🧠 Insights

# NYC School Analysis Summary (Day 3)

## 1. School Distribution by Borough

| Borough | Number of Schools |
|---------|-------------------|
| Brooklyn | 121 |
| Bronx | 118 |
| Manhattan | 106 |
| Queens | 80 |
| Staten Island | 10 |

This shows that Brooklyn and the Bronx have the highest number of schools in NYC, while Staten Island has the fewest.

## 2. Average % of English Language Learners (ELL) per Borough

| Borough | Avg ELL % |
|---------|-----------|
| Manhattan | 7.57 |

❗ **Note:** ELL percentage data was only available for Manhattan schools. This was confirmed by the following query:

```sql
SELECT DISTINCT h.borough
FROM nyc_schools.school_demographics d
JOIN nyc_schools.high_school_directory h   
  ON d.dbn = h.dbn;
```

**Result:**
```
borough
0  Manhattan
```

## 3. ♿ Top 3 Schools per Borough by % of Special Education Students

| Borough | School Name | SPED % |
|---------|-------------|--------|
| Manhattan | East Side Community School | 28.8 |
| Manhattan | East Side Community School | 27.7 |
| Manhattan | East Side Community School | 26.7 |

As with ELL data, special education (sped_percent) data was only available for Manhattan schools.

## 4. Optional Exploration: Can school_safety_report Fill the Borough Gap?

We attempted to recover borough information for school_demographics by joining it with school_safety_report, which also contains a borough_name field. However, this join still resulted in only Manhattan records.

```sql
SELECT DISTINCT s.borough_name
FROM nyc_schools.school_demographics d
JOIN nyc_schools.school_safety_report s
  ON d.dbn = s.dbn;
```

**Result:**
```
borough_name
-------------
MANHATTAN
```

This confirms that even with alternate join paths, the underlying demographic data only covers Manhattan schools. The dataset appears intentionally scoped to that borough for onboarding purposes.

## 🧾 Final Summary

This analysis combined SQL and Python to explore school patterns across NYC boroughs using the provided database. While the distribution of schools was available across all boroughs, demographic data such as ELL and special education percentages were only present for Manhattan. This limitation restricts borough-level comparisons for those metrics.

Such data gaps are a common challenge in real-world data workflows. Recognizing and transparently communicating these issues is essential for accurate interpretation and decision-making.
