# Day 3 

#### ✅ Instructions

1. Set up a connection to the training database using Python (with psycopg2, sqlalchemy, or similar).
    Connection help is in: daily_tasks/day_3/day3_sql_combined_with_creds.ipynb 

2. Open a new Jupyter Notebook and:

    - Connect to the database
    - Use pandas.read_sql() or cursor-based fetching to query and display results

3. Answer the following questions:

    🧮 School Distribution

        - How many schools are there in each borough?

    🎓 Language Learners

        - What is the average % of English Language Learners (ELL) per borough?

    🔗School supporting special needs
    
        - Using the data from the school demographics and high school directory, write a query to find the top 3 schools in each borough with the highest percentage of special education students (sped_percent)


## Imports

In [82]:
import pandas as pd
import psycopg2
from sqlalchemy import create_engine, text
import warnings
warnings.filterwarnings('ignore')

## DB Connection

In [83]:
conn = psycopg2.connect(
    dbname = "neondb", 
    user = "neondb_owner",
    password = "a9Am7Yy5r9_T7h4OF2GN",
    host = "ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech",
    port="5432",
    sslmode="require"
)

cur = conn.cursor()

In [84]:
## test query
query = "SELECT * FROM nyc_schools.high_school_directory LIMIT 5;"
df = pd.read_sql(query, conn)
df.head()

Unnamed: 0,dbn,school_name,borough,building_code,phone_number,fax_number,grade_span_min,grade_span_max,expgrade_span_min,expgrade_span_max,...,number_programs,Location 1,Community Board,Council District,Census Tract,Zip Codes,Community Districts,Borough Boundaries,City Council Districts,Police Precincts
0,27Q260,Frederick Douglass Academy VI High School,Queens,Q465,718-471-2154,718-471-2890,9.0,12,,,...,1,"{'latitude': '40.601989336', 'longitude': '-73...",14,31,100802,20529,51,3,47,59
1,21K559,Life Academy High School for Film and Music,Brooklyn,K400,718-333-7750,718-333-7775,9.0,12,,,...,1,"{'latitude': '40.593593811', 'longitude': '-73...",13,47,306,17616,21,2,45,35
2,16K393,Frederick Douglass Academy IV Secondary School,Brooklyn,K026,718-574-2820,718-574-2821,9.0,12,,,...,1,"{'latitude': '40.692133704', 'longitude': '-73...",3,36,291,18181,69,2,49,52
3,08X305,Pablo Neruda Academy,Bronx,X450,718-824-1682,718-824-1663,9.0,12,,,...,1,"{'latitude': '40.822303765', 'longitude': '-73...",9,18,16,11611,58,5,31,26
4,03M485,Fiorello H. LaGuardia High School of Music & A...,Manhattan,M485,212-496-0700,212-724-5748,9.0,12,,,...,6,"{'latitude': '40.773670507', 'longitude': '-73...",7,6,151,12420,20,4,19,12


In [85]:
# SQLAlchemy connection string format:
# postgresql+psycopg2://user:password@host:port/dbname

DATABASE_URL = (
    "postgresql+psycopg2://neondb_owner:a9Am7Yy5r9_T7h4OF2GN"
    "@ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech:5432/neondb"
    "?sslmode=require"
)

# Create engine and establish connection
engine = create_engine(DATABASE_URL)

In [86]:
## test query
query = "SELECT * FROM nyc_schools.high_school_directory LIMIT 5;"
df_table = pd.read_sql(query, engine)
df.head()

Unnamed: 0,dbn,school_name,borough,building_code,phone_number,fax_number,grade_span_min,grade_span_max,expgrade_span_min,expgrade_span_max,...,number_programs,Location 1,Community Board,Council District,Census Tract,Zip Codes,Community Districts,Borough Boundaries,City Council Districts,Police Precincts
0,27Q260,Frederick Douglass Academy VI High School,Queens,Q465,718-471-2154,718-471-2890,9.0,12,,,...,1,"{'latitude': '40.601989336', 'longitude': '-73...",14,31,100802,20529,51,3,47,59
1,21K559,Life Academy High School for Film and Music,Brooklyn,K400,718-333-7750,718-333-7775,9.0,12,,,...,1,"{'latitude': '40.593593811', 'longitude': '-73...",13,47,306,17616,21,2,45,35
2,16K393,Frederick Douglass Academy IV Secondary School,Brooklyn,K026,718-574-2820,718-574-2821,9.0,12,,,...,1,"{'latitude': '40.692133704', 'longitude': '-73...",3,36,291,18181,69,2,49,52
3,08X305,Pablo Neruda Academy,Bronx,X450,718-824-1682,718-824-1663,9.0,12,,,...,1,"{'latitude': '40.822303765', 'longitude': '-73...",9,18,16,11611,58,5,31,26
4,03M485,Fiorello H. LaGuardia High School of Music & A...,Manhattan,M485,212-496-0700,212-724-5748,9.0,12,,,...,6,"{'latitude': '40.773670507', 'longitude': '-73...",7,6,151,12420,20,4,19,12


In [87]:
## checking columns
query="""
SELECT 
    table_schema, 
    table_name, 
    column_name, 
    data_type
FROM information_schema.columns
WHERE table_schema = 'nyc_schools'
ORDER BY ordinal_position;"""
df = pd.read_sql(query, engine)
df

Unnamed: 0,table_schema,table_name,column_name,data_type
0,nyc_schools,mytable,somevalue,character varying
1,nyc_schools,thofa_tazkia_sat_results,dbn,text
2,nyc_schools,sat_scores_cleaned,dbn,text
3,nyc_schools,marianna_gokova_cleaned_sat_results,dbn,character varying
4,nyc_schools,anastasia_sat_results,dbn,text
...,...,...,...,...
342,nyc_schools,high_school_directory,Zip Codes,character varying
343,nyc_schools,high_school_directory,Community Districts,character varying
344,nyc_schools,high_school_directory,Borough Boundaries,character varying
345,nyc_schools,high_school_directory,City Council Districts,character varying


In [88]:
borough_tuple=('Bronx','Queens')
borough_tuple

('Bronx', 'Queens')

In [89]:
query = f"""
SELECT *

FROM nyc_schools.high_school_directory 
where borough in {(borough_tuple)};"""
df = pd.read_sql(query, conn)
df.head()

Unnamed: 0,dbn,school_name,borough,building_code,phone_number,fax_number,grade_span_min,grade_span_max,expgrade_span_min,expgrade_span_max,...,number_programs,Location 1,Community Board,Council District,Census Tract,Zip Codes,Community Districts,Borough Boundaries,City Council Districts,Police Precincts
0,27Q260,Frederick Douglass Academy VI High School,Queens,Q465,718-471-2154,718-471-2890,9.0,12,,,...,1,"{'latitude': '40.601989336', 'longitude': '-73...",14,31,100802,20529,51,3,47,59
1,08X305,Pablo Neruda Academy,Bronx,X450,718-824-1682,718-824-1663,9.0,12,,,...,1,"{'latitude': '40.822303765', 'longitude': '-73...",9,18,16,11611,58,5,31,26
2,11X509,High School of Language and Innovation,Bronx,X415,718-944-3625,718-944-3641,9.0,12,,,...,1,"{'latitude': '40.859698316', 'longitude': '-73...",11,13,324,11607,59,5,12,32
3,08X348,Schuylerville Preparatory High School,Bronx,X405,718-904-4200,718-935-4209,9.0,11,9.0,12.0,...,1,"{'latitude': '40.840513977', 'longitude': '-73...",10,13,194,11270,43,5,12,28
4,24Q236,International High School for Health Sciences,Queens,Q455,718-595-8600,718-595-8605,9.0,11,9.0,12.0,...,1,"{'latitude': '40.7412052', 'longitude': '-73.8...",4,25,461,14784,66,3,5,68


## 🧮 School Distribution

How many schools are there in each **borough**?

In [92]:
query_schools = """
SELECT borough, COUNT(DISTINCT dbn) AS total_schools
FROM nyc_schools.high_school_directory
GROUP BY borough
ORDER BY total_schools DESC;
"""
schools_df = pd.read_sql(query_schools, engine)
schools_df

Unnamed: 0,borough,total_schools
0,Brooklyn,121
1,Bronx,118
2,Manhattan,106
3,Queens,80
4,Staten Island,10


## 🎓 Language Learners

What is the **average %** of English Language Learners (ELL) **per borough**?

In [None]:
## Avg ELL percentage by borough using table joins 
query_ell = """
SELECT borough, AVG(ell_percent) AS avg_ell_percent
FROM nyc_schools.high_school_directory hsd LEFT JOIN nyc_schools.school_demographics sd
ON hsd.dbn = sd.dbn
GROUP BY borough    
ORDER BY avg_ell_percent DESC
LIMIT 5;
"""
ell_df = pd.read_sql(query_ell, engine)
ell_df

Unnamed: 0,borough,avg_ell_percent
0,Brooklyn,
1,Queens,
2,Staten Island,
3,Bronx,
4,Manhattan,7.5725


In [93]:
## ELL programs count by borough -> all boroughs having ELL programs
query_ell_2 = """
SELECT borough, ell_programs, count(ell_programs) AS total_ell_programs 
FROM nyc_schools.high_school_directory
GROUP BY borough, ell_programs
ORDER BY borough;
"""
ell_df_2 = pd.read_sql(query_ell_2, engine)
ell_df_2

Unnamed: 0,borough,ell_programs,total_ell_programs
0,Bronx,ESL; Transitional Bilingual Program: Spanish,7
1,Bronx,ESL,109
2,Bronx,ESL; Dual Language: Spanish,2
3,Brooklyn,ESL; Dual Language: Chinese; Transitional Bili...,1
4,Brooklyn,"ESL; Transitional Bilingual Program: Arabic, C...",1
5,Brooklyn,"ESL; Transitional Bilingual Program: Chinese, ...",2
6,Brooklyn,ESL; Transitional Bilingual Program: Chinese,1
7,Brooklyn,"ESL; Dual Language: Haitian Creole, Russian; T...",1
8,Brooklyn,ESL; Transitional Bilingual Program: Spanish,4
9,Brooklyn,ESL,111


### 💭 Additional Note

The query below was an idea of mine to address the issue of missing information in the *school_demographics* table.  
However, since the column **number_programs** does not specifically refer to **ELL programs**, the resulting percentage does not accurately answer the question.  
It reflects the total number of school programs, not just those related to English Language Learners, and therefore cannot be used as a reliable indicator for this analysis.

In [None]:
query_ell_3 = """
WITH ell_count AS (
  SELECT
    borough,
    SUM(
      CASE
        WHEN number_programs ~ '^[0-9]+$' THEN number_programs::numeric
        ELSE 0
      END
    ) AS total_ell_programs
  FROM nyc_schools.high_school_directory
  GROUP BY borough
)
SELECT
  borough,
  total_ell_programs,
  (ROUND(total_ell_programs / SUM(total_ell_programs) OVER (), 4))*100 AS share_of_total
FROM ell_count
ORDER BY total_ell_programs DESC;
"""
ell_df_3 = pd.read_sql(query_ell_3, engine)
ell_df_3

Unnamed: 0,borough,total_ell_programs,share_of_total
0,Brooklyn,224.0,30.15
1,Queens,170.0,22.88
2,Bronx,153.0,20.59
3,Manhattan,146.0,19.65
4,Staten Island,50.0,6.73


## 🔗 School supporting special needs

Using the data from the school demographics and high school directory, write a **query** to find the **top 3 schools in each borough with the highest percentage of special education students (sped_percent)**.

In [None]:
## Top 3 schools by special education percentage in each borough
query_sped = """
WITH ranked AS (
    SELECT
        d.borough,
        d.school_name,
        d.dbn,
        dem.sped_percent,
        RANK() OVER (PARTITION BY d.borough ORDER BY dem.sped_percent DESC) AS rank
    FROM nyc_schools.high_school_directory AS d
    JOIN nyc_schools.school_demographics AS dem
        ON d.dbn = dem.dbn
)
SELECT borough, school_name, sped_percent
FROM ranked
WHERE rank <= 3
ORDER BY borough, rank;
"""
sped_df = pd.read_sql(query_sped, engine)
sped_df

Unnamed: 0,borough,school_name,sped_percent
0,Manhattan,East Side Community School,28.8
1,Manhattan,East Side Community School,27.7
2,Manhattan,East Side Community School,26.7


## Insights


Both questions — the one about English Language Learners and the one about schools supporting special needs — rely on data from the *school_demographics* table.  
However, this table only returns valid results for **Manhattan**, while all other boroughs have missing values.  

This indicates that *school_demographics* contains incomplete or poorly maintained data, at least for the columns **ell_percent** and **sped_percent**.  
Because of this, borough-level comparisons are not reliable, and additional data sources (such as *high_school_directory* or other datasets) would be needed to obtain accurate insights.  

Furthermore, since the dataset shows clear signs of missing information, we also cannot fully trust that the values available for **Manhattan** are entirely correct or representative — they might also be affected by incomplete reporting or data entry errors.