# Project 3
# NYC High School Graduation Rates by Borough, 2012–2019
- **Dataset(s) to be used:** 
  - NYC Open Data: Graduation results for Cohorts 2012 to 2019 (Classes of 2016 to 2023)https://data.cityofnewyork.us/Education/Graduation-results-for-Cohorts-2012-to-2019-Classe/mjm3-8dw8/about_data

- **Analysis question:**  
  - How do NYC high school four-year graduation rates vary across boroughs over time?

- **Columns that will (likely) be used:**  
  - `Report Category` (to filter to borough-level rows)  
  - `Geographic Subdivision` (borough: Bronx, Brooklyn, Manhattan, Queens, Staten Island)  
  - `Category` (to filter to “All Students”)  
  - `Cohort Year` (2012–2019)  
  - `Cohort` (to filter to “4 year August”)  
  - `% Grads` (four-year graduation rate, which I will convert to a numeric column `pct_grads`)  

- **(If you're using multiple datasets) Columns to be used to merge/join them:**  
  -  I only plan to use one graduation dataset in this project.

- **Hypothesis:**  
  - I hypothesize that NYC's four-year graudation rate increase over time between 2012 and 2019 cohort.  I expect Manhattan and Quees to have higher graudation rates than the rest of cohorts, even all boroughs show improvement. 
  - Manhattan and Queens are expected to have higher graduation rates because they generally serve students from more advantaged socioeconomic backgrounds and have greater access to educational resources.


## Step 1
### Import and Understand Data 

In [1]:
import pandas as pd
import plotly.express as px

grad_df = pd.read_csv("Graduation_results_for_Cohorts.csv")
#check the data 
grad_df.head()

  grad_df = pd.read_csv("Graduation_results_for_Cohorts.csv")


Unnamed: 0,Report Category,Geographic Subdivision,School Name,Category,Cohort Year,Cohort,# Total Cohort,# Grads,% Grads,# Total Regents,...,% Local of Cohort,% Local of Grads,# Still Enrolled,% Still Enrolled,# Dropout,% Dropout,# SACC (IEP Diploma),% SACC (IEP Diploma) of Cohort,# TASC (GED),% TASC (GED) of Cohort
0,Borough,Bronx,,White,2016,6 year June,734,643,87.6,632,...,1.5,1.7,21,2.9,58,7.9,9,1.2,3,0.4
1,Borough,Bronx,,White,2015,6 year June,660,565,85.6,531,...,5.2,6,24,3.6,53,8,13,2,5,0.8
2,Borough,Bronx,,White,2014,6 year June,683,565,82.7,522,...,6.3,7.6,27,4,76,11.1,12,1.8,3,0.4
3,District,16,,Multi-Racial,2014,6 year June,1,s,s,s,...,s,s,s,s,s,s,s,s,s,s
4,Borough,Bronx,,White,2013,6 year June,651,529,81.3,501,...,4.3,5.3,38,5.8,76,11.7,5,0.8,3,0.5


In [2]:
# Get a summary of the dataset: number of rows/columns and data types
grad_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 321002 entries, 0 to 321001
Data columns (total 29 columns):
 #   Column                                Non-Null Count   Dtype 
---  ------                                --------------   ----- 
 0   Report Category                       321002 non-null  object
 1   Geographic Subdivision                321002 non-null  object
 2   School Name                           294543 non-null  object
 3   Category                              321002 non-null  object
 4   Cohort Year                           321002 non-null  int64 
 5   Cohort                                320163 non-null  object
 6   # Total Cohort                        321002 non-null  object
 7   # Grads                               321002 non-null  object
 8   % Grads                               321002 non-null  object
 9   # Total Regents                       320163 non-null  object
 10  % Total Regents of Cohort             320163 non-null  object
 11  % Total Regen

In [3]:
# Explore key columns to understand how to filter later

grad_df["Report Category"].unique()


array(['Borough', 'District', 'Citywide', 'School', 'Transfer School',
       'Charter School'], dtype=object)

In [4]:
grad_df["Category"].unique()


array(['White', 'Multi-Racial', 'All Students', 'Asian',
       'Native American', 'ELL', 'Former ELL', 'Not ELL', 'Current ELL',
       'Ever ELL', 'Never ELL', 'Not SWD', 'SWD', 'Black', 'Hispanic',
       'Female', 'Male', 'Neither Female nor Male', 'Female Asian',
       'Female Black', 'Female Hispanic', 'Female Multi-Racial',
       'Female Native American', 'Female White', 'Male Asian',
       'Male Black', 'Male Hispanic', 'Male Multi-Racial',
       'Male Native American', 'Male White',
       'Neither Female nor Male Black',
       'Neither Female nor Male Hispanic',
       'Neither Female nor Male Multi-Racial',
       'Neither Female nor Male White', 'Econ Disadv', 'Not Econ Disadv'],
      dtype=object)

In [5]:
grad_df["Cohort"].unique()


array(['6 year June', '4 year June', '4 year August', '5 year August',
       '5 year June', nan], dtype=object)

# Step 2
### Cleaning Dataset
In this step, I filter the raw dataset down to the rows that match my research question. I keep only borough-level rows (Report Category == "Borough"), the “All Students” category, and the “4 year August” cohort, since that represents on-time four-year graduation. I also convert the % Grads column from a string to a numeric pct_grads column so I can compute averages and trends. This cleaned subset is the basis for all of the plots below.


In [6]:
# Filter to borough-level, all students, 4-year August cohort
boro_mask = (
    (grad_df["Report Category"] == "Borough") &
    (grad_df["Category"] == "All Students") &
    (grad_df["Cohort"] == "4 year August")
)

df_boro_all_4yr = grad_df[boro_mask].copy()

df_boro_all_4yr.head()

Unnamed: 0,Report Category,Geographic Subdivision,School Name,Category,Cohort Year,Cohort,# Total Cohort,# Grads,% Grads,# Total Regents,...,% Local of Cohort,% Local of Grads,# Still Enrolled,% Still Enrolled,# Dropout,% Dropout,# SACC (IEP Diploma),% SACC (IEP Diploma) of Cohort,# TASC (GED),% TASC (GED) of Cohort
1156,Borough,Bronx,,All Students,2019,4 year August,12256,9733,79.4,9678,...,0.4,0.6,1381,11.3,957,7.8,74,0.6,91,0.7
1157,Borough,Bronx,,All Students,2018,4 year August,12487,9921,79.5,9874,...,0.4,0.5,1400,11.2,974,7.8,95,0.8,87,0.7
1158,Borough,Bronx,,All Students,2017,4 year August,13152,10220,77.7,10136,...,0.6,0.8,1761,13.4,1001,7.6,54,0.4,98,0.7
1159,Borough,Bronx,,All Students,2016,4 year August,13421,9938,74.0,9733,...,1.5,2.1,1976,14.7,1271,9.5,60,0.4,175,1.3
1160,Borough,Bronx,,All Students,2015,4 year August,13891,9752,70.2,8446,...,9.4,13.4,2124,15.3,1759,12.7,80,0.6,175,1.3


In [7]:
# Keep colums that I need
cols_keep = [
    "Geographic Subdivision",  # borough name
    "Cohort Year",
    "# Total Cohort",
    "# Grads",
    "% Grads",
]

df_boro_all_4yr = df_boro_all_4yr[cols_keep].copy()
df_boro_all_4yr.head()


Unnamed: 0,Geographic Subdivision,Cohort Year,# Total Cohort,# Grads,% Grads
1156,Bronx,2019,12256,9733,79.4
1157,Bronx,2018,12487,9921,79.5
1158,Bronx,2017,13152,10220,77.7
1159,Bronx,2016,13421,9938,74.0
1160,Bronx,2015,13891,9752,70.2


In [8]:
# Remove commas and convert the cohort size and grads to integers
for col in ["# Total Cohort", "# Grads"]:
    df_boro_all_4yr[col] = (
        df_boro_all_4yr[col]
        .astype(str)
        .str.replace(",", "", regex=False)
        .astype("int64")
    )

# Convert % Grads to a numeric float column
df_boro_all_4yr["pct_grads"] = df_boro_all_4yr["% Grads"].astype(float)

df_boro_all_4yr.head()


Unnamed: 0,Geographic Subdivision,Cohort Year,# Total Cohort,# Grads,% Grads,pct_grads
1156,Bronx,2019,12256,9733,79.4,79.4
1157,Bronx,2018,12487,9921,79.5,79.5
1158,Bronx,2017,13152,10220,77.7,77.7
1159,Bronx,2016,13421,9938,74.0,74.0
1160,Bronx,2015,13891,9752,70.2,70.2


In [9]:
# Check shape
df_boro_all_4yr.shape


(40, 6)

In [10]:
# Check the list of boroughs
df_boro_all_4yr["Geographic Subdivision"].unique()


array(['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'],
      dtype=object)

In [11]:
# Check the cohort years included
sorted(df_boro_all_4yr["Cohort Year"].unique())


[np.int64(2012),
 np.int64(2013),
 np.int64(2014),
 np.int64(2015),
 np.int64(2016),
 np.int64(2017),
 np.int64(2018),
 np.int64(2019)]

## Step 3
### Compute Citywide Weighted Graduation Trend
Before comparing individual boroughs, I first want to see how NYC’s four-year graduation rate has changed citywide over time. Because my filtered dataset `df_boro_all_4yr` has one row per borough–cohort combination, I can combine these borough rows to get a weighted citywide graduation rate for each `Cohort Year`.

To do this

- Group the data by `Cohort Year`.
- Sum the `# Total Cohort` and `# Grads` across all five boroughs for each year.
- Compute a new column `pct_grads_citywide` as: (total_grads / total_cohort) * 100

This gives me one citywide graduation rate per cohort year, weighted by the actual number of students. I then create a line chart of `pct_grads_citywide` over time to see whether NYC’s overall four-year graduation rate is increasing, flat, or decreasing across the 2012–2019 cohorts.


In [12]:
# Group by cohort year to get citywide totals and a weighted graduation rate
citywide_by_cohort = (
    df_boro_all_4yr
    .groupby("Cohort Year")
    .agg(
        total_cohort=("# Total Cohort", "sum"),
        total_grads=("# Grads", "sum")
    )
    .reset_index()
)

# Compute citywide weighted graduation rate (%) for each cohort year
citywide_by_cohort["pct_grads_citywide"] = (
    citywide_by_cohort["total_grads"] / citywide_by_cohort["total_cohort"] * 100
)


citywide_by_cohort


Unnamed: 0,Cohort Year,total_cohort,total_grads,pct_grads_citywide
0,2012,74172,54161,73.020816
1,2013,73154,54324,74.259781
2,2014,74948,56923,75.949992
3,2015,73772,57035,77.31253
4,2016,73565,58704,79.798817
5,2017,72663,60055,82.648666
6,2018,70912,59374,83.729129
7,2019,69893,58503,83.703661


In [13]:
citywide_by_cohort["pct_grads_citywide"].describe()


count     8.000000
mean     78.802924
std       4.284518
min      73.020816
25%      75.527439
50%      78.555674
75%      82.912415
max      83.729129
Name: pct_grads_citywide, dtype: float64

In [14]:
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook_connected"

fig = px.line(
    citywide_by_cohort,
    x="Cohort Year",
    y="pct_grads_citywide",
    markers=True,
    title="NYC Citywide 4-Year Graduation Rate by Cohort Year",
    labels={
        "Cohort Year": "Cohort Year",
        "pct_grads_citywide": "Graduation Rate (%)"
    }
)

fig.show()


The line chart shows that the citywide four-year graduation rate has generally increased across the 2012–2019 cohorts. The overall trend suggests that NYC high school graduation performance has improved, which provides context for the borough-level comparisons in the next step. In the following section, I break down this citywide pattern by borough to see which areas are consistently above or below the citywide average.


## Step 4
### Compare Borough Graduation Trends Over Time

Next, I divde the citywide pattern down by borough to directly address my reserach question: which boroughs have higher or lower four-year graduation rates, are gaps changing over time?

Using my filtered dataset `df_boro_all_4yr`, each row already represents a single borough–cohort year combination with a clean graduation rate in the `pct_grads` column. In this step

- Use the `Geographic Subdivision` column as the borough name (Bronx, Brooklyn, Manhattan, Queens, Staten Island).
- Plot `pct_grads` against `Cohort Year` for each borough.
- Use a separate line color for each borough so that trends are easy to compare on one figure.

This visualization shows both the overall direction (whether each borough is improving) and the relative differences across boroughs in the same cohort year. It provides the main empirical evidence to test my hypothesis that Manhattan and Queens will have higher graduation rates than the Bronx and Brooklyn, even as all boroughs improve over time.


Before plotting, I reshape the data into a pivoted table where each row is a cohort year and each column is a borough’s graduation rate. This makes it easier to compare boroughs side by side for a given cohort year.

In [15]:
# Pivot table: rows = cohort years, columns = boroughs
boro_pivot = df_boro_all_4yr.pivot(
    index="Cohort Year",
    columns="Geographic Subdivision",
    values="pct_grads"
)

boro_pivot


Geographic Subdivision,Bronx,Brooklyn,Manhattan,Queens,Staten Island
Cohort Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2012,64.9,72.8,74.7,76.1,79.5
2013,66.3,74.4,74.9,77.8,80.3
2014,67.4,76.6,76.7,79.5,80.8
2015,70.2,77.0,78.3,80.7,82.7
2016,74.0,79.3,80.1,82.8,84.7
2017,77.7,81.7,83.2,85.7,86.1
2018,79.5,83.1,83.4,86.8,86.3
2019,79.4,82.9,83.5,86.8,86.6


This pivot tables also shows that each row is a cohort year and each column is a borough’s graduation rate. This reshaping makes it easier to see all boroughs side by side for a given cohort.


Then, I plot the visualization part.

In [16]:
# Make sure data are sorted by cohort year inside each borough
df_boro_all_4yr_sorted = df_boro_all_4yr.sort_values(
    ["Geographic Subdivision", "Cohort Year"]
)

fig = px.line(
    df_boro_all_4yr_sorted,
    x="Cohort Year",
    y="pct_grads",
    color="Geographic Subdivision",
    markers=True,
    title="NYC 4-Year Graduation Rates by Borough (Cohorts 2012–2019)",
    labels={
        "Cohort Year": "Cohort Year",
        "pct_grads": "Graduation Rate (%)",
        "Geographic Subdivision": "Borough"
    }
)

fig.show()


### Interpretation:

The figure shows that four-year graduation rates increased across all five boroughs between the 2012 and 2019 cohorts. The Bronx starts out with the lowest graduation rate 64.9% and makes the largest improvement. It rises to under 80 percent by the 2019 cohort. Brooklyn and Manhattan also show steady gains around 80% at the end of 2019.

Contrary to my original expectation that Manhattan would be the clear top performer, the highest graduation rates throughout the period are in Queens and Staten Island. Staten Island starts at the highest graduation rate at 2012 and remain at the second highest in 2019. Overall, the gaps between boroughs increases over time, but gaps still exsits. The Bronx still lags behind Queens and Staten Island in 2019, even after making substantial progress.



## Step 5
### Summarize Borough Differences: Levels and Changes
To summarize the borough differences more clearly, I look at two simple comparisons:

1. Graduation rates in the most recent cohort (2019), by borough.
2. Change in graduation rates from the first cohort in my sample (2012) to the last cohort (2019), by borough.

The first bar chart answers “which boroughs are doing best right now?”。 The second bar chart answers “which boroughs improved the most over time?”.


In [17]:

latest_year = df_boro_all_4yr["Cohort Year"].max()
latest_df = df_boro_all_4yr[df_boro_all_4yr["Cohort Year"] == latest_year].copy()

fig_latest = px.bar(
    latest_df,
    x="Geographic Subdivision",
    y="pct_grads",
    title=f"NYC 4-Year Graduation Rates by Borough, Cohort {latest_year}",
    labels={
        "Geographic Subdivision": "Borough",
        "pct_grads": "Graduation Rate (%)"
    }
)

fig_latest.show()

### Interpretation: latest cohort (2019)

The 2019 bar chart shows that all five boroughs now have four-year graduation rates around or above 80 percent. Queens and Staten Island have the highest rates (around 87%), followed closely by Manhattan and Brooklyn (about 83%). The Bronx is still the lowest at about 79%, but the gap between the Bronx and the best-performing boroughs is now less than 10 percentage points, smaller than it was at the start of the period.


To measure improvement over time, I reshape the data into separate DataFrames for the first cohort (2012) and the last cohort (2019), and then use `merge()` to join them by borough. This merged table lets me compute the change in graduation rates (in percentage points) for each borough.


In [18]:
first_year = df_boro_all_4yr["Cohort Year"].min()
last_year = latest_year

# Create a dataframe for the first year (2012)
first_df = (
    df_boro_all_4yr[df_boro_all_4yr["Cohort Year"] == first_year]
    [["Geographic Subdivision", "pct_grads"]]
    .rename(columns={"pct_grads": "pct_first"})
)

# Create a dataframe for the last year (2019)
last_df = (
    df_boro_all_4yr[df_boro_all_4yr["Cohort Year"] == last_year]
    [["Geographic Subdivision", "pct_grads"]]
    .rename(columns={"pct_grads": "pct_last"})
)
change_df = first_df.merge(last_df, on="Geographic Subdivision")
change_df["pct_point_change"] = change_df["pct_last"] - change_df["pct_first"]

change_df


Unnamed: 0,Geographic Subdivision,pct_first,pct_last,pct_point_change
0,Bronx,64.9,79.4,14.5
1,Brooklyn,72.8,82.9,10.1
2,Manhattan,74.7,83.5,8.8
3,Queens,76.1,86.8,10.7
4,Staten Island,79.5,86.6,7.1


In [19]:
fig_change = px.bar(
    change_df,
    x="Geographic Subdivision",
    y="pct_point_change",
    title=f"Change in 4-Year Graduation Rates by Borough ({first_year}→{last_year})",
    labels={
        "Geographic Subdivision": "Borough",
        "pct_point_change": "Change in Graduation Rate (percentage points)"
    }
)

fig_change.show()


### Interpretation: change from 2012 to 2019

The change-over-time bar chart shows that the Bronx has made the largest gain in graduation rates, increasing by 14.5 percent points between the 2012 and 2019 cohorts. Brooklyn and Queens also improve by 10.1 and 10.7 percentage points. Manhattan and Staten Island gain slightly less 8.8 and 7.1 percentage points, because they started from relatively higher baselines. This pattern suggests that lower-performing boroughs, especially the Bronx, have been catching up, even though they have not fully closed the gap with Queens and Staten Island.


To summarize, these results partially support my hypothesis. While graduation rates improved across every borough, the highest rates in 2019 were observed in Queens and Staten Island, rather than Manhattan. The Bronx has the lowest perform, but is shows the largest performance improvement over time. This suggests a trend of covergence for the graduation rate for each borough, but still remains inequality between them. 


## Step 6 
### Limitations and Assumptions
In this project, I focus on borough level averages for "All Students". This high level of aggregation hides important differences within boroughs. Large boroughs like Brooklyn and Queens include both very high- and low-performing neighborhoods and schools, but in my graphs they appear as a single average line. I also use the five boroughs as comparable units, even though they differ in the number of schools, student demographics, and program types. Second, I focus on a specific outcome definition: the “4 year August” cohort. This is a standard DOE (Department of Education in NYC) measure of on-time graduation, but it excludes students who graduate in five or six years and also not clearly state the differences between the types of diploma. It cannot fully illustrate college readiness or longer-term outcomes. Lastly, I only use the “All Students” category, so I do not control for changes in student composition over time or across boroughs. If a borough’s graduation rate increases, it is difficult to find whether schools are doing better for similar students or whether the mix of students has changed. In addition, the analysis is descriptive. There are some variables that I didn't control sucha as school resources, policy change, etc. For these reasons, the results should be interpreted as high-level patterns in graduation rates, not as proof that one borough’s schools are “better” or that specific policies caused the improvements.

## Step 7
### Conclusion
This project shows how NYC four-year high school graduation rates vary across boroughs over time. Using Department of Education in NYC data for cohorts 2012–2019, I focused on borough-level outcomes for “All Students” and the “4 year August” cohort, and used pandas to aggregate and compare trends across the five boroughs.
The first main finding is that graduation rates improved citywide over this period. The weighted citywide graduation rate increases steadily across cohorts. It suggests that higher amount of students are earning a diploma within four years. The second finding is that distinct borough differences exist. The Bronx starts with the lowest graduation rate. Although it makes the largest gain in percentage points, it still below the other boroughs. Queens and Staten Island consistently appear at the top, with the highest graduation rates in the most recent cohort. Manhattan and Brooklyn are in the middle position.
Overall, the gaps between boroughs are small but still exist. This partially supports my original hypothesis: graduation rates improve in every borough, but the pattern at the top is different from what I expected. Queens and Staten Island, not Manhattan, have the best performance. Bronx remains the lowest even after substantial improvement. The future step to dive in this research would be to look within boroughs by student subgroup or by individual school. Connecting graduation rates to college enrollment or persistence to better understand how these trends translate into longer-term educational opportunities.