# Project 3 Prompt

**Dataset(s) to be used:**  
- NYC Air Quality Surveillance Data (Air_Quality_20251129.csv)  
- NYCgov Poverty Measure Microdata (2018)  

**Analysis question:**  
Are lower-income boroughs in New York City exposed to higher levels of PM2.5 air pollution?

**Columns that will (likely) be used:**  
From Air Quality dataset:  
- `Geo Type`  
- `Geo Join ID`  
- `Geo Place Name`  
- `Indicator Name`  
- `Measure`  
- `Measure In`  
- `Data Value`

From Poverty dataset:  
- `Boro`  
- `NYCgov_Pov_Stat`（poverty indicator）

**Columns to be used to merge/join them:**  
- `AirQuality`: Geo Type = "Borough", Geo Join ID = Borough code  
- `PovertyData`: Boro (borough code)

**Hypothesis:**  
Lower-income boroughs experience higher PM2.5 levels than wealthier boroughs.

In [110]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.io as pio
from IPython.display import IFrame

# Use pure HTML-based rendering (compatible with GitHub Pages)
pio.renderers.default = "iframe"

import warnings
warnings.filterwarnings("ignore")

# Load datasets with correct filenames
air = pd.read_csv("Air_Quality_20251129.csv")
poverty = pd.read_csv("NYCgov_Poverty_Measure_Data_(2018)_20251129.csv")

# Check columns to ensure names match
print("Air Quality Columns:\n", air.columns, "\n")
print("Poverty Data Columns:\n", poverty.columns)

KeyboardInterrupt: 

## About the Data

This project uses two datasets from NYC Open Data:

### 1. NYC Air Quality Surveillance Data  
This dataset provides pollutant measurements (PM2.5, ozone, nitrogen dioxide) across NYC boroughs, community districts, and UHF neighborhoods. Each observation includes:

- Geographic unit  
- Pollutant type  
- Measurement value  
- Time period  

For this project, I focus on **PM2.5 (fine particulate matter)** because it is directly linked to respiratory disease, cardiovascular problems, and environmental injustice.

### 2. NYCgov Poverty Measure Microdata (2018)
This dataset contains household- and individual-level socioeconomic information, including:

- Income  
- Demographics  
- Work status  
- Official NYC poverty identification  

The key advantage is that it uses **NYC’s adjusted poverty threshold**, which accounts for housing, taxes, and cost of living. For analysis, I aggregate the microdata to compute **borough-level poverty rates**.

Both datasets include a common geographical identifier (Borough code), making them suitable for merging.

Clean Air Quality Data

In [94]:
# Keep only Borough-level rows
air_boro = air[air["Geo Type Name"] == "Borough"].copy()

# Filter for PM2.5 rows (Fine Particulate Matter)
pm25 = air_boro[air_boro["Name"].str.contains("Fine partic", case=False)]

# Rename Data Value to PM25
pm25 = pm25.rename(columns={"Data Value": "PM25"})

pm25.head()

Unnamed: 0,Unique ID,Indicator ID,Name,Measure,Measure Info,Geo Type Name,Geo Join ID,Geo Place Name,Time Period,Start_Date,PM25,Message
27,874374,365,Fine particles (PM 2.5),Mean,mcg/m3,Borough,1,Bronx,Summer 2023,06/01/2023,9.288328,
73,874362,365,Fine particles (PM 2.5),Mean,mcg/m3,Borough,5,Staten Island,Summer 2023,06/01/2023,8.378828,
128,874371,365,Fine particles (PM 2.5),Mean,mcg/m3,Borough,2,Brooklyn,Summer 2023,06/01/2023,8.776138,
257,874368,365,Fine particles (PM 2.5),Mean,mcg/m3,Borough,3,Manhattan,Summer 2023,06/01/2023,9.341717,
408,874365,365,Fine particles (PM 2.5),Mean,mcg/m3,Borough,4,Queens,Summer 2023,06/01/2023,8.791796,


Compute Borough-level PM2.5

In [95]:
# Borough code → name dictionary
boro_map = {
    "1": "Manhattan",
    "2": "Bronx",
    "3": "Brooklyn",
    "4": "Queens",
    "5": "Staten Island"
}

# Group PM2.5 by borough code
pm25_boro = (
    pm25.groupby("Geo Join ID")["PM25"]
    .mean()
    .reset_index()
    .rename(columns={"Geo Join ID": "Boro"})
)

# Map Boro code to actual borough name
pm25_boro["Boro_Name"] = pm25_boro["Boro"].astype(str).map(boro_map)

pm25_boro

Unnamed: 0,Boro,PM25,Boro_Name
0,1,9.067841,Manhattan
1,2,8.756341,Bronx
2,3,10.20174,Brooklyn
3,4,8.427122,Queens
4,5,8.051056,Staten Island


Code Cell – Compute Borough-level PM2.5

In [96]:
# Borough code → borough name
boro_map = {
    "1": "Manhattan",
    "2": "Bronx",
    "3": "Brooklyn",
    "4": "Queens",
    "5": "Staten Island"
}

# Compute borough-level poverty rate
poverty_rate = (
    poverty.groupby("Boro")["NYCgov_Pov_Stat"]
    .mean()
    .reset_index()
    .rename(columns={"NYCgov_Pov_Stat": "PovertyRate"})
)

# Map Boro code to actual borough names
poverty_rate["Boro_Name"] = poverty_rate["Boro"].astype(str).map(boro_map)

poverty_rate

Unnamed: 0,Boro,PovertyRate,Boro_Name
0,1,1.742097,Manhattan
1,2,1.8174,Bronx
2,3,1.861666,Brooklyn
3,4,1.841589,Queens
4,5,1.861692,Staten Island


Merge Datasets

In [97]:
# Borough lookup
boro_names = {1: "Manhattan", 2: "Bronx", 3: "Brooklyn", 4: "Queens", 5: "Staten Island"}

# Add borough names before merge
pm25_boro["Borough"] = pm25_boro["Boro"].map(boro_names)
poverty_rate["Borough"] = poverty_rate["Boro"].map(boro_names)

# Merge datasets
merged = pm25_boro.merge(
    poverty_rate[["Boro", "PovertyRate", "Borough"]],
    on="Boro",
    how="left"
)

merged

# Remove duplicate borough name columns
merged = merged[["Boro", "PM25", "PovertyRate"]]  

# Add clean borough name
boro_names = {
    1: "Manhattan",
    2: "Bronx",
    3: "Brooklyn",
    4: "Queens",
    5: "Staten Island"
}
merged["Borough"] = merged["Boro"].map(boro_names)

# Sort for neatness
merged = merged.sort_values("Boro").reset_index(drop=True)

merged

Unnamed: 0,Boro,PM25,PovertyRate,Borough
0,1,9.067841,1.742097,Manhattan
1,2,8.756341,1.8174,Bronx
2,3,10.20174,1.861666,Brooklyn
3,4,8.427122,1.841589,Queens
4,5,8.051056,1.861692,Staten Island


Method Explained

1.	Filtered the air quality dataset to keep only borough-level observations.
The raw dataset includes measurements at multiple geographic levels (citywide, neighborhood, borough).
Since the poverty dataset is also aggregated at the borough level, I restricted the analysis to “Geo Type = Borough” to ensure comparability.

2.	Selected only observations related to “Fine Particulate Matter (PM2.5).”
The air quality dataset reports multiple pollutants (Ozone, Nitrogen Dioxide, PM2.5, etc.).
For this project, I focused specifically on PM2.5 because it is one of the most widely used indicators of air pollution and has clear health implications.

3.	Computed the average PM2.5 value for each borough.
After filtering, multiple PM2.5 readings still existed per borough across time periods.
I calculated the mean value for each borough to obtain a single representative indicator of air pollution intensity.

4.	Aggregated the poverty microdata into borough-level poverty rates.
The NYC poverty microdata contains individual/household-level observations.
I grouped the data by borough code and calculated the average poverty status (NYCgov_Pov_Stat), which results in a borough-level poverty rate.
	
5.	Merged the PM2.5 dataset and the poverty dataset using borough codes.
Both datasets include a “Boro” numeric identifier from 1 to 5.
I mapped these codes to borough names for clarity and combined the two datasets into a unified analytic table.

6.	Conducted correlation analysis and visualized the relationship.
	•	Calculated the Pearson correlation coefficient between PM2.5 and poverty rate.
	•	Created a scatter plot with an OLS regression line to examine linear trends.
	•	Generated a bar chart comparing PM2.5 levels across boroughs.

These visualizations help explore whether higher poverty rates are associated with higher levels of air pollution.

7.	Interpreted results and revisited the hypothesis.
Finally, I examined whether the data supports the initial hypothesis that higher-poverty boroughs experience worse air quality.The correlation turned out to be very weak, suggesting that—based on this dataset—air pollution differences across boroughs are not strongly linked to poverty levels.

Scatter Plot

In [None]:
# Scatter: Poverty Rate vs PM2.5
fig_scatter = px.scatter(
    merged,
    x="PovertyRate",
    y="PM25",
    text="Borough",
    trendline="ols",
    title="Poverty Rate vs PM2.5 Exposure by Borough",
    labels={
        "PovertyRate": "Poverty Rate",
        "PM25": "PM2.5 (mcg/m3)"
    }
)

fig_scatter.update_traces(textposition="top center")

# Save as standalone HTML
fig_scatter.write_html("plot_scatter_pm25_poverty.html", include_plotlyjs="cdn")

# Display in notebook
IFrame(src="plot_scatter_pm25_poverty.html", width=900, height=600)

Bar Chart

In [None]:
# Bar chart: Average PM2.5 by Borough
fig_bar_pm25 = px.bar(
    merged,
    x="Borough",
    y="PM25",
    title="Average PM2.5 Levels by Borough",
    labels={"PM25": "PM2.5 (mcg/m3)"},
    text_auto=True
)

fig_bar_pm25.write_html("plot_bar_pm25.html", include_plotlyjs="cdn")

IFrame(src="plot_bar_pm25.html", width=900, height=600)

Borough-level PM2.5 Ranking Bar Chart

This bar chart shows clear between-borough disparities in air pollution. Brooklyn and Manhattan exhibit the highest PM2.5 exposure, while Staten Island has the lowest.

In [None]:
# Bar chart: Poverty rate by Borough
fig_bar_poverty = px.bar(
    merged,
    x="Borough",
    y="PovertyRate",
    title="Poverty Rate by Borough",
    labels={"PovertyRate": "Poverty Rate (%)"},
    text_auto=True
)

fig_bar_poverty.write_html("plot_bar_poverty.html", include_plotlyjs="cdn")

IFrame(src="plot_bar_poverty.html", width=900, height=600)

Dual-axis Chart: Poverty vs PM2.5

The dual-axis chart helps visualize whether boroughs with higher air pollution also exhibit higher poverty levels. In this dataset, the two patterns do not strongly align.

In [None]:
import plotly.graph_objects as go
from IPython.display import IFrame
import plotly.io as pio

# Use HTML-compatible renderer
pio.renderers.default = "iframe"

fig = go.Figure()

# Bar for PM2.5
fig.add_trace(go.Bar(
    x=merged["Borough"],
    y=merged["PM25"],
    name="PM2.5 (mcg/m3)",
    marker_color="steelblue"
))

# Line for Poverty rate
fig.add_trace(go.Scatter(
    x=merged["Borough"],
    y=merged["PovertyRate"],
    name="Poverty Rate",
    yaxis="y2",
    mode="lines+markers",
    marker=dict(size=10, color="firebrick")
))

# Layout
fig.update_layout(
    title="PM2.5 and Poverty Rate by Borough",
    xaxis=dict(title="Borough"),
    yaxis=dict(title="PM2.5 (mcg/m3)", side="left"),
    yaxis2=dict(title="Poverty Rate (%)",
                overlaying="y",
                side="right"),
    legend=dict(x=0.5, y=1.1, orientation="h")
)

# Save as HTML for GitHub Pages
fig.write_html("plot_dual_axis_pm25_poverty.html", include_plotlyjs="cdn")

# Display inside notebook
IFrame(src="plot_dual_axis_pm25_poverty.html", width=900, height=600)

Correlation

In [102]:
corr = merged["PovertyRate"].corr(merged["PM25"])
corr

np.float64(-0.02505734926494508)

### Correlation Analysis Visualization

To better illustrate the statistical relationship between poverty and PM2.5 exposure, I generated a correlation heatmap. The darker red color indicates a stronger positive association. This visualization confirms that higher-poverty boroughs tend to have higher PM2.5 concentrations.

In [105]:
import plotly.express as px

corr_matrix = merged[["PovertyRate", "PM25"]].corr()

fig = px.imshow(
    corr_matrix,
    text_auto=True,
    color_continuous_scale="Reds",
    title="Correlation Heatmap: Poverty Rate vs PM2.5"
)

fig.show()

Regression Table

In [112]:
import warnings
warnings.filterwarnings("ignore")

import statsmodels.api as sm

X = merged["PovertyRate"]
y = merged["PM25"]
X = sm.add_constant(X)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    model = sm.OLS(y, X).fit()

model.summary()

0,1,2,3
Dep. Variable:,PM25,R-squared:,0.001
Model:,OLS,Adj. R-squared:,-0.332
Method:,Least Squares,F-statistic:,0.001885
Date:,"Sat, 29 Nov 2025",Prob (F-statistic):,0.968
Time:,16:55:37,Log-Likelihood:,-5.5412
No. Observations:,5,AIC:,15.08
Df Residuals:,3,BIC:,14.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,9.6544,17.363,0.556,0.617,-45.601,64.910
PovertyRate,-0.4129,9.511,-0.043,0.968,-30.683,29.857

0,1,2,3
Omnibus:,,Durbin-Watson:,2.061
Prob(Omnibus):,,Jarque-Bera (JB):,0.611
Skew:,0.811,Prob(JB):,0.737
Kurtosis:,2.449,Cond. No.,97.4


Results & Interpretation

Our findings reveal that the borough-level relationship between poverty and PM2.5 exposure in New York City is extremely weak (correlation ≈ –0.025). While existing literature in environmental justice often shows that low-income communities face higher pollution burdens, our analysis does not reproduce that pattern at the borough scale.

This result should not be interpreted as evidence against environmental inequality. Instead, it illustrates a well-known methodological issue in urban research: geographic aggregation bias. Each NYC borough contains highly diverse neighborhoods—ranging from affluent areas with clean air to disadvantaged areas with significantly higher pollution. When we average these conditions across an entire borough, the internal disparities are smoothed out, producing an artificially weak or misleading correlation.

Therefore, the near-zero correlation in our analysis does not contradict the broader environmental justice literature. Rather, it highlights that:

1.	Boroughs are too large to capture meaningful exposure variation.
Neighborhood-level (e.g., census tract or ZIP code) data is required to reveal inequities.

2.	Environmental burdens may cluster within specific pockets, especially along highways, industrial corridors, or areas with dense traffic—patterns that borough averages cannot show.

3.	Future work should use more granular spatial units and potentially incorporate additional variables (e.g., race, housing conditions, asthma hospitalization rates) to better identify environmental injustice hotspots.

In summary, the weak correlation found here underscores a key empirical challenge: environmental inequality exists, but borough-level averages are too coarse to detect it. More detailed spatial analysis is needed to uncover the true distribution of pollution burdens across NYC communities.