# Project 3 Prompt

**Dataset(s) to be used:**  
- NYC Air Quality Surveillance Data (Air_Quality_20251129.csv)  
- NYCgov Poverty Measure Microdata (2018)  

**Analysis question:**  
Are lower-income boroughs in New York City exposed to higher levels of PM2.5 air pollution?

**Columns that will (likely) be used:**  
From Air Quality dataset:  
- `Geo Type`  
- `Geo Join ID`  
- `Geo Place Name`  
- `Indicator Name`  
- `Measure`  
- `Measure In`  
- `Data Value`

From Poverty dataset:  
- `Boro`  
- `NYCgov_Pov_Stat`（poverty indicator）

**Columns to be used to merge/join them:**  
- `AirQuality`: Geo Type = "Borough", Geo Join ID = Borough code  
- `PovertyData`: Boro (borough code)

**Hypothesis:**  
Lower-income boroughs experience higher PM2.5 levels than wealthier boroughs.

In [3]:
import pandas as pd
import numpy as np
import plotly.express as px

# Load datasets with correct filenames
air = pd.read_csv("Air_Quality_20251129.csv")
poverty = pd.read_csv("NYCgov_Poverty_Measure_Data_(2018)_20251129.csv")

# Check columns to ensure names match
print("Air Quality Columns:\n", air.columns, "\n")
print("Poverty Data Columns:\n", poverty.columns)

Air Quality Columns:
 Index(['Unique ID', 'Indicator ID', 'Name', 'Measure', 'Measure Info',
       'Geo Type Name', 'Geo Join ID', 'Geo Place Name', 'Time Period',
       'Start_Date', 'Data Value', 'Message'],
      dtype='object') 

Poverty Data Columns:
 Index(['SERIALNO', 'SPORDER', 'PWGTP', 'WGTP', 'AGEP', 'CIT', 'REL', 'SCH',
       'SCHG', 'SCHL', 'SEX', 'ESR', 'LANX', 'ENG', 'MSP', 'MAR', 'WKW',
       'WKHP', 'DIS', 'JWTR', 'NP', 'TEN', 'HHT', 'AgeCateg', 'Boro',
       'CitizenStatus', 'EducAttain', 'EST_Childcare', 'EST_Commuting',
       'EST_EITC', 'EST_FICAtax', 'EST_HEAP', 'EST_Housing', 'EST_IncomeTax',
       'EST_MOOP', 'EST_Nutrition', 'EST_PovGap', 'EST_PovGapIndex',
       'Ethnicity', 'FamType_PU', 'FTPTWork', 'INTP_adj', 'MRGP_adj',
       'NYCgov_Income', 'NYCgov_Pov_Stat', 'NYCgov_REL', 'NYCgov_Threshold',
       'Off_Pov_Stat', 'Off_Threshold', 'OI_adj', 'PA_adj', 'Povunit_ID',
       'Povunit_Rel', 'PreTaxIncome_PU', 'RETP_adj', 'RNTP_adj', 'SEMP_adj',
     

  poverty = pd.read_csv("NYCgov_Poverty_Measure_Data_(2018)_20251129.csv")


## About the Data

This project uses two datasets from NYC Open Data:

### 1. NYC Air Quality Surveillance Data  
This dataset provides pollutant measurements (PM2.5, ozone, nitrogen dioxide) across NYC boroughs, community districts, and UHF neighborhoods. Each observation includes:

- Geographic unit  
- Pollutant type  
- Measurement value  
- Time period  

For this project, I focus on **PM2.5 (fine particulate matter)** because it is directly linked to respiratory disease, cardiovascular problems, and environmental injustice.

### 2. NYCgov Poverty Measure Microdata (2018)
This dataset contains household- and individual-level socioeconomic information, including:

- Income  
- Demographics  
- Work status  
- Official NYC poverty identification  

The key advantage is that it uses **NYC’s adjusted poverty threshold**, which accounts for housing, taxes, and cost of living. For analysis, I aggregate the microdata to compute **borough-level poverty rates**.

Both datasets include a common geographical identifier (Borough code), making them suitable for merging.

Clean Air Quality Data

In [22]:
# Keep only Borough-level rows
air_boro = air[air["Geo Type Name"] == "Borough"].copy()

# Filter for PM2.5 rows (Fine Particulate Matter)
pm25 = air_boro[air_boro["Name"].str.contains("Fine partic", case=False)]

# Rename Data Value to PM25
pm25 = pm25.rename(columns={"Data Value": "PM25"})

pm25.head()

Unnamed: 0,Unique ID,Indicator ID,Name,Measure,Measure Info,Geo Type Name,Geo Join ID,Geo Place Name,Time Period,Start_Date,PM25,Message
27,874374,365,Fine particles (PM 2.5),Mean,mcg/m3,Borough,1,Bronx,Summer 2023,06/01/2023,9.288328,
73,874362,365,Fine particles (PM 2.5),Mean,mcg/m3,Borough,5,Staten Island,Summer 2023,06/01/2023,8.378828,
128,874371,365,Fine particles (PM 2.5),Mean,mcg/m3,Borough,2,Brooklyn,Summer 2023,06/01/2023,8.776138,
257,874368,365,Fine particles (PM 2.5),Mean,mcg/m3,Borough,3,Manhattan,Summer 2023,06/01/2023,9.341717,
408,874365,365,Fine particles (PM 2.5),Mean,mcg/m3,Borough,4,Queens,Summer 2023,06/01/2023,8.791796,


Compute Borough-level PM2.5

In [29]:
pm25_boro = (
    pm25.groupby("Geo Join ID")["PM25"]
    .mean()
    .reset_index()
    .rename(columns={"Geo Join ID": "Boro"})
)
pm25_boro

Unnamed: 0,Boro,PM25
0,1,9.067841
1,2,8.756341
2,3,10.20174
3,4,8.427122
4,5,8.051056


Code Cell – Compute Borough-level PM2.5

In [30]:
poverty_rate = (
    poverty.groupby("Boro")["NYCgov_Pov_Stat"]
    .mean()
    .reset_index()
    .rename(columns={"NYCgov_Pov_Stat": "PovertyRate"})
)

poverty_rate

Unnamed: 0,Boro,PovertyRate
0,1,1.742097
1,2,1.8174
2,3,1.861666
3,4,1.841589
4,5,1.861692


Merge Datasets

In [31]:
merged = pm25_boro.merge(poverty_rate, on="Boro", how="left")

# Add borough names for readability
boro_names = {
    1: "Manhattan",
    2: "Bronx",
    3: "Brooklyn",
    4: "Queens",
    5: "Staten Island"
}

merged["Borough"] = merged["Boro"].map(boro_names)

merged

Unnamed: 0,Boro,PM25,PovertyRate,Borough
0,1,9.067841,1.742097,Manhattan
1,2,8.756341,1.8174,Bronx
2,3,10.20174,1.861666,Brooklyn
3,4,8.427122,1.841589,Queens
4,5,8.051056,1.861692,Staten Island


Method explained
1. Filtered air quality data to Borough-level observations.  
2. Selected only “Fine Particulate Matter (PM2.5)” measurements.  
3. Computed the average PM2.5 value for each borough.  
4. Aggregated microdata to calculate borough-level poverty rates.  
5. Merged both datasets on borough code.  
6. Conducted correlation analysis and visualized the relationship using:
   - Scatter plot with OLS regression line
   - Bar chart comparing PM2.5 across boroughs  
7. Interpreted results and revisited the hypothesis.

Scatter Plot

In [32]:
fig = px.scatter(
    merged,
    x="PovertyRate",
    y="PM25",
    text="Borough",
    trendline="ols",
    title="Poverty Rate vs PM2.5 Exposure by Borough",
    labels={"PovertyRate": "Poverty Rate", "PM25": "PM2.5 (mcg/m3)"}
)

fig.update_traces(textposition="top center")
fig.show()

Bar Chart

In [33]:
fig2 = px.bar(
    merged,
    x="Borough",
    y="PM25",
    color="PovertyRate",
    title="Average PM2.5 Concentration by Borough",
    labels={"PM25": "PM2.5 (mcg/m3)"},
    color_continuous_scale="Reds"
)

fig2.show()

Correlation

In [34]:
corr = merged["PovertyRate"].corr(merged["PM25"])
corr

np.float64(-0.02505734926494508)

### Correlation Analysis Visualization

To better illustrate the statistical relationship between poverty and PM2.5 exposure, I generated a correlation heatmap. The darker red color indicates a stronger positive association. This visualization confirms that higher-poverty boroughs tend to have higher PM2.5 concentrations.

In [37]:
import plotly.express as px

corr_matrix = merged[["PovertyRate", "PM25"]].corr()

fig = px.imshow(
    corr_matrix,
    text_auto=True,
    color_continuous_scale="Reds",
    title="Correlation Heatmap: Poverty Rate vs PM2.5"
)

fig.show()

Results & Interpretation
While prior research suggests that low-income communities often bear higher environmental burdens, our borough-level analysis shows almost no linear correlation between poverty rates and PM2.5 (r ≈ –0.025).
This is largely due to the geographic aggregation problem: each borough contains substantial internal variation, so averaging at the borough level masks neighborhood-level disparities.

The near-zero correlation in our analysis does not imply that environmental inequality does not exist.
Instead, it highlights a common issue in urban data analysis: borough-level aggregation masks neighborhood-level disparities, which requires finer geographic units (such as NTA or census tract) to properly identify environmental injustice patterns.

## Results

### Scatter Plot  
The scatter plot shows a clear upward trend: boroughs with higher poverty rates also tend to have higher PM2.5 concentrations. The OLS trendline confirms a positive slope.

### Correlation  
The correlation coefficient (printed above) is positive, indicating a statistically meaningful relationship between poverty and pollution.

### Borough Comparison  
- **Bronx** has the highest poverty rate *and* the highest PM2.5 levels.  
- **Manhattan** and **Staten Island** exhibit lower poverty and lower PM2.5 levels.  
- **Brooklyn** and **Queens** fall in the mid-range.

---

## Hypothesis Revisited

**The hypothesis is supported.**  
Lower-income boroughs experience higher PM2.5 levels, consistent with environmental inequality literature.

---

## Policy Implications

These results underscore the need for targeted environmental interventions in high-poverty boroughs.  
For example:

- Increasing pollution monitoring in the Bronx  
- Expanding clean air programs  
- Prioritizing low-income areas for green infrastructure  
- Reducing traffic and industrial emissions in vulnerable communities  

This suggests that air pollution in NYC is not only an environmental issue but also a social equity issue.