Weather station and heat exposure data filtering using 5 stations in NYC area, making the dataset more robust and statistically powerful and also stable.

The csv file was manually downloaded from NOAA's website on Global Surface Summary of the Day (GSOD) filtering from 2018 thru 2025 from June through August (defined as summer months).

https://www.ncei.noaa.gov/access/search/data-search/global-summary-of-the-day?startDate=2018-06-01T00:00:00&endDate=2025-08-31T23:59:59&pageNum=1&dataTypes=MAX&dataTypes=MIN&dataTypes=PRCP&dataTypes=TEMP&bbox=40.930,-74.294,40.487,-73.821&pageSize=100

- Central Park
- JFK Airport
- La Guardia Airport
- The Battery
- Port Authority

Two other weather stations exist around Staten Island and Kings Point, but they are heavily marine-based stations that could introduce a large "cooling" effect to the overall temperatures and are not representative of NYC's dense urban heat canopy, so were excluded.

The below code blocks parse through the csv file and calculate new variable columns for the regression model using the 95th percentile as the threshold per other studies.

- **MAX_CP:** Max temperature for Central Park.
- **MAX_JFK:** Max temperature for JFK Airport.
- **MAX_LGA:** Max temperature for La Guardia Airport.
- **MAX_BAT:** Max temperature for Battery.
- **MAX_PORT:** Max temperature for Port Authority.

- **TEMP_CP:** Avg temperature for Central Park.
- **TEMP_JFK:** Avg temperature for JFK Airport.
- **TEMP_LGA:** Avg temperature for La Guardia Airport.
- **TEMP_BAT:** Avg temperature for Battery.
- **TEMP_PORT:** Avg temperature for Port Authority.

- **MIN_CP:** Min temperature for Central Park.
- **MIN_JFK:** Min temperature for JFK Airport.
- **MIN_LGA:** Min temperature for La Guardia Airport.
- **MIN_BAT:** Min temperature for Battery.
- **MIN_PORT:** Min temperature for Port Authority.

- **PRCP_CP:** Min temperature for Central Park.
- **PRCP_JFK:** Min temperature for JFK Airport.
- **PRCP_LGA:** Min temperature for La Guardia Airport.
- **PRCP_BAT:** Min temperature for Battery.
- **PRCP_PORT:** Min temperature for Port Authority.

- **TMAX_CITY:** Max city temperature.
- **TMIN_CITY:** Min city temperature.
- **TMEAN_CITY:** Avg city temperature.
- **PRCP_CITY:** Avg city precipitation.
- **EXTREME_HEAT:** Binary for yes or no extreme heat day.

In [1]:
import pandas as pd
from pathlib import Path
import requests
import numpy as np
from pathlib import Path
from datetime import datetime
import time
import geopandas as gpd
from shapely.geometry import Point

In [10]:
raw = pd.read_csv("data/nyc_weather_stations.csv")

In [11]:
# Convert DATE to datetime.
raw["DATE"] = pd.to_datetime(raw["DATE"])

# Ensure uppercase column names.
raw.columns = [c.upper() for c in raw.columns]

In [None]:
station_keywords = {
    "CP": "NY CITY CENTRAL PARK, NY US",
    "JFK": "JFK INTERNATIONAL AIRPORT, NY US",
    "LGA": "LAGUARDIA AIRPORT, NY US",
    "BAT": "THE BATTERY, NY US",
    "PORT": "PORT AUTH DOWNTN MANHATTAN WALL ST HEL, NY US"
}

def detect_station(name):
    for key, word in station_keywords.items():
        if word in name.upper():
            return key
    return None

raw["STATION_ID"] = raw["NAME"].apply(detect_station)

filtered = raw[raw["STATION_ID"].notna()].copy()

In [None]:
wide = filtered.pivot_table(
    index = "DATE",
    columns = "STATION_ID",
    values = ["MAX", "MIN", "TEMP", "PRCP"]
)

In [None]:
wide.columns = [f"{var}_{station}" for var, station in wide.columns]

wide = wide.sort_index()

In [15]:
stations = ["CP", "JFK", "LGA", "BAT", "PORT"]

# Daily max.
wide["TMAX_CITY"] = wide[[f"MAX_{s}" for s in stations]].mean(axis=1)

# Daily min.
wide["TMIN_CITY"] = wide[[f"MIN_{s}" for s in stations]].mean(axis=1)

# Daily mean.
wide["TMEAN_CITY"] = wide[[f"TEMP_{s}" for s in stations]].mean(axis=1)

# Daily precipitation.
wide["PRCP_CITY"] = wide[[f"PRCP_{s}" for s in stations]].mean(axis=1)

In [None]:
# Determine time frame.
summer = wide.loc[(wide.index.month >= 6) & (wide.index.month <= 8)]

# Declare threshold to 95th percentile, which is general practice.
threshold_95 = summer["TMAX_CITY"].quantile(0.95)

# Print threshold.
threshold_95

np.float64(93.46124999999999)

In [None]:
# Convert to wide format for csv writing.
wide["EXTREME_HEAT"] = (wide["TMAX_CITY"] >= threshold_95).astype(int)

In [None]:
# Write csv.
wide.to_csv("data/nyc_heat_exposure_2018_2025.csv")