# Phase 1: Data Collection

**Objective**: Download and organize data for NYC energy analysis (2021-2023)

## Data Sources:
1. **NYC Open Data Portal** - Electricity consumption by borough
2. **EIA** - Power generation data (New York State)
3. **NOAA** - Weather data (NYC)
4. **U.S. Census Bureau** - Borough population data
5. **EIA** - Average electricity rates (or assume $0.20/kWh)

## Data Limitations:
- Generation data is for **New York State**, not NYC specifically
- Surplus calculation is an approximation: NY State Generation - NYC Consumption
- This methodological constraint is acknowledged in the analysis

## Expected Outputs:
- `data/raw/nyc_consumption.csv`
- `data/raw/eia_generation.xlsx`
- `data/raw/noaa_weather.csv`
- `data/raw/census_population.csv`
- `data/raw/electricity_rates.csv` (or hardcoded value)

In [3]:
import pandas as pd
import numpy as np
import os 
from pathlib import Path 
import warnings
warnings.filterwarnings("ignore")

# Create directories
Path('data/raw').mkdir(parents=True, exist_ok=True)
Path('data/processed').mkdir(parents=True, exist_ok=True)

print("Succeed!")
print(f"Currently working directory: {os.getcwd()}")

Succeed!
Currently working directory: /Users/haneuljang/Desktop/nyc-surplus-energy-analysis/notebooks


In [6]:
print("="*60)
print("DATA 1: NYC ELECTRICITY CONSUMPTION")
print("="*60)

print("""
MANUAL DOWNLOAD INSTRUCTIONS:

1. Visit: https://data.cityofnewyork.us/Housing-Development/Electric-Consumption-And-Cost-2010-Feb-2025-/jr24-e7cr/about_data

2. Click 'Export' → 'CSV'

3. Filter for years 2021-2023 if possible

4. Save as: data/raw/nyc_consumption.csv

Key columns needed:
- Borough
- Revenue Month
- Consumption (KWH)
- KW Charges
""")

DATA 1: NYC ELECTRICITY CONSUMPTION

MANUAL DOWNLOAD INSTRUCTIONS:

1. Visit: https://data.cityofnewyork.us/Housing-Development/Electric-Consumption-And-Cost-2010-Feb-2025-/jr24-e7cr/about_data

2. Click 'Export' → 'CSV'

3. Filter for years 2021-2023 if possible

4. Save as: data/raw/nyc_consumption.csv

Key columns needed:
- Borough
- Revenue Month
- Consumption (KWH)
- KW Charges



In [5]:
print("="*60)
print("DATA 2: NEW YORK STATE POWER GENERATION")
print("="*60)

print("""
MANUAL DOWNLOAD INSTRUCTIONS:

1. Visit: https://www.eia.gov/electricity/data/browser/

2. Select the following:
   - Geography: New York
   - Sector: All Sectors
   - Energy Source:
     * Conventional Hydroelectric
     * Wind
     * Solar Photovoltaic
     * All Fuels (utility-scale)
   - Time Period: Jan 2021 - Dec 2023
   - Frequency: Monthly

3. Click 'Download' → Excel format

4. Save as: data/raw/eia_generation.xlsx

Note: This is NY STATE data, not NYC specifically
""")

DATA 2: NEW YORK STATE POWER GENERATION

MANUAL DOWNLOAD INSTRUCTIONS:

1. Visit: https://www.eia.gov/electricity/data/browser/

2. Select the following:
   - Geography: New York
   - Sector: All Sectors
   - Energy Source:
     * Conventional Hydroelectric
     * Wind
     * Solar Photovoltaic
     * All Fuels (utility-scale)
   - Time Period: Jan 2021 - Dec 2023
   - Frequency: Monthly

3. Click 'Download' → Excel format

4. Save as: data/raw/eia_generation.xlsx

Note: This is NY STATE data, not NYC specifically



In [7]:
print("="*60)
print("DATA 3: NYC WEATHER DATA")
print("="*60)

print("""
MANUAL DOWNLOAD INSTRUCTIONS:

1. Visit: https://www.ncdc.noaa.gov/cdo-web/search

2. Search for: "New York Central Park"

3. Select:
   - Date Range: 2021-01-01 to 2023-12-31
   - Dataset: Daily Summaries
   - Data Types:
     * TAVG (Average Temperature)
     * PRCP (Precipitation)
     * AWND (Average Wind Speed)

4. Add to Cart → Submit Order

5. You'll receive download link via email

6. Save as: data/raw/noaa_weather.csv
""")

DATA 3: NYC WEATHER DATA

MANUAL DOWNLOAD INSTRUCTIONS:

1. Visit: https://www.ncdc.noaa.gov/cdo-web/search

2. Search for: "New York Central Park"

3. Select:
   - Date Range: 2021-01-01 to 2023-12-31
   - Dataset: Daily Summaries
   - Data Types:
     * TAVG (Average Temperature)
     * PRCP (Precipitation)
     * AWND (Average Wind Speed)

4. Add to Cart → Submit Order

5. You'll receive download link via email

6. Save as: data/raw/noaa_weather.csv



In [8]:
print("="*60)
print("DATA 4: NYC BOROUGH POPULATION")
print("="*60)

print("""
MANUAL DOWNLOAD INSTRUCTIONS:

Option A - Quick Reference (2020 Census):
Use these average population values:

Borough         | Population (2020)
----------------|------------------
Bronx           | 1,472,654
Brooklyn        | 2,736,074
Manhattan       | 1,694,251
Queens          | 2,405,464
Staten Island   | 495,747

You can create CSV manually or:

Option B - Official Download:
1. Visit: https://data.census.gov/
2. Search: "New York City population by borough"
3. Download 2020-2023 estimates
4. Save as: data/raw/census_population.csv

Format should be:
Borough,Population
Bronx,1472654
Brooklyn,2736074
...
""")

DATA 4: NYC BOROUGH POPULATION

MANUAL DOWNLOAD INSTRUCTIONS:

Option A - Quick Reference (2020 Census):
Use these average population values:

Borough         | Population (2020)
----------------|------------------
Bronx           | 1,472,654
Brooklyn        | 2,736,074
Manhattan       | 1,694,251
Queens          | 2,405,464
Staten Island   | 495,747

You can create CSV manually or:

Option B - Official Download:
1. Visit: https://data.census.gov/
2. Search: "New York City population by borough"
3. Download 2020-2023 estimates
4. Save as: data/raw/census_population.csv

Format should be:
Borough,Population
Bronx,1472654
Brooklyn,2736074
...



In [12]:
population_data = {
    'Borough': ['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'],
    'Population': [1472654, 2736074, 1694251, 2405464, 495747]
}
df_population = pd.DataFrame(population_data)
df_population.to_csv('../data/raw/census_population.csv', index=False)

In [15]:
print("="*60)
print("DATA 5: ELECTRICITY RATES")
print("="*60)

print("""
APPROACH:

Option A - Use Average Rate (Recommended for simplicity):
We'll assume: $0.20 per kWh (NYC average 2021-2023)

Option B - Download Official Rates:
1. Visit: https://www.eia.gov/electricity/monthly/
2. Look for "Average Retail Price of Electricity"
3. Filter: New York, Residential sector
4. Save as: data/raw/electricity_rates.csv

For this project, we'll use Option A unless you prefer precision.
""")

# Create a simple rates file
rates_data = {
    'Year': [2021, 2022, 2023],
    'Rate_per_kWh': [0.20, 0.20, 0.20]  # Average NYC rate
}
df_rates = pd.DataFrame(rates_data)
df_rates.to_csv('../data/raw/electricity_rates.csv', index=False)
print("Created electricity_rates.csv with assumed rate of $0.20/kWh")

DATA 5: ELECTRICITY RATES

APPROACH:

Option A - Use Average Rate (Recommended for simplicity):
We'll assume: $0.20 per kWh (NYC average 2021-2023)

Option B - Download Official Rates:
1. Visit: https://www.eia.gov/electricity/monthly/
2. Look for "Average Retail Price of Electricity"
3. Filter: New York, Residential sector
4. Save as: data/raw/electricity_rates.csv

For this project, we'll use Option A unless you prefer precision.

Created electricity_rates.csv with assumed rate of $0.20/kWh


In [26]:
print("="*60)
print("FILE VERIFICATION")
print("="*60)

files_to_check = {
    '1. NYC Consumption': '../data/raw/nyc_consumption.csv',
    '2. EIA Generation': '../data/raw/eia_generation.csv', 
    '3. NOAA Weather': '../data/raw/noaa_weather.csv',
    '4. Census Population': '../data/raw/census_population.csv',
    '5. Electricity Rates': '../data/raw/electricity_rates.csv'
}


missing_files = []
for name, path in files_to_check.items():
    if os.path.exists(path):
        size = os.path.getsize(path) / 1024  # KB
        print(f"✓ {name}: Found ({size:.2f} KB)")
    else:
        print(f"✗ {name}: NOT FOUND - Please download")
        missing_files.append(name)

print("\n" + "="*60)
if len(missing_files) == 0:
    print("✓ ALL FILES PRESENT!")
    print("Ready to proceed to: 02_data_preprocessing.ipynb")
else:
    print(f"✗ Missing {len(missing_files)} files:")
    for f in missing_files:
        print(f"  - {f}")
    print("Please download the missing files before proceeding.")

FILE VERIFICATION
✓ 1. NYC Consumption: Found (110910.33 KB)
✓ 2. EIA Generation: Found (2.38 KB)
✓ 3. NOAA Weather: Found (133.03 KB)
✓ 4. Census Population: Found (0.10 KB)
✓ 5. Electricity Rates: Found (0.04 KB)

✓ ALL FILES PRESENT!
Ready to proceed to: 02_data_preprocessing.ipynb
