# Coastal Labor-Resilience Engine
## Phase 1: Data Exploration Notebook

This notebook demonstrates how to use the data ingestion pipeline to:
1. Fetch NOAA water level data for Santa Barbara
2. Identify extreme coastal events (potential "shocks")
3. Load and explore EPA EJScreen demographic data
4. Align environmental events with workforce changes

In [None]:
# Setup and imports
import sys
sys.path.insert(0, '..')

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Project imports
from src.data.noaa_baseline import NOAAEnvironmentalClient
from src.data.demographic_overlay import DemographicOverlay
from src.data.cleaning import DataAligner, DataCleaner

print("Imports successful!")

## 1. NOAA Water Level Data

Fetch water level data from the Santa Barbara NOAA station (9411340).

In [None]:
# Initialize NOAA client
noaa = NOAAEnvironmentalClient(station_id="9411340")

# Fetch station datums (reference levels)
datums = noaa.get_station_datums()
print("Station Datums (meters):")
for name, value in sorted(datums.items()):
    print(f"  {name}: {value:.3f}")

In [None]:
# Fetch recent high/low tide data (last 30 days)
end_date = datetime.now()
begin_date = end_date - timedelta(days=30)

tides = noaa.fetch_high_low_data(
    begin_date=begin_date.strftime("%Y%m%d"),
    end_date=end_date.strftime("%Y%m%d")
)

print(f"Retrieved {len(tides)} tide observations")
tides.head(10)

In [None]:
# Basic statistics
print("Water Level Statistics (meters):")
print(f"  Min:  {tides['water_level_m'].min():.3f}")
print(f"  Max:  {tides['water_level_m'].max():.3f}")
print(f"  Mean: {tides['water_level_m'].mean():.3f}")
print(f"  Std:  {tides['water_level_m'].std():.3f}")

## 2. Identify Extreme Events

Find historical extreme water level events that could represent coastal "shocks".

In [None]:
# Identify highest observed tides (last 5 years)
# Note: This may take a moment as it fetches historical data
extreme_events = noaa.identify_highest_observed_tides(
    years_back=5,
    top_n=20
)

print(f"Found {len(extreme_events)} extreme events")
extreme_events[['timestamp', 'water_level_m', 'anomaly_m', 'tide_type']].head(10)

## 3. Data Alignment Demo

Demonstrate how to align environmental events with workforce changes.

In [None]:
# Create sample workforce data for demonstration
sample_workforce = pd.DataFrame({
    'first_name': ['Alice', 'Bob', 'Carlos', 'Diana', 'Elena'],
    'job_title': ['Server', 'Fisher', 'Manager', 'Nurse', 'Driver'],
    'industry': ['Hospitality', 'Fishing', 'Hospitality', 'Healthcare', 'Transportation'],
    'zip_code': ['93101', '93103', '93101', '93105', '93109'],
    'job_start_date': pd.to_datetime([
        '2024-01-15', '2024-02-20', '2024-03-10', '2024-04-05', '2024-05-01'
    ])
})

print("Sample Workforce Data:")
sample_workforce

In [None]:
# Initialize data aligner
aligner = DataAligner()

# Standardize timestamps to Pacific Time
if not extreme_events.empty:
    events_standardized = aligner.standardize_noaa_data(extreme_events)
    print("Timestamps standardized to Pacific Time")
    print(f"Sample: {events_standardized['timestamp'].iloc[0]}")

In [None]:
# Create a unified timeline
if not extreme_events.empty:
    datasets = {
        'events': (extreme_events, 'timestamp'),
        'workforce': (sample_workforce, 'job_start_date')
    }
    
    timeline = aligner.create_unified_timeline(datasets, resolution='M')
    print("Monthly Timeline:")
    timeline

## 4. Next Steps

To complete the Phase 1 data pipeline:

1. **Get API credentials**: Set up Live Data Technologies API access
2. **Download EJScreen data**: Get EPA EJScreen CSV from their website
3. **Run full pipeline**: Use `load_all_data()` to fetch all datasets
4. **Explore correlations**: Analyze relationships between events and job changes

In [None]:
# Example: Full data pipeline (requires API credentials and data files)
# 
# from src.data import load_all_data
# 
# data = load_all_data(
#     begin_date='20190101',
#     end_date='20240101',
#     include_workforce=True
# )
# 
# print(f"Water levels: {len(data['water_levels'])} records")
# print(f"High/low tides: {len(data['high_low_tides'])} records")
# print(f"EJScreen tracts: {len(data['ejscreen'])} records")
# print(f"Workforce profiles: {len(data['workforce'])} records")