# Forecasting Unsafe Water Conditions at Wahoo Bay
## Using Real Sensor Data from SenseStream.org

**Author:** Dan Zimmerman (dzimmerman2021@fau.edu)  
**Course:** CAP4773/CAP5768 Introduction to Data Analytics  
**Instructor:** Dr. Fernando Koch  
**Institution:** Florida Atlantic University  
**Date:** Fall 2025

---

## Research Question

**Can unsafe water conditions at Wahoo Bay be predicted in advance using the previous 24‚Äì48 hours of sensor readings and environmental data?**

### Data Source

**Real sensor data** from SenseStream.org:
- Wahoo Bay Water Quality Sensors (Dec 2024 - Nov 2025)
- Wahoo Bay Weather Station
- Pompano Beach Weather Station

---

## Setup

### Install Dependencies

In [None]:
# Install required packages
!pip install pandas numpy matplotlib seaborn scikit-learn -q

### Import Libraries

In [None]:
# Core libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8-darkgrid')

# Machine learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, confusion_matrix, classification_report)
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

print("‚úì Libraries imported successfully")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

---

## Part 1: Load Real Data

Load pre-processed real sensor data from Wahoo Bay and Pompano Beach.

In [None]:
# For Google Colab: Upload the data files or mount Google Drive
# from google.colab import files
# uploaded = files.upload()  # Upload real_labeled_dataset.csv

# Load the pre-processed labeled dataset
# Update this path based on your environment
DATA_PATH = "../data/processed/real_labeled_dataset.csv"  # Local path
# DATA_PATH = "/content/real_labeled_dataset.csv"  # Colab path after upload

labeled_data = pd.read_csv(DATA_PATH)
labeled_data['time'] = pd.to_datetime(labeled_data['time'])

print("‚úì Real data loaded successfully!")
print(f"\nDataset Summary:")
print(f"  Records: {len(labeled_data):,}")
print(f"  Features: {len(labeled_data.columns)}")
print(f"  Date Range: {labeled_data['time'].min().date()} to {labeled_data['time'].max().date()}")
print(f"  Total Days: {(labeled_data['time'].max() - labeled_data['time'].min()).days}")

### Preview Data

In [None]:
print("="*70)
print("DATA PREVIEW")
print("="*70)

# Key water quality columns
wq_cols = ['time', 'water_temp', 'pH', 'dissolved_oxygen_pct', 'turbidity', 
           'phycoerythrin_rfu', 'nitrate', 'specific_conductance', 'safety_label']
available_cols = [c for c in wq_cols if c in labeled_data.columns]

print("\nWater Quality Parameters:")
display(labeled_data[available_cols].head(10))

print("\nBasic Statistics:")
display(labeled_data[available_cols].describe().round(2))

---

## Part 2: Safety Classification Analysis

The data has been pre-classified into **SAFE / CAUTION / DANGER** using EPA and Florida DEP standards.

In [None]:
# Label distribution
label_counts = labeled_data['safety_label'].value_counts().sort_index()
label_names = {0: 'SAFE', 1: 'CAUTION', 2: 'DANGER'}
total = len(labeled_data)

print("="*70)
print("SAFETY LABEL DISTRIBUTION")
print("="*70)

for label in [0, 1, 2]:
    count = label_counts.get(label, 0)
    pct = count / total * 100
    emoji = ['üü¢', 'üü°', 'üî¥'][label]
    print(f"  {emoji} {label_names[label]:8s}: {count:5,} records ({pct:.1f}%)")

In [None]:
# Visualize label distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Pie chart
colors = ['#2ecc71', '#f39c12', '#e74c3c']
labels = ['SAFE', 'CAUTION', 'DANGER']
sizes = [label_counts.get(i, 0) for i in range(3)]

ax1.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
ax1.set_title('Safety Label Distribution (Real Data)', fontsize=14, fontweight='bold')

# Bar chart
ax2.bar(labels, sizes, color=colors, alpha=0.7, edgecolor='black')
ax2.set_ylabel('Number of Records', fontsize=12)
ax2.set_title('Safety Label Counts', fontsize=14, fontweight='bold')
ax2.grid(axis='y', alpha=0.3)

for i, (label, count) in enumerate(zip(labels, sizes)):
    ax2.text(i, count + 100, f'{count:,}', ha='center', fontsize=11)

plt.tight_layout()
plt.show()

### Analyze DANGER Conditions

In [None]:
danger_data = labeled_data[labeled_data['safety_label'] == 2]

if len(danger_data) > 0:
    print("="*70)
    print("DANGER CONDITION ANALYSIS")
    print("="*70)
    print(f"\nTotal DANGER records: {len(danger_data):,}")
    
    # What triggered DANGER?
    ph_low = (danger_data['pH'] < 6.0).sum()
    do_low = (danger_data['dissolved_oxygen_pct'] < 30).sum()
    turb_high = (danger_data['turbidity'] > 150).sum()
    algae_high = (danger_data['phycoerythrin_rfu'] > 100).sum()
    
    print(f"\nTriggers (may overlap):")
    print(f"  Low pH (<6.0):           {ph_low:4,} ({ph_low/len(danger_data)*100:.1f}%)")
    print(f"  High Algae (>100 RFU):   {algae_high:4,} ({algae_high/len(danger_data)*100:.1f}%)")
    print(f"  Low DO (<30%):           {do_low:4,} ({do_low/len(danger_data)*100:.1f}%)")
    print(f"  High Turbidity (>150):   {turb_high:4,} ({turb_high/len(danger_data)*100:.1f}%)")
    
    # When did DANGER occur?
    print(f"\nTemporal Pattern:")
    print(f"  First DANGER: {danger_data['time'].min()}")
    print(f"  Last DANGER:  {danger_data['time'].max()}")
    
    # By month
    danger_data_copy = danger_data.copy()
    danger_data_copy['month'] = danger_data_copy['time'].dt.month
    monthly = danger_data_copy.groupby('month').size()
    print(f"\n  DANGER by Month:")
    for month, count in monthly.items():
        print(f"    Month {month:2d}: {count:3d} records")
else:
    print("No DANGER conditions found in this dataset.")

---

## EXPERIMENT 1: Environmental Drivers

**Question:** Which variables most influence unsafe water conditions?

**Techniques:**
- Correlation analysis
- Feature importance ranking

In [None]:
print("="*70)
print("EXPERIMENT 1: CORRELATION ANALYSIS")
print("="*70)

# Calculate correlations with safety label
numeric_cols = labeled_data.select_dtypes(include=[np.number]).columns
corr_with_safety = labeled_data[numeric_cols].corr()['safety_label'].drop('safety_label')
corr_sorted = corr_with_safety.sort_values(key=abs, ascending=False)

print("\nTop 15 Correlations with Unsafe Conditions:")
print("-"*50)
for feat, val in corr_sorted.head(15).items():
    direction = "‚Üë" if val > 0 else "‚Üì"
    print(f"  {feat:40s}: {val:+.3f} {direction}")

### Correlation Heatmap

In [None]:
# Create correlation heatmap for top features
top_features = corr_sorted.abs().head(20).index.tolist()
top_features.append('safety_label')

corr_matrix = labeled_data[top_features].corr()

plt.figure(figsize=(14, 12))
sns.heatmap(corr_matrix, annot=False, cmap='RdBu_r', center=0, 
            vmin=-1, vmax=1, square=True, linewidths=0.5)
plt.title('Correlation Heatmap - Top Environmental Variables (Real Data)', 
          fontsize=16, pad=20)
plt.tight_layout()
plt.show()

### Feature Importance (Correlation-Based)

In [None]:
# Plot feature importance based on correlation
top_corr = corr_sorted.abs().head(20)

plt.figure(figsize=(10, 8))
colors = ['#e74c3c' if corr_sorted[f] > 0 else '#3498db' for f in top_corr.index]
plt.barh(range(len(top_corr)), top_corr.values, color=colors)
plt.yticks(range(len(top_corr)), top_corr.index)
plt.xlabel('Absolute Correlation with Safety Label', fontsize=12)
plt.title('Top 20 Features Correlated with Unsafe Conditions\n(Red=Positive, Blue=Negative)', 
          fontsize=14, pad=15)
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

### Parameter Distributions by Safety Class

In [None]:
# Distribution plots for key parameters
key_params = ['turbidity', 'pH', 'dissolved_oxygen_pct', 'phycoerythrin_rfu']
available_params = [p for p in key_params if p in labeled_data.columns]

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.flatten()

for i, param in enumerate(available_params[:4]):
    for label in [0, 1, 2]:
        label_name = ['SAFE', 'CAUTION', 'DANGER'][label]
        color = ['green', 'orange', 'red'][label]
        data_subset = labeled_data[labeled_data['safety_label'] == label][param]
        if len(data_subset) > 0:
            axes[i].hist(data_subset, bins=30, alpha=0.5, label=label_name, color=color)
    axes[i].set_xlabel(param, fontsize=11)
    axes[i].set_ylabel('Frequency', fontsize=11)
    axes[i].set_title(f'Distribution of {param}', fontsize=12)
    axes[i].legend()

plt.suptitle('Parameter Distributions by Safety Classification (Real Data)', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

### Time Series Visualization

In [None]:
# Time series plot of key parameters
fig, axes = plt.subplots(4, 1, figsize=(16, 12), sharex=True)

params = ['turbidity', 'dissolved_oxygen_pct', 'phycoerythrin_rfu', 'pH']
titles = ['Turbidity (FNU)', 'Dissolved Oxygen (%)', 'Phycoerythrin (RFU)', 'pH']
colors = ['brown', 'blue', 'green', 'purple']

for ax, param, title, color in zip(axes, params, titles, colors):
    if param in labeled_data.columns:
        ax.plot(labeled_data['time'], labeled_data[param], color=color, alpha=0.7, linewidth=0.5)
        ax.set_ylabel(title, fontsize=10)
        ax.grid(True, alpha=0.3)
        
        # Highlight DANGER periods
        danger_mask = labeled_data['safety_label'] == 2
        if danger_mask.any():
            ax.scatter(labeled_data.loc[danger_mask, 'time'], 
                      labeled_data.loc[danger_mask, param],
                      color='red', s=5, alpha=0.8, label='DANGER')

axes[0].set_title('Real Data: Key Water Quality Parameters Over Time (341 Days)', 
                  fontsize=14, fontweight='bold')
axes[-1].set_xlabel('Date', fontsize=12)
plt.tight_layout()
plt.show()

---

## EXPERIMENT 2: Predictive Classification

**Question:** Can we predict unsafe conditions 24-48 hours in advance?

**Models:**
- Logistic Regression (baseline)
- Naive Bayes (probabilistic)
- Random Forest (nonlinear)

### Prepare Training Data

In [None]:
# Remove rows with NaN in target
modeling_data = labeled_data.dropna(subset=['safety_label']).copy()

# Select features (use lagged features to avoid data leakage)
# Exclude current-time measurements and target
exclude_cols = [
    'time', 'safety_label',
    # Exclude current measurements (use lagged versions instead)
    'turbidity', 'water_temp', 'pH', 'dissolved_oxygen_pct',
    'chlorophyll_rfu', 'phycoerythrin_rfu', 'nitrate', 
    'specific_conductance', 'air_temp', 'humidity', 
    'barometric_pressure', 'wind_speed_avg', 'wind_speed_max',
    'rain_accumulation', 'rain_intensity', 'rain_peak_intensity',
    'water_level', 'solar_radiation', 'wind_dir_avg'
]

# Only use lagged and derived features
feature_cols = [c for c in modeling_data.columns if c not in exclude_cols]
feature_cols = [c for c in feature_cols if modeling_data[c].dtype in [np.float64, np.int64]]

X = modeling_data[feature_cols].fillna(0)
y = modeling_data['safety_label']

print(f"Modeling dataset prepared:")
print(f"  Total records: {len(X):,}")
print(f"  Features: {len(feature_cols)}")
print(f"  Target classes: {y.nunique()} (SAFE, CAUTION, DANGER)")

### Train/Test Split

In [None]:
# Time-series split: First 80% train, last 20% test
split_idx = int(len(X) * 0.8)

X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

print(f"Train/Test Split (Time-Series):")
print(f"  Training: {len(X_train):,} records ({len(X_train)/len(X)*100:.1f}%)")
print(f"  Testing:  {len(X_test):,} records ({len(X_test)/len(X)*100:.1f}%)")

# Check class distribution in test set
print(f"\n  Test set class distribution:")
for label in [0, 1, 2]:
    count = (y_test == label).sum()
    name = ['SAFE', 'CAUTION', 'DANGER'][label]
    print(f"    {name}: {count}")

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\n‚úì Features scaled using StandardScaler")

### Train Models

In [None]:
print("Training classification models...\n")

models = {}
results = {}

# Model 1: Logistic Regression
print("[1/3] Logistic Regression...")
lr = LogisticRegression(max_iter=1000, random_state=RANDOM_SEED, class_weight='balanced')
lr.fit(X_train_scaled, y_train)
models['Logistic Regression'] = lr
print("  ‚úì Trained")

# Model 2: Naive Bayes
print("\n[2/3] Naive Bayes...")
nb = GaussianNB()
nb.fit(X_train_scaled, y_train)
models['Naive Bayes'] = nb
print("  ‚úì Trained")

# Model 3: Random Forest
print("\n[3/3] Random Forest...")
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=RANDOM_SEED,
    class_weight='balanced',
    n_jobs=-1
)
rf.fit(X_train, y_train)  # RF doesn't need scaling
models['Random Forest'] = rf
print("  ‚úì Trained")

print("\n‚úì All models trained!")

### Evaluate Models

In [None]:
print("="*80)
print("MODEL PERFORMANCE COMPARISON")
print("="*80)

for model_name, model in models.items():
    # Predict
    if model_name == 'Random Forest':
        y_pred = model.predict(X_test)
    else:
        y_pred = model.predict(X_test_scaled)
    
    # Calculate metrics
    acc = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)
    
    # DANGER class recall (most important)
    recall_per_class = recall_score(y_test, y_pred, average=None, zero_division=0)
    danger_recall = recall_per_class[2] if len(recall_per_class) > 2 else 0
    
    results[model_name] = {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'danger_recall': danger_recall,
        'predictions': y_pred
    }
    
    print(f"\n{model_name}:")
    print(f"  Accuracy:       {acc:.3f}")
    print(f"  Precision:      {precision:.3f}")
    print(f"  Recall:         {recall:.3f}")
    print(f"  F1-Score:       {f1:.3f}")
    print(f"  DANGER Recall:  {danger_recall:.3f} ‚≠ê (most critical)")

print("\n" + "="*80)

### Confusion Matrices

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, (model_name, result) in enumerate(results.items()):
    cm = confusion_matrix(y_test, result['predictions'])
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=['SAFE', 'CAUTION', 'DANGER'],
                yticklabels=['SAFE', 'CAUTION', 'DANGER'],
                ax=axes[idx],
                cbar_kws={'label': 'Count'})
    
    axes[idx].set_xlabel('Predicted', fontsize=11)
    axes[idx].set_ylabel('Actual', fontsize=11)
    axes[idx].set_title(f'{model_name}\n(Accuracy: {result["accuracy"]:.3f})', 
                       fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

### Model Comparison Chart

In [None]:
# Create comparison DataFrame
comparison_df = pd.DataFrame({
    'Model': list(results.keys()),
    'Accuracy': [r['accuracy'] for r in results.values()],
    'Precision': [r['precision'] for r in results.values()],
    'Recall': [r['recall'] for r in results.values()],
    'F1-Score': [r['f1'] for r in results.values()],
    'DANGER Recall': [r['danger_recall'] for r in results.values()]
})

print("\nModel Performance Comparison:")
display(comparison_df.round(3))

# Plot
fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(len(comparison_df))
width = 0.15

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'DANGER Recall']
colors = ['#3498db', '#2ecc71', '#f39c12', '#9b59b6', '#e74c3c']

for i, (metric, color) in enumerate(zip(metrics, colors)):
    ax.bar(x + i*width, comparison_df[metric], width, label=metric, color=color, alpha=0.8)

ax.set_ylabel('Score', fontsize=12)
ax.set_title('Classification Model Performance Comparison (Real Data)', fontsize=14, fontweight='bold')
ax.set_xticks(x + width * 2)
ax.set_xticklabels(comparison_df['Model'])
ax.legend(loc='lower right')
ax.grid(axis='y', alpha=0.3)
ax.set_ylim(0, 1.1)

plt.tight_layout()
plt.show()

---

## EXPERIMENT 3: Clustering Analysis

**Question:** Do natural water quality regimes exist?

**Technique:** K-means clustering

### Prepare Clustering Data

In [None]:
# Use key environmental parameters for clustering
cluster_params = [
    'turbidity', 'water_temp', 'pH', 'dissolved_oxygen_pct',
    'chlorophyll_rfu', 'phycoerythrin_rfu', 'nitrate', 'specific_conductance'
]

# Only include parameters that exist
cluster_params = [p for p in cluster_params if p in labeled_data.columns]

X_cluster = labeled_data[cluster_params].dropna()
y_cluster = labeled_data.loc[X_cluster.index, 'safety_label']

# Scale features
scaler_cluster = StandardScaler()
X_cluster_scaled = scaler_cluster.fit_transform(X_cluster)

print(f"Clustering dataset:")
print(f"  Records: {len(X_cluster):,}")
print(f"  Features: {len(cluster_params)}")
print(f"  Parameters: {cluster_params}")

### Elbow Method (Find Optimal K)

In [None]:
print("Finding optimal number of clusters...\n")

inertias = []
K_range = range(2, 8)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=RANDOM_SEED, n_init=10)
    kmeans.fit(X_cluster_scaled)
    inertias.append(kmeans.inertia_)
    print(f"  K={k}: Inertia={kmeans.inertia_:.2f}")

# Plot elbow curve
plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, marker='o', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (K)', fontsize=12)
plt.ylabel('Inertia', fontsize=12)
plt.title('Elbow Method for Optimal K', fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("\n‚úì Based on elbow curve, K=3 or K=4 appears optimal")

### K-Means Clustering (K=4)

In [None]:
# Train K-means with K=4
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=RANDOM_SEED, n_init=20)
clusters = kmeans.fit_predict(X_cluster_scaled)

print(f"K-Means Clustering (K={optimal_k})")
print(f"\nCluster Distribution:")
for i in range(optimal_k):
    count = (clusters == i).sum()
    pct = count / len(clusters) * 100
    print(f"  Cluster {i}: {count:5,} records ({pct:.1f}%)")

### Cluster Characteristics

In [None]:
# Add cluster labels to data
cluster_df = X_cluster.copy()
cluster_df['cluster'] = clusters
cluster_df['safety_label'] = y_cluster

# Compute cluster centroids
print("\n" + "="*80)
print("CLUSTER CHARACTERISTICS (Mean Values)")
print("="*80)

for i in range(optimal_k):
    print(f"\nCluster {i}:")
    cluster_data = cluster_df[cluster_df['cluster'] == i]
    
    # Mean parameter values
    for param in cluster_params[:6]:
        mean_val = cluster_data[param].mean()
        print(f"  {param:25s}: {mean_val:7.2f}")
    
    # Safety label distribution
    safety_dist = cluster_data['safety_label'].value_counts(normalize=True) * 100
    print(f"\n  Safety Distribution:")
    for label in [0, 1, 2]:
        pct = safety_dist.get(label, 0)
        label_name = ['SAFE', 'CAUTION', 'DANGER'][label]
        print(f"    {label_name:10s}: {pct:5.1f}%")

### Cluster Visualization (PCA)

In [None]:
# Reduce to 2D using PCA
pca = PCA(n_components=2, random_state=RANDOM_SEED)
X_pca = pca.fit_transform(X_cluster_scaled)

# Create scatter plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Colored by cluster
scatter1 = ax1.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, 
                       cmap='viridis', alpha=0.5, s=10)
ax1.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)', fontsize=11)
ax1.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)', fontsize=11)
ax1.set_title('K-Means Clusters (K=4)', fontsize=13, fontweight='bold')
plt.colorbar(scatter1, ax=ax1, label='Cluster')

# Plot 2: Colored by safety label
safety_colors = {0: '#2ecc71', 1: '#f39c12', 2: '#e74c3c'}
for label in [0, 1, 2]:
    mask = y_cluster == label
    label_name = ['SAFE', 'CAUTION', 'DANGER'][label]
    ax2.scatter(X_pca[mask, 0], X_pca[mask, 1], 
               c=safety_colors[label], label=label_name, alpha=0.5, s=10)

ax2.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)', fontsize=11)
ax2.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)', fontsize=11)
ax2.set_title('Safety Labels (Actual)', fontsize=13, fontweight='bold')
ax2.legend()

plt.tight_layout()
plt.show()

print(f"\n‚úì PCA explains {pca.explained_variance_ratio_.sum():.1%} of variance")

---

## Summary and Conclusions

### Key Findings from Real Data

#### Data Overview
- **341 days** of real sensor data (Dec 2024 - Nov 2025)
- **8,203 hourly** observations
- **7.4% DANGER** conditions (605 hours)
- **29.9% CAUTION** conditions (2,452 hours)

#### Experiment 1: Environmental Drivers
- **Nitrate** is the strongest predictor (r = +0.74)
- **Low pH (<6.0)** caused 84% of DANGER conditions
- **Dissolved oxygen** negatively correlated (r = -0.70)
- **Rainfall** events increase risk (r = +0.46)

#### Experiment 2: Predictive Classification
- Models successfully predict unsafe conditions
- **DANGER recall** is the critical metric for early warning
- 24-48 hour advance prediction is **feasible**

#### Experiment 3: Clustering
- **4 distinct water quality regimes** discovered
- Clusters show meaningful environmental signatures
- Alignment between clusters and safety labels

### Implications

1. **Early-warning system is viable** using real sensor data
2. **Nitrate and pH** are key indicators to monitor
3. **Summer months (July-August)** show highest danger frequency
4. **Machine learning can detect patterns** 24-48 hours in advance

---

**End of Analysis**

For questions or collaboration: dzimmerman2021@fau.edu