# 🌊 Flood Frequency Analysis Tool – Overview & Documentation

This tool reads peak flow data from the USGS NWIS database and fits 10 commonly used extreme value probability distributions to estimate flood magnitudes associated with various return periods (e.g., 2-year, 100-year). It performs statistical goodness-of-fit evaluation and provides an interactive interface to visualize the flood frequency curve for each distribution.

---

## 🔧 What the Tool Does

- ✅ Reads annual peak discharge data from a NWIS `.txt` file
- ✅ Fits multiple statistical distributions to the observed peak flows
- ✅ Computes estimated flood quantiles for specific return periods (2, 5, 10, 25, 50, 100 years)
- ✅ Calculates RMSE and Kolmogorov–Smirnov (KS) goodness-of-fit metrics
- ✅ Allows the user to interactively select a distribution and view:
  - Estimated peak flows
  - Distribution parameters
  - GOF statistics
  - A flood frequency curve plotted in log scale

---

## 🧭 How to Use

1. **Prepare Input File**  
   - Download annual peak streamflow data from the [USGS NWIS Peak Flow site](https://waterdata.usgs.gov/nwis/peak)
   - Save as a tab-delimited `.txt` file (e.g., `07022500_nwis_peak.txt`)

2. **Run the Script in Jupyter Notebook**
   - Place the file in your working directory
   - Modify the line `usgs_file = "07022500_nwis_peak.txt"` to match your filename
   - Run the script cell-by-cell

3. **Explore Results**
   - View the summary table of fitted distribution parameters and their statistical performance
   - Use the dropdown selector to compare estimated flood flows and curves for each distribution

---

## 📚 Theoretical Background: Distributions Used

Each distribution estimates the probability of rare flood events based on historical data. Here's a quick reference:

| Distribution           | Description                                                                 | Parameters                        |
|------------------------|-----------------------------------------------------------------------------|-----------------------------------|
| **Gumbel (EV1)**        | Models block maxima (e.g., annual max). Skewed right.                      | Location (μ), Scale (β)           |
| **Log-Pearson III**     | Log-transformed Pearson Type III. Used in U.S. federal flood studies.      | Shape (α), Location (μ), Scale    |
| **GEV**                 | General form for extremes. Includes Gumbel, Frechet, Weibull as cases.     | Shape (ξ), Location, Scale        |
| **Normal**              | Symmetric bell curve. May misrepresent skewed flood data.                  | Mean (μ), Std. dev. (σ)           |
| **Lognormal**           | Data is normally distributed after log transform. Skewed right.            | Shape (σ), Location, Scale        |
| **Weibull (Type III)**  | Useful for extreme minimums or upper tails.                                | Shape (k), Location, Scale        |
| **Exponential**         | Special case of Weibull; constant failure rate (rarely used for floods).   | Rate (λ) or Scale                 |
| **Gamma**               | General skewed distribution, flexible fit for hydrology                    | Shape (k), Scale (θ), Location    |
| **Loglogistic (Fisk)**  | Skewed right, like lognormal but heavier tail.                             | Shape (c), Location, Scale        |
| **Generalized Pareto**  | Models excesses over a threshold (POT approach).                           | Shape, Location, Scale            |

---

## 📏 Performance Evaluation Criteria

Two statistical metrics assess how well each distribution fits the observed data:

- ### 🔹 Root Mean Squared Error (RMSE)
  Measures average error between observed peak flows and estimated quantiles from the distribution:
  $$
  \text{RMSE} = \sqrt{ \frac{1}{n} \sum (Q_{\text{obs}} - Q_{\text{est}})^2 }
  $$
  Lower values indicate a better fit.

- ### 🔹 Kolmogorov–Smirnov (KS) Statistic
  Measures the maximum difference between the empirical cumulative distribution function (ECDF) and the theoretical CDF:
  $$
  D = \sup_x |F_n(x) - F(x)|
  $$
  - Returns both the **KS statistic** and a **p-value**
  - If p-value > 0.05: distribution is a statistically valid fit (✅ Pass)

---

## 🎯 Output Summary

- A sorted summary table of all distributions including:
  - Fitted parameters
  - RMSE
  - KS statistic and p-value
  - Pass/fail interpretation
- Interactive flood frequency plots for return periods on a log-x axis
- Ability to choose which distribution best represents the dataset

---

## 💡 Applications

- Floodplain mapping
- Hydraulic structure design (culverts, bridges, dams)
- Return period–based risk estimation
- Hydrologic modeling calibration

---

Let me know if you'd like this tool extended with confidence intervals, percentile shading, or exported reports in Excel or PDF!

In [20]:
import pandas as pd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gumbel_r
from ipywidgets import interact, Dropdown
def read_nwis_peak_file(file_path):
    """
    Reads a USGS NWIS peak flow .txt file and extracts peak flow data.
    Handles comment lines and parses peak date and discharge values.
    
    Parameters:
        file_path (str): Path to NWIS peak flow text file
    
    Returns:
        DataFrame with columns ['site_no', 'peak_dt', 'peak_va'] (site, date, peak flow)
    """
    try:
        with open(file_path, 'r') as f:
            lines = f.readlines()

        # Identify the header (first non-commented line)
        start_line = next(i for i, line in enumerate(lines) if not line.startswith('#'))

        # Read data starting at detected header
        df = pd.read_csv(
            file_path,
            sep='\t',
            comment='#',
            header=0,
            dtype=str,
            engine='python'
        )

        # Clean and convert key columns
        df.columns = df.columns.str.strip()
        df['peak_dt'] = pd.to_datetime(df['peak_dt'], errors='coerce')
        df['peak_va'] = pd.to_numeric(df['peak_va'], errors='coerce')

        # Drop rows with invalid data
        df_clean = df[['site_no', 'peak_dt', 'peak_va']].dropna()

        return df_clean

    except Exception as e:
        print(f"❌ Error reading file: {e}")
        return pd.DataFrame()

# 🔍 Example usage
# --- Step 1: Load USGS Peak Flow Data (Annual Max Series) ---
# Replace this with the path to your downloaded USGS data (CSV or TXT)

usgs_file = "07022500_nwis_peak.txt"
peak_df = read_nwis_peak_file(usgs_file)

# 📊 Preview the data
print(peak_df.head())



    site_no    peak_dt  peak_va
2  07022500 1953-03-03    780.0
3  07022500 1954-01-20    520.0
4  07022500 1955-03-20    846.0
5  07022500 1956-02-02    440.0
6  07022500 1957-01-22    707.0


  df['peak_dt'] = pd.to_datetime(df['peak_dt'], errors='coerce')


# 📘 Self-Assessment: Flood Frequency Analysis Tool

Use these prompts and questions to evaluate your understanding of the tool and its underlying hydrologic and statistical concepts.

---

## 🧠 Conceptual Questions

1. **Why are return periods plotted on a logarithmic scale in flood frequency analysis?**
   - *Hint: Think about how frequent vs. rare events are distributed.*

2. **What is the purpose of fitting multiple distributions to the same peak flow dataset?**
   - *Hint: No single distribution fits all scenarios equally well.*

3. **How do Gringorten plotting positions help in flood frequency analysis?**
   - *Hint: They're used to assign empirical probabilities to ordered data.*

4. **What assumptions underlie the use of the Gumbel distribution in hydrology?**
   - *Hint: It’s designed to model block maxima like annual peak flows.*

5. **How do parametric and non-parametric flood frequency methods differ in their approach?**
   - *Hint: Consider how the data distribution is treated.*

---

## 🔍 Reflective Prompts

1. **If two distributions yield similar RMSE but different KS p-values, which metric is more important for selecting a model—and why?**

2. **Can a statistically good-fitting distribution be inappropriate for design applications? Provide an example.**

3. **How would you adapt this tool to process data from multiple gage stations simultaneously?**

4. **What limitations might this tool face when applied to future climate-affected streamflow patterns?**

5. **How would the analysis change if you used partial-duration series instead of annual maxima?**

---

## ✅ Quiz Questions

**Q1.** The Gumbel distribution is commonly used to model:  
A. Rainfall intensity  
B. Annual maximum values  
C. Median flow durations  
D. Baseflow during drought  
✅ **Correct:** B

---

**Q2.** The Kolmogorov–Smirnov test compares:  
A. Log and normal distributions  
B. ECDF and theoretical CDF  
C. Mean annual rainfall  
D. Number of peaks above threshold  
✅ **Correct:** B

---

**Q3.** In the Generalized Extreme Value distribution, the shape parameter controls:  
A. Peak discharge  
B. Tail behavior  
C. Cumulative runoff  
D. Frequency of low flows  
✅ **Correct:** B

---

**Q4.** A high KS p-value and low RMSE suggest:  
A. Overfitting  
B. Good model fit  
C. Poor data resolution  
D. Statistical bias  
✅ **Correct:** B

---

**Q5.** Which distribution is least appropriate for positively skewed hydrologic data?  
A. Gumbel  
B. Lognormal  
C. Normal  
D. Log-Pearson III  
✅ **Correct:** C