
# Module 5 Project: Radar Parameters and Rainfall Prediction
### By: Nathan Makowski (Student in ATMS 523 Fall 2025)


In the **Module 5 Project** we analyze polarimetric radar data to estimate rainfall intensity. Using measurements collected from disdrometers in **Huntsville, Alabama**, this notebook explores how various radar-derived parameters relate to rain rate.


In this project, I will:


1. Load and explore the dataset `radar_parameters.csv` (8 columns, 18,969 rows).
2. Split the dataset into **70% training** and **30% testing** subsets.
3. Train and evaluate multiple regression models, including:
- **Multiple Linear Regression**
- **Polynomial Regression** (with grid search over polynomial degree 0–9, using 7-fold cross-validation)
- **Random Forest Regressor** (with grid search over a specified hyperparameter grid)
4. Compare each model’s performance to a **baseline empirical relationship** between radar reflectivity and rain rate:
\( Z = 200R^{1.6} \)


Performance metrics used include:
- **Coefficient of Determination (R²)** — measures explained variance.
- **Root Mean Square Error (RMSE)** — measures average prediction error.


By the end of this analysis, I identify which model most accurately estimates rainfall rate and whether it can outperform the baseline relationship.

In [1]:
# Imports
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
import joblib

In [2]:
# If you want to quickly check your current working directory in the notebook, uncomment:
print(os.getcwd())

/home/njm12/ATMS_523/Module_5/ATMS-523-Module-5


In [4]:
# 0) Load data (adjust path if necessary)
csv_path = '/home/njm12/ATMS_523/Module_5/ATMS-523-Module-5/homework/radar_parameters.csv'
try:
    df = pd.read_csv(csv_path)
except FileNotFoundError:
    raise FileNotFoundError(f"Could not find {csv_path}. If your notebook's working directory is not the repo root, either change csv_path or run: import os; os.getcwd() to inspect the cwd.")


# Quick peek
print('rows, cols:', df.shape)
print(df.columns.tolist())

rows, cols: (18969, 8)
['Unnamed: 0', 'Zh (dBZ)', 'Zdr (dB)', 'Ldr (dB)', 'Kdp (deg km-1)', 'Ah (dBZ/km)', 'Adr (dB/km)', 'R (mm/hr)']


In [7]:
# Features and target
feature_cols = ['Zh (dBZ)', 'Zdr (dB)', 'Ldr (dB)', 'Kdp (deg km-1)', 'Ah (dBZ/km)', 'Adr (dB/km)']
# Drop any unnamed index column if present
if 'Unnamed: 0' in df.columns:
    df = df.drop(columns=['Unnamed: 0'])


expected_cols = feature_cols + ['R (mm/hr)']
if not all(col in df.columns for col in expected_cols):
    raise ValueError('Input CSV missing expected columns. Expected columns include: ' + ','.join(expected_cols))


X = df[feature_cols].values
y = df['R (mm/hr)'].values

#print(X)
#print(y)

In [8]:
# 1) Split 70/30 (train/test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
print('Train size:', X_train.shape[0], 'Test size:', X_test.shape[0])


# Helper functions


def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))


# Baseline function: Z = 200 R^1.6 --> R = (Z/200)^(1/1.6)
# Note: Zh is in dBZ, Z = 10^(Zh/10)


def baseline_R_from_Zh(Zh_array):
    Z = 10**(Zh_array / 10.0)
    R_hat = (Z / 200.0) ** (1.0 / 1.6)
    return R_hat


# Compute baseline on train and test
R_baseline_train = baseline_R_from_Zh(X_train[:, 0])
R_baseline_test = baseline_R_from_Zh(X_test[:, 0])


# Baseline metrics
baseline_train_r2 = r2_score(y_train, R_baseline_train)
baseline_test_r2 = r2_score(y_test, R_baseline_test)
baseline_train_rmse = rmse(y_train, R_baseline_train)
baseline_test_rmse = rmse(y_test, R_baseline_test)


print('\nBaseline (Z=200 R^1.6) metrics:')
print(f' Train R2: {baseline_train_r2:.4f}, RMSE: {baseline_train_rmse:.4f}')
print(f' Test R2: {baseline_test_r2:.4f}, RMSE: {baseline_test_rmse:.4f}')

Train size: 13278 Test size: 5691

Baseline (Z=200 R^1.6) metrics:
 Train R2: 0.2756, RMSE: 7.1440
 Test R2: 0.3566, RMSE: 7.1893


In [9]:
print(np.var(y_train), np.var(y_test))

70.44801283780265 80.33838182548162


In [10]:
# 2) Multiple Linear Regression (using all features)
pipe_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('linreg', LinearRegression())
])


pipe_lr.fit(X_train, y_train)


# Predictions
y_train_pred_lr = pipe_lr.predict(X_train)
y_test_pred_lr = pipe_lr.predict(X_test)


lr_train_r2 = r2_score(y_train, y_train_pred_lr)
lr_test_r2 = r2_score(y_test, y_test_pred_lr)
lr_train_rmse = rmse(y_train, y_train_pred_lr)
lr_test_rmse = rmse(y_test, y_test_pred_lr)


print('\nLinear Regression metrics:')
print(f' Train R2: {lr_train_r2:.4f}, RMSE: {lr_train_rmse:.4f}')
print(f' Test R2: {lr_test_r2:.4f}, RMSE: {lr_test_rmse:.4f}')


Linear Regression metrics:
 Train R2: 0.9879, RMSE: 0.9229
 Test R2: 0.9891, RMSE: 0.9358


In [None]:
# 3) Polynomial features grid search (degrees 0-9) with 7-fold CV
# We'll create a pipeline: scaler -> PolynomialFeatures(degree=d, include_bias=False) -> LinearRegression
# Note: degree=0 means only bias term (constant model). We include include_bias=False then allow LinearRegression intercept.


pipeline_poly = Pipeline([
('scaler', StandardScaler()),
('poly', PolynomialFeatures(include_bias=False)),
('linreg', LinearRegression())
])


param_grid = {
'poly__degree': list(range(0, 10))
}


grid_poly = GridSearchCV(pipeline_poly, param_grid, cv=7, scoring='r2', n_jobs=-1, verbose=1)
grid_poly.fit(X_train, y_train)


print('\nPolynomial GridSearch best params:', grid_poly.best_params_)
print('Best CV R2:', grid_poly.best_score_)


best_poly = grid_poly.best_estimator_
# Evaluate best on train and test
y_train_pred_poly = best_poly.predict(X_train)
y_test_pred_poly = best_poly.predict(X_test)


poly_train_r2 = r2_score(y_train, y_train_pred_poly)
poly_test_r2 = r2_score(y_test, y_test_pred_poly)
poly_train_rmse = rmse(y_train, y_train_pred_poly)
poly_test_rmse = rmse(y_test, y_test_pred_poly)


print('\nBest Polynomial model metrics:')
print(f" degree = {grid_poly.best_params_['poly__degree']}")
print(f' Train R2: {poly_train_r2:.4f}, RMSE: {poly_train_rmse:.4f}')
print(f' Test R2: {poly_test_r2:.4f}, RMSE: {poly_test_rmse:.4f}')


Fitting 7 folds for each of 10 candidates, totalling 70 fits

Polynomial GridSearch best params: {'poly__degree': 2}
Best CV R2: 0.9969985736490088

Best Polynomial model metrics:
 degree = 2
 Train R2: 0.9996, RMSE: 0.1672
 Test R2: 0.9996, RMSE: 0.1836


7 fits failed out of a total of 70.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
7 fits failed with the following error:
Traceback (most recent call last):
  File "/home/njm12/ATMS_523/envs/xarray-climate/lib/python3.13/site-packages/sklearn/model_selection/_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/njm12/ATMS_523/envs/xarray-climate/lib/python3.13/site-packages/sklearn/base.py", line 1365, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/njm12/ATMS_523/envs/xarray-climate/lib/python3.13/site-packages/sklearn/pipeline.py", line 655, in fit
    Xt = self._fit(X, y, routed_params, raw_params=param

In [12]:
# 4) Random Forest Regressor grid search (use provided param_grid), 7-fold CV
rf = RandomForestRegressor(random_state=42)
param_grid_rf = {
"bootstrap": [True, False],
"max_depth": [10, 100],
"max_features": ["sqrt", 1.0],
"min_samples_leaf": [1, 4],
"min_samples_split": [2, 10],
"n_estimators": [200, 1000]
}


grid_rf = GridSearchCV(rf, param_grid_rf, cv=7, scoring='r2', n_jobs=-1, verbose=1)
# Warning: this search can be computationally heavy depending on data size and your machine.
grid_rf.fit(X_train, y_train)


print('\nRandom Forest GridSearch best params:', grid_rf.best_params_)
print('Best CV R2:', grid_rf.best_score_)


best_rf = grid_rf.best_estimator_
# Evaluate best on train and test
y_train_pred_rf = best_rf.predict(X_train)
y_test_pred_rf = best_rf.predict(X_test)


rf_train_r2 = r2_score(y_train, y_train_pred_rf)
rf_test_r2 = r2_score(y_test, y_test_pred_rf)
rf_train_rmse = rmse(y_train, y_train_pred_rf)
rf_test_rmse = rmse(y_test, y_test_pred_rf)


print('\nBest Random Forest metrics:')
print(f' Train R2: {rf_train_r2:.4f}, RMSE: {rf_train_rmse:.4f}')
print(f' Test R2: {rf_test_r2:.4f}, RMSE: {rf_test_rmse:.4f}')

Fitting 7 folds for each of 64 candidates, totalling 448 fits





Random Forest GridSearch best params: {'bootstrap': True, 'max_depth': 100, 'max_features': 1.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best CV R2: 0.9813631013370073

Best Random Forest metrics:
 Train R2: 0.9974, RMSE: 0.4252
 Test R2: 0.9879, RMSE: 0.9858


In [13]:
# 5) Summary comparison table
results = pd.DataFrame({
'model': ['Baseline (Z->R)', 'LinearRegression', f'Poly deg={grid_poly.best_params_["poly__degree"]}', 'RandomForest (best)'],
'train_r2': [baseline_train_r2, lr_train_r2, poly_train_r2, rf_train_r2],
'test_r2': [baseline_test_r2, lr_test_r2, poly_test_r2, rf_test_r2],
'train_rmse': [baseline_train_rmse, lr_train_rmse, poly_train_rmse, rf_train_rmse],
'test_rmse': [baseline_test_rmse, lr_test_rmse, poly_test_rmse, rf_test_rmse]
})


print('\nComparison of models:')
print(results)


# %%
# Save the best models if you want
joblib.dump(pipe_lr, 'best_linear_model.joblib')
joblib.dump(best_poly, 'best_polynomial_model.joblib')
joblib.dump(best_rf, 'best_random_forest_model.joblib')


print('\nSaved best models to disk: best_linear_model.joblib, best_polynomial_model.joblib, best_random_forest_model.joblib')


Comparison of models:
                 model  train_r2   test_r2  train_rmse  test_rmse
0      Baseline (Z->R)  0.275551  0.356643    7.143950   7.189316
1     LinearRegression  0.987909  0.989099    0.922940   0.935812
2           Poly deg=2  0.999603  0.999581    0.167173   0.183564
3  RandomForest (best)  0.997433  0.987904    0.425229   0.985775

Saved best models to disk: best_linear_model.joblib, best_polynomial_model.joblib, best_random_forest_model.joblib


In [15]:
# Print file sizes
print('\nSaved model file sizes:')
for file in ['best_linear_model.joblib', 'best_polynomial_model.joblib', 'best_random_forest_model.joblib']:
    size_mb = os.path.getsize(file) / (1024 * 1024)
    print(f" {file}: {size_mb:.2f} MB")


Saved model file sizes:
 best_linear_model.joblib: 0.00 MB
 best_polynomial_model.joblib: 0.00 MB
 best_random_forest_model.joblib: 230.65 MB


# Model Performance Comparison and Discussion


**Answer:**


All three machine learning models (**Linear**, **Polynomial**, **Random Forest**) dramatically outperform the baseline \( Z = 200R^{1.6} \) model.


- The **baseline’s test R² ≈ 0.36** shows it explains only ~36% of the variance, whereas the others explain >98%.
- Its **RMSE (~7.19 mm/hr)** is an order of magnitude larger than the ML models (**~0.18–0.98 mm/hr**).
- **Linear regression** already performs extremely well (**R² ≈ 0.99**, RMSE ≈ 0.94 mm/hr).
- **Polynomial regression (degree = 2)** provides a small but meaningful improvement over linear:
- R² rises to **0.9996**
- RMSE drops to **0.18 mm/hr**, indicating an almost perfect fit without overfitting (train and test scores nearly identical).
- **Random Forest** achieves excellent training performance but does not beat the polynomial model on the test set:
- Test R² ≈ **0.988** (slightly worse than linear/polynomial)
- Test RMSE ≈ **0.986 mm/hr** (worse than polynomial, similar to linear)


 **Therefore:**
- The **best polynomial model (degree = 2)** outperforms the baseline, linear regression, and optimized Random Forest models in both **R²** and **RMSE**.