**Zachary Torstrick**  
**HW 5**  
**ATMS 523**
**Fall 2025**  

## Module 5 Project

Fork this repository, and submit this code as a pull request back to GitHub by the date and time listed in Canvas.

For this assignment, use the dataset called `radar_parameters.csv` provided in the GitHub repository in the folder `homework`.

### Dataset Description

The training data consists of polarimetric radar parameters calculated from a disdrometer (an instrument that measures rain drop sizes, shapes, and rainfall rate) measurements from several years in Huntsville, Alabama. A model called `pytmatrix` is used to calculate polarimetric radar parameters from the droplet observations, which can be used as a way to compare what a remote sensing instrument would see and rainfall.

### Data columns

Features (radar measurements):

`Zh` - radar reflectivity factor (dBZ) - use the formula $dBZ = 10\log_{10}(Z)$

`Zdr` - differential reflectivity

`Ldr` - linear depolarization ratio

`Kdp` - specific differential phase

`Ah` - specific attenuation

`Adp` - differential attenuation

Target :

`R` - rain rate

In [2]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

df = pd.read_csv('homework/radar_parameters.csv')

X = df.drop('R (mm/hr)', axis=1)
y = df['R (mm/hr)']   

##############################################################################
## 1. Split the data into a 70-30 split for training and testing data.
##############################################################################
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    random_state=0,
    train_size=0.7)

##############################################################################
### 2. Using the split created in (1), train a multiple linear regression dataset 
###    using the training dataset, and validate it using the testing dataset.  
###    Compare the $R^2$ and root mean square errors of model on the training and 
###    testing sets to a baseline prediction of rain rate using the formula 
###    $Z = 200 R^{1.6}$.
##############################################################################

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
predict_test = model.predict(X_test)
predict_train = model.predict(X_train)

# Evaluate R² and RMSE
r2_train = r2_score(y_train, predict_train)
rmse_train = np.sqrt(mean_squared_error(y_train, predict_train))
r2_test = r2_score(y_test, predict_test)
rmse_test = np.sqrt(mean_squared_error(y_test, predict_test))

# Convert dBZ to linear Z
# dBZ = 10 * log10(Z)  →  Z = 10^(dBZ/10)
Zh_train_dBZ = X_train['Zh (dBZ)']
Zh_test_dBZ = X_test['Zh (dBZ)']
Z_train = 10 ** (Zh_train_dBZ / 10)
Z_test = 10 ** (Zh_test_dBZ / 10)

# Solve for R: R = (Z / 200)^(1/1.6)
baseline_train = (Z_train / 200) ** (1 / 1.6)
baseline_test = (Z_test / 200) ** (1 / 1.6)

# predict baseline
r2_baseline_train = r2_score(y_train, baseline_train)
rmse_baseline_train = np.sqrt(mean_squared_error(y_train, baseline_train))
r2_baseline_test = r2_score(y_test, baseline_test)
rmse_baseline_test = np.sqrt(mean_squared_error(y_test, baseline_test))

print("=== Baseline (Z-R Relationship) ===")
print(f"Training R²: {r2_baseline_train:.4f}, RMSE: {rmse_baseline_train:.4f}")
print(f"Testing R²: {r2_baseline_test:.4f}, RMSE: {rmse_baseline_test:.4f}")
print('\n')
print("=== Linear Regression ===")
print(f"Training R²: {r2_train:.4f}, RMSE: {rmse_train:.4f}")
print(f"Testing R²: {r2_test:.4f}, RMSE: {rmse_test:.4f}")


=== Baseline (Z-R Relationship) ===
Training R²: 0.3325, RMSE: 7.0725
Testing R²: 0.2266, RMSE: 7.3524


=== Linear Regression ===
Training R²: 0.9888, RMSE: 0.9147
Testing R²: 0.9869, RMSE: 0.9583


In [3]:
##############################################################################
### 3. Repeat 1 doing a grid search over polynomial orders, using a grid 
###    search over orders 0-9, and use cross-validation of 7 folds.  For the 
###    best polynomial model in terms of $R^2$, does it outperform the baseline 
###    and the linear regression model in terms of $R^2$ and root mean square error?
##############################################################################

def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression(**kwargs))

param_grid = {
    'polynomialfeatures__degree': np.arange(0, 9)
}

# Step 3: Set up GridSearchCV
grid = GridSearchCV(
    PolynomialRegression(),  # The model pipeline
    param_grid,              # Parameters to search
    cv=7,                    # 7-fold cross-validation
    scoring='r2',            # Optimize for R²
    verbose=2                # Show progress
)

# Fit the grid search
grid.fit(X_train, y_train)

# Get the best model
best_poly_model = grid.best_estimator_
best_degree = grid.best_params_['polynomialfeatures__degree']
print(f"Best polynomial degree: {best_degree}")

# Make predictions with the best model
grid_train = best_poly_model.predict(X_train)
grid_test = best_poly_model.predict(X_test)
r2_grid_train = r2_score(y_train, grid_train)
rmse_grid_train = np.sqrt(mean_squared_error(y_train, grid_train))
r2_grid_test = r2_score(y_test, grid_test)
rmse_grid_test = np.sqrt(mean_squared_error(y_test, grid_test))

print("\n=== Best Polynomial Model ===")
print(f"Training R²: {r2_grid_train:.4f}, RMSE: {rmse_grid_train:.4f}")
print(f"Testing R²: {r2_grid_test:.4f}, RMSE: {rmse_grid_test:.4f}")

# Step 8: Compare all three models
print("\n=== COMPARISON ===")
print("Baseline:", r2_baseline_test, rmse_baseline_test)
print("Linear Regression:", r2_test, rmse_test)
print("Polynomial (degree", best_degree, "):", r2_grid_test, rmse_grid_test)


Fitting 7 folds for each of 9 candidates, totalling 63 fits
[CV] END .......................polynomialfeatures__degree=0; total time=   0.0s
[CV] END .......................polynomialfeatures__degree=0; total time=   0.0s
[CV] END .......................polynomialfeatures__degree=0; total time=   0.0s
[CV] END .......................polynomialfeatures__degree=0; total time=   0.0s
[CV] END .......................polynomialfeatures__degree=0; total time=   0.0s
[CV] END .......................polynomialfeatures__degree=0; total time=   0.0s
[CV] END .......................polynomialfeatures__degree=0; total time=   0.0s
[CV] END .......................polynomialfeatures__degree=1; total time=   0.0s
[CV] END .......................polynomialfeatures__degree=1; total time=   0.0s
[CV] END .......................polynomialfeatures__degree=1; total time=   0.0s
[CV] END .......................polynomialfeatures__degree=1; total time=   0.0s
[CV] END .......................polynomialfeature

In [4]:
##############################################################################
### 4. Repeat 1 with a Random Forest Regressor, 
###    and perform a grid_search on the following parameters:
###   
###   ```python
###   param_grid = {
###     "bootstrap": [True, False],
###     "max_depth": [10, 100],
###     "max_features": ["sqrt", 1.0],  
###     "min_samples_leaf": [1, 4],
###     "min_samples_split": [2, 10],
###     "n_estimators": [200, 1000]}
###   ```
###  Can you beat the baseline, or the linear regression, or best polynomial 
###  model with the best optimized Random Forest Regressor in terms of $R^2$ and 
###  root mean square error?
##############################################################################

param_grid = {
    "bootstrap": [True, False],
    "max_depth": [10, 100],
    "max_features": ["sqrt", 1.0],  
    "min_samples_leaf": [1, 4],
    "min_samples_split": [2, 10],
    "n_estimators": [200, 1000]
}

rf_model = RandomForestRegressor(random_state=0)

# Set up GridSearchCV with 7-fold cross-validation
grid_rf = GridSearchCV(
    rf_model,
    param_grid,
    cv=7,
    scoring='r2',
    verbose=2,  # Shows progress
    n_jobs=-1   # Use all CPU cores to speed it up
)

# Fit the grid search
grid_rf.fit(X_train, y_train)

best_rf_model = grid_rf.best_estimator_
best_params = grid_rf.best_params_
print(best_params)

# Make predictions with best model
rf_train_pred = best_rf_model.predict(X_train)
rf_test_pred = best_rf_model.predict(X_test)

# Evaluate
r2_rf_train = r2_score(y_train, rf_train_pred)
rmse_rf_train = np.sqrt(mean_squared_error(y_train, rf_train_pred))
r2_rf_test = r2_score(y_test, rf_test_pred)
rmse_rf_test = np.sqrt(mean_squared_error(y_test, rf_test_pred))

print("\n=== Best Random Forest Model ===")
print(f"Training R²: {r2_rf_train:.4f}, RMSE: {rmse_rf_train:.4f}")
print(f"Testing R²: {r2_rf_test:.4f}, RMSE: {rmse_rf_test:.4f}")

# Final comparison of all models
print("\n" + "="*50)
print("FINAL MODEL COMPARISON")
print("="*50)
print(f"Baseline:          R²={r2_baseline_test:.4f}, RMSE={rmse_baseline_test:.4f}")
print(f"Linear Regression: R²={r2_test:.4f}, RMSE={rmse_test:.4f}")
print(f"Polynomial (deg {best_degree}): R²={r2_grid_test:.4f}, RMSE={rmse_grid_test:.4f}")
print(f"Random Forest:     R²={r2_rf_test:.4f}, RMSE={rmse_rf_test:.4f}")

Fitting 7 folds for each of 64 candidates, totalling 448 fits


[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.3s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.4s
[CV] END bootstrap=True, max_depth=10, max_featu



[CV] END bootstrap=True, max_depth=100, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=1000; total time=  34.8s
[CV] END bootstrap=True, max_depth=100, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=1000; total time=  34.9s
[CV] END bootstrap=True, max_depth=100, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=1000; total time=  35.2s
[CV] END bootstrap=True, max_depth=100, max_features=sqrt, min_samples_leaf=1, min_samples_split=10, n_estimators=200; total time=   5.5s
[CV] END bootstrap=True, max_depth=100, max_features=sqrt, min_samples_leaf=1, min_samples_split=10, n_estimators=200; total time=   5.4s
[CV] END bootstrap=True, max_depth=100, max_features=sqrt, min_samples_leaf=1, min_samples_split=10, n_estimators=200; total time=   5.4s
[CV] END bootstrap=True, max_depth=100, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=1000; total time=  35.7s
[CV] END bootstrap=True, max_depth