# ATMS 523

## Module 5 Project
### Sarah Henry

Fork this repository, and submit this code as a pull request back to GitHub by the date and time listed in Canvas.

For this assignment, use the dataset called `radar_parameters.csv` provided in the GitHub repository in the folder `homework`.

## Dataset Description

The training data consists of polarimetric radar parameters calculated from a disdrometer (an instrument that measures rain drop sizes, shapes, and rainfall rate) measurements from several years in Huntsville, Alabama. A model called `pytmatrix` is used to calculate polarimetric radar parameters from the droplet observations, which can be used as a way to compare what a remote sensing instrument would see and rainfall.

## Data columns

Features (radar measurements):

`Zh` - radar reflectivity factor (dBZ) - use the formula $dBZ = 10\log_{10}(Z)$

`Zdr` - differential reflectivity

`Ldr` - linear depolarization ratio

`Kdp` - specific differential phase

`Ah` - specific attenuation

`Adp` - differential attenuation

Target :

`R` - rain rate

In [9]:
# import statements
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             RocCurveDisplay, PrecisionRecallDisplay,
                             mean_squared_error, r2_score)
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import validation_curve
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor


1. Split the data into a 70-30 split for training and testing data.

In [2]:
# load data
df = pd.read_csv("./homework/radar_parameters.csv")
target = ["R (mm/hr)"]
predictors = df.drop(columns="R (mm/hr)").columns.tolist()

# split data
X = df[predictors].values
y = df[target].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print(len(X_train), len(X_test))

13278 5691


2. Using the split created in (1), train a multiple linear regression dataset using the training dataset, and validate it using the testing dataset.  Compare the $R^2$ and root mean square errors of model on the training and testing sets to a baseline prediction of rain rate using the formula $Z = 200 R^{1.6}$.

In [3]:
# train
model = LinearRegression()
model.fit(X_train, y_train)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# validate
def rmse(a, b):
    return np.sqrt(mean_squared_error(a, b))

r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

rmse_train = rmse(y_train, y_train_pred)
rmse_test = rmse(y_test, y_test_pred)

print("validation:")
print(f"R^2 train: {r2_train}, R^2 test: {r2_test}")
print(f"RMSE train: {rmse_train}, RMSE test: {rmse_test}")

# compare to baseline
R = 10 ** (df["Zh (dBZ)"] / 10)
Z_baseline = (R / 200) ** (1 / 1.6)

Z_train = X_train[:, predictors.index("Zh (dBZ)")]
Z_test = X_test[:, predictors.index("Zh (dBZ)")]

Z_baseline_train = (10 ** (Z_train / 10) / 200) ** (1 / 1.6)
Z_baseline_test = (10 ** (Z_test / 10) / 200) ** (1 / 1.6)

r2_base_train = r2_score(y_train, Z_baseline_train)
r2_base_test = r2_score(y_test, Z_baseline_test)

rmse_base_train = rmse(y_train, Z_baseline_train)
rmse_base_test = rmse(y_test, Z_baseline_test)

print("\ncomparison to baseline:")
print(f"R^2 train: {r2_base_train}, R^2 test: {r2_base_test}")
print(f"RMSE train: {rmse_base_train}, RMSE test: {rmse_base_test}")

validation:
R^2 train: 0.9888357865565246, R^2 test: 0.9868605147786397
RMSE train: 0.9146705347774785, RMSE test: 0.9583373917841838

comparison to baseline:
R^2 train: 0.3325181229088806, R^2 test: 0.22661047398943468
RMSE train: 7.072457766822083, RMSE test: 7.3523877227693095


3. Repeat 1 doing a grid search over polynomial orders, using a grid search over orders 0-9, and use cross-validation of 7 folds.  For the best polynomial model in terms of $R^2$, does it outperform the baseline and the linear regression model in terms of $R^2$ and root mean square error?


In [15]:
def PolynomialRegression(degree=2, **kwargs):
    include_bias = (degree == 0)
    return make_pipeline(
        PolynomialFeatures(degree, include_bias=include_bias),
        LinearRegression(**kwargs)
    )

degrees = np.arange(0, 2+1)
train_r2 = []
val_r2 = []

for d in degrees:
    print(f"Evaluating degree {d}...")
    model = PolynomialRegression(degree=d)
    scores = cross_val_score(model, X_train, y_train.ravel(), cv=7, scoring='r2', n_jobs=-1)
    val_r2.append(scores.mean())

best_degree = degrees[np.argmax(val_r2)]
print(f"\nBest degree (CV R^2): {best_degree}")
print(f"Mean CV R^2 for best degree: {max(val_r2)}")

Evaluating degree 0...
Evaluating degree 1...
Evaluating degree 2...

Best degree (CV R^2): 2
Mean CV R^2 for best degree: 0.9989829546605608


Output from all degrees 0-9:

Best degree (CV R²): 2

Mean CV R² for best degree: 0.999


4. Repeat 1 with a Random Forest Regressor, and perform a grid_search on the following parameters:
   
   ```python
   param_grid = {
    "bootstrap": [True, False],
    "max_depth": [10, 100],
    "max_features": ["sqrt", 1.0],  
    "min_samples_leaf": [1, 4],
    "min_samples_split": [2, 10],
    "n_estimators": [200, 1000]}
   ```
  Can you beat the baseline, or the linear regression, or best polynomial model with the best optimized Random Forest Regressor in terms of $R^2$ and root mean square error?

In [14]:
# param_grid = {
#     "bootstrap": [True, False],
#     "max_depth": [10, 100],
#     "max_features": ["sqrt", 1.0],
#     "min_samples_leaf": [1, 4],
#     "min_samples_split": [2, 10],
#     "n_estimators": [200, 1000]
# }

# rf = RandomForestRegressor(random_state=0)

# grid_rf = GridSearchCV(
#     estimator=rf,
#     param_grid=param_grid,
#     cv=7,
#     scoring='r2',
#     n_jobs=-1,
#     verbose=1
# )

# grid_rf.fit(X_train, y_train.ravel())

# best_rf = grid_rf.best_estimator_
# print("Best RF params:")
# print(grid_rf.best_params_)

# y_train_rf = best_rf.predict(X_train)
# y_test_rf = best_rf.predict(X_test)

# r2_rf_train = r2_score(y_train, y_train_rf)
# r2_rf_test = r2_score(y_test, y_test_rf)
# rmse_rf_train = rmse(y_train, y_train_rf)
# rmse_rf_test = rmse(y_test, y_test_rf)

# print("\nRandom Forest Regressor Performance:")
# print(f"R^2 train: {r2_rf_train}, R^2 test: {r2_rf_test}")
# print(f"RMSE train: {rmse_rf_train}, RMSE test: {rmse_rf_test}")

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

rf = RandomForestRegressor(random_state=0)

param_dist = {
    "bootstrap": [True],
    "max_depth": randint(5, 30),           # smaller trees = faster
    "max_features": ["sqrt"],
    "min_samples_leaf": randint(1, 4),
    "min_samples_split": randint(2, 5),
    "n_estimators": randint(50, 150)       # fewer trees = much faster
}

rand_rf = RandomizedSearchCV(
    rf,
    param_distributions=param_dist,
    n_iter=10, # fewer combinations
    cv=3, # fewer folds
    scoring='r2',
    n_jobs=-1,
    random_state=0,
    verbose=1
)

rand_rf.fit(X_train, y_train.ravel())

best_rf = rand_rf.best_estimator_
print("Best RF params:")
print(rand_rf.best_params_)

y_train_rf = best_rf.predict(X_train)
y_test_rf = best_rf.predict(X_test)

r2_rf_train = r2_score(y_train, y_train_rf)
r2_rf_test = r2_score(y_test, y_test_rf)
rmse_rf_train = rmse(y_train, y_train_rf)
rmse_rf_test = rmse(y_test, y_test_rf)

print("\nRandom Forest Regressor Performance:")
print(f"R^2 train: {r2_rf_train:.3f}, R^2 test: {r2_rf_test:.3f}")
print(f"RMSE train: {rmse_rf_train:.3f}, RMSE test: {rmse_rf_test:.3f}")


Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best RF params:
{'bootstrap': True, 'max_depth': 14, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 119}

Random Forest Regressor Performance:
R^2 train: 0.996, R^2 test: 0.970
RMSE train: 0.547, RMSE test: 1.459


No, the random forest model doesn't beat the 2-degree polynomial model. I would guess that there might be some overfitting with the 2-degree polynomial model but since the degree is low and it cross validates well, it looks solid. Despite that though, the random forest model is still robust and if I were able to run the whole code (it was taking wayyy too long) it might improve just a bit.