For this assignment, use the dataset called `radar_parameters.csv` provided in the GitHub repository in the folder `homework`.

## Dataset Description

The training data consists of polarimetric radar parameters calculated from a disdrometer (an instrument that measures rain drop sizes, shapes, and rainfall rate) measurements from several years in Huntsville, Alabama. A model called `pytmatrix` is used to calculate polarimetric radar parameters from the droplet observations, which can be used as a way to compare what a remote sensing instrument would see and rainfall.

## Data columns

Features (radar measurements):

`Zh` - radar reflectivity factor (dBZ) - use the formula $dBZ = 10\log_{10}(Z)$

`Zdr` - differential reflectivity

`Ldr` - linear depolarization ratio

`Kdp` - specific differential phase

`Ah` - specific attenuation

`Adp` - differential attenuation

Target :

`R` - rain rate

## Your assignment

1. Split the data into a 70-30 split for training and testing data.

2. Using the split created in (1), train a multiple linear regression dataset using the training dataset, and validate it using the testing dataset.  Compare the $R^2$ and root mean square errors of model on the training and testing sets to a baseline prediction of rain rate using the formula $Z = 200 R^{1.6}$.

3. Repeat 1 doing a grid search over polynomial orders, using a grid search over orders 0-9, and use cross-validation of 7 folds.  For the best polynomial model in terms of $R^2$, does it outperform the baseline and the linear regression model in terms of $R^2$ and root mean square error?

4. Repeat 1 with a Random Forest Regressor, and perform a grid_search on the following parameters:
   
   ```python
   {'bootstrap': [True, False],  
   'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],  
   'max_features': ['auto', 'sqrt'],  
   'min_samples_leaf': [1, 2, 4],  
   'min_samples_split': [2, 5, 10],  
   'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}
   ```
  Can you beat the baseline, or the linear regression, or best polynomial model with the best optimized Random Forest Regressor in terms of $R^2$ and root mean square error?


### Q1

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv('/home/kiaracr2/module-5-assignment/ATMS-523-Module-5/homework/radar_parameters.csv').dropna()

# Separate features (X) and target (y)
X = df.drop(columns='R (mm/hr)')   # all predictors
y = df['R (mm/hr)']                # target variable

# Split into 70% training and 30% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Confirm the split
print("Training set:", X_train.shape, y_train.shape)
print("Testing set:", X_test.shape, y_test.shape)


Training set: (13278, 7) (13278,)
Testing set: (5691, 7) (5691,)


### Q2

In [11]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# --- 1. Train multiple linear regression ---
linreg = LinearRegression()
linreg.fit(X_train, y_train)

# Predictions
y_train_pred = linreg.predict(X_train)
y_test_pred = linreg.predict(X_test)

# --- 2. Evaluate model ---
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

print("Linear Regression Performance:")
print(f"Train R^2: {r2_score(y_train, y_train_pred):.3f}, RMSE: {rmse(y_train, y_train_pred):.3f}")
print(f"Test  R^2: {r2_score(y_test, y_test_pred):.3f}, RMSE: {rmse(y_test, y_test_pred):.3f}")

# --- 3. Baseline using Z-R relationship ---
# Convert dBZ to linear Z
Z_train_linear = 10 ** (X_train['Zh (dBZ)'] / 10)
Z_test_linear = 10 ** (X_test['Zh (dBZ)'] / 10)

# Predict rain rate from Z using the empirical formula
y_train_baseline = (Z_train_linear / 200) ** (1/1.6)
y_test_baseline = (Z_test_linear / 200) ** (1/1.6)

print("\nBaseline (Z=200R^1.6) Performance:")
print(f"Train R^2: {r2_score(y_train, y_train_baseline):.3f}, RMSE: {rmse(y_train, y_train_baseline):.3f}")
print(f"Test  R^2: {r2_score(y_test, y_test_baseline):.3f}, RMSE: {rmse(y_test, y_test_baseline):.3f}")


Linear Regression Performance:
Train R^2: 0.988, RMSE: 0.923
Test  R^2: 0.989, RMSE: 0.936

Baseline (Z=200R^1.6) Performance:
Train R^2: 0.276, RMSE: 7.144
Test  R^2: 0.357, RMSE: 7.189


### Q3

In [14]:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import r2_score, mean_squared_error

# --- 1. Build pipeline: PolynomialFeatures + LinearRegression ---
pipe = Pipeline([
    ("poly", PolynomialFeatures(include_bias=False)),
    ("scaler", StandardScaler()),   # scaling helps with higher-order terms
    ("linreg", LinearRegression())
])

# --- 2. Define parameter grid for polynomial degree 0‚Äì9 ---
param_grid = {
    "poly__degree": list(range(0, 10))
}

# --- 3. Cross-validation setup (7 folds) ---
cv = KFold(n_splits=7, shuffle=True, random_state=42)

# --- 4. Grid search ---
grid = GridSearchCV(pipe, param_grid, cv=cv, scoring="r2", n_jobs=-1)
grid.fit(X_train, y_train)

# --- 5. Best polynomial model ---
best_model = grid.best_estimator_
best_degree = grid.best_params_["poly__degree"]
print(f"Best polynomial degree: {best_degree}")

# Predictions
y_train_poly = best_model.predict(X_train)
y_test_poly = best_model.predict(X_test)

# --- 6. Evaluate best polynomial model ---
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

print("\nBest Polynomial Model Performance:")
print(f"Train R^2: {r2_score(y_train, y_train_poly):.3f}, RMSE: {rmse(y_train, y_train_poly):.3f}")
print(f"Test  R^2: {r2_score(y_test, y_test_poly):.3f}, RMSE: {rmse(y_test, y_test_poly):.3f}")

7 fits failed out of a total of 70.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
7 fits failed with the following error:
Traceback (most recent call last):
  File "/home/kiaracr2/envs/xarray-climate/lib/python3.13/site-packages/sklearn/model_selection/_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kiaracr2/envs/xarray-climate/lib/python3.13/site-packages/sklearn/base.py", line 1365, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/kiaracr2/envs/xarray-climate/lib/python3.13/site-packages/sklearn/pipeline.py", line 655, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
  File "/home/k

Best polynomial degree: 2

Best Polynomial Model Performance:
Train R^2: 1.000, RMSE: 0.166
Test  R^2: 1.000, RMSE: 0.183


Key Takeaway:

The grid search picked polynomial degree 2 as best.

That model gave very high ùëÖ^2(~1.0) and very low RMSE on both train and test.

The best polynomial model clearly outperforms both the baseline (Z‚ÄìR formula)
and the plain linear regression in terms of ùëÖ^2 and RMSE.

### Q4

In [2]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# --- 1. Define the model ---
rf = RandomForestRegressor(random_state=42)

# --- 2. Define the parameter grid ---
param_grid = {
    'bootstrap': [True, False],
    'max_depth': [10, 100, None],
    'max_features': ['sqrt', 1.0],
    'min_samples_leaf': [1, 4],
    'min_samples_split': [2, 10],
    'n_estimators': [200, 1000]  
}

# --- 3. Cross-validation setup ---
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# --- 4. Grid search ---
grid = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=cv,
    scoring='r2',
    n_jobs=-1,
    verbose=2
)

grid.fit(X_train, y_train)

# --- 5. Best model ---
best_rf = grid.best_estimator_
print("Best parameters:", grid.best_params_)

# --- 6. Evaluate ---
y_train_rf = best_rf.predict(X_train)
y_test_rf = best_rf.predict(X_test)

def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

print("\nRandom Forest Performance:")
print(f"Train R^2: {r2_score(y_train, y_train_rf):.3f}, RMSE: {rmse(y_train, y_train_rf):.3f}")
print(f"Test  R^2: {r2_score(y_test, y_test_rf):.3f}, RMSE: {rmse(y_test, y_test_rf):.3f}")


Fitting 5 folds for each of 96 candidates, totalling 480 fits
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.2s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.2s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.3s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.3s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   4.2s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=1000; total time=  21.2s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=1000; tot

# Key takeaways:

- Yes: Random Forest beats the baseline and linear regression.

- No: It doesn‚Äôt quite beat the best polynomial model in this dataset.

- But: Random Forest is often the safer choice in real‚Äëworld scenarios, since it handles noise and nonlinearities without exploding feature counts.