---
title: Random Forest Hyperparameter Tuning
author: Andrei Akopian
date: 2025-10-06
format:
  html:
    code-fold: true
    code-summary: "Show the code"
  pdf:
    code-overflow: wrap
    echo: false
    output: true

---

# Random Forest Hyperparameter Tuning

Notion page: 

https://utat-ss.notion.site/Random-Forest-Hyperparameter-Tuning-2843e028b0ea80ff8d8bd341b8e7cda5

## Abstract

Report on testing the possibility of using Random Forest model to estimate npv, gv, and soil abundances from spectra as part of Unmixing effrots at UTAT Science. The analysis was done using Python with Pandas and Scikit-learn libraries on `simpler_data.csv` (no noise). Grid and Randomized searches (as well as manual) for tuning parameters were used. The results are disappointing. Random forest is prone to overfitting, and achieving a maximum of $r^2<0.74$ on all abundances, and $r^2<0.64$ for npv abundance only. Curiously, noise (snr=180) has only minor effects on random forest. Overall, Random Forest produces poor results and should be phased out.

## Background Information

Zoe previously performed major grid search trying to tune Random Forest ([see Notion](https://utat-ss.notion.site/FAE-RF-Random-Forest-1b63e028b0ea80e3afcad34492232512)). That search did not produce a model significantly different from Random and manual searches I performed.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

In [92]:
filenames = [
    "unmixing/simpler_data.csv",
    "unmixing/simpler_data_SNR_180.csv"
]
data = pd.read_csv(filenames[0])

In [83]:
#| code-summary: "Helper functions"
def take_subset(df,start=1500,end=1650, abundances = ["npv_fraction","gv_fraction","soil_fraction"]):
    columns = df.columns.to_list()
    wanted = []
    for c in columns:
        if c.isdigit():
            if start<=int(c)<=end:
                wanted.append(c)
    npv_fractions = df[abundances]
    spectra = df[wanted]
    spectra_sources = df[["Spectra"]]
    return npv_fractions, spectra, spectra_sources
def validate(model,train_X,validate_X,train_y,validate_y):
    print("Training R^2:",round(model.score(train_X,train_y),4))
    print("Validation R^2:",round(model.score(validate_X,validate_y),4))

In [93]:
#| code-summary: "Data Preparation"
npv_fractions, spectra, spectra_sources = take_subset(data,start=900,end=1700,abundances=["npv_fraction"])
train_X, validate_X, train_y, validate_y = sklearn.model_selection.train_test_split(spectra, npv_fractions, test_size=0.2, random_state=42)

## Manual Tuning
More details on rationale here: https://utat-ss.notion.site/Random-Forest-Hyperparameter-Tuning-2843e028b0ea80ff8d8bd341b8e7cda5

I manually checked the derivitives from minor changes in each parameters. I concluded that all parameters have a primitive effects and either have only one peak, or simply platoe. The parameters don't seem to have any interesting bonds or effects on each other's peak locations or predictions.

Best parameters:
```py
n_estimators=200,
max_depth=18,
max_features = 13,
min_samples_split=3,
min_samples_leaf=1,
min_impurity_decrease=0.0,
min_weight_fraction_leaf=0.0,
random_state = 42,
```

In [94]:
#| code-fold: false
#| echo: true
#| output: false
rf = sklearn.ensemble.RandomForestRegressor(
    n_estimators=200,
    max_depth=18,
    max_features = 13,
    min_samples_split=3,
    min_samples_leaf=1,
    min_impurity_decrease=0.0,
    min_weight_fraction_leaf=0.0,
    random_state = 42,
)
rf.fit(train_X, train_y)

  return fit_method(estimator, *args, **kwargs)


0,1,2
,n_estimators,200
,criterion,'squared_error'
,max_depth,18
,min_samples_split,3
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,13
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [95]:
#| echo: true
#| output: true
validate(rf,train_X,validate_X,train_y,validate_y)

Training R^2: 0.9392
Validation R^2: 0.6309


Cross validation of the model to inspect the variation of outputs

Results of cross validation of my best manual model($r^2=0.6309$) on its own training data:

`array([0.59907805, 0.60968461, 0.56844624, 0.47771526, 0.61500608])`

All lower than the validation r^2 (?).

Based on variation size in cross validation, I conclude that variation is $\pm 0.02$ so fighting for minor increases in $r^2$ is a pointless endeavor.

In [None]:
#| echo: true
sklearn.model_selection.cross_val_score(rf, train_X, train_y, cv=5)

# Systematic Tuning

I mostly used RandomizedSearch for tuning, because it is faster than Grid Search, and random would quickly find symptoms of a good model, if one exists.

Unfortunately, RandomizedSearch found nothing that has potential to outperform my best manual model, so I never got to using grid search. 

Rough ranges I covered:
- `n_estimators` I tested 50 - 400, and >200 shows only minor improvements
- `max_depth` I tested 5 - 20, and 13-18 range seems to work th best, but nothing interesting
- `max_features` I checked the full range from 0.1 to 1 (fraction). But other than long runtime, nothing significant
- `min_samples_split` Checked 2 - 10 low is good for this, which aligns with my manual tuning.
- `min_samples_leaf` Checked 1 - 5, and 1 is the best
- `criterion` Checked criterions avaiable for Sklearn's model, and they all work about the same

In [None]:
#| echo: true
#| output: false
#| code-overflow: wrap
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [15, 16, 17, 20],
    'max_features': [0.4,0.6],
    'min_samples_split': [2,3],
    'min_samples_leaf': [1, 2, 3],
    'criterion': ['squared_error']
}
random_reg = sklearn.model_selection.RandomizedSearchCV(sklearn.ensemble.RandomForestRegressor(), param_grid, cv=5, n_jobs=-1, random_state=42)
random_reg.fit(train_X,train_y)

In [77]:
#| output: true
#| echo: true
random_reg.best_params_

{'n_estimators': 200,
 'min_samples_split': 3,
 'min_samples_leaf': 1,
 'max_features': 0.4,
 'max_depth': 17,
 'criterion': 'squared_error'}

In [None]:
#| echo: true
# Validation of the tuning produced model
validate(random_reg,train_X,validate_X,train_y,validate_y)

Training R^2: 0.9634
Validation R^2: 0.7331


Cross validation again shows variation, which reassures me that chasing minor changes in $r^2$ is pointless.

For `Validation R^2: 0.7331`
`array([0.7238703 , 0.72821315, 0.72174309, 0.69880877, 0.74258825])` (this is for all abundances)

In [49]:
sklearn.model_selection.cross_val_score(random_reg.best_estimator_, train_X, train_y, cv=5, random_state=42)

array([0.7238703 , 0.72821315, 0.72174309, 0.69880877, 0.74258825])

### Manual after the Auto

I tried some manual tuning on all models produces by the computer. Manual tuning attempts confirmed that the computer is tuning for insignificant increases, and the searches and the found models didn't produce anything interesting.

In [35]:
params = {'n_estimators': 200,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 0.8,
 'max_depth': 16,
 'criterion': 'squared_error'}

In [None]:
rf2 = sklearn.ensemble.RandomForestRegressor(**params)
rf2.fit(train_X, train_y)

In [37]:
validate(rf2,train_X, validate_X, train_y, validate_y)

Training R^2: 0.9369
Validation R^2: 0.6274


### Clarification on Results

The results are confusing (to you), because there are many variations and I was switching between. 

Instead please reference this table:

| Target \ Data | simpler_data | snr 180 |
| --- | --- | --- |
| npv abundance only | $r^2<0.64$ | $r^2<0.62$ |
| all abundances | $r^2<0.74$ | $r^2<0.73$ |

Best model (other top models don't differ from it significantly):

```py
rf = sklearn.ensemble.RandomForestRegressor(
    n_estimators=200,
    max_depth=18,
    max_features = 13,
    min_samples_split=3,
    min_samples_leaf=1,
    min_impurity_decrease=0.0,
    min_weight_fraction_leaf=0.0,
    random_state = 42,
)
```

## Conclusion

The evidence shows that Random Forest has little potential, and produces poor results even on non-noisy data.

My lingering concern is that I am not a Random Forest (or hyperparameter tuning) expert, and there were some unexplained behaviors in cross validation. Also, random forest predict validation best when they overfit the training data. In theory, it means that there is some unused potential. However, I achieved my best models in around 15 minutes after starting to fit random forest, and was unable to make any breakthroughs after 3 hours of searching.

I believe it is safe to assume that Random Forest is a bad model for unmixing.