# Exploring RandomForest Models for Crop Classification - DLBS 

## Introduction

This notebook investigates the performance of RandomForest models in crop classification using the Zueri Crop dataset. Various configurations of the RandomForest model, including different numbers of estimators and class weight settings, will be explored. The goal is to analyze how these hyperparameters impact the model's accuracy and generalization on crop classification tasks. The notebook utilizes the `DeepModel_Trainer` class for data loading and model training, and evaluation metrics are logged using WandB for comprehensive analysis.


In [None]:
import os
print(os.getcwd())
if os.getcwd().endswith("modelling"):
    os.chdir("..")

import numpy as np
import matplotlib.pyplot as plt
from skimage import future
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from src.modelling import DeepModel_Trainer
import torch.utils.data
import torch
import wandb
import seaborn as sns
import numpy as np
import os

# set numpy random state
random_state = 123
np.random.seed(random_state)

# silence wandb
os.environ["WANDB_SILENT"] = "true"
# HDF5 file path
file_name_zueri = r'D:\Temp\AgroLuege\raw_data\ZueriCrop\ZueriCrop.hdf5'

## Functions for Random Forest Fitting

The provided Python functions are essential components of a RandomForest-based model training and evaluation pipeline. The `prepare_data_fold` function reshapes input and target tensors, while `fit_rf` trains a RandomForest model using specified training data. The `predict_rf` function utilizes the trained model for making predictions on new data. The `evaluate_log` function assesses model performance, logs various metrics with WandB, and generates a confusion matrix visualization. Additionally, the `load_data_train` and `load_data_test` functions load and concatenate training and testing data batches, respectively. This collective set of functions constitutes a comprehensive workflow for training, predicting, and evaluating RandomForest models in a machine learning pipeline.

In [None]:
def setup_wandb_run(
    project_name: str,
    run_group: str,
    fold: int,
    model_architecture: str,
    batchsize: int,
    seed: int,
    entity: str = "dlbs_crop",
):
    """
    Sets a new run up (used for k-fold)
    :param str project_name: Name of the project in wandb.
    :param str run_group: Name of the project in wandb.
    :param str fold: number of the executing fold
    :param str model_architecture: Modeltype (architectur) of the model
    :param int batchsize
    :param int seed
    """
    # init wandb
    run = wandb.init(
        settings=wandb.Settings(start_method="thread"),
        project=project_name,
        entity=entity,
        name=f"{fold}-Fold",
        group=run_group,
        config={
            "model architecture": model_architecture,
            "batchsize": batchsize,
            "seed": seed
        },
    )
    return run


def load_data_train(model_trainer: DeepModel_Trainer):
    """
    Load and concatenate training data batches.

    Args:
        model_trainer (DeepModel_Trainer): Trainer object containing the training data loader.

    Returns:
        Tuple[torch.Tensor, torch.Tensor]: Loaded and concatenated input and target tensors for training.
    """
    # Initialize an empty list to store batches
    all_data_input = []
    all_data_target = []

    # Iterate through the DataLoader
    for batch in model_trainer.train_loader:
        input, _, target_2, _ = batch
        all_data_input.append(input)
        all_data_target.append(target_2)

    # Concatenate all batches into a single tensor along the batch dimension (dim=0)
    input_train = torch.cat(all_data_input, dim=0)
    target_train = torch.cat(all_data_target, dim=0)
    return input_train, target_train


def load_data_test(model_trainer: DeepModel_Trainer):
    """
    Load and concatenate testing data batches.

    Args:
        model_trainer (DeepModel_Trainer): Trainer object containing the testing data loader.

    Returns:
        Tuple[torch.Tensor, torch.Tensor]: Loaded and concatenated input and target tensors for testing.
    """
    # Initialize an empty list to store batches
    all_data_input = []
    all_data_target = []

    # Iterate through the DataLoader
    for batch in model_trainer.test_loader:
        input, _, target_2, _ = batch
        all_data_input.append(input)
        all_data_target.append(target_2)

    # Concatenate all batches into a single tensor along the batch dimension (dim=0)
    input_test = torch.cat(all_data_input, dim=0)
    target_test = torch.cat(all_data_target, dim=0)
    return input_test, target_test


def prepare_data_fold(input: torch.Tensor, target: torch.Tensor):
    """
    Prepare the input and target data for training or testing.

    Args:
        input (torch.Tensor): Input data tensor.
        target (torch.Tensor): Target data tensor.

    Returns:
        Tuple[torch.Tensor, torch.Tensor]: Reshaped input and target tensors.
    """
    reshape_factor = len(input) // 10
    input = input[0:reshape_factor * 10]
    target = target[0:reshape_factor * 10]
    reshaped_tensor = input.reshape(24 * reshape_factor, 24 * 10, 4)
    reshaped_target = target.reshape(24 * reshape_factor, 24 * 10)

    return reshaped_tensor, reshaped_target


def fit_rf(clf: RandomForestClassifier, reshaped_tensor_train: torch.Tensor, reshaped_target_train: torch.Tensor):
    """
    Fit a RandomForest model on the training data.

    Args:
        clf (RandomForestClassifier): RandomForest model.
        reshaped_tensor_train (torch.Tensor): Reshaped input tensor for training.
        reshaped_target_train (torch.Tensor): Reshaped target tensor for training.

    Returns:
        RandomForestClassifier: Trained RandomForest model.
    """
    clf = future.fit_segmenter(
        reshaped_target_train, reshaped_tensor_train, clf)
    return clf


def predict_rf(model: RandomForestClassifier, reshaped_tensor: torch.Tensor):
    """
    Make predictions using a trained RandomForest model.

    Args:
        model (RandomForestClassifier): Trained RandomForest model.
        reshaped_tensor (torch.Tensor): Reshaped input tensor for prediction.

    Returns:
        np.ndarray: Predicted labels.
    """
    y_pred = future.predict_segmenter(reshaped_tensor, model)
    return y_pred


def evaluate_log(reshaped_target_train, y_pred, wandbrun, verbose: bool = False,
                 class_names_cm=["0_unknown", "Field crops", "Forest",
                                 "Grassland", "Orchards", "Special crops"],
                 is_test: bool = False):
    """
    Evaluate and log performance metrics using WandB.

    Args:
        reshaped_target_train (torch.Tensor): Reshaped target tensor for training.
        y_pred (np.ndarray): Predicted labels.
        wandbrun (wandb.Run): WandB run object.
        verbose (bool, optional): Whether to print verbose information. Defaults to False.
        class_names_cm (List[str], optional): List of class names for confusion matrix. Defaults to ["0_unknown", "Field crops", "Forest", "Grassland", "Orchards", "Special crops"].
        is_test (bool, optional): Whether the evaluation is on the test set. Defaults to False.
    """
    accuracy = accuracy_score(
        reshaped_target_train.numpy().ravel(), y_pred.ravel())
    conf_matrix = confusion_matrix(
        reshaped_target_train.numpy().ravel(), y_pred.ravel())
    f1score = f1_score(reshaped_target_train.numpy().ravel(),
                       y_pred.ravel(), average=None)
    # Log the F1 scores for each class
    if is_test:
        f1_scores_dict = {
            "F1-Score_test_" + class_names_cm[i]: f1score[i] for i in range(len(f1score))}
    else:
        f1_scores_dict = {
            "F1-Score_train_" + class_names_cm[i]: f1score[i] for i in range(len(f1score))}
    wandbrun.log(f1_scores_dict)
    if verbose:
        print("Accuracy:", accuracy)
        print("f1scores:", f1_scores_dict)
    if is_test:
        wandbrun.log({'accuracy_test': accuracy})
        wandbrun.log(f1_scores_dict)
    else:
        wandbrun.log({'accuracy_train': accuracy})
        wandbrun.log(f1_scores_dict)

    plt.figure(figsize=(12, 6))
    sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
                cbar=False, xticklabels=class_names_cm,
                yticklabels=class_names_cm)
    plt.xlabel('Predicted Labels')
    plt.ylabel('True Labels')
    if is_test:
        wandbrun.log({"confusion_matrix_test": wandb.Image(plt)})
    else:
        wandbrun.log({"confusion_matrix_train": wandb.Image(plt)})
    plt.close()


def train_evaluate_rf(clf: RandomForestClassifier, trainer: DeepModel_Trainer, run_group="RandomForest-Baseline"):
    """
    Train and evaluate a RandomForest model using WandB for logging.

    Args:
        clf (RandomForestClassifier): RandomForest model to be trained.
        trainer (DeepModel_Trainer): Trainer object containing training and testing data loaders.
        run_group (str, optional): Name of the run group in WandB. Defaults to "RandomForest-Baseline".
    """
    # Initialize the WandB run
    run = setup_wandb_run(project_name="dlbs_crop-rf",
                          run_group=run_group,
                          fold='all', model_architecture="RandomForest",
                          batchsize='Full', seed=random_state)

    # Create data loaders
    trainer.create_loader()

    # Load and prepare training data
    input_train, target_train = load_data_train(trainer)
    reshaped_tensor_train, reshaped_target_train = prepare_data_fold(
        input_train, target_train)

    # Fit RandomForest model on training data
    rfmodel = fit_rf(clf, reshaped_tensor_train, reshaped_target_train)

    # Predict on the training set
    y_pred_train = predict_rf(rfmodel, reshaped_tensor_train)

    # Evaluate and log training performance
    evaluate_log(reshaped_target_train, y_pred_train, run, verbose=True)

    # Load and prepare testing data
    input_test, target_test = load_data_test(trainer)
    reshaped_tensor_test, reshaped_target_test = prepare_data_fold(
        input_test, target_test)

    # Predict on the test set
    y_pred_test = predict_rf(rfmodel, reshaped_tensor_test)

    # Evaluate and log testing performance
    evaluate_log(reshaped_target_test, y_pred_test,
                 run, verbose=True, is_test=True)

    # Finish WandB run
    wandb.finish()


## Train & Evaluate Random Forests

The code iterates over different numbers of estimators in a RandomForest model, using a `DeepModel_Trainer` to load data and labels from an HDF5 file. For each iteration, a RandomForest model is trained and evaluated with specific parameters, and the results are logged using WandB. Additionally, a separate evaluation is conducted with 150 estimators without class weights. This iterative process systematically explores RandomForest configurations, providing insights into model performance variations.

In [None]:
# Iterate over different numbers of estimators
for estimators in [300, 200, 150, 100, 50, 20, 10]:
    # Initialize DeepModel_Trainer with data and labels
    trainer = DeepModel_Trainer(file_name_zueri, 'labels.csv', None, 'cpu')

    # Create RandomForestClassifier with specified parameters
    clf = RandomForestClassifier(
        n_estimators=estimators, n_jobs=-1, class_weight={0: 1e-10, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1})

    # Train and evaluate RandomForest model
    train_evaluate_rf(clf, trainer, f"RandomForest-Baseline-{estimators}")


In [None]:
# Train and evaluate RandomForest model with 150 estimators and no class weights
trainer = DeepModel_Trainer(file_name_zueri, 'labels.csv', None, 'cpu')
clf = RandomForestClassifier(n_estimators=150, n_jobs=-1)

# Train and evaluate RandomForest model without class weights
train_evaluate_rf(clf, trainer, "RandomForest-Baseline-150-No-Weight")
