# Feature Evaluation and Selection for Random Forest Classifier of Symptomatic and Non-Symptomatic Sweet Chestnut

### Imports

In [3]:
import random
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.feature_selection import RFE
import numpy as np
from collections import defaultdict

### Feature Selection with Leave-Location-Out Cross-Validation (LLO CV) and Recursive Feature Elimination (RFE)

#### Leave-Location-Out Cross-Validation (LLO CV) for Model Testing

To test and validate the performance of the Random Forest (RF) classifier, Leave-Location-Out Cross-Validation (LLO CV) was used. For this, I defined 9 folds based on a visual assesment of the spatial distribution of the given tree samples. From each fold, 10 samples are held out for the testset. The remaining samples are used for the trainset.

In [None]:
# Loading input file with sampled tree data and conversion into dataframe
# The samples are labelled with its corresponding spatial fold (1 to 9)
df = pd.read_csv("input/input_data.csv", sep=";")

trainSets_List = []
testSets_List = []

# Iteratively assign 10 samples from each spatial fold for the testset and the remaining samples to the trainset
for i in range(1,10):
    temp_df = pd.DataFrame(df[(df['Spat_fold']==i)]).drop(columns=['Spat_fold']).sample(frac=1).reset_index(drop=True)
    testSets_List.append(temp_df.tail(10))
    trainSets_List.append(temp_df.iloc[:-10])

#### Pre-Selection of Features

For performance efficiency, I preselect 32 features from the total of 64 available features with one single Recursive Feature Elimination (RFE) run. For performing RFE, the use of an estimator, in this case the Random Forest (RF) model, is required. By reducing the feature set to 32 input features, it became computationally feasible to conduct 180 different feature combinations.

In [None]:
# Initialise the RF classifier 
selection_model = RandomForestClassifier()

# Apply RFE with the RF classifier as an estimator
# Select the 32 most important features by removing 1 feature per iteration
rfe = RFE(estimator=selection_model, n_features_to_select=32, step=1)

# Fit the RF model with the trainset, while removing the labelled classes "symptomatic" and "non-symptomatic"
rfe.fit(pd.concat(trainSets_List).drop('Class', axis=1), pd.concat(trainSets_List)['Class'])

# Save preselected input features in a dataframe
preselected_features = pd.concat(trainSets_List).drop('Class', axis=1).columns[rfe.support_]

#### Find Optimal Number and Combination of Features with Recursive Feature Elimination (RFE) and Leave-Location-Out Cross-Validation (LLO CV)

Using a large number of features as input to the RF model can lead to high correlations between features, which may bias the evaluation of feature importances (Dobrinić et al., 2022). By reducing the feature set to 32, I aim to minimise such correlations and ensure a more accurate assessment of feature relevance. Thus, I aim to find the most relevant feature combination to optimise the performance of the Random Forest (RF) model. For this, I need to evaluate which and how many features should be used. I focus on evaluating subsets of 6 to 23 features (n = 6, 7, ..., 23). For each number of features n, 10 different feature combinations are being built out from the pre-selection of the 32 most relevant features. This processing allows for an evaluation of the model’s performance across different numbers of features with computational efficiency and the potential for improved classification accuracy.

Source:  Dobrinić, D., Gašparović, M., & Medak, D. (2022). Evaluation of Feature Selection Methods for Vegetation Mapping Using Multitemporal Sentinel Imagery. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLIII-B3-2022, 485–491. https://doi.org/10.5194/isprs-archives-XLIII-B3-2022-485-2022

In [None]:
best_accuracy = 0
best_num_features = 0
results = []

# Set range for number of features [6, ..., 23]
range_features_rfe = range(6, 24)

# Perform RFE before LLO CV to select input features
for num_features in range_features_rfe:

    # Initialise list to store all 180 (10 combinations * 18 number of features) generated feature c0mbinations
    all_selected_features = []

    # Perform RFE 10 times to generate different feature combinations for n input features [6, ..., 23] and store selected features
    # A random seed is used to get variability in testset 
    for _ in range(10):
        random_seed = random.randint(0, 10000)
        selection_model = RandomForestClassifier(random_state=random_seed)
        rfe = RFE(estimator=selection_model, n_features_to_select=num_features, step=1)
        rfe.fit(pd.concat(trainSets_List)[preselected_features], pd.concat(trainSets_List)['Class'])
        selected_features = pd.concat(trainSets_List)[preselected_features].columns[rfe.support_]
        all_selected_features.append(selected_features)

    # Initialise list to store mean accuracy for each combination of features
    mean_accuracies = []

    # Loop through each RFE-selected feature combination, train a RF model and validate its performance with LLO CV
    for selected_features in all_selected_features:
        fold_accuracies = []
        # The defaultdict-function simplifies the code as checking of existing keys and initialisation of new lists is not necessary anymore
        feature_importance_tracker = defaultdict(list) 

        # Perform LLO CV
        for i in range(9):
            # Define the validation and training sets
            df_val = trainSets_List[i]
            X_val = df_val.loc[:, selected_features]
            y_val = df_val['Class']

            df_train = pd.concat([trainSets_List[j] for j in range(9) if j != i])
            X_train = df_train.loc[:, selected_features]
            y_train = df_train['Class']

            # Train the model with selected features
            classifier_model = RandomForestClassifier()
            classifier_model.fit(X_train, y_train)

            # Make predictions
            y_pred = classifier_model.predict(X_val)

            # Calculate accuracy for each fold and store it
            accuracy = accuracy_score(y_val, y_pred)
            fold_accuracies.append(accuracy)

            # Track feature importances for selected features
            for feature, importance in zip(selected_features, classifier_model.feature_importances_):
                feature_importance_tracker[feature].append(importance)

        # Calculate the mean accuracy across all folds for this feature combination 
        mean_accuracy = np.mean(fold_accuracies)
        mean_accuracies.append(mean_accuracy)

        # If the current model has a better mean accuracy, update the best performing model and accuracy parameter
        if mean_accuracy > best_accuracy:
            best_accuracy = mean_accuracy
            best_num_features = num_features

        # Store the collected metrics (different accuracies) for each feature combination
        averaged_importances = {feature: np.mean(importances) for feature, importances in feature_importance_tracker.items()}
        results.append({
            'num_features': num_features,
            'mean_accuracy': mean_accuracy,
            'feature_importances': averaged_importances,
            'CV_accuracies': fold_accuracies,
            'selected_features': selected_features
        })

        print(f"RFE with {num_features} features - Mean CV Accuracy: {mean_accuracy:.4f}")

    print(f"Best accuracy for {num_features} features across 10 RFE runs: {max(mean_accuracies):.4f}")

print(f"Best CV Accuracy: {best_accuracy:.4f} with {best_num_features} features")


#### Results of Feature Evaluation

This section provides an overview of the feature evaluation process by displaying a dataframe that contains all the collected results. The dataframe includes the performances of a RF model with 10 different feature combinations as input data for each selected number of features [6, ..., 23]. For each feature combination, the following information is presented:

- The **numer of features** selected for training the model
- The **mean accuracy** of the LLO CV
- The **individual LLO CV accuracies** for each fold
- The **selected features** for that particular combination
- The **sorted feature importances** for that particular combination

In [None]:
# Convert results list into a dataframe for better displaying options
results_df = pd.DataFrame(results)

# Rank feature importances for better overview
results_df['sorted_feature_importances'] = results_df['feature_importances'].apply(
    lambda x: {k: v for k, v in sorted(x.items(), key=lambda item: item[1], reverse=True)})

results_df.drop('feature_importances', inplace=True)
display(results_df)

In [None]:
# Extract and display details of best performing model
best_row = results_df[results_df['num_features'] == best_num_features]
importances = best_row['sorted_feature_importances'].values[0]
CV_accuracies = best_row['CV_accuracies'].values[0]

importances_df = pd.DataFrame(importances.items(), columns=['Feature', 'Importance']).sort_values(by='Importance', ascending=False).reset_index(drop=True)
CV_accuracies_df = pd.DataFrame(CV_accuracies, columns=['CV_Accuracy'])

CV_acc_num_features_df = results_df[['num_features', 'mean_accuracy']]

display(importances_df)
display(CV_accuracies_df)
display(CV_acc_num_features_df)


### Final Random Forest (RF) Model Training
With the best performing feature combination evaluated in the combined approach of RFE and LLO CV, I train the final RF model with the given trainset for the classification task of my thesis. 

In [None]:
# Putting tThe best performing feature combination from LLO CV into a list
top_importances = importances_df['Feature'].head(best_num_features).tolist()
top_importances

In [None]:
# Define class names to corresponding integers 
class_mapping = {'non-symptomatic': 0, 'symptomatic': 1}

# Define the testing and training sets and map class names to integers
trainSet = pd.concat(trainSets_List)
y_train = trainSet['Class'].map(class_mapping)
X_train = trainSet.drop('Class', axis=1)

testSet = pd.concat(testSets_List, ignore_index=True)
y_Test = testSet['Class'].map(class_mapping) 
X_Test = testSet.drop('Class', axis=1)

# Reduce testing and training sets to only the selected input features
selected_features = top_importances
X_train_rfe = X_train.loc[:, selected_features]
X_test_rfe = X_Test.loc[:, selected_features]

# Train the model with the final selected input features
final_classifier = RandomForestClassifier()
final_classifier.fit(X_train_rfe, y_train)

# Make predictions
y_pred = final_classifier.predict(X_test_rfe)

# Check class distributions (Braucht es dies?)
print("True Class Distribution:")
print(y_Test.value_counts())
print("Predicted Class Distribution:")
print(pd.Series(y_pred).value_counts())

# Calculate the accuracy on the testing set for getting the final accuracy score
accuracy_TestSet = accuracy_score(y_Test, y_pred)

# Calculate the class-specific accuracy metrics on the testing set for getting the final class-wise accuracy scores
precision_TestSet_class = precision_score(y_Test, y_pred, average=None, labels=[0, 1], zero_division=0)
recall_TestSet_class = recall_score(y_Test, y_pred, average=None, labels=[0, 1], zero_division=0)
f1_TestSet_class = f1_score(y_Test, y_pred, average=None, labels=[0, 1], zero_division=0)

# Print the results
print(f"Test Set Accuracy: {accuracy_TestSet:.4f}")
print("")
print(f"Symptomatic (Class 1) Precision: {precision_TestSet_class[1]:.4f}")
print(f"Symptomatic (Class 1) Recall: {recall_TestSet_class[1]:.4f}")
print(f"Symptomatic (Class 1) F1 Score: {f1_TestSet_class[1]:.4f}")
print("")
print(f"Non-Symptomatic (Class 0) Precision: {precision_TestSet_class[0]:.4f}")
print(f"Non-Symptomatic (Class 0) Recall: {recall_TestSet_class[0]:.4f}")
print(f"Non-Symptomatic (Class 0) F1 Score: {f1_TestSet_class[0]:.4f}")

### Feature Evaluation of Final Random Forest (RF) Model

Again, I export the feature importances of the final RF model for later analysis in my thesis. 

In [None]:
# Creating sorted dataframe of feature importances of the final model and exporting it to a csv-file
final_features = final_classifier.feature_importances_
final_features_df = pd.DataFrame({
    'Feature': X_train_rfe.columns,
    'Importance (Best Model)': final_features})
final_features_df.sort_values(by='Importance (Best Model)', ascending=False, inplace=True)
display(final_features_df)
final_features_df.to_csv('output/final_feature_importances.csv')
