# Random forest classification tutorial

Classification Task: Predicting Soil Behavior (Binary Classification)

The task is a binary classification problem, where the goal is to predict whether a soil sample will exhibit contracting or dilative behavior based on the `state` parameter. If `state >= -0.05`, soil tends to be contractive, and if `state < -0.05`, soil tends to be dilative. We first train the model using the training examples (i.e., feature-label). Then we test the model on unseen data to assess whether the model correctly classify the soil behavior (contractive or dilative)

## Data loading

Load the dataset and display its first few rows and information.



In [None]:
import pandas as pd

try:
    df = pd.read_csv('state_prediction.csv')
    display(df.head())
    display(df.info())
except FileNotFoundError:
    print("Error: 'state_prediction.csv' not found.")
    df = None

## Data exploration

Explore the dataset by visualizing relationships between features and the target variable 'state' using pair plots, correlation matrix, heatmap, and distributions of features.



In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Determine the shape of the DataFrame
print(f"Shape of the DataFrame: {df.shape}")

# 2. Generate a pair plot
sns.pairplot(df, hue='state')
plt.show()

# 3 & 4. Calculate and display the correlation matrix as a heatmap
correlation_matrix = df.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix Heatmap')
plt.show()

# 5. Describe the distributions of each feature
df.hist(figsize=(16, 12), bins=20)
plt.suptitle("Distribution of Features", fontsize=16)
plt.show()

# Analyze the correlation between features and the target variable 'state'
print("Correlation with 'state':")
print(correlation_matrix['state'].sort_values(ascending=False))

## Data preparation


Remove the "gray zone" data points where 'state' is 0 and create a new binary target variable 'binary_state' based on the provided conditions. Then, drop the original 'state' column and verify the changes.



In [None]:
# Remove rows where 'state' is 0
df = df[df['state'] != 0]

# Create the 'binary_state' column using .loc for correct assignment
df['binary_state'] = 0
df.loc[df['state'] >= 0.05, 'binary_state'] = 1

# Drop the original 'state' column
df = df.drop('state', axis=1)

# Verify the changes
display(df.head())
display(df['binary_state'].value_counts())

## Data splitting

Split the data into training, validation, and testing sets using train_test_split.



In [None]:
from sklearn.model_selection import train_test_split

# Define features (X) and target variable (y)
X = df.drop('binary_state', axis=1)
y = df['binary_state']

# First split: combined training/validation and testing sets
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.15, stratify=y, random_state=42
)

# Second split: training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val,
    y_train_val,
    test_size=0.15 / (0.15 + 0.7),  # Adjust test_size for the desired 70/15/15 split
    stratify=y_train_val,
    random_state=42,
)

## Model training

Train a random forest model using the training data.



In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest model
# your code here:

# Train the model
# your code here:


## Model optimization

Optimize the random forest model using GridSearchCV with k-fold cross-validation.


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the parameter grid for Random Forest
# your code here:

# Create a Random Forest model
# your code here:

# Create the GridSearchCV object
# your code here:

# Fit the GridSearchCV object to the training data
# your code here:

# Print the best hyperparameters
# your code here:

# Get the best estimator
# your code here:


## Model evaluation


Evaluate the best model on the test set, generate a classification report, confusion matrix, ROC curve, and AUC.



In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt
import seaborn as sns

# Predict on the test set
# your code here:

# Generate the classification report
print(classification_report(y_test, y_pred))

# Compute and plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Calculate the ROC curve and AUC
# your code here:

fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

# Print the AUC value
print(f"AUC: {roc_auc}")

# Plot the ROC curve
plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

## Model interpretation

In [None]:
from sklearn.tree import export_graphviz
import graphviz
from IPython.display import display

# Assuming 'best_rf_model' is your trained Random Forest model
tree = best_rf_model.estimators_[0]  # Select the first tree in the forest

dot_data = export_graphviz(tree,
                            out_file=None,
                            feature_names=X_train.columns,  # Use feature names from your data
                            class_names=['0', '1'],  # Use class names if applicable
                            filled=True,
                            rounded=True,
                            special_characters=True)

graph = graphviz.Source(dot_data)
display(graph)
