# Coffee Leaf Diseases Prediction

## Overview
This notebook is a reproduction of the coffee leaf disease classification method described in the research paper below, using machine learning techniques with RGB and CMY color features.

## References

### Research Paper
- **Title**: Comparative Analysis of the Performance of the Decision Tree and K-Nearest Neighbors Methods in Classifying Coffee Leaf Diseases
- **Authors**: Adie Suryadi, Murhaban Murhaban, Rivansyah Suhendra
- **Published in**: Department of Information Technology, Teuku Umar University, Indonesia
- **URL**: [https://aptikom-journal.id/conferenceseries/article/view/649/272](https://aptikom-journal.id/conferenceseries/article/view/649/272)

### Dataset
- **Dataset**: Coffee Leaf Diseases
- **Source**: Kaggle
- **URL**: [https://www.kaggle.com/datasets/badasstechie/coffee-leaf-diseases/code](https://www.kaggle.com/datasets/badasstechie/coffee-leaf-diseases/code)

## Methodology
This implementation extracts color-based features from coffee leaf images:
- **RGB features**: Mean and standard deviation for each R, G, B channel (6 features)
- **CMY features**: Mean and standard deviation for each C, M, Y channel (6 features)
- **Total**: 12 color-based features per image

The features are then used to classify coffee leaves into four categories:
- Miner
- Phoma
- Rust
- No disease

## Preprocessing Data

In [None]:
from utils import (
    load_and_extract_features,
    show_evaluation_results,
    plot_confusion_matrix_single_label,
    plot_roc_curve_single_label
)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import pandas as pd

# Load and extract features (raw pixel data with single label)
train_features, train_labels = load_and_extract_features('train', (100, 50), use_raw_data=True, single_label=True)
test_features, test_labels = load_and_extract_features('test', (100, 50), use_raw_data=True, single_label=True)

X_train, X_valid, y_train, y_valid = train_test_split(
    train_features, 
    train_labels,
    test_size=0.2,
    stratify=train_labels,
    random_state=123
)

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_valid_encoded = label_encoder.transform(y_valid)
test_labels_encoded = label_encoder.transform(test_labels)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
test_features_scaled = scaler.transform(test_features)

## Building and Evaluating Models

### Using the parameters described in the paper

#### Without SMOTE

##### Predict validation set

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

# ---------- Decision Tree ----------
dt = DecisionTreeClassifier(
    criterion='gini',
    max_depth=None,
    max_features=None,
    min_samples_leaf=1,
    min_samples_split=2,
    random_state=123,
    splitter='best'
)
dt.fit(X_train_scaled, y_train_encoded)
y_pred_valid_dt = dt.predict(X_valid_scaled)

show_evaluation_results("Decision Tree", y_pred_valid_dt, y_valid_encoded)

# ---------- KNN ----------
knn = KNeighborsClassifier(
    algorithm='auto',
    leaf_size=30,
    metric='minkowski',
    n_jobs=-1,
    n_neighbors=5,
    p=2,
    weights='uniform'
)
knn.fit(X_train_scaled, y_train_encoded)
y_pred_valid_knn = knn.predict(X_valid_scaled)

show_evaluation_results("KNN", y_pred_valid_knn, y_valid_encoded)

In [None]:
# Confusion Matrix Heatmap
labels_to_display = label_encoder.classes_

# ---------- Decision Tree ----------
plot_confusion_matrix_single_label('Decision Tree', y_pred_valid_dt, y_valid_encoded, labels_to_display, 'Validation Set')

# ---------- KNN ----------
plot_confusion_matrix_single_label('KNN', y_pred_valid_knn, y_valid_encoded, labels_to_display, 'Validation Set')

In [None]:
# ROC-AUC Curves
# ---------- Decision Tree ----------
plot_roc_curve_single_label('Decision Tree', dt, X_valid_scaled, y_valid, label_encoder, 'Validation Set')

# ---------- KNN ----------
plot_roc_curve_single_label('KNN', knn, X_valid_scaled, y_valid, label_encoder, 'Validation Set')

##### Predict test set

In [None]:
# ---------- Decision Tree ----------
y_pred_test_dt = dt.predict(test_features_scaled)
show_evaluation_results("Decision Tree", y_pred_test_dt, test_labels_encoded)

# ---------- KNN ----------
y_pred_test_knn = knn.predict(test_features_scaled)
show_evaluation_results("KNN", y_pred_test_knn, test_labels_encoded)

In [None]:
# Confusion Matrix Heatmap
# ---------- Decision Tree ----------
plot_confusion_matrix_single_label('Decision Tree', y_pred_test_dt, test_labels_encoded, labels_to_display, 'Test Set')

# ---------- KNN ----------
plot_confusion_matrix_single_label('KNN', y_pred_test_knn, test_labels_encoded, labels_to_display, 'Test Set')

In [None]:
# ROC-AUC Curves
# ---------- Decision Tree ----------
plot_roc_curve_single_label('Decision Tree', dt, test_features_scaled, test_labels, label_encoder, 'Test Set')

# ---------- KNN ----------
plot_roc_curve_single_label('KNN', knn, test_features_scaled, test_labels, label_encoder, 'Test Set')

#### With SMOTE

##### Predict validation set

In [None]:
# ---------- Decision Tree ----------
pipeline_dt = ImbPipeline([
    ('smote', SMOTE(random_state=123)),
    ('model', dt)
])
pipeline_dt.fit(X_train_scaled, y_train_encoded)
y_pred_valid_dt_smote = pipeline_dt.predict(X_valid_scaled)

show_evaluation_results("Decision Tree", y_pred_valid_dt_smote, y_valid_encoded)

# ---------- KNN ----------
pipeline_knn = ImbPipeline([
    ('smote', SMOTE(random_state=123)),
    ('model', knn)
])
pipeline_knn.fit(X_train_scaled, y_train_encoded)
y_pred_valid_knn_smote = pipeline_knn.predict(X_valid_scaled)

show_evaluation_results("KNN", y_pred_valid_knn_smote, y_valid_encoded)

In [None]:
# Confusion Matrix Heatmap
# ---------- Decision Tree ----------
plot_confusion_matrix_single_label('Decision Tree', y_pred_valid_dt_smote, y_valid_encoded, labels_to_display, 'Validation Set')

# ---------- KNN ----------
plot_confusion_matrix_single_label('KNN', y_pred_valid_knn_smote, y_valid_encoded, labels_to_display, 'Validation Set')

In [None]:
# ROC-AUC Curves
# ---------- Decision Tree ----------
plot_roc_curve_single_label('Decision Tree', pipeline_dt, X_valid_scaled, y_valid, label_encoder, 'Validation Set')

# ---------- KNN ----------
plot_roc_curve_single_label('KNN', pipeline_knn, X_valid_scaled, y_valid, label_encoder, 'Validation Set')

##### Predict test set

In [None]:
# ---------- Decision Tree ----------
y_pred_test_dt_smote = pipeline_dt.predict(test_features_scaled)
show_evaluation_results("Decision Tree", y_pred_test_dt_smote, test_labels_encoded)

# ---------- KNN ----------
y_pred_test_knn_smote = pipeline_knn.predict(test_features_scaled)
show_evaluation_results("KNN", y_pred_test_knn_smote, test_labels_encoded)

In [None]:
# Confusion Matrix Heatmap
# ---------- Decision Tree ----------
plot_confusion_matrix_single_label('Decision Tree', y_pred_test_dt_smote, test_labels_encoded, labels_to_display, 'Test Set')

# ---------- KNN ----------
plot_confusion_matrix_single_label('KNN', y_pred_test_knn_smote, test_labels_encoded, labels_to_display, 'Test Set')

In [None]:
# ROC-AUC Curves
# ---------- Decision Tree ----------
plot_roc_curve_single_label('Decision Tree', pipeline_dt, test_features_scaled, test_labels, label_encoder, 'Test Set')

# ---------- KNN ----------
plot_roc_curve_single_label('KNN', pipeline_knn, test_features_scaled, test_labels, label_encoder, 'Test Set')

### Hyperparameter Tuning

#### Without SMOTE

In [None]:
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline as SKPipeline

def build_model(model_type, model, param_grid, X_data, y_data, hasSMOTE=False):
    if hasSMOTE:
        pipeline = ImbPipeline([
            ('pca', PCA(random_state=123)),
            ('smote', SMOTE(random_state=123)),
            ('model', model)
        ])
    else:
        pipeline = SKPipeline([
            ('pca', PCA(random_state=123)),
            ('model', model)
        ])
    
    grid = GridSearchCV(
        estimator=pipeline,
        param_grid=param_grid,
        scoring='f1_macro',
        cv=10,
        n_jobs=-1
    )
    
    grid.fit(X_data, y_data)
    print(f"Best parameters for {model_type}: {grid.best_params_}")
    print(f"Best F1 Macro Score for {model_type}: {grid.best_score_}")
    
    return grid.best_estimator_

# ---------- Decision Tree ----------
best_singlelabel_dt = build_model(
    'Decision Tree',
    DecisionTreeClassifier(),
    {
        'pca__n_components': [10, 20, 50],
        'model__criterion': ['gini', 'entropy'],
        'model__max_depth': [5, 8, 13, 18, None],
        'model__min_samples_split': [2, 5],
        'model__min_samples_leaf': [1, 3],
        'model__class_weight': ['balanced', None],
        'model__min_impurity_decrease': [0.0, 0.001, 0.01]
    },
    X_train_scaled,
    y_train_encoded
)

# ---------- KNN ----------
best_singlelable_knn = build_model(
    'KNN',
    KNeighborsClassifier(),
    {
        'pca__n_components': [10, 20, 50],
        'model__n_neighbors': [1, 3, 5, 7, 9],
        'model__metric': ['euclidean', 'manhattan', 'cosine'],
        'model__weights': ['uniform', 'distance']
    },
    X_train_scaled,
    y_train_encoded
)

##### Predict validation set

In [None]:
# ---------- Decision Tree ----------
y_pred_valid_dt_best = best_singlelabel_dt.predict(X_valid_scaled)
show_evaluation_results("Decision Tree", y_pred_valid_dt_best, y_valid_encoded)

# ---------- KNN ----------
y_pred_valid_knn_best = best_singlelable_knn.predict(X_valid_scaled)
show_evaluation_results("KNN", y_pred_valid_knn_best, y_valid_encoded)

In [None]:
# Confusion Matrix Heatmap
# ---------- Decision Tree ----------
plot_confusion_matrix_single_label('Decision Tree', y_pred_valid_dt_best, y_valid_encoded, labels_to_display, 'Validation Set')

# ---------- KNN ----------
plot_confusion_matrix_single_label('KNN', y_pred_valid_knn_best, y_valid_encoded, labels_to_display, 'Validation Set')

In [None]:
# ROC-AUC Curves
# ---------- Decision Tree ----------
plot_roc_curve_single_label('Decision Tree', best_singlelabel_dt, X_valid_scaled, y_valid, label_encoder, 'Validation Set')

# ---------- KNN ----------
plot_roc_curve_single_label('KNN', best_singlelable_knn, X_valid_scaled, y_valid, label_encoder, 'Validation Set')

##### Predict test set

In [None]:
# ---------- Decision Tree ----------
y_pred_test_dt_best = best_singlelabel_dt.predict(test_features_scaled)
show_evaluation_results("Decision Tree", y_pred_test_dt_best, test_labels_encoded)

# ---------- KNN ----------
y_pred_test_knn_best = best_singlelable_knn.predict(test_features_scaled)
show_evaluation_results("KNN", y_pred_test_knn_best, test_labels_encoded)

In [None]:
# Confusion Matrix Heatmap
# ---------- Decision Tree ----------
plot_confusion_matrix_single_label('Decision Tree', y_pred_test_dt_best, test_labels_encoded, labels_to_display, 'Test Set')

# ---------- KNN ----------
plot_confusion_matrix_single_label('KNN', y_pred_test_knn_best, test_labels_encoded, labels_to_display, 'Test Set')

In [None]:
# ROC-AUC Curves
# ---------- Decision Tree ----------
plot_roc_curve_single_label('Decision Tree', best_singlelabel_dt, test_features_scaled, test_labels, label_encoder, 'Test Set')

# ---------- KNN ----------
plot_roc_curve_single_label('KNN', best_singlelable_knn, test_features_scaled, test_labels, label_encoder, 'Test Set')

#### With SMOTE

In [None]:
# ---------- Decision Tree ----------
best_singlelabel_dt_smote = build_model(
    'Decision Tree',
    DecisionTreeClassifier(),
    {
        'pca__n_components': [10, 20, 50],
        'smote__k_neighbors': [3, 5, 7],
        'model__criterion': ['gini', 'entropy'],
        'model__max_depth': [5, 8, 13, 18, None],
        'model__min_samples_split': [2, 5],
        'model__min_samples_leaf': [1, 3],
        'model__class_weight': ['balanced', None],
        'model__min_impurity_decrease': [0.0, 0.001, 0.01]
    },
    X_train_scaled,
    y_train_encoded,
    True
)

# ---------- KNN ----------
best_singlelable_knn_smote = build_model(
    'KNN',
    KNeighborsClassifier(),
    {
        'pca__n_components': [10, 20, 50],
        'smote__k_neighbors': [3, 5, 7],
        'model__n_neighbors': [1, 3, 5, 7, 9],
        'model__metric': ['euclidean', 'manhattan', 'cosine'],
        'model__weights': ['uniform', 'distance']
    },
    X_train_scaled,
    y_train_encoded,
    True
)

##### Predict validation set

In [None]:
# ---------- Decision Tree ----------
y_pred_valid_dt_best_smote = best_singlelabel_dt_smote.predict(X_valid_scaled)
show_evaluation_results("Decision Tree", y_pred_valid_dt_best_smote, y_valid_encoded)

# ---------- KNN ----------
y_pred_valid_knn_best_smote = best_singlelable_knn_smote.predict(X_valid_scaled)
show_evaluation_results("KNN", y_pred_valid_knn_best_smote, y_valid_encoded)

In [None]:
# Confusion Matrix Heatmap
# ---------- Decision Tree ----------
plot_confusion_matrix_single_label('Decision Tree', y_pred_valid_dt_best_smote, y_valid_encoded, labels_to_display, 'Validation Set')

# ---------- KNN ----------
plot_confusion_matrix_single_label('KNN', y_pred_valid_knn_best_smote, y_valid_encoded, labels_to_display, 'Validation Set')

In [None]:
# ROC-AUC Curves
# ---------- Decision Tree ----------
plot_roc_curve_single_label('Decision Tree', best_singlelabel_dt_smote, X_valid_scaled, y_valid, label_encoder, 'Validation Set')

# ---------- KNN ----------
plot_roc_curve_single_label('KNN', best_singlelable_knn_smote, X_valid_scaled, y_valid, label_encoder, 'Validation Set')

##### Predict test set

In [None]:
# ---------- Decision Tree ----------
y_pred_test_dt_best_smote = best_singlelabel_dt_smote.predict(test_features_scaled)
show_evaluation_results("Decision Tree", y_pred_test_dt_best_smote, test_labels_encoded)

# ---------- KNN ----------
y_pred_test_knn_best_smote = best_singlelable_knn_smote.predict(test_features_scaled)
show_evaluation_results("KNN", y_pred_test_knn_best_smote, test_labels_encoded)

In [None]:
# Confusion Matrix Heatmap
# ---------- Decision Tree ----------
plot_confusion_matrix_single_label('Decision Tree', y_pred_test_dt_best_smote, test_labels_encoded, labels_to_display, 'Test Set')

# ---------- KNN ----------
plot_confusion_matrix_single_label('KNN', y_pred_test_knn_best_smote, test_labels_encoded, labels_to_display, 'Test Set')

In [None]:
# ROC-AUC Curves
# ---------- Decision Tree ----------
plot_roc_curve_single_label('Decision Tree', best_singlelabel_dt_smote, test_features_scaled, test_labels, label_encoder, 'Test Set')

# ---------- KNN ----------
plot_roc_curve_single_label('KNN', best_singlelable_knn_smote, test_features_scaled, test_labels, label_encoder, 'Test Set')

## Combine All Train and Test Images

### Checking label distribution

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Load multi-label data for label distribution check
train_features_ml, train_labels_ml = load_and_extract_features('train', (100, 50), use_raw_data=True, single_label=False)
test_features_ml, test_labels_ml = load_and_extract_features('test', (100, 50), use_raw_data=True, single_label=False)

label_cols = ['miner', 'phoma', 'rust']
train_perc = (train_labels_ml[label_cols].sum() / len(train_labels_ml) * 100).reset_index()
train_perc.columns = ['Label', 'Percentage']
train_perc['Dataset'] = 'Train'

test_perc = (test_labels_ml[label_cols].sum() / len(test_labels_ml) * 100).reset_index()
test_perc.columns = ['Label', 'Percentage']
test_perc['Dataset'] = 'Test'

import pandas as pd
combined_perc = pd.concat([train_perc, test_perc])

plt.figure(figsize=(10, 6))
sns.barplot(data=combined_perc, x='Label', y='Percentage', hue='Dataset', palette='magma')

plt.title('Label Percentage Comparison: Train vs Test')
plt.ylabel('Percentage of Samples (%)')
plt.show()

### Preprocessing Data

In [None]:
# Combine all images (use already loaded data)
all_features = np.vstack([train_features, test_features])
all_labels = pd.concat([train_labels, test_labels], axis=0).reset_index(drop=True)

X_all_train, X_all_test, y_all_train, y_all_test = train_test_split(
    all_features,
    all_labels,
    test_size=0.2,
    stratify=all_labels,
    random_state=123
)

y_all_train_encoded = label_encoder.transform(y_all_train)
y_all_test_encoded = label_encoder.transform(y_all_test)

X_all_train_scaled = scaler.fit_transform(X_all_train)
X_all_test_scaled = scaler.transform(X_all_test)

### Building and Evaluating Models

#### Using the parameters described in the paper

##### Without SMOTE

In [None]:
# ---------- Decision Tree ----------
dt.fit(X_all_train_scaled, y_all_train_encoded)
y_all_pred_dt = dt.predict(X_all_test_scaled)
show_evaluation_results("Decision Tree", y_all_pred_dt, y_all_test_encoded)

# ---------- KNN ----------
knn.fit(X_all_train_scaled, y_all_train_encoded)
y_all_pred_knn = knn.predict(X_all_test_scaled)
show_evaluation_results("KNN", y_all_pred_knn, y_all_test_encoded)

In [None]:
# Confusion Matrix Heatmap
# ---------- Decision Tree ----------
plot_confusion_matrix_single_label('Decision Tree', y_all_pred_dt, y_all_test_encoded, labels_to_display, 'Test Set Split from All Images')

# ---------- KNN ----------
plot_confusion_matrix_single_label('KNN', y_all_pred_knn, y_all_test_encoded, labels_to_display, 'Test Set Split from All Images')

In [None]:
# ROC-AUC Curves
# ---------- Decision Tree ----------
plot_roc_curve_single_label('Decision Tree', dt, X_all_test_scaled, y_all_test, label_encoder, 'Test Set Split from All Images')

# ---------- KNN ----------
plot_roc_curve_single_label('KNN', knn, X_all_test_scaled, y_all_test, label_encoder, 'Test Set Split from All Images')

##### With SMOTE

In [None]:
# ---------- Decision Tree ----------
pipeline_dt.fit(X_all_train_scaled, y_all_train_encoded)
y_all_pred_dt_smote = pipeline_dt.predict(X_all_test_scaled)
show_evaluation_results("Decision Tree", y_all_pred_dt_smote, y_all_test_encoded)

# ---------- KNN ----------
pipeline_knn.fit(X_all_train_scaled, y_all_train_encoded)
y_all_pred_knn_smote = pipeline_knn.predict(X_all_test_scaled)
show_evaluation_results("KNN", y_all_pred_knn_smote, y_all_test_encoded)

In [None]:
# Confusion Matrix Heatmap
# ---------- Decision Tree ----------
plot_confusion_matrix_single_label('Decision Tree', y_all_pred_dt_smote, y_all_test_encoded, labels_to_display, 'Test Set Split from All Images')

# ---------- KNN ----------
plot_confusion_matrix_single_label('KNN', y_all_pred_knn_smote, y_all_test_encoded, labels_to_display, 'Test Set Split from All Images')

In [None]:
# ROC-AUC Curves
# ---------- Decision Tree ----------
plot_roc_curve_single_label('Decision Tree', pipeline_dt, X_all_test_scaled, y_all_test, label_encoder, 'Test Set Split from All Images')

# ---------- KNN ----------
plot_roc_curve_single_label('KNN', pipeline_knn, X_all_test_scaled, y_all_test, label_encoder, 'Test Set Split from All Images')

#### Hyperparameter Tuning

##### Without SMOTE

In [None]:
# ---------- Decision Tree ----------
best_all_singlelabel_dt = build_model(
    'Decision Tree',
    DecisionTreeClassifier(),
    {
        'pca__n_components': [10, 20, 50],
        'model__criterion': ['gini', 'entropy'],
        'model__max_depth': [5, 8, 13, 18, None],
        'model__min_samples_split': [2, 5],
        'model__min_samples_leaf': [1, 3],
        'model__class_weight': ['balanced', None],
        'model__min_impurity_decrease': [0.0, 0.001, 0.01]
    },
    X_all_train_scaled,
    y_all_train_encoded
)

# ---------- KNN ----------
best_all_singlelabel_knn = build_model(
    'KNN',
    KNeighborsClassifier(),
    {
        'pca__n_components': [10, 20, 50],
        'model__n_neighbors': [1, 3, 5, 7, 9],
        'model__metric': ['euclidean', 'manhattan', 'cosine'],
        'model__weights': ['uniform', 'distance']
    },
    X_all_train_scaled,
    y_all_train_encoded
)

In [None]:
# ---------- Decision Tree ----------
y_all_pred_dt_best = best_all_singlelabel_dt.predict(X_all_test_scaled)
show_evaluation_results("Decision Tree", y_all_pred_dt_best, y_all_test_encoded)

# ---------- KNN ----------
y_all_pred_knn_best = best_all_singlelabel_knn.predict(X_all_test_scaled)
show_evaluation_results("KNN", y_all_pred_knn_best, y_all_test_encoded)

In [None]:
# Confusion Matrix Heatmap
# ---------- Decision Tree ----------
plot_confusion_matrix_single_label('Decision Tree', y_all_pred_dt_best, y_all_test_encoded, labels_to_display, 'Test Set Split from All Images')

# ---------- KNN ----------
plot_confusion_matrix_single_label('KNN', y_all_pred_knn_best, y_all_test_encoded, labels_to_display, 'Test Set Split from All Images')

In [None]:
# ROC-AUC Curves
# ---------- Decision Tree ----------
plot_roc_curve_single_label('Decision Tree', best_all_singlelabel_dt, X_all_test_scaled, y_all_test, label_encoder, 'Test Set Split from All Images')

# ---------- KNN ----------
plot_roc_curve_single_label('KNN', best_all_singlelabel_knn, X_all_test_scaled, y_all_test, label_encoder, 'Test Set Split from All Images')

##### With SMOTE

In [None]:
# ---------- Decision Tree ----------
best_all_singlelabel_dt_smote = build_model(
    'Decision Tree',
    DecisionTreeClassifier(),
    {
        'pca__n_components': [10, 20, 50],
        'smote__k_neighbors': [3, 5, 7],
        'model__criterion': ['gini', 'entropy'],
        'model__max_depth': [5, 8, 13, 18, None],
        'model__min_samples_split': [2, 5],
        'model__min_samples_leaf': [1, 3],
        'model__class_weight': ['balanced', None],
        'model__min_impurity_decrease': [0.0, 0.001, 0.01]
    },
    X_all_train_scaled,
    y_all_train_encoded,
    True
)

# ---------- KNN ----------
best_all_singlelable_knn_smote = build_model(
    'KNN',
    KNeighborsClassifier(),
    {
        'pca__n_components': [10, 20, 50],
        'smote__k_neighbors': [3, 5, 7],
        'model__n_neighbors': [1, 3, 5, 7, 9],
        'model__metric': ['euclidean', 'manhattan', 'cosine'],
        'model__weights': ['uniform', 'distance']
    },
    X_all_train_scaled,
    y_all_train_encoded,
    True
)

In [None]:
# ---------- Decision Tree ----------
y_all_pred_dt_best_smote = best_all_singlelabel_dt_smote.predict(X_all_test_scaled)
show_evaluation_results("Decision Tree", y_all_pred_dt_best_smote, y_all_test_encoded)

# ---------- KNN ----------
y_all_pred_knn_best_smote = best_all_singlelable_knn_smote.predict(X_all_test_scaled)
show_evaluation_results("KNN", y_all_pred_knn_best_smote, y_all_test_encoded)

In [None]:
# Confusion Matrix Heatmap
# ---------- Decision Tree ----------
plot_confusion_matrix_single_label('Decision Tree', y_all_pred_dt_best_smote, y_all_test_encoded, labels_to_display, 'Test Set Split from All Images')

# ---------- KNN ----------
plot_confusion_matrix_single_label('KNN', y_all_pred_knn_best_smote, y_all_test_encoded, labels_to_display, 'Test Set Split from All Images')

In [None]:
# ROC-AUC Curves
# ---------- Decision Tree ----------
plot_roc_curve_single_label('Decision Tree', best_all_singlelabel_dt_smote, X_all_test_scaled, y_all_test, label_encoder, 'Test Set Split from All Images')

# ---------- KNN ----------
plot_roc_curve_single_label('KNN', best_all_singlelable_knn_smote, X_all_test_scaled, y_all_test, label_encoder, 'Test Set Split from All Images')

## Save models

To save scikit-learn models, we use `joblib` which is more efficient for large numpy arrays:

In [None]:
import joblib

joblib.dump(dt, 'models/decision_tree_model_improved.pkl')
joblib.dump(knn, 'models/knn_model_improved.pkl')
joblib.dump(pipeline_dt, 'models/decision_tree_model_improved_smote.pkl')
joblib.dump(pipeline_knn, 'models/knn_model_improved_smote.pkl')
joblib.dump(best_singlelabel_dt, 'models/best_singlelabel_dt.pkl')
joblib.dump(best_singlelable_knn, 'models/best_singlelabel_knn.pkl')
joblib.dump(best_singlelabel_dt_smote, 'models/best_singlelabel_dt_smote.pkl')
joblib.dump(best_singlelable_knn_smote, 'models/best_singlelabel_knn_smote.pkl')
joblib.dump(best_all_singlelabel_dt, 'models/best_all_singlelabel_dt.pkl')
joblib.dump(best_all_singlelabel_knn, 'models/best_all_singlelabel_knn.pkl')
joblib.dump(best_all_singlelabel_dt_smote, 'models/best_all_singlelabel_dt_smote.pkl')
joblib.dump(best_all_singlelable_knn_smote, 'models/best_all_singlelabel_knn_smote.pkl')