# Coffee Leaf Diseases Prediction

## Overview
This notebook is a reproduction of the coffee leaf disease classification method described in the research paper below, using machine learning techniques with RGB and CMY color features.

## References

### Research Paper
- **Title**: Comparative Analysis of the Performance of the Decision Tree and K-Nearest Neighbors Methods in Classifying Coffee Leaf Diseases
- **Authors**: Adie Suryadi, Murhaban Murhaban, Rivansyah Suhendra
- **Published in**: Department of Information Technology, Teuku Umar University, Indonesia
- **URL**: [https://aptikom-journal.id/conferenceseries/article/view/649/272](https://aptikom-journal.id/conferenceseries/article/view/649/272)

### Dataset
- **Dataset**: Coffee Leaf Diseases
- **Source**: Kaggle
- **URL**: [https://www.kaggle.com/datasets/badasstechie/coffee-leaf-diseases/code](https://www.kaggle.com/datasets/badasstechie/coffee-leaf-diseases/code)

## Methodology
This implementation extracts color-based features from coffee leaf images:
- **RGB features**: Mean and standard deviation for each R, G, B channel (6 features)
- **CMY features**: Mean and standard deviation for each C, M, Y channel (6 features)
- **Total**: 12 color-based features per image

The features are then used to classify coffee leaves into four categories:
- Miner
- Phoma
- Rust
- No disease

## Preprocessing Data

In [None]:
from utils import (
    load_and_extract_features,
    show_evaluation_results,
    plot_confusion_matrix_single_label,
    plot_roc_curve_single_label
)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder 

# Load and extract features with single label
train_features, train_labels = load_and_extract_features('train', (410, 205), single_label=True)
test_features, test_labels = load_and_extract_features('test', (410, 205), single_label=True)

X_train, X_valid, y_train, y_valid = train_test_split(
    train_features, 
    train_labels,
    test_size=0.2,
    stratify=train_labels,
    random_state=123
)

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_valid_encoded = label_encoder.transform(y_valid)
test_labels_encoded = label_encoder.transform(test_labels)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
test_features_scaled = scaler.transform(test_features)

## Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# ---------- Decision Tree ----------
dt = DecisionTreeClassifier(max_features=None, random_state=123, splitter='best')
param_grid_dt = {
    'criterion': ['gini', 'entropy'],
    'max_depth': list(range(1, 21)) + [None],
    'min_samples_split': range(2, 11),
    'min_samples_leaf': range(1, 6)
}
grid_search_dt = GridSearchCV(dt, param_grid_dt, cv=10)
grid_search_dt.fit(X_train_scaled, y_train_encoded)

best_model_dt = grid_search_dt.best_estimator_
print(grid_search_dt.best_params_)
print(grid_search_dt.best_score_)

# ---------- KNN ----------
knn = KNeighborsClassifier(algorithm='auto', leaf_size=30, n_jobs=-1, p=2, weights='uniform')
param_grid_knn = {
    'metric': ['euclidean', 'manhattan'],
    'n_neighbors': range(1, 21),
}
grid_search_knn = GridSearchCV(knn, param_grid_knn, cv=10)
grid_search_knn.fit(X_train_scaled, y_train_encoded)

best_model_knn = grid_search_knn.best_estimator_
print(grid_search_knn.best_params_)
print(grid_search_knn.best_score_)

The best model for Decision Tree goes with below parameter:
- criterion: 'entropy'
- max_depth: 13
- min_samples_leaf: 1
- min_samples_split: 2

The best model for KNN goes with below parameter:
- metric: 'euclidean'
- n_neighbors: 1

## Find the Best Model
### Using the parameters described in the paper

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# ---------- Evaluation on Validation Set ----------
# ---------- Decision Tree ----------
dt = DecisionTreeClassifier(
    criterion='gini',
    max_depth=None,
    max_features=None,
    min_samples_leaf=1,
    min_samples_split=2,
    random_state=123,
    splitter='best'
)
dt.fit(X_train_scaled, y_train_encoded)
y_pred_valid_dt = dt.predict(X_valid_scaled)

show_evaluation_results('Decision Tree', y_pred_valid_dt, y_valid_encoded)

# ---------- KNN ----------
knn = KNeighborsClassifier(
    algorithm='auto',
    leaf_size=30,
    metric='minkowski',
    n_jobs=-1,
    n_neighbors=5,
    p=2,
    weights='uniform'
)
knn.fit(X_train_scaled, y_train_encoded)
y_pred_valid_knn = knn.predict(X_valid_scaled)

show_evaluation_results('KNN', y_pred_valid_knn, y_valid_encoded)

In [None]:
# ---------- Evaluation on Test Set ----------
# ---------- Decision Tree ----------
y_pred_test_dt = dt.predict(test_features_scaled)
show_evaluation_results('Decision Tree', y_pred_test_dt, test_labels_encoded)

# ---------- KNN ----------
y_pred_test_knn = knn.predict(test_features_scaled)
show_evaluation_results('KNN', y_pred_test_knn, test_labels_encoded)

#### Confusion Matrix Heatmap

In [None]:
labels = sorted(y_valid.unique())

# Decision Tree
plot_confusion_matrix_single_label('Decision Tree', y_pred_valid_dt, y_valid_encoded, labels, 'Validation Set')
plot_confusion_matrix_single_label('Decision Tree', y_pred_test_dt, test_labels_encoded, labels, 'Test Set')

# KNN
plot_confusion_matrix_single_label('KNN', y_pred_valid_knn, y_valid_encoded, labels, 'Validation Set')
plot_confusion_matrix_single_label('KNN', y_pred_test_knn, test_labels_encoded, labels, 'Test Set')

#### ROC-AUC Curves

In [None]:
# Decision Tree
plot_roc_curve_single_label('Decision Tree', dt, X_valid_scaled, y_valid, label_encoder, 'Validation Set')
plot_roc_curve_single_label('Decision Tree', dt, test_features_scaled, test_labels, label_encoder, 'Test Set')

# KNN
plot_roc_curve_single_label('KNN', knn, X_valid_scaled, y_valid, label_encoder, 'Validation Set')
plot_roc_curve_single_label('KNN', knn, test_features_scaled, test_labels, label_encoder, 'Test Set')

### Using the best parameters from CV

In [None]:
# ---------- Evaluation on Validation Set ----------
# ---------- Decision Tree ----------
y_pred_valid_dt_best = best_model_dt.predict(X_valid_scaled)
show_evaluation_results('Decision Tree', y_pred_valid_dt_best, y_valid_encoded)

# ---------- KNN ----------
y_pred_valid_knn_best = best_model_knn.predict(X_valid_scaled)
show_evaluation_results('KNN', y_pred_valid_knn_best, y_valid_encoded)

In [None]:
# ---------- Evaluation on Test Set ----------
# ---------- Decision Tree ----------
y_pred_test_dt_best = best_model_dt.predict(test_features_scaled)
show_evaluation_results('Decision Tree', y_pred_test_dt_best, test_labels_encoded)

# ---------- KNN ----------
y_pred_test_knn_best = best_model_knn.predict(test_features_scaled)
show_evaluation_results('KNN', y_pred_test_knn_best, test_labels_encoded)

#### Confusion Matrix Heatmap

In [None]:
# Decision Tree
plot_confusion_matrix_single_label('Decision Tree', y_pred_valid_dt_best, y_valid_encoded, labels, 'Validation Set')
plot_confusion_matrix_single_label('Decision Tree', y_pred_test_dt_best, test_labels_encoded, labels, 'Test Set')

# KNN
plot_confusion_matrix_single_label('KNN', y_pred_valid_knn_best, y_valid_encoded, labels, 'Validation Set')
plot_confusion_matrix_single_label('KNN', y_pred_test_knn_best, test_labels_encoded, labels, 'Test Set')

#### ROC-AUC Curves

In [None]:
# Decision Tree
plot_roc_curve_single_label('Decision Tree', best_model_dt, X_valid_scaled, y_valid, label_encoder, 'Validation Set')
plot_roc_curve_single_label('Decision Tree', best_model_dt, test_features_scaled, test_labels, label_encoder, 'Test Set')

# KNN
plot_roc_curve_single_label('KNN', best_model_knn, X_valid_scaled, y_valid, label_encoder, 'Validation Set')
plot_roc_curve_single_label('KNN', best_model_knn, test_features_scaled, test_labels, label_encoder, 'Test Set')

## Save models

To save scikit-learn models, we use `joblib` which is more efficient for large numpy arrays:

In [None]:
import joblib

joblib.dump(best_model_knn, 'best_model_knn.pkl')
joblib.dump(best_model_dt, 'best_model_dt.pkl')
joblib.dump(dt, 'decision_tree_model.pkl')
joblib.dump(knn, 'knn_model.pkl')