# Coffee Leaf Diseases Prediction - Additional Models

## Overview
This notebook extends the baseline model (`coffee-leaf-diseases-prediction.ipynb`) by adding **Logistic Regression** and **Neural Network (MLP)** classifiers to the experimentation process.

## Approach
- **Baseline**: Decision Tree and KNN (from research paper)
- **Additional Models**: Logistic Regression and Neural Network (MLPClassifier)
- **Feature Extraction**: RGB/CMY color features (12 features per image)
- **Label Type**: Single-label classification

## Models
| Model | Source |
|-------|--------|
| Decision Tree | Research Paper (Baseline) |
| K-Nearest Neighbors | Research Paper (Baseline) |
| Logistic Regression | Additional |
| Neural Network (MLP) | Additional |

## Methodology
This implementation extracts color-based features from coffee leaf images:
- **RGB features**: Mean and standard deviation for each R, G, B channel (6 features)
- **CMY features**: Mean and standard deviation for each C, M, Y channel (6 features)
- **Total**: 12 color-based features per image

The features are then used to classify coffee leaves into four categories:
- Miner
- Phoma
- Rust
- No disease

## Workflow
1. **Validation Set**: Train on 80% of training data, validate on 20%
2. **Test Set**: Evaluate best models on separate test data
3. **Combined Set**: Retrain on all data (train + test) for final model

## Preprocessing Data

In [None]:
from utils import (
    load_and_extract_features,
    show_evaluation_results,
    plot_confusion_matrix_single_label,
    plot_roc_curve_single_label
)
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd
import numpy as np

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

## 1. Validation Set

In [None]:
# Load and extract features with single label
train_features, train_labels = load_and_extract_features('train', (410, 205), single_label=True)

X_train, X_valid, y_train, y_valid = train_test_split(
    train_features,
    train_labels,
    test_size=0.2,
    stratify=train_labels,
    random_state=123
)

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_valid_encoded = label_encoder.transform(y_valid)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)

### Hyperparameter Tuning and GridSearch

In [None]:
# ---------- Decision Tree ----------
dt = DecisionTreeClassifier(max_features=None, random_state=123, splitter='best')
param_grid_dt = {
    'criterion': ['gini', 'entropy'],
    'max_depth': list(range(1, 21)) + [None],
    'min_samples_split': range(2, 11),
    'min_samples_leaf': range(1, 6)
}

grid_search_dt = GridSearchCV(dt, param_grid_dt, cv=10)
grid_search_dt.fit(X_train_scaled, y_train)

best_model_dt = grid_search_dt.best_estimator_
print("Decision Tree best params:", grid_search_dt.best_params_)
print("Decision Tree CV score:", grid_search_dt.best_score_)

In [None]:
# ---------- KNN ----------
knn = KNeighborsClassifier(algorithm='auto', leaf_size=30, n_jobs=-1, p=2, weights='uniform')
param_grid_knn = {
    'metric': ['euclidean', 'manhattan'],
    'n_neighbors': range(1, 21),
}
grid_search_knn = GridSearchCV(knn, param_grid_knn, cv=10)
grid_search_knn.fit(X_train_scaled, y_train)

best_model_knn = grid_search_knn.best_estimator_
print("KNN best params:", grid_search_knn.best_params_)
print("KNN CV score:", grid_search_knn.best_score_)

In [None]:
# ---------- Logistic Regression ----------
lr = LogisticRegression(max_iter=1000, random_state=123)
param_grid_lr = {
    'solver': ['lbfgs', 'saga'],
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l2']
}
grid_search_lr = GridSearchCV(lr, param_grid_lr, cv=10, scoring='accuracy', n_jobs=-1)
grid_search_lr.fit(X_train_scaled, y_train)

best_model_lr = grid_search_lr.best_estimator_
print("Logistic Regression best params:", grid_search_lr.best_params_)
print("Logistic Regression CV score:", grid_search_lr.best_score_)

In [None]:
# ---------- Neural Network ----------
nn = MLPClassifier(max_iter=5000, random_state=123)
param_grid_nn = {
    'hidden_layer_sizes': [(50,), (100,), (100, 50)],
    'solver': ['adam', 'lbfgs'],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate_init': [0.001, 0.01]
}
grid_search_nn = GridSearchCV(estimator=nn, param_grid=param_grid_nn, cv=10, scoring='accuracy', n_jobs=-1)
grid_search_nn.fit(X_train_scaled, y_train)

best_model_nn = grid_search_nn.best_estimator_
print("Neural Network best params:", grid_search_nn.best_params_)
print("Neural Network CV score:", grid_search_nn.best_score_)

In [None]:
# Summary of GridSearch results
results = []
for name, gs in [("Decision Tree", grid_search_dt),
                 ("KNN", grid_search_knn),
                 ("Logistic Regression", grid_search_lr),
                 ("Neural Network", grid_search_nn)]:
    results.append({
        "Model": name,
        "Best Params": gs.best_params_,
        "CV Score": gs.best_score_
    })
pd.DataFrame(results)

### Evaluation using the best parameters

In [None]:
# Predictions on validation set
y_pred_valid_dt_best = best_model_dt.predict(X_valid_scaled)
y_pred_valid_knn_best = best_model_knn.predict(X_valid_scaled)
y_pred_valid_lr_best = best_model_lr.predict(X_valid_scaled)
y_pred_valid_nn_best = best_model_nn.predict(X_valid_scaled)

# Show evaluation results
show_evaluation_results("Decision Tree", y_pred_valid_dt_best, y_valid)
show_evaluation_results("KNN", y_pred_valid_knn_best, y_valid)
show_evaluation_results("Logistic Regression", y_pred_valid_lr_best, y_valid)
show_evaluation_results("Neural Network", y_pred_valid_nn_best, y_valid)

In [None]:
# Metrics summary for validation set
results_valid = []
for name, y_pred in [
    ("Decision Tree", y_pred_valid_dt_best),
    ("KNN", y_pred_valid_knn_best),
    ("Logistic Regression", y_pred_valid_lr_best),
    ("Neural Network", y_pred_valid_nn_best)
]:
    results_valid.append({
        "Model": name,
        "Accuracy": accuracy_score(y_valid, y_pred),
        "Precision (micro)": precision_score(y_valid, y_pred, average='micro', zero_division=0),
        "Recall (micro)": recall_score(y_valid, y_pred, average='micro', zero_division=0),
        "F1-score (micro)": f1_score(y_valid, y_pred, average='micro', zero_division=0),
        "Precision (macro)": precision_score(y_valid, y_pred, average='macro', zero_division=0),
        "Recall (macro)": recall_score(y_valid, y_pred, average='macro', zero_division=0),
        "F1-score (macro)": f1_score(y_valid, y_pred, average='macro', zero_division=0),
    })
pd.DataFrame(results_valid)

### Confusion Matrix Heatmap

In [None]:
labels = sorted(y_valid.unique())

plot_confusion_matrix_single_label('Decision Tree', y_pred_valid_dt_best, y_valid, labels, 'Validation Set')
plot_confusion_matrix_single_label('KNN', y_pred_valid_knn_best, y_valid, labels, 'Validation Set')
plot_confusion_matrix_single_label('Logistic Regression', y_pred_valid_lr_best, y_valid, labels, 'Validation Set')
plot_confusion_matrix_single_label('Neural Network', y_pred_valid_nn_best, y_valid, labels, 'Validation Set')

### ROC-AUC Curves

In [None]:
plot_roc_curve_single_label('Decision Tree', best_model_dt, X_valid_scaled, y_valid, label_encoder, 'Validation Set')
plot_roc_curve_single_label('KNN', best_model_knn, X_valid_scaled, y_valid, label_encoder, 'Validation Set')
plot_roc_curve_single_label('Logistic Regression', best_model_lr, X_valid_scaled, y_valid, label_encoder, 'Validation Set')
plot_roc_curve_single_label('Neural Network', best_model_nn, X_valid_scaled, y_valid, label_encoder, 'Validation Set')

## 2. Test Set

In [None]:
# Load test data
test_features, test_labels = load_and_extract_features('test', (410, 205), single_label=True)

y_test = test_labels
y_test_encoded = label_encoder.transform(y_test)
X_test_scaled = scaler.transform(test_features)

### Evaluation using the best parameters

In [None]:
# Predictions on test set
y_pred_test_dt_best = best_model_dt.predict(X_test_scaled)
y_pred_test_knn_best = best_model_knn.predict(X_test_scaled)
y_pred_test_lr_best = best_model_lr.predict(X_test_scaled)
y_pred_test_nn_best = best_model_nn.predict(X_test_scaled)

# Show evaluation results
show_evaluation_results("Decision Tree", y_pred_test_dt_best, y_test)
show_evaluation_results("KNN", y_pred_test_knn_best, y_test)
show_evaluation_results("Logistic Regression", y_pred_test_lr_best, y_test)
show_evaluation_results("Neural Network", y_pred_test_nn_best, y_test)

In [None]:
# Metrics summary for test set
results_test = []
for name, y_pred in [
    ("Decision Tree", y_pred_test_dt_best),
    ("KNN", y_pred_test_knn_best),
    ("Logistic Regression", y_pred_test_lr_best),
    ("Neural Network", y_pred_test_nn_best)
]:
    results_test.append({
        "Model": name,
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision (micro)": precision_score(y_test, y_pred, average='micro', zero_division=0),
        "Recall (micro)": recall_score(y_test, y_pred, average='micro', zero_division=0),
        "F1-score (micro)": f1_score(y_test, y_pred, average='micro', zero_division=0),
        "Precision (macro)": precision_score(y_test, y_pred, average='macro', zero_division=0),
        "Recall (macro)": recall_score(y_test, y_pred, average='macro', zero_division=0),
        "F1-score (macro)": f1_score(y_test, y_pred, average='macro', zero_division=0),
    })
pd.DataFrame(results_test)

### Confusion Matrix Heatmap

In [None]:
labels = sorted(y_test.unique())

plot_confusion_matrix_single_label('Decision Tree', y_pred_test_dt_best, y_test, labels, 'Test Set')
plot_confusion_matrix_single_label('KNN', y_pred_test_knn_best, y_test, labels, 'Test Set')
plot_confusion_matrix_single_label('Logistic Regression', y_pred_test_lr_best, y_test, labels, 'Test Set')
plot_confusion_matrix_single_label('Neural Network', y_pred_test_nn_best, y_test, labels, 'Test Set')

### ROC-AUC Curves

In [None]:
plot_roc_curve_single_label('Decision Tree', best_model_dt, X_test_scaled, y_test, label_encoder, 'Test Set')
plot_roc_curve_single_label('KNN', best_model_knn, X_test_scaled, y_test, label_encoder, 'Test Set')
plot_roc_curve_single_label('Logistic Regression', best_model_lr, X_test_scaled, y_test, label_encoder, 'Test Set')
plot_roc_curve_single_label('Neural Network', best_model_nn, X_test_scaled, y_test, label_encoder, 'Test Set')

## 3. Combined Set (Train + Test)

In [None]:
# Load both train and test data
train_features, train_labels = load_and_extract_features('train', (410, 205), single_label=True)
test_features, test_labels = load_and_extract_features('test', (410, 205), single_label=True)

# Combine all features and labels
all_features = np.vstack([train_features, test_features])
all_labels = pd.concat([train_labels, test_labels], axis=0, ignore_index=True)

X_train, X_valid, y_train, y_valid = train_test_split(
    all_features,
    all_labels,
    test_size=0.2,
    stratify=all_labels,
    random_state=123
)

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_valid_encoded = label_encoder.transform(y_valid)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)

### Hyperparameter Tuning and Grid Search

In [None]:
# ---------- Decision Tree ----------
dt = DecisionTreeClassifier(max_features=None, random_state=123, splitter='best')
param_grid_dt = {
    'criterion': ['gini', 'entropy'],
    'max_depth': list(range(1, 21)) + [None],
    'min_samples_split': range(2, 11),
    'min_samples_leaf': range(1, 6)
}
grid_search_dt = GridSearchCV(dt, param_grid_dt, cv=10)
grid_search_dt.fit(X_train_scaled, y_train)

best_model_dt = grid_search_dt.best_estimator_
print("Decision Tree best params:", grid_search_dt.best_params_)
print("Decision Tree CV score:", grid_search_dt.best_score_)

In [None]:
# ---------- KNN ----------
knn = KNeighborsClassifier(algorithm='auto', leaf_size=30, n_jobs=-1, p=2, weights='uniform')
param_grid_knn = {
    'metric': ['euclidean', 'manhattan'],
    'n_neighbors': range(1, 21),
}
grid_search_knn = GridSearchCV(knn, param_grid_knn, cv=10)
grid_search_knn.fit(X_train_scaled, y_train)

best_model_knn = grid_search_knn.best_estimator_
print("KNN best params:", grid_search_knn.best_params_)
print("KNN CV score:", grid_search_knn.best_score_)

In [None]:
# ---------- Logistic Regression ----------
lr = LogisticRegression(max_iter=1000, random_state=123)
param_grid_lr = {
    'solver': ['lbfgs', 'saga'],
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l2']
}
grid_search_lr = GridSearchCV(lr, param_grid_lr, cv=10, scoring='accuracy', n_jobs=-1)
grid_search_lr.fit(X_train_scaled, y_train)

best_model_lr = grid_search_lr.best_estimator_
print("Logistic Regression best params:", grid_search_lr.best_params_)
print("Logistic Regression CV score:", grid_search_lr.best_score_)

In [None]:
# ---------- Neural Network ----------
nn = MLPClassifier(max_iter=5000, random_state=123)
param_grid_nn = {
    'hidden_layer_sizes': [(50,), (100,), (100, 50)],
    'solver': ['adam', 'lbfgs'],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate_init': [0.001, 0.01]
}

grid_search_nn = GridSearchCV(nn, param_grid_nn, cv=10, scoring='accuracy', n_jobs=-1)
grid_search_nn.fit(X_train_scaled, y_train)

best_model_nn = grid_search_nn.best_estimator_
print("Neural Network best params:", grid_search_nn.best_params_)
print("Neural Network CV score:", grid_search_nn.best_score_)

In [None]:
# Summary of GridSearch results
results_combine = []
for name, gs in [("Decision Tree", grid_search_dt),
                 ("KNN", grid_search_knn),
                 ("Logistic Regression", grid_search_lr),
                 ("Neural Network", grid_search_nn)]:
    results_combine.append({
        "Model": name,
        "Best Params": gs.best_params_,
        "CV Score": gs.best_score_
    })

pd.DataFrame(results_combine)

### Evaluation using the best parameters

In [None]:
# Predictions on combined validation set
y_pred_valid_dt_best = best_model_dt.predict(X_valid_scaled)
y_pred_valid_knn_best = best_model_knn.predict(X_valid_scaled)
y_pred_valid_lr_best = best_model_lr.predict(X_valid_scaled)
y_pred_valid_nn_best = best_model_nn.predict(X_valid_scaled)

# Show evaluation results
show_evaluation_results("Decision Tree (Combine)", y_pred_valid_dt_best, y_valid)
show_evaluation_results("KNN (Combine)", y_pred_valid_knn_best, y_valid)
show_evaluation_results("Logistic Regression (Combine)", y_pred_valid_lr_best, y_valid)
show_evaluation_results("Neural Network (Combine)", y_pred_valid_nn_best, y_valid)

In [None]:
# Metrics summary for combined validation set
results_valid_combine = []
for name, y_pred in [
    ("Decision Tree", y_pred_valid_dt_best),
    ("KNN", y_pred_valid_knn_best),
    ("Logistic Regression", y_pred_valid_lr_best),
    ("Neural Network", y_pred_valid_nn_best)
]:
    results_valid_combine.append({
        "Model": name,
        "Accuracy": accuracy_score(y_valid, y_pred),
        "Precision (micro)": precision_score(y_valid, y_pred, average='micro', zero_division=0),
        "Recall (micro)": recall_score(y_valid, y_pred, average='micro', zero_division=0),
        "F1-score (micro)": f1_score(y_valid, y_pred, average='micro', zero_division=0),
        "Precision (macro)": precision_score(y_valid, y_pred, average='macro', zero_division=0),
        "Recall (macro)": recall_score(y_valid, y_pred, average='macro', zero_division=0),
        "F1-score (macro)": f1_score(y_valid, y_pred, average='macro', zero_division=0),
    })
pd.DataFrame(results_valid_combine)

### Confusion Matrix Heatmap

In [None]:
labels = sorted(y_valid.unique())

plot_confusion_matrix_single_label('Decision Tree', y_pred_valid_dt_best, y_valid, labels, 'Combined Valid')
plot_confusion_matrix_single_label('KNN', y_pred_valid_knn_best, y_valid, labels, 'Combined Valid')
plot_confusion_matrix_single_label('Logistic Regression', y_pred_valid_lr_best, y_valid, labels, 'Combined Valid')
plot_confusion_matrix_single_label('Neural Network', y_pred_valid_nn_best, y_valid, labels, 'Combined Valid')

### ROC-AUC Curves

In [None]:
plot_roc_curve_single_label('Decision Tree', best_model_dt, X_valid_scaled, y_valid, label_encoder, 'Combined Valid')
plot_roc_curve_single_label('KNN', best_model_knn, X_valid_scaled, y_valid, label_encoder, 'Combined Valid')
plot_roc_curve_single_label('Logistic Regression', best_model_lr, X_valid_scaled, y_valid, label_encoder, 'Combined Valid')
plot_roc_curve_single_label('Neural Network', best_model_nn, X_valid_scaled, y_valid, label_encoder, 'Combined Valid')

## Save models

To save scikit-learn models, we use `joblib` which is more efficient for large numpy arrays:

In [None]:
import joblib

joblib.dump(best_model_dt, 'models/best_model_dt.pkl')
joblib.dump(best_model_knn, 'models/best_model_knn.pkl')
joblib.dump(best_model_lr, 'models/best_model_lr.pkl')
joblib.dump(best_model_nn, 'models/best_model_nn.pkl')
joblib.dump(scaler, 'models/scaler.pkl')
joblib.dump(label_encoder, 'models/label_encoder.pkl')

print("Models saved successfully!")