# Problem Definition

## Project Title: Agricultural Plant Health Classification

### Context
Early detection of plant diseases is crucial for maintaining crop health and maximizing yield. Manually inspecting plants is time-consuming and error-prone. This project focuses on developing a classification system to automatically distinguish between healthy and unhealthy plants based on their photographs.

### Objective
Develop a machine learning model to automatically classify plant images into:
- **Healthy Plants**: No disease or stress
- **Unhealthy Plants**: Showing signs of disease, pest damage, or stress

### Expected Outcome
A working proof-of-concept system that predicts the health status of a plant based on images.

### Application
Farmers and agricultural experts can use this system for timely intervention to prevent crop damage.

# Data Collection

### Source
- Used publicly available dataset from Mendeley Data [Bangladesh Dataset](https://data.mendeley.com/datasets/3wby28tkcp/2)


## Dataset Selection and Preparation

### Focus Crop: Bean

The dataset used in this project is the **Vegetables Dataset**, which contains images of four different vegetable crops, each divided into healthy and unhealthy categories:
```
Vegetables Dataset/
├── Malabar/
│ ├── Healthy/
│ └── Unhealthy/
├── Brinjal/
│ ├── Healthy/
│ └── Unhealthy/
├── Cauliflower/
│ ├── Healthy/
│ └── Unhealthy/
└── Bean/
├── Healthy/
└── Unhealthy/
```

For this **proof-of-concept project**, we focus only on the **Bean** crop, performing a **binary classification**:

- **Healthy Beans → label 0**  
- **Unhealthy Beans → label 1**

### Reasoning

Using only a single crop type has several advantages:

1. **Reduces variability**: Different crops have different leaf shapes, sizes, and colors. Combining all crops into one model can confuse the classifier and reduce predictive accuracy.
2. **Simplifies the baseline model**: By focusing on Bean, the model learns patterns specific to this crop, allowing us to establish a strong baseline before expanding to multiple crops.
3. **Demonstrates ML workflow**: The goal is to show end-to-end model building — loading data, preprocessing, feature extraction, training, and evaluation — on a manageable and interpretable subset.

### Folder Structure for Bean

The **Bean crop images** are organized as follows:
```
data/
└── Bean/
├── Healthy/ # Images of healthy bean plants
└── Unhealthy/ # Images of diseased or stressed bean plants
```

Each subfolder corresponds to a **class label**, which will be used during model training.

 

By focusing on this single crop type, the model can learn clear distinguishing patterns between healthy and unhealthy plants, making this a clean **baseline for future expansion** to multi-class or multi-crop classification.



# Data Preprocessing

In this section, we prepare the Bean plant images for machine learning. Since we are performing **binary classification** (Healthy vs. Unhealthy), we need to convert the raw images into a format suitable for a **classical ML model** like Logistic Regression.  

Steps taken:

1. **Image Resizing**  
   - All images are resized to a uniform size (64x64 pixels) to ensure consistency across the dataset.  
   - Resizing helps reduce computational load while retaining enough detail for the model to learn.

2. **Color Conversion and Flattening**  
   - Images are converted to RGB to ensure all have 3 color channels.  
   - Each image is flattened into a 1-dimensional array of pixel values so that classical ML models can process them.  
   - Flattening turns a 64x64x3 image into a 12,288-length feature vector.

3. **Label Encoding**  
   - Folder names (`Healthy` and `Unhealthy`) are mapped to numeric labels:  
     - Healthy - 0  
     - Unhealthy - 1  

4. **Feature and Label Arrays**  
   - All processed images are stored in `X` (features).  
   - Corresponding labels are stored in `y` (target).  

5. **Verification**  
   - We check the shape of `X` and `y` to ensure the data is loaded correctly:  
     - `X.shape` - `(number_of_images, 12288)`  
     - `y.shape` - `(number_of_images,)`  

This preprocessing ensures the images are ready for training a **binary classification model** while keeping the workflow simple and interpretable.


In [2]:
import os
import cv2
import numpy as np
from tqdm import tqdm # adds a nice loading bar

# 1. Define Paths
BASE_PATH = '/home/shaddy/Downloads/Dataset on Bangladeshi Healthy and Unhealthy Veget/Vegetables/Bean'
CATEGORIES = ['Healthy', 'Unhealthy']
IMG_SIZE = 64

X = []
y = []

# 2. The Wrangling Loop
print("Starting Image Processing...")
for category in CATEGORIES:
    path = os.path.join(BASE_PATH, category)
    label = CATEGORIES.index(category) # Healthy=0, Unhealthy=1
    
    for img_name in tqdm(os.listdir(path), desc=f"Loading {category}"):
        try:
            # Read and Convert
            img_path = os.path.join(path, img_name)
            img = cv2.imread(img_path)
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            
            # Step 1: Resize
            img = cv2.resize(img, (IMG_SIZE, IMG_SIZE))
            
            # Step 2: Flattening (64*64*3 = 12288)
            flattened_img = img.flatten()
            
            X.append(flattened_img)
            y.append(label)
        except Exception as e:
            pass

# Step 4: Convert to Arrays
X = np.array(X)
y = np.array(y)

# Step 6: Normalization
X = X / 255.0

# Step 5: Verification
print("\n--- Verification ---")
print(f"Features shape (X): {X.shape}") # Expect (num_images, 12288)
print(f"Labels shape (y): {y.shape}")     # Expect (num_images,)
print(f"Sample Label (first image): {y[0]}")

Starting Image Processing...


Loading Healthy: 100%|████████████████████████| 632/632 [01:52<00:00,  5.64it/s]
Loading Unhealthy: 100%|████████████████████████| 13/13 [00:02<00:00,  6.36it/s]


--- Verification ---
Features shape (X): (645, 12288)
Labels shape (y): (645,)
Sample Label (first image): 0





**Note on Class Imbalance**

After loading the images, we notice that the dataset is highly imbalanced:

- Healthy: 632 images
- Unhealthy: 13 images

This imbalance can affect model performance, especially for minority classes.  
In a production scenario, techniques like **class weighting, oversampling the minority class and others** could help mitigate this issue.  

For this proof-of-concept, we proceed with the dataset as-is, but the imbalance is noted for consideration during evaluation.


# Modeling

We train a Logistic Regression model to classify Bean plant images as Healthy (0) or Unhealthy (1).

- The training and test sets are split with an 80/20 ratio, preserving class distribution (`stratify=y`).
- Logistic Regression is used with `class_weight='balanced'` to partially account for the imbalanced dataset.
- Evaluation to be  done using Accuracy, Confusion Matrix, and classification metrics (Precision, Recall, F1-score) to understand model performance on both classes.


## Train/Test Split

We split our data into training and testing sets so you can evaluate performance:

In [3]:
from sklearn.model_selection import train_test_split

# Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set:", X_train.shape, y_train.shape)
print("Test set:", X_test.shape, y_test.shape)


Training set: (516, 12288) (516,)
Test set: (129, 12288) (129,)


### Logistic Regression Implementation
- I implemented LR 

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize Logistic Regression
lr = LogisticRegression(max_iter=1000, class_weight='balanced')  # 'balanced' helps with class imbalance

# Train
lr.fit(X_train, y_train)

# Predict
y_pred = lr.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy: 0.9457364341085271

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.97      0.97       126
           1       0.00      0.00      0.00         3

    accuracy                           0.95       129
   macro avg       0.49      0.48      0.49       129
weighted avg       0.95      0.95      0.95       129


Confusion Matrix:
 [[122   4]
 [  3   0]]


# Model Evaluation

The Logistic Regression model achieved 94.5% accuracy. However, due to the severe class imbalance (126 Healthy vs 3 Unhealthy in the test set), the model fails to correctly classify any Unhealthy plants. 

- All Unhealthy plants in the test set were misclassified as Healthy.
- This highlights the importance of handling class imbalance, e.g., through oversampling, class weighting, or collecting more Unhealthy images.
- While overall accuracy is high, the Recall for the 'Unhealthy' class is 0%. In an agricultural context, this is a 'False Negative' and is the most costly error. This proves that the model is simply 'learning' the distribution of the data (predicting the majority class) rather than learning the features of the disease."

For this proof-of-concept, I document the results and noted that the model performs well for the majority class but poorly for the minority class.


# Trying Other Classifiers

### SVM Classifier

In [5]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Initialize SVM
svm_model = SVC(
    kernel='rbf',          # Non-linear decision boundary
    class_weight='balanced',  # Handle class imbalance
    random_state=42
)

# Train
svm_model.fit(X_train, y_train)

# Predict
y_pred_svm = svm_model.predict(X_test)

# Evaluation
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("\nClassification Report (SVM):\n", classification_report(y_test, y_pred_svm))
print("Confusion Matrix (SVM):\n", confusion_matrix(y_test, y_pred_svm))


SVM Accuracy: 0.9767441860465116

Classification Report (SVM):
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       126
           1       0.00      0.00      0.00         3

    accuracy                           0.98       129
   macro avg       0.49      0.50      0.49       129
weighted avg       0.95      0.98      0.97       129

Confusion Matrix (SVM):
 [[126   0]
 [  3   0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


### Random Forest Classifier

In [6]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(
    n_estimators=200,
    class_weight='balanced',
    random_state=42
)

# Train
rf_model.fit(X_train, y_train)

# Predict
y_pred_rf = rf_model.predict(X_test)

# Evaluation
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("\nClassification Report (RF):\n", classification_report(y_test, y_pred_rf))
print("Confusion Matrix (RF):\n", confusion_matrix(y_test, y_pred_rf))


Random Forest Accuracy: 0.9767441860465116

Classification Report (RF):
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       126
           1       0.00      0.00      0.00         3

    accuracy                           0.98       129
   macro avg       0.49      0.50      0.49       129
weighted avg       0.95      0.98      0.97       129

Confusion Matrix (RF):
 [[126   0]
 [  3   0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## Model Comparisons

In evaluated different multiple classification models to assess their performance on the plant health dataset. Due to class imbalance, accuracy alone is not sufficient, so precision, recall, F1-score, and confusion matrices are also considered.

#### Logistic Regression

**Accuracy:** 94.6%

The Logistic Regression model serves as the baseline classifier.

**Observations from Confusion Matrix:**
- Performs well on the majority class (Healthy)
- Fails to correctly classify any samples from the minority class (Unhealthy)
- High accuracy is misleading due to severe class imbalance

#### Support Vector Machine (SVM)

**Accuracy:** 97.7%

**Observations from confusion matrix:**
- Correctly classifies all healthy samples
- Does not detect any unhealthy samples
- Demonstrates stronger bias toward the majority class

#### Random Forest Classifier

**Accuracy:** 97.7%

**Observations from Confusion matrix:**
- Ensemble method improves stability
- Still fails to identify the minority class
- Accuracy increase does not translate to better class balance


### Key Insight

Although all models achieve high accuracy, none successfully classify the minority (unhealthy) class. This highlights the limitations of traditional machine learning models when applied to highly imbalanced image datasets.

### Conclusion

- Accuracy alone is not a reliable metric for imbalanced classification problems
- Traditional classifiers struggle with raw image features
- More advanced approaches such as class weighting, data augmentation, or Convolutional Neural Networks (CNNs) are likely required for better performance


### Trying a simple CNN

In [8]:
import tensorflow as tf
from tensorflow.keras import layers, models

# 1. Reshape data back to images (since CNNs need 3D shapes, not flat lines)
X_train_cnn = X_train.reshape(-1, 64, 64, 3)
X_test_cnn = X_test.reshape(-1, 64, 64, 3)

# 2. Build a Simple CNN
cnn_model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(64, 64, 3)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='sigmoid') # Binary output
])

# 3. Compile and Train
cnn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Use class_weight to help with the imbalance
cnn_model.fit(X_train_cnn, y_train, epochs=10, class_weight={0: 1, 1: 50})

2026-01-28 07:52:21.747650: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2026-01-28 07:52:21.999526: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-01-28 07:52:27.785003: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
2026-01-28 07:52:30.510844: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


Epoch 1/10
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 216ms/step - accuracy: 0.7364 - loss: 1.6412
Epoch 2/10
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 218ms/step - accuracy: 0.9806 - loss: 1.3875
Epoch 3/10
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 231ms/step - accuracy: 0.9806 - loss: 1.4007
Epoch 4/10
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 196ms/step - accuracy: 0.5097 - loss: 1.3513
Epoch 5/10
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 207ms/step - accuracy: 0.5659 - loss: 1.3505
Epoch 6/10
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 211ms/step - accuracy: 0.9787 - loss: 1.3544
Epoch 7/10
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 220ms/step - accuracy: 0.9535 - loss: 1.3368
Epoch 8/10
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 209ms/step - accuracy: 0.9089 - loss: 1.2207
Epoch 9/10
[1m17/17[0m [32m━━━━━━━━━━

<keras.src.callbacks.history.History at 0x7987d83e7f20>

## Conclusion, Reflection, and Challenges

### Conclusion

This project demonstrates an end-to-end workflow for binary image classification of Bean plant health, progressing from classical machine learning baselines to an exploratory deep learning approach.

The objective was to classify images as **Healthy (0)** or **Unhealthy (1)** to support early detection of plant stress or disease.


### Dataset and Preprocessing Summary

- Images were loaded from a folder-based structure (`Healthy`, `Unhealthy`) using OpenCV.
- All images were:
  - Converted to RGB
  - Resized to **64×64**
  - Normalized to the range **[0, 1]**
- For classical machine learning models, images were **flattened into 1D feature vectors**.
- Final dataset shape:
  - `X.shape = (645, 12288)`
  - `y.shape = (645,)`
- A **severe class imbalance** was observed:
  - Healthy: 632 samples
  - Unhealthy: 13 samples

This imbalance strongly influenced model behavior and evaluation.


### Modeling Approach

To align with my current learning stage and ensure clarity, I followed a **progressive modeling strategy**:

#### Classical Machine Learning (Baseline)
- **Logistic Regression**
- **Support Vector Machine (SVM)**
- **Random Forest Classifier**

These models were chosen to:
- Establish interpretable baselines
- Apply concepts learned from regression (data splitting, scaling, evaluation)
- Highlight limitations when applied to image data

#### Deep Learning (Exploratory)
- Implemented a **simple Convolutional Neural Network (CNN)** to assess the suitability of deep learning for this task.
- I used CNN as a proof-of-concept, not a final production model.


### Model Evaluation and Insights

#### Classical Models
- All classical models achieved **high overall accuracy (94–98%)**.
- However, **none successfully detected the Unhealthy class**.
- Confusion matrices and classification reports showed:
  - Perfect or near-perfect performance on the majority class
  - Zero recall for the minority class

This reinforces an important lesson:
 **Accuracy alone is misleading for imbalanced classification problems.**

#### CNN Experiment
- The CNN demonstrated the ability to learn spatial features directly from images.
- Training accuracy fluctuated significantly across epochs.
- Loss remained unstable due to:
  - Small dataset size
  - Extreme class imbalance
  - Lack of validation split and regularization

The CNN results highlight both the potential and sensitivity of deep learning models when applied to limited data.


### Reflection and Learning Outcomes

This project strengthened my understanding of:

- How classification workflows closely mirror regression workflows:
  - Data preprocessing
  - Train/test splitting
  - Model selection
  - Evaluation and interpretation
- The importance of choosing metrics beyond accuracy
- The practical limitations of classical ML on image data
- Why CNNs are the preferred approach for computer vision tasks, even though they require:
  - More data
  - Careful tuning
  - Regularization strategies


### Challenges Encountered

#### 1. Data Availability and Imbalance
- Locating and isolating Bean images from a larger dataset required manual inspection.
- The extreme imbalance (13 unhealthy samples) limited meaningful learning for minority classes.

#### 2. Transition from Regression to Classification
- My prior experience was primarily regression-focused.
- I had to quickly learn:
  - Binary classification concepts
  - Precision, recall, F1-score
  - Confusion matrix interpretation

#### 3. Image-to-Feature Conversion
- Classical models required images to be flattened, which:
  - Loses spatial information
  - Motivated exploration of CNNs

Due to time constraints and my current learning stage, this project intentionally focuses on:
- Building a strong classical ML baseline
- Demonstrating awareness of limitations