# Problem Definition

## Project Title: Agricultural Plant Health Classification

### Context
Early detection of plant diseases is crucial for maintaining crop health and maximizing yield. Manually inspecting plants is time-consuming and error-prone. This project focuses on developing a classification system to automatically distinguish between healthy and unhealthy plants based on their photographs.

### Objective
Develop a machine learning model to automatically classify plant images into:
- **Healthy Plants**: No disease or stress
- **Unhealthy Plants**: Showing signs of disease, pest damage, or stress

### Expected Outcome
A working proof-of-concept system that predicts the health status of a plant based on images.

### Application
Farmers and agricultural experts can use this system for timely intervention to prevent crop damage.

# Data Collection

### Source
- Used publicly available dataset from Mendeley Data [Bangladesh Dataset](https://data.mendeley.com/datasets/3wby28tkcp/2)


## Dataset Selection and Preparation

### Focus Crop: Bean

The dataset used in this project is the **Vegetables Dataset**, which contains images of four different vegetable crops, each divided into healthy and unhealthy categories:
```
Vegetables Dataset/
├── Malabar/
│ ├── Healthy/
│ └── Unhealthy/
├── Brinjal/
│ ├── Healthy/
│ └── Unhealthy/
├── Cauliflower/
│ ├── Healthy/
│ └── Unhealthy/
└── Bean/
├── Healthy/
└── Unhealthy/
```

For this **proof-of-concept project**, we focus only on the **Bean** crop, performing a **binary classification**:

- **Healthy Beans → label 0**  
- **Unhealthy Beans → label 1**

### Reasoning

Using only a single crop type has several advantages:

1. **Reduces variability**: Different crops have different leaf shapes, sizes, and colors. Combining all crops into one model can confuse the classifier and reduce predictive accuracy.
2. **Simplifies the baseline model**: By focusing on Bean, the model learns patterns specific to this crop, allowing us to establish a strong baseline before expanding to multiple crops.
3. **Demonstrates ML workflow**: The goal is to show end-to-end model building — loading data, preprocessing, feature extraction, training, and evaluation — on a manageable and interpretable subset.

### Folder Structure for Bean

The **Bean crop images** are organized as follows:
```
data/
└── Bean/
├── Healthy/ # Images of healthy bean plants
└── Unhealthy/ # Images of diseased or stressed bean plants
```

Each subfolder corresponds to a **class label**, which will be used during model training.

 

By focusing on this single crop type, the model can learn clear distinguishing patterns between healthy and unhealthy plants, making this a clean **baseline for future expansion** to multi-class or multi-crop classification.



# Data Preprocessing

In this section, we prepare the Bean plant images for machine learning. Since we are performing **binary classification** (Healthy vs. Unhealthy), we need to convert the raw images into a format suitable for a **classical ML model** like Logistic Regression.  

Steps taken:

1. **Image Resizing**  
   - All images are resized to a uniform size (64x64 pixels) to ensure consistency across the dataset.  
   - Resizing helps reduce computational load while retaining enough detail for the model to learn.

2. **Color Conversion and Flattening**  
   - Images are converted to RGB to ensure all have 3 color channels.  
   - Each image is flattened into a 1-dimensional array of pixel values so that classical ML models can process them.  
   - Flattening turns a 64x64x3 image into a 12,288-length feature vector.

3. **Label Encoding**  
   - Folder names (`Healthy` and `Unhealthy`) are mapped to numeric labels:  
     - Healthy - 0  
     - Unhealthy - 1  

4. **Feature and Label Arrays**  
   - All processed images are stored in `X` (features).  
   - Corresponding labels are stored in `y` (target).  

5. **Verification**  
   - We check the shape of `X` and `y` to ensure the data is loaded correctly:  
     - `X.shape` - `(number_of_images, 12288)`  
     - `y.shape` - `(number_of_images,)`  

This preprocessing ensures the images are ready for training a **binary classification model** while keeping the workflow simple and interpretable.


In [13]:
import os
import cv2
import numpy as np
from tqdm import tqdm # adds a nice loading bar

# 1. Define Paths
BASE_PATH = '/home/shaddy/Downloads/Dataset on Bangladeshi Healthy and Unhealthy Veget/Vegetables/Bean'
CATEGORIES = ['Healthy', 'Unhealthy']
IMG_SIZE = 64

X = []
y = []

# 2. The Wrangling Loop
print("Starting Image Processing...")
for category in CATEGORIES:
    path = os.path.join(BASE_PATH, category)
    label = CATEGORIES.index(category) # Healthy=0, Unhealthy=1
    
    for img_name in tqdm(os.listdir(path), desc=f"Loading {category}"):
        try:
            # Read and Convert
            img_path = os.path.join(path, img_name)
            img = cv2.imread(img_path)
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            
            # Step 1: Resize
            img = cv2.resize(img, (IMG_SIZE, IMG_SIZE))
            
            # Step 2: Flattening (64*64*3 = 12288)
            flattened_img = img.flatten()
            
            X.append(flattened_img)
            y.append(label)
        except Exception as e:
            pass

# Step 4: Convert to Arrays
X = np.array(X)
y = np.array(y)

# Step 6: Normalization
X = X / 255.0

# Step 5: Verification
print("\n--- Verification ---")
print(f"Features shape (X): {X.shape}") # Expect (num_images, 12288)
print(f"Labels shape (y): {y.shape}")     # Expect (num_images,)
print(f"Sample Label (first image): {y[0]}")

Starting Image Processing...


Loading Healthy: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 632/632 [01:43<00:00,  6.09it/s]
Loading Unhealthy: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:01<00:00,  6.65it/s]


--- Verification ---
Features shape (X): (645, 12288)
Labels shape (y): (645,)
Sample Label (first image): 0





**Note on Class Imbalance**

After loading the images, we notice that the dataset is highly imbalanced:

- Healthy: 632 images
- Unhealthy: 13 images

This imbalance can affect model performance, especially for minority classes.  
In a production scenario, techniques like **class weighting, oversampling the minority class and others** could help mitigate this issue.  

For this proof-of-concept, we proceed with the dataset as-is, but the imbalance is noted for consideration during evaluation.


# Modeling

We train a Logistic Regression model to classify Bean plant images as Healthy (0) or Unhealthy (1).

- The training and test sets are split with an 80/20 ratio, preserving class distribution (`stratify=y`).
- Logistic Regression is used with `class_weight='balanced'` to partially account for the imbalanced dataset.
- Evaluation to be  done using Accuracy, Confusion Matrix, and classification metrics (Precision, Recall, F1-score) to understand model performance on both classes.


## Train/Test Split

We split our data into training and testing sets so you can evaluate performance:

In [14]:
from sklearn.model_selection import train_test_split

# Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set:", X_train.shape, y_train.shape)
print("Test set:", X_test.shape, y_test.shape)


Training set: (516, 12288) (516,)
Test set: (129, 12288) (129,)


### Logistic Regression Implementation
- I implemented LR 

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize Logistic Regression
lr = LogisticRegression(max_iter=1000, class_weight='balanced')  # 'balanced' helps with class imbalance

# Train
lr.fit(X_train, y_train)

# Predict
y_pred = lr.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy: 0.9457364341085271

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.97      0.97       126
           1       0.00      0.00      0.00         3

    accuracy                           0.95       129
   macro avg       0.49      0.48      0.49       129
weighted avg       0.95      0.95      0.95       129


Confusion Matrix:
 [[122   4]
 [  3   0]]


# Model Evaluation

The Logistic Regression model achieved 94.5% accuracy. However, due to the severe class imbalance (126 Healthy vs 3 Unhealthy in the test set), the model fails to correctly classify any Unhealthy plants. 

- All Unhealthy plants in the test set were misclassified as Healthy.
- This highlights the importance of handling class imbalance, e.g., through oversampling, class weighting, or collecting more Unhealthy images.
- While overall accuracy is high, the Recall for the 'Unhealthy' class is 0%. In an agricultural context, this is a 'False Negative' and is the most costly error. This proves that the model is simply 'learning' the distribution of the data (predicting the majority class) rather than learning the features of the disease."

For this proof-of-concept, I document the results and noted that the model performs well for the majority class but poorly for the minority class.


# Conclusion, Reflection, and Challenges

## Conclusion

This project demonstrates an end-to-end workflow for binary image classification of Bean plant health using classical machine learning. The key steps and observations are summarized below:

1. **Problem Definition**
   - Objective: Automatically classify Bean plant images as Healthy (0) or Unhealthy (1) to support early detection of plant stress or disease.
   - Dataset: Bean images from the Vegetables Dataset, divided into Healthy and Unhealthy categories.

2. **Data Loading**
   - Images were loaded from folder structure (`Healthy` and `Unhealthy`) using OpenCV.
   - Ensured all images were RGB and resized to 64x64 pixels for consistency.
   - Flattened images into 1D arrays suitable for classical ML models.

3. **Data Preprocessing**
   - Converted images into feature vectors (`X`) and labels (`y`).
   - Normalized pixel values to range [0,1].
   - Verified shapes: `X.shape = (645, 12288)`, `y.shape = (645,)`.
   - Noted class imbalance (632 Healthy vs 13 Unhealthy), which could impact model performance.

4. **Modeling**
   - Applied Logistic Regression, a classical ML model aligned with claimed skills.
   - Split data into training and test sets (80/20), with stratification to maintain class distribution.
   - Used `class_weight='balanced'` to partially account for class imbalance.

5. **Evaluation**
   - Overall accuracy: **94.5%**
   - Confusion matrix and classification report reveal that all Unhealthy plants were misclassified, highlighting the effect of class imbalance.
   - Demonstrates the importance of evaluating metrics beyond accuracy, especially with imbalanced datasets.



## Reflection and Future Directions

After completing this proof-of-concept, I have noted that using **Computer Vision techniques and deep learning models (e.g., Convolutional Neural Networks)** could improve performance significantly. Unlike classical ML models, deep learning:

- Automatically learns complex features from images such as shapes, textures, and patterns.  
- Reduces the need for manual feature engineering or flattening of images.  
- Can better handle variability in leaf shapes, lighting conditions, and background noise.  
- May improve classification of minority classes (e.g., Unhealthy plants) if combined with data augmentation techniques.

If I were to extend this project, I could:

- Collect more Unhealthy plant images to balance the dataset.  
- Experiment with CNN architectures to directly process images without flattening.  
- Apply data augmentation to create a richer training set and improve model generalization.  
- Compare classical ML performance with deep learning to evaluate trade-offs in simplicity vs accuracy.



## Challenges Encountered

During this project, I faced several challenges, which reflect both technical and learning aspects:

1. **Data Collection and Preparation**
   - The dataset was downloaded from Mendeley, but I had to locate the specific Bean crop images and ensure folder structure matched labels.
   - Managing imbalanced classes (632 Healthy vs 13 Unhealthy) posed challenges in model evaluation.

2. **Learning and Applying Classification**
   - My prior experience was mainly with regression; I had to quickly learn the basics of binary classification, evaluation metrics (precision, recall, F1-score), and how to interpret confusion matrices.
   - Figuring out how to convert images into numerical features suitable for Logistic Regression (flattening, normalization) required research and experimentation.

3. **Time Constraints**
   - The project had to be completed in 3 days, which meant I had to learn, implement, and test quickly while ensuring a coherent, reproducible workflow.

4. **Balancing Simplicity vs Performance**
   - Choosing Logistic Regression was a decision to stay within my known skillset, even though after research I found out deep learning could perform better.
   - Highlighting class imbalance and thinking about future improvements shows awareness of model limitations without overcomplicating the current implementation.

5. **Technical Hurdles**
   - Reading images consistently  
   - Resizing and flattening images for classical ML  
   - Not a hurdel>> Data without errors was a boost


## Production-Oriented Approach 

In a production setting where computer vision is a core requirement, this system would be extended using a CNN-based architecture. A pretrained model (e.g. ResNet or MobileNet) could be fine-tuned on the Bean dataset to automatically learn spatial and texture-based features.

This approach would:
- Remove the need for manual feature flattening
- Improve robustness to lighting and background variation
- Improve minority class detection through data augmentation

Due to time constraints and my current learning stage, this project focuses on a classical ML baseline while clearly outlining the path to a production-ready CV system.
