**PREPROCESSING DATA**
--
Data comes from the PlantVillage dataset.

**ROADMAP**
Before loading data into this Jupyter notebook:
- Download raw data from PlantVillage. This data is already split into training and validation.

After loading data into this Jupyter notebook:
1. Inspect the split of the PlantVillage data.
2. Use ImageDataGenerator to preprocess images into keras-compatible arrays. Recall that batch size is 32 and picture resolution is 256x256.
3. Save preprocessed data as numpy arrays.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator

2025-04-22 13:52:25.947772: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


**CHECKING SPLIT**

In [3]:
''' 
Checking number of files in each folder.
    Note: Files were not pushed to git because they were too large. Instead, image files were kept on local device.
'''
import os

train_dir = '../PlantVillage/train'
val_dir = '../PlantVillage/val'

train_count = 0
val_count = 0

for class_name in os.listdir(val_dir):
    # Get each class directory in the val directory
    class_path = os.path.join(val_dir, class_name)
    
    if not os.path.isdir(class_path):
        continue

    # Count the number of files in the class directory and then add to the running sum
    num_files = sum(1 for entry in os.scandir(class_path) if entry.is_file())
    val_count += num_files

for class_name in os.listdir(train_dir):
    # Get each class directory in the train directory
    class_path = os.path.join(train_dir, class_name)
    
    if not os.path.isdir(class_path):
        continue

    # Count the number of files in the class directory and then add to the running sum
    num_files = sum(1 for entry in os.scandir(class_path) if entry.is_file())
    train_count += num_files

print(f"Total images in train set: {train_count}")
print(f"Total images in val set: {val_count}")

total_count = train_count+val_count
print(f"Train-Val Split: {[train_count/total_count, val_count/total_count]}")

Total images in train set: 32000
Total images in val set: 8000
Train-Val Split: [0.8, 0.2]


**PREPROCESSING**

In [4]:
''' 
Creating data generators for training  and validations data.
    Note: the steps to increase randomness for training were not stated by the paper.
'''

train_datagen = ImageDataGenerator(
    rescale=1./255, # Normalize pixel values from [0, 255] to [0.0, 1.0]
    rotation_range=20, # Randomly rotates images by up to 20 degrees
    zoom_range=0.1, # Randomly zooms in or out by up to 10%
    width_shift_range=0.1, # Randomly shifts image horizontally right or left by up to 10%
    height_shift_range=0.1, # Randomly shifts image vertically up or down by up to 10%
)

# Do not want to randomize validation data
val_datagen = ImageDataGenerator(rescale=1./255)

In [5]:
'''
Apply data generators to PlantVillage data.
    Note: Preprocessed data is not saved after this point. 

Parameters as per paper:
    image size = [256, 256]

Each item in train_prepro is of size ([training size, 256, 256, 3], [training size, 38]), where the first element in the tuple is 
    the preprocessed image and the second is the class. The next cell lists out all of the one-hot encoded classes with their original
    class names. Remember that there are 38 classes in total, and these classes are one-hot encoded.
'''

target_size = (256, 256)

train_prepro = train_datagen.flow_from_directory(
    train_dir,
    target_size=target_size,
    batch_size=32, # For saving purposes, we don't use batch sizes and separate out the batches later
    class_mode='categorical',
    shuffle=True
)

val_prepro = val_datagen.flow_from_directory(
    val_dir,
    target_size=target_size,
    batch_size=32,
    class_mode='categorical',
    shuffle=False
)


Found 32000 images belonging to 33 classes.
Found 8000 images belonging to 33 classes.


In [None]:
'''  
Printing out the indices for the encoded classes (just for reference)

How to interpret these indices:
    'Apple___Cedar_apple_rust: 2' means that the encoded label for the cedar apple rust disease image is [0 0 1 0 ... 0 0 0]
'''
print(train_prepro.class_indices)


{'Apple___Apple_scab': 0, 'Apple___Black_rot': 1, 'Apple___Cedar_apple_rust': 2, 'Apple___healthy': 3, 'Cherry_(including_sour)___Powdery_mildew': 4, 'Cherry_(including_sour)___healthy': 5, 'Corn_(maize)___Cercospora_leaf_spot Gray_leaf_spot': 6, 'Corn_(maize)___Common_rust_': 7, 'Corn_(maize)___Northern_Leaf_Blight': 8, 'Corn_(maize)___healthy': 9, 'Grape___Black_rot': 10, 'Grape___Esca_(Black_Measles)': 11, 'Grape___Leaf_blight_(Isariopsis_Leaf_Spot)': 12, 'Grape___healthy': 13, 'Peach___Bacterial_spot': 14, 'Peach___healthy': 15, 'Pepper,_bell___Bacterial_spot': 16, 'Pepper,_bell___healthy': 17, 'Potato___Early_blight': 18, 'Potato___Late_blight': 19, 'Potato___healthy': 20, 'Strawberry___Leaf_scorch': 21, 'Strawberry___healthy': 22, 'Tomato___Bacterial_spot': 23, 'Tomato___Early_blight': 24, 'Tomato___Late_blight': 25, 'Tomato___Leaf_Mold': 26, 'Tomato___Septoria_leaf_spot': 27, 'Tomato___Spider_mites Two-spotted_spider_mite': 28, 'Tomato___Target_Spot': 29, 'Tomato___Tomato_Yellow

**SAVING PREPROCESSED DATA**

In [7]:
train_images, train_labels = next(train_prepro) # train_images and train_labels are numpy arrays
val_images, val_labels = next(val_prepro) # val_images and val_labels are numpy arrays

In [None]:
43444*(1-0.125)
train_images[0]
train_labels[0]




array([0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
      dtype=float32)