In [9]:
import os
import tensorflow as tf
import matplotlib
matplotlib.use("TkAgg")
import matplotlib.pyplot as plt

HOME_DIRECTORY = os.path.expanduser("~")
PISTACHIO_FILE_PATH = os.path.join(HOME_DIRECTORY, "Documents/"
                                            "ML@Purdue/"
                                            "Pistachio Classifier/"
                                            "Pistachio_Image_Dataset/"
                                            "Pistachio_Image_Dataset")
BATCH_SIZE = 9
EPOCHS = 10

In this cell, we are importing os (for directory functionality), tensorflow for the CNN, and matplotlib. matplotlib.use("TkAgg") was used because the default backend of matplotlib doesn't full support RGB rendering. After the imports, the file path to the data files was created using the os library and batch/epoch sizes of 9 and 10 were chosen arbitrarily.

In [10]:
def load_images():
    all_data = tf.keras.utils.image_dataset_from_directory(PISTACHIO_FILE_PATH,
                                                           labels='inferred',
                                                           label_mode="int",
                                                           shuffle=True,
                                                           batch_size=BATCH_SIZE)
    return all_data


def split_data(dataset):
    data_size = tf.data.Dataset.cardinality(dataset).numpy()
    train_size = int(0.7 * data_size)
    val_size = int(0.15 * data_size)

    training_set = dataset.take(train_size)
    validation_set = dataset.skip(train_size).take(val_size)
    test_set = dataset.skip(train_size + val_size)

    return training_set, validation_set, test_set


The load_images() function serves to use tensorflow.keras' built in import function to take the image data and convert it directly into a tf.data.Dataset object, the data type that the CNN will take. It is important to note the arguments; label="inferred" allows for automatic labeling of data, label_mode="int" insures numerical labels, shuffle=True shuffles the dataset to ensure randomness, and batch size sets the batch size for the CNN. Then, split_data() takes the dataset and splits it into a training set, validation set, and test set, with each having 70%, 15% and 15% of the dataset respectively.

In [11]:
def get_model():
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Conv2D(filters=32,
                                     kernel_size=(2, 2),
                                     activation="relu",
                                     data_format="channels_last",
                                     input_shape=(256, 256, 3)))
    model.add(tf.keras.layers.MaxPooling2D((2, 2)))
    model.add(tf.keras.layers.Conv2D(filters=64,
                                     kernel_size=(2, 2),
                                     activation="relu"))
    model.add(tf.keras.layers.MaxPooling2D((2, 2)))
    model.add(tf.keras.layers.Conv2D(filters=128,
                                     kernel_size=(2, 2),
                                     activation="relu"))
    model.add(tf.keras.layers.MaxPooling2D((2, 2)))
    model.add(tf.keras.layers.Conv2D(filters=256,
                                     kernel_size=(2, 2),
                                     activation="relu"))
    model.add(tf.keras.layers.GlobalAveragePooling2D())
    model.add(tf.keras.layers.Dropout(0.3))
    model.add(tf.keras.layers.Dense(1, activation="sigmoid"))
    model.summary()
    return model
MAIN_MODEL = get_model()

This function describes the model architecture. It is a sequential model with alternating Convolutional and Pooling neuron layers with ReLU activation and a final dense neuron with a sigmoid activation. The convolutional layers slide a small 2x2 window over the data (kernel) and identify patterns such as edges or textures (ex. cracks in the pistachios). What is passed to the next layer is a set of feature maps that highlight the specific features the kernel found. The pooling layers downsample ("pixelate") the image, condensing the information from the previous layer to make the prediction more resilient to small shifts/distortions in the input. All these parameters are condensed into 256 parameters representing each feature, and this was put into one dense neuron with one output. 

In [12]:
DATASET = load_images()
TRAIN_SET, VAL_SET, TEST_SET = split_data(DATASET)
PISTA_NAMES = DATASET.class_names

Found 2148 files belonging to 2 classes.


These serve to initialize the training, validation, and test sets, as well as to retrieve the names of the different classes of pistachio.

In [13]:
MAIN_MODEL.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
              loss="binary_crossentropy",
              metrics=["accuracy", tf.keras.metrics.AUC(name="auc")])
training_bool = input("Do you want to train? Y/N  ")
if training_bool == "Y":
    train = MAIN_MODEL.fit(TRAIN_SET, epochs=EPOCHS,
                    validation_data=VAL_SET)
    MAIN_MODEL.save("pistachio_cnn.keras")
elif training_bool == "N":
    MAIN_MODEL = tf.keras.models.load_model("pistachio_cnn.h5")



Here we compile the model, using the Adam optimizer with a learning rate of 0.001. This learning rate is a general standard, as it is usually small enough to not let the minimization function diverge but also big enough that the minimization won't get stuck in a local minima. We use the Adam optimizer because it can carry over "momentum" from previous gradients by remembering the updates it made in the past. It also adapts the learning rate depending on the gradient, making sure it doesn't get too large or too small depending on the scenario. We also measure the accuracy of the model using the Area Under the ROC curve, with 0.5 being random classification and 1 being perfect classification. Finally, in order to ensure training doesn't occur every time, I've given the user an option to train or not. If the input is Y, the model will train, save the model and test. If the answer is N, the program simply loads in the most recent training run and tests on that.

In [14]:
test_loss, test_acc, test_auc = MAIN_MODEL.evaluate(TEST_SET)
print(f"Test loss: {test_loss:.4f}")
print(f"Test accuracy: {test_acc:.4f}")
print(f"Test AUC: {test_auc:.4f}")

for images, labels in TEST_SET.take(1):
    probabilities = MAIN_MODEL.predict_on_batch(images)
    predictions = (probabilities >= 0.5).astype(int).ravel()

    plt.figure(figsize=(10.0, 10.0))
    for i in range(len(images)):
        new_sub = plt.subplot(3, 3, i + 1)
        plt.imshow(images[i].numpy().astype(int))
        new_sub.axis("off")
        true_label = int(labels[i].numpy())
        pred_label = int(predictions[i])
        title = (f"Prediction: {PISTA_NAMES[pred_label]} ({probabilities[i][0]:.3f})\n"
                 f"Actual: {PISTA_NAMES[true_label]}")
        color = "green" if (true_label == pred_label) else "red"
        plt.title(title, color=color, fontsize=9)
    plt.show()

[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 34ms/step - accuracy: 0.8364 - auc: 0.9063 - loss: 0.4245
Test loss: 0.4245
Test accuracy: 0.8364
Test AUC: 0.9063


2025-09-10 18:13:21.093084: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In this final cell, we evaluate the model using the test set and create a plot that shows the images of one batch of the test set, along with the predicted and actual classifications. If the classification is right, the text is green, and if it is wrong the text is red.