# PepsiCo Lab Potato Chips Quality Control

**Written by**: Sai Machiraju and Dylan Winer

## Introduction
This dataset was provided by Frito-Lay, the subidiary of Pepsico, in an effort to improve their chip quality control through a Kaggle competition two years ago. We decided to tackle this challenge to sharpen our deep learning and computer vision skills. During the chip-making process, some of the chips get burnt, and there is a maximum amount of damage for a chip to be passable. Therefore, acting like PepsiCo quality control engineers, we built a deep learning model to solve this problem.
[Link to Kaggle](https://www.kaggle.com/datasets/concaption/pepsico-lab-potato-quality-control/data "Kaggle Data")

### Dataset
The dataset consists of a balanced collection of 961 JPG images within defective and non-defective labels hosted on Kaggle. Therefore, our deep learning model needed to perform a binary classification to categorize images of chips as either defective or non-defective. 

### Introduction to PyTorch and TorchVision
Computer vision enables machines to interpret and understand the visual world, providing a vital source of input data to train and test deep learning models. PyTorch is the most popular open-source deep learning library, which includes the TorchVision library dedicated to computer vision tasks. TorchVision is capable of a wide variety of powerful functionalities, ranging from image preprocessing to dataset handling to evaluation. PyTorch and TorchVision will serve as the backbone for our deep learning model, allowing us to build a robust and efficient binary classification algorithm.

**First Attempt at Image Pre-Processing**
In our original attempt to process the loading data, we tried to traverse the train and test folders, create `X_train` and `X_test` matrices and `y_train` and `y_test` vectors with all samples in the dataset, and load these vectors into Numpy matrices. However, it was extremely memory-intensive to immediately iterate through all samples in both the Train and Test folders, so we opted for a different approach.

```python
import os
from PIL import Image
import numpy as np

# Initialize X_train and X_test lists
X_train = []
X_test = []
base_path = os.path.join("pepsico-lab-potato-quality-control", "Pepsico RnD Potato Lab Dataset")

# Encoding: 1 represents defective, 0 represents non-defective
for folder in ['Train', 'Test']:
    for category in ['Non-Defective', 'Defective']:
        folder_path = os.path.join(base_path, folder, category)
        for image_path in os.listdir(folder_path):
            # Ignore the init files, only want jpg files
            if image_path.endswith('.jpg') is False:
                continue
            img = Image.open(os.path.join(folder_path, image_path))
            if folder == 'Train':
                X_train.append(np.asarray(img))
            else:
                X_test.append(np.asarray(img))

# Convert the lists to numpy arrays
X_train = np.array(X_train)
X_test = np.array(X_test)

print(f'X_test Shape: {X_test.shape}')
print(f'X_train Shape: {X_train.shape}')

with open('X_train.npy', 'wb') as file:
    np.save(file, X_train)

with open('X_test.npy', 'wb') as file:
    np.save(file, X_test)
```

**Background Removal Attempt**

Based on previous experience with OpenCV, we elected to import the OpenCV library to aid in the image filtering and analysis process. The photos of the chips did not all have a consistent background, so we wanted to improve performance by converting the background to white. 

Therefore, we searched for background removal algorithms utilizing OpenCV methods, and we selected a function from FreedomVC. [Link to FreedomVC Site](https://www.freedomvc.com/index.php/2022/01/17/basic-background-remover-with-opencv/ "FreedomVC")

**Background Removal Function**

This function is designed to remove the background from an input image using color-based segmentation in the HSV color space. The process requires creating a binary mask that distinguishes between the foreground and the background based on the image's saturation and brightness channels. The input first converts the RGB space to the HSV (Hue, Saturation, Value) color space using cv2.cvtColor. After, the saturation mask is created by extracting the saturation channel (S). Next, a mask (s) is created where values below a threshold (80) are set to 0, and values above or equal to the threshold are then 1. It then increases the brightness of the Value channel (V) by adding 80 to each value. Next, a modulo of 255 is applied to ensure that values remain in the valid range of 0-255. We create a mask (v) where values above a threshold (80) are set to 1, and values below the threshold are set to 0. Combining the saturation and value masks (s and v) into a single binary mask allows us to create the foreground. Pixels are considered part of the foreground if either the saturation or brightness are above their thresholds.

It inverts the foreground mask to obtain the background mask. Pixels that are not part of the foreground are set to 255 (white).  Utilizing cv2.bitwise_and to apply the foreground, we set the background pixels to 0 and kept the foreground constant. The function combines the foreground and background image, converting the final image to a NumPy _array (img_np)_ and returning it.

_Note on Return_:

The function originally returned an OpenCV image; however, after testing, we needed the function to return the image as an explicit Numpy array to prevent future errors.

**Implementation**

We attempted to pass each of the images through the bgremove function in the initialization step. However, after testing the accuracy with this inclusion, we realized the algorithm performed better without the removed background. Regardless, this attempt enhanced our understanding of the OpenCV library and was effective at segmenting the foreground and background of the chip images passed.

```python
import cv2

def bgremove(myimage):
    # BG Remover
    myimage_hsv = cv2.cvtColor(myimage, cv2.COLOR_BGR2HSV)
     
    #Take S and remove any value that is less than half
    s = myimage_hsv[:,:,1]
    s = np.where(s < 80, 0, 1) # Any value below 80 will be excluded
 
    # We increase the brightness of the image and then mod by 255
    v = (myimage_hsv[:,:,2] + 80) % 255
    v = np.where(v > 80, 1, 0)  # Any value above 80 will be part of our mask
 
    # Combine our two masks based on S and V into a single "Foreground"
    foreground = np.where(s+v > 0, 1, 0).astype(np.uint8)  #Casting back into 8bit integer
 
    background = np.where(foreground==0,255,0).astype(np.uint8) # Invert foreground to get background in uint8
    background = cv2.cvtColor(background, cv2.COLOR_GRAY2BGR)  # Convert background back into BGR space
    foreground=cv2.bitwise_and(myimage,myimage,mask=foreground) # Apply our foreground map to original image
    finalimage = background+foreground # Combine foreground and background
    img_np = np.array(finalimage)
    return img_np
    #return finalimage
```

**Model**

We implemented the LeNet Convolutional Neural Network (CNN) architecture, using the code from [this PyImageSearch article](https://pyimagesearch.com/2021/07/19/pytorch-training-your-first-convolutional-neural-network-cnn/). One key modification, though, was that we excluded the final softmax layer, returning the logits (log probabilities) directly. We applied the softmax function in our training code.

We chose LeNet because of its simplicity: We were able to reach 95% accuracy using a dataset with only 700 training examples. Additionally, LeNet is ubiquitous: LeNet-5 was first documented in a [transformative 1998 paper.](http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf)

>**Note:** The LeNet model module is available [on our GitHub repository.](https://github.com/saimachi/AI-Quality-Control)

**Libraries**

We imported libraries to handle the complex tasks of image pre-processing, model creation, optimization, and implementation.

* The os library in Python provides a way to interact with the operating system by offering functions for file and directory manipulation, path operations, and more, facilitating file management within our Jupyter Lab environment.
* The torch library, part of the PyTorch framework, provides data structures for efficient multi-dimensional tensor computations and automatic differentiation for building and training neural networks.
* BCEWithLogitsLoss (Binary Cross Entropy with Logits Loss)  is a loss function for binary classification. This loss function combines the sigmoid activation function with the binary cross-entropy loss.
* Adam (Adaptive Moment Estimation), is an optimization algorithm for training deep neural networks. It combines RMSprop, which adapts learning rates for each parameter individually, and Momentum, which adds a moving average of past gradients to smooth the optimization trajectory. Adam maintains two moving averages for each parameter, which handles different learning rates for different parameters, making it well-suited for deep learning tasks.
* Sigmoid (sigmoid activation function) is used to squash input values to a range between 0 and 1, which produces a likelihood of a sample belonging to a certain class (i.e., being defective).
* PIL (Python Imaging Library or Pillow) opens, manipulates, and saves images. It has many image processing capabilities, including resizing, cropping, rotating, and filtering.
* NumPy is a numerical computing library for Python used in various domains, including machine learning, data analysis, and numerical simulations. Its key feature is the numpy.array data structure, which allows efficient and fast operations on large datasets, including images converted to arrays.

**Dataset Class**

**Constructor (`__init__`)**

The constructor initializes attributes of the `Dataset` instance based on the provided parameters, setting up the initial state necessary for creating batches of training or validation data. Notably, we apply data augmentation transforms to the training dataset, but not the testing dataset.

**`__len__`**

The `__len__` overloaded method returns the size or length of the dataset; in this case, it returns the number of samples in the dataset.

**`__getitem__`**

The `__getitem__` overloaded method is designed to retrieve a specific image and label from the dataset based on its index. 

**`generate_datset`**

Its purpose is to create an instance of the `Dataset` class based on the directory structure of the PepsiCo RnD Potato Lab dataset. It distinguishes between training and testing datasets using the `is_train` parameter.

>**Note:** We renamed `Test/Not Defective` in the Kaggle dataset download to `Test/Non-Defective` to simplify the data loading process.

In [None]:
# Import necessary libraries
import os
import random
import torch
from torch.utils.data import DataLoader, Subset
from torchvision.io import read_image
from torchvision import transforms
from lenet import LeNet
from torch.nn import BCEWithLogitsLoss
from torch.optim import Adam
from torch.nn.functional import sigmoid
from PIL import Image
import numpy as np

# Define a custom Dataset class for handling image data
class Dataset(torch.utils.data.Dataset):
    def __init__(self, file_ids, labels, base_path, is_training):
        """
        Constructor for the Dataset.

        Parameters
        ----------
        file_ids : list[str]
            List of file names (not paths)
        labels : list[int]
            List of class identifiers, corresponding to the list of file IDs
        base_path : str
            Path to the folder containing the files in `file_ids`

        Returns
        -------
        Dataset
        """
        self.file_ids = file_ids
        self.labels = labels
        self.base_path = base_path
        # Define a list of image transformations using torchvision.transforms
        transform_list = [
            transforms.ToPILImage(),
            transforms.Resize(28)
        ]
        # If it is a training dataset, include additional data augmentation transformations
        if is_training:
            transform_list.extend([
                transforms.RandomHorizontalFlip(), # Randomly flip the image horizontally
                transforms.RandomVerticalFlip(), # Randomly flip the image vertically
                transforms.RandomRotation(degrees=20), # Randomly rotate the image by up to 20 degrees
                # Uncomment the line below to include color jittering
                # However, from our tests, its inclusion worsened the model's accuracy
                #transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.2)
            ])
        # Add the final transformation to convert the image to a PyTorch tensor
        transform_list.append(transforms.ToTensor())
        self.transform = transforms.Compose(transform_list)
    
    def __len__(self):
        """
        Return the size of the dataset.

        Parameters
        ----------

        Returns
        -------
        int
        """
        return len(self.file_ids)

    def __getitem__(self, index):
        """
        Get a feature and label tuple based on the index.

        Parameters
        ----------
        index : int
            This function does not check the bounds of the Dataset

        Returns
        -------
        (torch.tensor, int) : Image vector and class
        """
        # Get the filename and label for the given index
        filename = self.file_ids[index]
        label = self.labels[index]
        # Read the image using torchvision.io.read_image
        img = read_image(os.path.join(self.base_path, "Defective" if label == 1 else "Non-Defective", filename))
        # Apply the defined transformations to the image
        return self.transform(img), self.labels[index]


# Function to generate training and testing datasets
def generate_dataset(is_train):
    """
    Generate training and testing datasets for the PepsiCo folder structure.

    Parameters
    ----------
    is_train: bool
        whether the dataset to be generated is for training (True) or testing (False).

    Returns
    -------
    Dataset
        Instance of the class representing the generated dataset
    """
    # Root directory of the PepsiCo dataset
    base_path = os.path.join("pepsico-lab-potato-quality-control", "Pepsico RnD Potato Lab Dataset")
    # Based on whether the dataset is for training or testing.   
    train_path = os.path.join(base_path, "Train" if is_train else "Test")
    file_ids = []
    labels = []
    
     # Iterate through categories (Defective and Non-Defective) and collect filenames and labels
    for category in ["Defective", "Non-Defective"]:
        for filename in os.listdir(os.path.join(train_path, category)):
            if filename.endswith(".jpg") is False:
                continue
            file_ids.append(filename)
            labels.append(1 if category == "Defective" else 0)
            
    # Create and return a Dataset instance with the collected data
    return Dataset(file_ids, labels, train_path, is_train)

**Data Loaders**

This portion of the code sets up data loaders for training, validation, and testing based on the generated datasets. It uses the PyTorch `DataLoader` and `Subset` classes to create iterable data loaders that provide mini-batches of data during the training and evaluation processes. Batches ensure that GPU memory is used efficiently.

**Test and Validation Data Loader**

The `generate_dataset()` function is called to create a testing dataset (`raw_test`) by setting `is_train=False`. The DataLoader is then used to create an iterable data loader (`test_loader`) for the testing dataset. It loads mini-batches of size 32, doesn't shuffle the data (`shuffle=False`), and utilizes pinned memory for faster data transfers from host memory to GPU VRAM (`pin_memory=True`).

**Training Data Loader**

A training dataset (`raw_train`) is generated by the same dataset call, but this time setting `is_train=True`. Random indices are generated using `random.sample()` to split the training dataset into 80% for training (`train_indices`) and 20% for validation (`val_indices`). Subsets (train and validation) of the training dataset are created using the `Subset` class based on the selected indices. After every training epoch, we run the validation set through the model to verify that it is not overfitting.

In [None]:
# Loader consisting of raw testing data in mini-batches of 32, without shuffling
raw_test = generate_dataset(is_train=False)
test_loader = DataLoader(raw_test, batch_size=32,shuffle=False,pin_memory=True)

# Raw training data: split into 20% for validation and 80% for testing
raw_train = generate_dataset(is_train=True)
raw_train_len = len(raw_train)
indices = random.sample(range(raw_train_len), raw_train_len)
train_indices = indices[:int(raw_train_len * 0.8)]
val_indices = list(set(indices).difference(set(train_indices)))
train = Subset(raw_train, train_indices)
validation = Subset(raw_train, val_indices)

# Loaders for training and validation based on training images
train_loader = DataLoader(train, batch_size=32, shuffle=True, pin_memory=True)
validation_loader = DataLoader(validation, batch_size=32, shuffle=False, pin_memory=True)

**Epochs**

The number of epochs is a hyperparameter that we specified before training begins. It represents how many times the entire training dataset is processed by the model. Trial and error was required to determine the right number of epochs to ensure the model learns from the data without overfitting (learning noise in the data) or underfitting (not capturing the underlying patterns). We monitored the training and validation performance over epochs and stopped training when performance plateaued or started decreasing. After a few tests, we identified 10 as the optimal number of epochs.

**Epoch Process**

At the beginning of training, the model's parameters are initialized (weights and biases). The training dataset is divided into smaller batches to facilitate efficient computation and parameter updates. The training process is organized into a series of epochs. During each epoch: 
* The model is presented with each batch of 32 training examples sequentially
* For each batch:
  * The model computes predictions based on the current parameters
  * The loss is calculated by comparing the model's predictions to the actual labels in the training data
  * The optimizer adjusts the model's parameters to minimize the loss
* The model is validated using data it hasn't been trained on

**Training Phase**

Within each epoch, the training data loader (`train_loader`) is iterated over in mini-batches. Model parameters are updated based on the computed loss and backpropagation.

**Validation Phase**

After each epoch, the model is evaluated on the validation dataset (`validation_loader`) without updating its parameters. The validation accuracy is computed to monitor the model's performance on unseen data during the training process.

**Testing Phase**

After completing all epochs, the model is evaluated on the test dataset (`test_loader`) without updating its parameters. The test set accuracy is calculated to assess the final performance of the trained model.

In [None]:
# Create an instance of the LeNet model for binary classification (3 input channels, 1 output channel)
# this model architecture is designed for image classification tasks
model = LeNet(3, 1)
# A GPU device is set up
gpu = torch.device("cuda")
# The LeNet model is moved to the GPU for faster computation during training and inference
model = model.to(device=gpu)
# Binary Cross Entropy with Logits Loss is chosen as the loss function, suitable for binary classification
loss = BCEWithLogitsLoss()
# The Adam optimizer is used to update the model parameters during the training process
optimizer = Adam(model.parameters())

# The training process is repeated for a specified number of epochs (10) 
epochs = 10
for epoch in range(epochs):
    print(f"Epoch: {epoch}")
    # each batch contains 32 images
    current_batch = 0
    
    # Training phase
    for batch in train_loader:
        if current_batch % 4 == 0:
            print(f"Batch: {current_batch}")
        current_batch += 1
        # batch[0] contains features, batch[1] contains labels
        # Even though the labels are 0 or 1, they must be floats for BCEWithLogitsLoss()
        X_batch, y_batch = batch[0].to(gpu), batch[1].to(gpu).unsqueeze(1).float()
        output = model.forward(X_batch)
        loss_value = loss(output, y_batch)
        optimizer.zero_grad()
        loss_value.backward()
        optimizer.step()
    # The training loss is printed after each epoch to monitor the model's convergence during training
    print(f"Train Loss: {loss_value}")
    
    # Validation phase
    # Disable gradient computation (no backward pass)
    with torch.set_grad_enabled(False):
        correct = 0
        for val_batch in validation_loader:
            X_validation, y_validation = val_batch[0].to(gpu), val_batch[1].to(gpu).unsqueeze(1)
            val_preds = model.forward(X_validation)
            # The final layer of the LeNet module does not apply the sigmoid function
            correct += int(sum((sigmoid(val_preds) >= 0.5) == y_validation))
        # Probability > 0.5 => Defective, otherwise non-defective
        acc = correct * 100 / len(val_indices)
        # The validation accuracy is printed after each epoch to assess the model's generalization to unseen data
        print(f"Validation Accuracy: {acc:.2f}%")
    print()
    
# Testing phase
with torch.set_grad_enabled(False):
    correct = 0
    for test_batch in test_loader:
        X_test, y_test = test_batch[0].to(gpu), test_batch[1].to(gpu).unsqueeze(1)
        test_preds = model.forward(X_test)
        correct += int(sum((sigmoid(test_preds) >= 0.5) == y_test))
    # Probability > 0.5 => Defective, otherwise non-defective
    acc = correct * 100 / len(raw_test)
    # After completing all epochs, the test set accuracy is printed, providing a final evaluation of the model's 
    # performance on completely unseen data
    print(f"Test Set Accuracy: {acc:.2f}%")

**Comment on Accuracy**

The highest validation and test accuracies we recorded were 96.75% and 95.83% respectively, which demonstrates the model's ability to generalize and make predictions on unseen data examples. For the application of chip quality control, 95.83% represents a solid success rate that a company like PepsiCo could utilize to automate and streamline their production process.