# Vegetable Image Classification

## Exercise 1: Building a Convnet from Scratch

In this exercise, we will build a classifier model from scratch that is able to distinguish among 15 different types of vegetables. Similarly to our last lab, we will:

1. Explore the example data
2. Build a small convnet from scratch to solve our classification problem
3. Evaluate training and validation accuracy

Let's go!

## Explore the Example Data

Let's start by downloading our example data, a .zip of 21,000 JPG pictures of vegetables, and extracting it locally in `/tmp`. These data are replicated from the [Kaggle Vegetable Image Dataset](https://www.kaggle.com/datasets/misrakahmed/vegetable-image-dataset)

In [None]:
!if ! [ -f /tmp/vegetables.zip ]; then \
  wget --no-check-certificate \
    https://cdn.c18l.org/vegetables.zip \
    -O /tmp/vegetables.zip; \
fi

In [None]:
import os
import zipfile

local_zip = '/tmp/vegetables.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/tmp')
zip_ref.close()

The contents of the .zip are extracted to the base directory `/tmp/Vegetable Images`, which contains `train`, `test`, and `validation` subdirectories for the training and validation datasets (see the [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/validation/check-your-intuition) for a refresher on training, validation, and test sets), which in turn each contain subdirectories for each of the image classes we'll be trying to predict. Let's define each of these directories:

In [None]:
from pathlib import Path

base_dir = Path('/tmp/Vegetable Images')
train_dir = base_dir / 'train'
test_dir = base_dir / 'test'
validation_dir = base_dir / 'validation'

image_classes = [x.name for x in train_dir.iterdir() if x.is_dir()]
image_classes

Now, let's see what the filenames look like for some of our vegetables in a `train` subdirectory (file naming conventions are the same in the `test` and `validation` directory):

In [None]:
train_radish_fnames = os.listdir(train_dir / image_classes[0])
print(train_radish_fnames[:10])

test_radish_fnames = os.listdir(test_dir / image_classes[0])
print(test_radish_fnames[:10])

validation_radish_fnames = os.listdir(validation_dir / image_classes[0])
print(validation_radish_fnames[:10])

Let's find out the total number of images in the `train` and `validation` directories:

In [None]:
print('total training radish images:', len(os.listdir(train_dir / image_classes[0])))
print('total testing radish images:', len(os.listdir(test_dir / image_classes[0])))
print('total validation radish images:', len(os.listdir(validation_dir / image_classes[0])))

For each image class, we have 1,000 training images, 200 test images, and 200 validation images.

Now let's take a look at a few pictures to get a better sense of what the dataset looks like. First, configure the matplot parameters:

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# Parameters for our graph; we'll output images in a 4x4 configuration
nrows = 4
ncols = 4

# Index for iterating over images
pic_index = 0

Now, display a batch of 8 radish pictures. You can rerun the cell to see a fresh batch each time:

In [None]:
# Set up matplotlib fig, and size it to fit 4x4 pics
fig = plt.gcf()
fig.set_size_inches(ncols * 4, nrows * 4)

pic_index += 8
next_pix = [
    os.path.join(train_dir / image_classes[0], fname)
    for fname in train_radish_fnames[pic_index-8:pic_index]
]

for i, img_path in enumerate(next_pix):
    sp = plt.subplot(nrows, ncols, i + 1)
    sp.axis('Off')

    img = mpimg.imread(img_path)
    plt.imshow(img)

plt.show()

## Building a Small Convnet from Scratch