# 🐶 End-to-end Multi-class Dog Bread Classification

This notebook builds an e2e multi-class image classifier using TensorFlow 2.0 and TensorFlow Hub.

## 1. Problem
Identifying the bread of a dog given an image of a dog.

When I'm sitting at the cafe and I take a photo of a dog, I want to know what breed of dog it is.

## 2. Data
The data we're using is from Kaggle's [dog breed identification competition](https://www.kaggle.com/competitions/dog-breed-identification/data).

## 3. Evaluation
The [evaluation](https://www.kaggle.com/competitions/dog-breed-identification/overview/evaluation) is a file with prediction probabilities for each dog breed of each test image.

## 4. Features
* We're dealing with images (unstructured data) so it's probably best we use deep learning/transfer learning.
* There are 120 breeds of dogs (this means there are 120 different classes).
* There are around 10,000+ images in the training set (these images have labels)
* There are around 10,000+ images in the training set (these images have no labels because we'll want to predict them)

In [1]:
# Unzip the uploaded data into Google Drive
# !unzip "/drive/MyDrive/Dog Vision/dog-breed-identification.zip" -d "/drive/MyDrive/Dog Vision/"

### Get our workspace ready

* Import TensorFlow 2.x ✅
* Import TensorFlow Hub
* Make sure we're using a GPU

In [2]:
# # Import TensorFlow into Colab
# import tensorflow as tf
# print("TensorFlow version:", tf.__version__)

In [3]:
# # In case TensorFlow version is less than 2.x
# # Import TensorFlow 2.x manually
# try:
#   # %tensorflow_version only exists in Google Colab
#   %tensorflow_version 2.x
# except Exception:
#   pass

In [4]:
# conda create --prefix .env tensorflow tensorflow-hub jupyter

In [5]:
# Import necessary tools
import tensorflow as tf
import tensorflow_hub as hub
print("TF version", tf.__version__)
print("TF Hub version", hub.__version__)

# Check if there's a GPU available
print("GPU", "available (YESSSS!!!!!!)" 
      if tf.config.list_physical_devices("GPU") 
      else "not available :(")

TF version 2.10.0
TF Hub version 0.8.0
GPU not available :(


In [6]:
# If GPU not available, check if the runtime is set to use a GPU
# Runtime > Change runtime type >  Hardware accelerator > GPU

## Getting our data ready (turning it into Tensors)

With all ML models, our data has to be in numerical format. So that's what we'll be doing here, turning our images into Tensors (numerical representations).

Let's start by accessing our data and checking out the labels.

In [8]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  0


In [7]:
# Checkout the labels of our data
import pandas as pd

labels_csv = pd.read_csv('/content/drive/MyDrive/Dog Vision/dog-breed-identification/labels.csv')
print(labels_csv.describe())
print(labels_csv.head())

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/Dog Vision/dog-breed-identification/labels.csv'

In [None]:
labels_csv.head()

In [None]:
labels_csv["breed"].value_counts()

In [None]:
# How many images are there of each breed?
labels_csv["breed"].value_counts().plot.bar(figsize=(20, 10));

In [None]:
labels_csv["breed"].value_counts().median()

In [None]:
# Let's view an image
from IPython.display import Image
Image("/content/drive/MyDrive/Dog Vision/dog-breed-identification/train/\
001513dfcb2ffafc82cccf4d8bbaba97.jpg")

### Getting images and their labels

Let's get a list of all of our image file pathnames.

In [None]:
labels_csv.head()

In [None]:
# Create pathnames from image IDs - part 1
filenames = [fname for fname in labels_csv["id"]]

# Check the first 5
filenames[:5]

In [None]:
# Create pathnames from image IDs - part 2
filenames = ["/content/drive/MyDrive/Dog Vision/dog-breed-identification/train/"\
             + fname + ".jpg" for fname in labels_csv["id"]]

# Check the first 5
filenames[:5]

In [None]:
# Check whether number of filenames matches number of actual image files
import os
if len(os.listdir("/content/drive/MyDrive/Dog Vision/dog-breed-identification/train/")) == len(filenames):
  print("Filenames match actual amount of files in our ../train/ folder!\nProceed with TF! ^_^")
else:
  print("Filenames do not match actual amount of files, check the target directory.")

In [None]:
# One more check
Image(filenames[9000])

In [None]:
labels_csv["breed"][9000]

Since we've now got our training image filepaths in a list, let's prepare our labels.

In [None]:
import numpy as np

labels = labels_csv["breed"]
labels = np.array(labels)
# labels = labels_csv["breed"].to_numpy() # does same thing as above
labels, len(labels)

In [None]:
# See if number of labels matches number of filenames
if len(labels) == len(filenames):
  print("Number of labels matches number of filenames!")
else:
  print("Number of labels does not match number of filenames, check data directories!")

In [None]:
 # Find the unique label values
 unique_breeds = np.unique(labels)
 len(unique_breeds), unique_breeds[:10]

In [None]:
# Turn a single label into an array of booleans
print(labels[0])
labels[0] == unique_breeds

In [None]:
# Turn every label into a boolean array
boolean_labels = [label == unique_breeds for label in labels]
boolean_labels[:2]

In [None]:
len(boolean_labels)

In [None]:
# Example: turning boolean array into integers
print(labels[0]) # original label
print(np.where(unique_breeds == labels[0])) # index where label occurs
print(boolean_labels[0].argmax()) # index where label occurs in boolean array
print(boolean_labels[0].astype(int)) # there will be a 1 where the sample label occurs

### Creating our own validation set

Since the dataset from Kaggle doesn't come with a validation set, we're going to create our own.

In [None]:
# Setup X & y variables
X = filenames
y = boolean_labels

In [None]:
len(filenames)

We're going to start off experimenting with ~1000 images and increase as needed.

In [None]:
# Set number of images to use for experimenting
NUM_IMAGES = 1000 #@param {type:"slider", min:1000, max:10000, step:1000}

In [None]:
# Let's split our data into train and validation sets
from sklearn.model_selection import train_test_split

# Split data into training and validation of total size NUM_IMAGES
X_train, X_val, y_train, y_val = train_test_split(X[:NUM_IMAGES],
                                                  y[:NUM_IMAGES],
                                                  test_size=0.2,
                                                  random_state=42)

# Check our train and validation sets' length
len(X_train), len(y_train), len(X_val), len(y_val)

In [None]:
# Let's have a geez at the training data
X_train[:5], y_train[:2]

## Preprocessing Images (Turning Images into Tensors)

To preprocess our images into Tensors we're going to write a function which does a few things:
1. Take an image filepath as input
2. Use TensorFlow to read the file and save it to a variable, e.g. `image`
3. Turn our `image` (a jpg) into Tensors
4. Normalize our `image` (convert colour channel values from 0-255 to 0-1).
5. Resize the `image` to be a shape of (224, 224)
6. Return the modified `image`

TensorFlow documentation on loading data:
* [tf.data: Build TensorFlow input pipelines](https://www.tensorflow.org/guide/data)
* [Load and preprocess images](https://www.tensorflow.org/tutorials/load_data/images)

Before we do, let's see what importing an image looks like.

In [None]:
# Convert an image to a NumPy array
from matplotlib.pyplot import imread

image = imread(filenames[42])
image.shape

In [None]:
image[:2]

In [None]:
# each image is composed of pixels made up of RGB color values between 0 and 255
image.min(), image.max()

In [None]:
# turn that same image into a Tensor that can be run on a GPU
tf.constant(image)[:2]

Now we've seen what an image looks like as a Tensor, let's make a function to preprocess them.
1. Take an image filepath as input
2. Use TensorFlow to read the file and save it to a variable, e.g. `image`
3. Turn our `image` (a jpg) into Tensors
4. Normalize our `image` (convert colour channel values from 0-255 to 0-1).
5. Resize the `image` to be a shape of (224, 224)
6. Return the modified `image`

In [None]:
# Define image size
IMG_SIZE = 224

# Create a function for preprocessing images
def process_image(image_path, img_size=IMG_SIZE):
  """
  Takes an image file path and image size as inputs and turns
  the image into a Tensor.
  """
  # Read an image file
  image = tf.io.read_file(image_path)
  # Turn the jpeg image into numerical Tensor with 3 colour channels (RGB)
  image = tf.image.decode_jpeg(image, channels=3)
  # Convert the colour channel values from 0-255 to 0-1 values (normalization)
  image = tf.image.convert_image_dtype(image, tf.float32)
  # Resize the image to our desired value (224, 224)
  image = tf.image.resize(image, size=[img_size, img_size])

  return image

In [None]:
# Break down the process_image function line-by-line to see what it does

In [None]:
# # Check the contents of one Tensor
# tensor = tf.io.read_file(filenames[26])
# tensor

In [None]:
# # Check the output of decoding one Tensor
# tensor = tf.image.decode_jpeg(tensor, channels=3)

In [None]:
# # Normalization
# # Check the output of a decoded tensor converted from 0-255 to 0-1 values
# tf.image.convert_image_dtype(tensor, tf.float32)[:2]

`Yann Lecun`, a renowned computer scientist and AI researcher, tweeted in April 2018 that "Friends don’t let friends use mini-batches larger than `32`" ¹. However, in May 2021, he tweeted that the optimal batch size is `128` ³. 

In general, the optimal batch size will be lower than 32 ¹. However, there is no magic batch size number that works for all cases as it depends on the complexity of your data and the GPU constraints you have ⁵. 

I hope this helps! Let me know if you have any other questions.

Source: Conversation with Bing, 4/5/2023
1. Neural Networks - How do I choose the optimal batch size? - Artificial .... https://bing.com/search?q=yann+lecun+optimal+batch+size.
2. Yann LeCun on Twitter: "No contrastive samples, no huge batch size .... https://twitter.com/ylecun/status/1391164045902888967.
3. How to get 4x speedup and better generalization using the right batch size. https://towardsdatascience.com/implementing-a-batch-size-finder-in-fastai-how-to-get-a-4x-speedup-with-better-generalization-813d686f6bdf.
4. Neural Networks - How do I choose the optimal batch size? - Artificial .... https://ai.stackexchange.com/questions/8560/how-do-i-choose-the-optimal-batch-size.
5. Training Nets with Large Batch Size - GitHub Pages. https://samliu.github.io/2019/03/11/large-batch-sizes.html.

## Turning Our Data into Batches

Why turn our data into batches?

Let's say you're trying to process 10k+ images in one go, they all might not fit into memory (e.g., 16GB RAM, 8GB VRAM). That's why we do about 32 images at a time (batch size).

In order to use TensorFlow effectively, we need our data in the form of Tensor tuples which look like this: `(image, label)`.

In [None]:
# Create a simple function to return a tuple (image, label) of Tensors
def get_image_label(image_path, label):
  """
  Takes an image file path name and the associated label,
  processes the image and returns a tuple of (image, label).
  """ 
  
  return process_image(image_path), label

In [None]:
# # Demo of the above function - tuple of (image, label) pair in the form of Tensors
# (process_image(X[42]), tf.constant(y[42]))

Now we've got a way to turn our data into tuples of Tensors in the form: 
`(image, label)`, let's make a function to turn all of our data (`X` & `y`)
into batches!

In [None]:
# Define the batch size, 32 is a good start
BATCH_SIZE = 32

# Create a function to turn data into batches
def create_data_batches(X, y=None, batch_size=BATCH_SIZE, 
                        valid_data=False, test_data=False):
  """
  Creates batches of data out of image (X) and label (y) pairs.             
  It shuffles the data if it's training data, but doesn't shuffle if it's 
  validation data. Also accepts test data as input (no labels).
  """
  # If the data is a test dataset, we don't have labels
  if test_data:
    print("Creating test data batches...")
    data = tf.data.Dataset.from_tensor_slices((tf.constant(X))) # only filepaths
    data_batch = data.map(process_image).batch(BATCH_SIZE) # only filepaths
    return data_batch

  # If the data is a valid dataset, we don't need to shuffle it
  elif valid_data:
    print("Creating validation data batches...")
    data = tf.data.Dataset.from_tensor_slices((tf.constant(X), # filepaths
                                               tf.constant(y))) # labels
    data_batch = data.map(get_image_label).batch(BATCH_SIZE)
    return data_batch

  # If not a test or valid dataset, then it's a training batch
  else:
    print("Creating training data batches...")
    # Turn filepaths and labels into Tensors
    data = tf.data.Dataset.from_tensor_slices((tf.constant(X),
                                               tf.constant(y)))
    # Shuffling filenames and labels before mapping image processor function \
    # is faster than shuffling images
    data = data.shuffle(buffer_size=len(X)) # shuffle the whole lot (len(X))
    
    # Create (image, label) tuples \
    # This also turns the image path into a preprocessed image
    data = data.map(get_image_label)

    # Turn the training data into batches
    data_batch = data.batch(BATCH_SIZE)
    return data_batch
                              

In [None]:
%time
# Create training and validation data batches \
# Test our create_data_batches function
train_data = create_data_batches(X_train, y_train)
val_data = create_data_batches(X_val, y_val, valid_data=True)

In [None]:
# Check out the different attributes of our data batches
train_data.element_spec, val_data.element_spec

## Visualizing Data Batches

Our data is now in batches, however these can be a little hard to understand,
let's visualize them.

In [None]:
import matplotlib.pyplot as plt

# Create a function for viewing images in a data batch
def show_25_images(images, labels):
  """
  Displays a plot of 25 images and their labels from a data batch.
  """
  # Setup the figure
  plt.figure(figsize=(10, 10))
  #Loop through 25 (for displaying 25 images)
  for i in range(25):
    # Create subplots (5 rows, 5 columns)
    ax = plt.subplot(5, 5, i+1)
    # Display an image
    plt.imshow(images[i])
    # Add the image label as the title
    plt.title(unique_breeds[labels[i].argmax()])
    # Turn the grid lines off
    plt.axis("off")

`.argmax()` is a method in Python that returns the indices of the maximum values along an axis. It is used to get the index of the maximum value in an array. Here's an example:

```python
import numpy as np

a = np.array([1, 2, 3, 2, 1])
print(np.argmax(a))
```

This will output `2`, which is the index of the maximum value in array `a`.

In [None]:
# The indices 19 has the maximum value in our y array, that is the value True (1)
y[0].argmax()

In [None]:
unique_breeds[y[0].argmax()]

In [None]:
unique_breeds[:10]

In [None]:
train_data

In [None]:
# Split our batch train_data into train_images and train_labels == un-batch it
train_images, train_labels = next(train_data.as_numpy_iterator()) # returns the next item from an iterator
# train_images[:5], train_labels[:5]

In [None]:
len(train_images), len(train_labels)

In [None]:
# Now let's visualize the data in a training batch
show_25_images(train_images, train_labels)

In [None]:
# Now let's visualize our validation set
val_images, val_labels = next(val_data.as_numpy_iterator())
show_25_images(val_images, val_labels)

In [None]:
# from PIL import Image

# # Open an image file
# with Image.open("/content/drive/MyDrive/Dog Vision/tensor-flow.png") as im:
#     # Display image
#     im.show()

## Building a Model

Before we build a model, there are a few things we need to define:
* The input shape (our image's shape, in the form of Tensors) of our model.
* The output shape (image labels, in the form of Tensors) of our model.
* The URL of th emodel we want to use.

In [None]:
from IPython.display import Image
Image("/content/drive/MyDrive/Dog Vision/tensor-flow.png")

In [None]:
Image("/content/drive/MyDrive/Dog Vision/tensor-flow_1.png")

In [None]:
# Setup input shape to the model
INPUT_SHAPE = [None, IMG_SIZE, IMG_SIZE, 3] # batch, height, width, colour channels

# Setup output shape of our model
OUTPUT_SHAPE = len(unique_breeds)

# Setup model URL from TensorFlow Hub
MODEL_URL = ...

## **Optional: How machines learn and what's going on behind the scenes?**

<div data-purpose="safely-set-inner-html:rich-text-viewer:html" class="article-asset--content--1dAQ9 rt-scaffolding"><p>Massive effort getting the data ready for use with a machine learning model! This is one of the most important steps in any machine learning project.</p><p>Now you've got the data ready, you're about to dive headfirst into writing deep learning code with TensorFlow 2.x.</p><p>Since we're focused on writing code first and foremost, these videos are optional but they're here for those who want to start to get an understanding of what goes on behind the scenes.</p><p><strong>How Machines Learn</strong></p><p>The first is a video called <a target="_blank" rel="noopener noreferrer" href="https://www.youtube.com/watch?v=R9OHn5ZF4Uo">How Machines Learn by GCP Grey on YouTube</a>.</p><p>It's a non-technical narrative explaining how some of the biggest tech companies in the world use data to improve their businesses. In short, they're leveraging techniques like the ones you've been learning. Instead of trying to think of every possible rule to code, they collect data and then use machines to figure out the patterns for them.</p><p><strong>What actually is a neural network?</strong></p><p>You're going to be writing code which builds a neural network (a type of machine learning model) so you might start to wonder, what's going on when you run the code?</p><p>When you pass inputs (often data and labels) to a neural network and it figures out patterns between them, how is it doing so?</p><p>When it tries to make predictions and gets them wrong, how does it improve itself?</p><p>The <a target="_blank" rel="noopener noreferrer" href="https://www.youtube.com/watch?v=aircAruvnKk">deep learning series by 3Blue1Brown on YouTube</a> contains a technical deep-dive into what's going on behind the code you're writing.</p><p>Be warned though, it isn't for the faint of heart. The videos explain the topics in a beautiful way but it doesn't mean the topics aren't still difficult to comprehend.</p><p>If you're up for it, a good idea would be to watch 1 video in the series one day and then another the day after and so on.</p><p>Remember, you don't need to know all of these things to get started writing machine learning code. Focus on solving problems first (like we're doing in this project) and then dive deeper when you need to.</p><p>And since these videos are optional, feel free to bookmark them for now, continue with the course and come back later!</p><p>Links:</p><p>(Non-technical) How Machines Learn by GCP Grey: <a target="_blank" rel="noopener noreferrer" href="https://www.youtube.com/watch?v=R9OHn5ZF4Uo">https://www.youtube.com/watch?v=R9OHn5ZF4Uo</a></p><p>(Technical) Deep Learning series by 3Blue1Brown: <a target="_blank" rel="noopener noreferrer" href="https://www.youtube.com/watch?v=aircAruvnKk">https://www.youtube.com/watch?v=aircAruvnKk</a></p></div>