# Dog Breed Classifier

**Name:** Dog Breed Classifier

**Author:** Sharome Burton

**Date:** 07/20/2021

**Description:** Machine learning model used to determine the breed of a dog from a given image.

**Kaggle:** https://www.kaggle.com/sharomeethan/disaster-tweet-classifier

**Colab:** https://colab.research.google.com/drive/1wLBBuwKx4a9w3jTectO0nzCzU9d3w_BB?usp=sharing

<img src="https://raw.githubusercontent.com/koulkoudakis/dog-breed-classifier/main/dog-breed-classifier.png"
     alt="dog-breed-classifier"
     style="float: left; margin-right: 10px;" />

## 1. Problem definition
> How well can we identify the breed of a dog from a given image?

## 2. Data
We are provided with a training set and a test set of images of dogs. Each image has a filename that is its unique `id`. The dataset comprises 120 breeds of dogs.
   
* `train.zip` - the training set, we are provided the breed for these dogs
* `test.zip` - the test set, we must predict the probability of each breed for each image
* `sample_submission.csv` - a sample submission file in the correct format
* `labels.csv` - the breeds for the images in the train set

There are 10,000+ labeled images in each set.
    
source: https://www.kaggle.com/c/dog-breed-identification/data

## 3. Features

   * `id` - a unique identifier for each image
   * `breed` - the breed of the dog, eg. 
    * affenpinscher
    * afghan_hound
    * african_hunting_dog
    * airedale
    * american_staffordshire_terrier 
   
## 4. Evaluation 

> **Goal:** Determine the breed of a dog in a given image with >75% accuracy.

The evaluation is a file with prediction probabilities for each dog breed of each test image. Submissions are evaluated on Multi Class Log Loss between the predicted probability and the observed target.

source: https://www.kaggle.com/c/dog-breed-identification/overview/evaluation




## Loading Dataset

In [1]:
# Unzipping dataset
# !unzip "drive/MyDrive/ML Projects/dog-breed-identification.zip" -d "drive/MyDrive/ML Projects/"

## Getting tools ready

* Import TensorFlow 2.x 
* Import TensorFlow Hub
* Ensure access to GPU


In [1]:
# TensorFlow
import tensorflow as tf
print("Tf version:", tf.__version__)
# Tensorflow Hub
import tensorflow_hub as hub
print("TF Hub version:", hub.__version__)


# Check GPU availability
print("GPU: ", "available" if tf.config.list_physical_devices else "not available")

## Getting data ready
With all machine learning models, data has to be in numerical format. Here we must convert our images into tensors.


In [1]:
# Display labels
import pandas as pd

labels_csv = pd.read_csv("/content/drive/MyDrive/ML Projects/labels.csv")


print(labels_csv.describe())
labels_csv.describe()
labels_csv


In [1]:
labels_csv["breed"].value_counts()

In [1]:
labels_csv["breed"].value_counts().plot.bar(figsize=(30,10))

In [1]:
labels_csv["breed"].value_counts().median()

In [1]:
# View image sample
from IPython.display import Image
Image("/content/drive/MyDrive/ML Projects/train/ffe5f6d8e2bff356e9482a80a6e29aac.jpg")

### Fetching images and labels



In [1]:
labels_csv.tail()

In [1]:
# Create pathnames from image ID's
filenames = ["drive/MyDrive/ML Projects/train/" + fname + ".jpg" for fname in labels_csv["id"]]

# Check first 10
filenames[:10]

In [1]:
# Check whether number of filenames matches number of images
import os
if len(os.listdir("drive/MyDrive/ML Projects/train/")) != len(filenames):
  print("Mismatched number of filenames and files, check target directory.")
else:
  print("Identical number of filenames and files.")

In [1]:
Image(filenames[9000])

In [1]:
labels_csv["breed"][9000]

Since we have all training filepaths in a list, let's prepare labels.

In [1]:
import numpy as np
labels = labels_csv["breed"]
labels

In [1]:
len(labels)

In [1]:
# Check for missing files
if len(labels) == len(filenames):
  print("Number of labels matches number of filenames")
else:
  print("Number of labels does not match numbe of filenames")

In [1]:
# Find unique label values
unique_breeds = np.unique(labels)
unique_breeds[:10]

In [1]:
len(unique_breeds)

In [1]:
# turn single label into array of booleans
labels[0] == unique_breeds

In [1]:
# Convert every label into boolean array
boolean_labels = [label == unique_breeds for label in labels]
boolean_labels[:2]


In [1]:
len(boolean_labels)

In [1]:
# Converting boolean array into integers
print(labels[0]) # original label
print(np.where(unique_breeds == labels[0])) # index where label occurs
print(boolean_labels[0].argmax()) # index where label occurs in boolean array
print(boolean_labels[0].astype(int)) # value 1 where sample label occurs

### Creating validation set
Since the dataset from Kaggle has no validations set, we will create our own.

In [1]:
# Set up X & y variables
X = filenames
y = boolean_labels

In [1]:
# Set number of images to use for experimentation
NUM_IMAGES = 1000 #@param {type:"slider", min:1000, max:10000, step:1}

### Splitting data

In [1]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X[:NUM_IMAGES],
                                                  y[:NUM_IMAGES],
                                                   test_size=0.2,
                                                   random_state=18)

len(X_train), len(y_train), len(X_val), len(y_val)

In [1]:
X_train[:2], y_train[:2]

## Preprocessing Images (turning images into Tensors)

To preprocess our images into Tensors, we will write a function which does these things:
1. Take an image filepath as input
2. Use TensorFlow to read the file and save it to a variable, `image`
3. Turn our `image` (.jpg) into Tensors
4. Normalize our image (convert color channel values from 0-255 to 0-1)
5. Resize the `image` to be a shape of (224,224)
6. Return the modified `image`

Let's see what importing an image looks like:

In [1]:
# Convert image to NumPy array
from matplotlib.pyplot import imread
image = imread(filenames[0])
image.shape

In [1]:
image.max(), image.min()

In [1]:
tf.constant(image)[0]

In [1]:
# Define image size
IMG_SIZE = 224

# Create function for preprocessing images
def process_image(image_path, img_size=IMG_SIZE):
  """
  Takes an image file path and turns image into a tensor
  """
  # Read in image file
  image = tf.io.read_file(image_path)
  # Turn .jpg image into numerical tensor with 3 color channels
  image = tf.image.decode_jpeg(image, channels=3)
  # Convert color channel values from 0-255 to 0-1 values
  image = tf.image.convert_image_dtype(image, tf.float32)
  # Resize image to our desired value(224,224)
  image = tf.image.resize(image, size=[img_size, img_size])

  return image

## Turning our data into batches

In order to use TensorFlow effectively, we need dat in the form of tensor tuples: `(image, label)`



In [1]:
# Create a simple function to return a tuple (image, label)
def get_image_label(image_path, label):
  """
  Takes an image file path name and associated label, processes
  the image and returns a tuple of (image, label).
  """
  image = process_image(image_path)
  return image, label

# Demo of above function
(process_image(X[18]), tf.constant(y[18]))


In [1]:
# Define the batch size
BATCH_SIZE = 32

# Create function to convert data to batches
def create_data_batches(X, y=None, batch_size=BATCH_SIZE, valid_data=False, test_data=False):
  """
  Creates batches of data out of image (X) and label (y) pairs.
  Shuffles the data if it is training data but does not shuffle
  validations data. 
  Also accepts test data as input (no labels).
  """
  # If test data, we don't have labels
  if test_data:
    print("Creating test data batches...")
    data = tf.data.Dataset.from_tensor_slices((tf.constant(X))) # only filepaths (no labels)
    data_batch = data.map(process_image).batch(BATCH_SIZE)
    print("Test data batches created")
    return data_batch

  # If data is valid dataset, we don't need to shuffle it
  elif valid_data:
    print("Creating validation data batches...")
    data = tf.data.Dataset.from_tensor_slices((tf.constant(X), # filepaths
                                               tf.constant(y)))
    data_batch = data.map(get_image_label).batch(BATCH_SIZE)
    print("Validation data batches created")
    return data_batch

  else:
    print("Creating training data batches...")
    # Turn filepaths and labels into tensors
    data = tf.data.Dataset.from_tensor_slices((tf.constant(X),
                                               tf.constant(y)))
    # Shuffling pathnames and lables before mapping image processor function
    # is faster than shuffling images
    data = data.shuffle(buffer_size=len(X))

    # Create (image, label) tuples (also turns image path into preprocessed image)
    data = data.map(get_image_label)

    # Turn training data into batches
    data_batch = data.batch(BATCH_SIZE)

    print("Training data batches created")

  return data_batch


In [1]:
# Check attributes of data batches
train_data = create_data_batches(X_train, y_train)
val_data = create_data_batches(X_val, y_val, valid_data=True)

In [1]:
# Check attributes of data batches
train_data.element_spec, val_data.element_spec

In [1]:
len(train_data)

## Visualizing data batches

In [1]:
import matplotlib.pyplot as plt
import math

# Create a function for viewing images in data batch
def show_images(images, labels, size=25):
  """
  Displays a plot of up to 32 images and their labels from a data batch
  """
  # Setup figure
  plt.figure(figsize=(10,10))
  # Loop through size
  for i in range(size):
    # Create subplots (dimension*dimension grid)
    dimension = int(math.sqrt(size-1))+1
    ax = plt.subplot(dimension,dimension,i+1)
    # Display an image
    plt.imshow(images[i])
    # Add image label as title
    plt.title(unique_breeds[labels[i].argmax()])
    # Turn grid lines off
    plt.axis("off")




In [1]:
train_images, train_labels = next(train_data.as_numpy_iterator())
# train_images, train_labels

In [1]:
len(train_images), len(train_labels)

In [1]:
# Now let's visualize data in training batch
show_images(train_images, train_labels, size=16)

In [1]:
# Visualize validation set
val_images, val_labels = next(val_data.as_numpy_iterator())
show_images(val_images, val_labels)

## Building a model
Before building a model, there are a few things to define:
* The input shape (image shape, in the form of tensors) to our model.
* The output shape(image labels, in form of tensors) of our model.
* (optional) the URL of the model we want to use from TensorFlow Hub - https://tfhub.dev/google/imagenet/mobilenet_v2_130_224/classification/5

In [1]:
# Setup input shape to the model
INPUT_SHAPE = [None, IMG_SIZE, IMG_SIZE, 3] # batch, height, width, color channels

# Setup output shape
OUTPUT_SHAPE = len(unique_breeds)

# Setup model URL from TensorFlow Hub
MODEL_URL = "https://tfhub.dev/google/imagenet/mobilenet_v2_130_224/classification/5"

Let's create a function which:
* Takes input shape, output shape and model we've chosen as parameters
* Defines the layers in a Keras model in sequential fashion (do this first, then this, then that).
* Compiles model (says it should be evaluated and improved upon)
* Builds model (tells the model the input shape to expect)
* Returns the model

Steps may be found here: https://www.tensorflow.org/guide/keras/sequential_model

In [1]:
# Create a function which builds a Keras model
def create_model(input_shape=INPUT_SHAPE, output_shape=OUTPUT_SHAPE, model_url=MODEL_URL):
  """
  Creates of specified input shape, output shape, with specified
  URL from TensorFlow Hub
  """
  print("Building model with:", model_url)

  # Setup model layers
  model = tf.keras.Sequential([
                              hub.KerasLayer(model_url), # Layer 1 (input layer)
                              tf.keras.layers.Dense(units=output_shape,
                              activation="softmax") # Layer 2 (output layer)
                              ])
  # Compile model
  model.compile(
      loss=tf.keras.losses.CategoricalCrossentropy(),
      optimizer=tf.keras.optimizers.Adam(),
      metrics=["accuracy"]
  )

  # Build model
  model.build(INPUT_SHAPE)

  return model


In [1]:
model = create_model()
model.summary()

## Creating model callbacks
Callbacks are helper functions a model can use during training to do such things as save its progress, check its progress or stop training early if a model stops improving.

We will create two callbacks:
* One for TensorBoard which helps track model's progress,
* Another for stopping training early to prevent overfitting

### TensorBoard Callback

To setup a TensorBoard callback, we need to do 3 things:
1. Load the TensorBoard notebook extension
2. Create a TensorBoard callback which is able to save logs to a directory and pass it to our model's training logs with the `%tensorboard` magic function 

In [1]:
# Load TensorBoard notebook extension
%load_ext tensorboard

In [1]:
import datetime

# Create a function to build a TensorBoard callback
def create_tensorboard_callback():
  # Create a log directory for storing TensorBoard logs
  logdir = os.path.join("drive/MyDrive/ML Projects/logs",
                        # Make it so that the lgos get tracked when
                        # we run an experiment
                        datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
  return tf.keras.callbacks.TensorBoard(logdir)

### Early Stopping Callback

Early stopping helps stop our model from overfitting by stopping training if a certain training evaluation metric is met.

Link: https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping

In [1]:
# Create early stopping callback
early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_accuracy",
                                                  patience=3)

## Training model (on subset of data)

Our first model will train on 1000 images in order to make sure everything is working.

In [1]:
NUM_EPOCHS = 100 #@param {type:"slider", min:"10", max:"100"}

In [1]:
# Check GPU availability
print("GPU: ", "available" if tf.config.list_physical_devices else "not available")

We will:
* Create a model using `create_model()`
* Setup a TensorBoard callback using `create_tensorboard_callback()`
* Call the `fit()` function on our model, passing it the training data (`train_data`), validation data (`val_data`), number of epochs to train for (`NUM_EPOCHS`) and the callbacks we would like to use
* Return the model

In [1]:
# Build a function to train and return a trained model
def train_model():
  """
  Trains a given model and returns the trained version.
  """
  # Create model
  model = create_model()

  # Create new TensorBoard session each time we train a model
  tensorboard = create_tensorboard_callback()

  # Fit model to data 
  model.fit(x=train_data,
            epochs=NUM_EPOCHS,
            validation_data=val_data,
            validation_freq=1,
            callbacks=[tensorboard, early_stopping])
  # Return fitted model
  return model

In [1]:
# Fit model to data
model = train_model()

It looks like our model is over-fitting because its performance on the training set far exceeds the performance of the validation set.

**Note**: Overfitting at the beginning is good; it means our model is learning.

### Checking TensorBoard logs

The TensorBoard magic function (`%tensorboard`) will access the logs directory and visualize its contents

In [1]:
%tensorboard --logdir drive/My\ Drive/ML\ Projects/logs

## Making and evaluating predictions using trained model

In [1]:
# Make predictions on validation data
predictions = model.predict(val_data, verbose=1)
predictions

In [1]:
predictions.shape

In [1]:
predictions[0]

In [1]:
np.sum(predictions[0])

In [1]:
predictions[0][predictions[0].argmax()]

In [1]:
# First prediction

index = 2
print(predictions[index])
print(f'Max value (probability of prediction): {np.max(predictions[index])}')
print(f'Sum: {np.sum(predictions[index])}')
print(f'Max index: {np.argmax(predictions[index])}')
print(f'Predicted label: {unique_breeds[np.argmax(predictions[index])]}')

**Note:** Predictions probabilities are also known as confidence intervals

In [1]:
# Turn prediction probabilities into their respective labels (easier to understand)
def get_pred_label(prediction_probabilities):
  """
  Turns an array of prediction probabilities into a label
  """
  return unique_breeds[np.argmax(prediction_probabilities)]

# Get predicted label based on the array of predictions probabilities
pred_label = get_pred_label(predictions[18])
pred_label

In [1]:
val_data

SInce our validation data is still in a batch dataset, we must unbatch the data to make predictions on the validation images and then compare those predictions to the validation labels (truth labels)

In [1]:
# Create a function to unbatch a batch dataset
def unbatch_data(data):
  """
  Takes a batched dataset of (image,label) tensors and returns separate arrays
  of images and labels.
  """
  images = []
  labels = []
  # Loop through unbatched data
  for image, label in data.unbatch().as_numpy_iterator():
    images.append(image)
    labels.append(unique_breeds[np.argmax(label)])
  return images, labels

# Unbatch the validation data
val_images, val_labels = unbatch_data(val_data)
val_images[3], val_labels[3]

In [1]:
# images_ = []
# labels_ = []

# # Loop through unbatched data
# for image, label in val_data.unbatch().as_numpy_iterator():
#   images_.append(image)
#   labels_.append(label)

# images_[0], labels_[0]

In [1]:
get_pred_label(val_labels[3])

In [1]:
get_pred_label(predictions[3])

We now have ways to obtain:
* Prediction labels
* Validation labels (truth labels)
* Validation images

We will now make some functions to visualize the results.

We will make a function which:
* Takes an array of prediction probabilities, an array of truth labels and an array of images and integers.
* Convert the prediction probabilities to a predicted label.
* Plot the predicted label, its predicted probability, the truth label and the target image on a single plot.

In [1]:
def plot_pred(prediction_probabilities, labels, images, n=1):
  """
  View the prediction, ground truth and image for sample n
  """
  pred_prob, true_label, image = prediction_probabilities[n], labels[n], images[n]

  # Get pred label
  pred_label = get_pred_label(pred_prob)

  # Plot image and remove ticks
  plt.imshow(image)
  plt.xticks([])
  plt.yticks([])

  # Change color of the title depending on if prediction is right or wrong
  if pred_label == true_label:
    color = "green"
  else:
    color = "red"

  # change plot title to be predicted, probability of prediction and truth label
  plt.title("{} {:2.0f}% {}".format(pred_label,
                                    np.max(pred_prob)*100,
                                    true_label),
                                    color=color
                                    )
                                    

In [1]:
plot_pred(prediction_probabilities=predictions,
          labels=val_labels,
          images=val_images,
          n=18)

Now we have one function to visualize our models top prediction, we will make another to view our model's top 10 predictions.

This function will:
* Take an input of prediction probabilities array and a ground truth array and an integer
* Find the prediction using `get_pred_label()`
* FInd the top 10:
  * Prediction probabilities indices
  * Prediction probabilities values
  * Prediction labels
* Plot the top 10 prediciton probabilities, values and labels, coloring the true label green

In [1]:
def plot_pred_conf(prediction_probabilities, labels, n=1):
  """
  Plot the top 10 highest predictions confidences along
  with the truth label for sample n.
  """
  pred_prob, true_label = prediction_probabilities[n], labels[n]

  # Get predicted label
  pred_label = get_pred_label(pred_prob)

  # Find top 10 prediction confidence indices
  top_10_pred_indices = pred_prob.argsort()[-10:][::-1]
  # Find top 10 prediction confidence values
  top_10_pred_values = pred_prob[top_10_pred_indices]
  # Find top 10 prediction labels
  top_10_pred_labels = unique_breeds[top_10_pred_indices]

  # Setup plot
  top_plot = plt.bar(np.arange(len(top_10_pred_labels)),
                     top_10_pred_values,
                     color="gray")
  plt.xticks(np.arange(len(top_10_pred_labels)),
             labels=top_10_pred_labels,
             rotation="vertical")
  
  # Change color of true label
  if np.isin(true_label, top_10_pred_labels):
    top_plot[np.argmax(top_10_pred_labels == true_label)].set_color("green")
  else:
    pass

In [1]:
plot_pred_conf(prediction_probabilities=predictions,
               labels = val_labels,
               n=25)

In [1]:
plot_pred(prediction_probabilities=predictions,
          labels=val_labels,
          images=val_images,
          n=25)

Now we have some functions to help us visualize our predictions and evaluate our model, let's check out a few.

In [1]:
i_multiplier = 18
num_rows = 2
num_cols = 2
num_images = num_rows*num_cols
plt.figure(figsize=(10*num_cols, 5*num_rows))
for i in range(num_images):
  plt.subplot(num_rows, 2*num_cols, 2*i+1)
  plot_pred(prediction_probabilities=predictions,
            labels=val_labels,
            images=val_images,
            n=i+i_multiplier)
  plt.subplot(num_rows, 2*num_cols, 2*i+2)
  plot_pred_conf(prediction_probabilities=predictions,
                 labels=val_labels,
                 n=i+i_multiplier)

plt.tight_layout(h_pad=1.0)
plt.show()

## Saving and loading a trained model

In [1]:
# Create a function to save model
def save_model(model, suffix=None):
  """
  Saves a given model in a models directory and appends a suffix (string).
  """
  # Create a model directory pathname with current time
  modeldir = os.path.join("drive/MyDrive/ML Projects/models", 
                          datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
  model_path=modeldir + "-" + suffix + ".h5" # Save format of model
  print(f"Saving model to: {model_path}...")
  model.save(model_path)
  return model_path

In [1]:
# Create a function to save a model
def load_model(model_path):
  """
  Loads a saved model from a specified path.
  """
  print(f"Loading saved model from: {model_path}")
  model = tf.keras.models.load_model(model_path,
                                     custom_objects={"KerasLayer":hub.KerasLayer})
  return model

In [1]:
save_model(model, suffix="1000-images-mobilenetv2-Adam")

In [1]:
# Load a trained model
loaded_1000_image_model = load_model("drive/MyDrive/ML Projects/models/20210724-164447-1000-images-mobilenetv2-Adam.h5")

In [1]:
# Evaluate the pre-saved model
model.evaluate(val_data)

In [1]:
# Evaluate saved model
loaded_1000_image_model.evaluate(val_data)

##Training model on full dataset

In [1]:
len(X), len(y)

In [1]:
# Create data batches with full dataset
full_data = create_data_batches(X,y)

In [1]:
full_data

In [1]:
# Create a model for full dataset
full_model = create_model()

In [1]:
# Create full model callbacks
full_model_tensorboard = create_tensorboard_callback()
# No validation set when training on all data, so we cannot monitor validation accuracy
full_model_early_stopping = tf.keras.callbacks.EarlyStopping(monitor="accuracy",
                                                             patience=3)

**Note** Running cell below will take longer on first epoch because GPU has to load all images into memory

In [1]:
# Fit full model to full dataset
full_model.fit(x=full_data,
               epochs=NUM_EPOCHS,
               callbacks=[full_model_tensorboard, full_model_early_stopping])

In [1]:
# Save trained full model
save_model(full_model, suffix="full-imageset-mobilenetv2-Adam")

In [1]:
# Load in full model
loaded_full_model = load_model("drive/MyDrive/ML Projects/models/20210724-173146-full-imageset-mobilenetv2-Adam.h5")

## Making predictions on test dataset

Since our model has been trained on images in the form of tensor batches, to make predictions on the test data, we will have to get it into the same format

We created `create_data_batches` earlier which can take a list of filenames as input and convert them into tensor batches

To make predictions on the test data, we will:
* Get the test image filenames
* Convert the filenames into test data batches using `create_data_batches` and setting the `test_data` parameter to `True` (the test data does not have labels)
* Make predictions array by passing the test batches to `predict()` method called on our model

In [1]:
# Load test image filenames
test_path = "drive/MyDrive/ML Projects/test/"
test_filenames = [test_path +  fname for fname in os.listdir(test_path)]
test_filenames[:10]

In [1]:
len(test_filenames)

In [1]:
# Create test data batch from filenames
test_data = create_data_batches(test_filenames, test_data=True)


In [1]:
test_data

**Note:** Calling `predict()` on our full model and passing it to test data batch will take a long time to run

In [1]:
# Make predictions on test data batch using loaded full model
test_predictions = loaded_full_model.predict(test_data, verbose=1)

In [1]:
# Save predictions to csv file for later access
np.savetxt("drive/MyDrive/ML Projects/predictions/pred_array.csv",
           test_predictions,
           delimiter=",")

In [1]:
# Load predictions from csv file
test_predictions = np.loadtxt("drive/MyDrive/ML Projects/predictions/pred_array.csv",
                              delimiter=",")

In [1]:
test_predictions.shape

## Preparing test dataset predictions for Kaggle

From the Kaggle sample submission, we find that the model prediction probability must be output in a DataFrame with an ID and a column for each dog breed. 

link: https://www.kaggle.com/c/dog-breed-identification/overview/evaluation

To get data in this format, we will:
* Create a pandas DataFrame with an ID column as well as a column for each dog breed
* Add data to the ID column by extracting test image IDs from their filepaths
* Add data (prediction probabilities) to each of the dog breed columns
* Export DataFrame as a CSV to submit to Kaggle


In [1]:
# Create a pandas DataFrame with empty columns
preds_df = pd.DataFrame(columns=["id"] + list(unique_breeds))
preds_df.head()

In [1]:
# Append test image ID's to predictions DataFrame
test_ids = [os.path.splitext(path)[0] for path in os.listdir(test_path)]
test_ids[:10]

In [1]:
preds_df["id"] = test_ids

In [1]:
os.path.splitext(test_filenames[0])

In [1]:
# Add prediction probabilities to each dog breed column
preds_df[list(unique_breeds)] = test_predictions
preds_df.head()


In [1]:
# Export to .csv for submission to Kaggle
preds_df.to_csv("drive/MyDrive/ML Projects/predictions/full_model_predictions_1_mobilenetV2.csv",
                index=False)

## Making predictions on custom images

To make predictions on custom images, we must:
* Get the filepaths of our own images
* Turn the filepaths into data batches using `create_data_batches()`. 
Since our custom images don't have labels, we set the `test_data` parameter to `True`.
* Pass the custom image data batch to our model's `predict()` method.
* Convert prediction output probabilities to prediction labels.
* Compare predicted labels to custom images

In [1]:
# Get custom image filepaths
custom_path = "drive/MyDrive/ML Projects/custom images/"
custom_image_paths = [custom_path + fname for fname in os.listdir(custom_path)]

In [1]:
custom_image_paths

In [1]:
# Turn custom images into batch datasets
custom_data = create_data_batches(custom_image_paths, test_data=True)
custom_data

In [1]:
# Make predictions on custom data
custom_preds = loaded_full_model.predict(custom_data)
custom_preds.shape


In [1]:
# Get custom image predictionn labels
custom_pred_labels = [get_pred_label(custom_preds[i]) for i in range(len(custom_preds))]
custom_pred_labels

In [1]:
# Get custom images (unbatch_data() function will not work since there are no labels)
custom_images = []
# Loop through unbatched data
for image in custom_data.unbatch().as_numpy_iterator():
  custom_images.append(image)

In [1]:
# Check custom image predictions
n = len(custom_image_paths)

plt.figure(figsize=(10,10))
for i, image in enumerate(custom_images):
  plt.subplot(int(math.sqrt(n))+1,int(math.sqrt(n))+1, i+1)
  plt.xticks([])
  plt.yticks([])
  plt.title(custom_pred_labels[i])
  plt.imshow(image)
