# Dog Breed Recognition Project

## Project Basics

#### Problem

Our goal is to identify dog breed from a photo of the dog.  
The project is taken from [Kaggle Dog Breed Identification Competition](https://www.kaggle.com/c/dog-breed-identification/data).  
The machine learning problem is **supervised learning > multiclass classification**.  
Our task is to build a neural network image classifier using TensorFlow and TensorFlow Hub.

#### Evaluation

The evaluation metric set for the competition is Multiclass Log Loss.  
Our target matrix contains N Dogs x M Breeds, true breed = 1, rest = 0.  
Our model predicts a probability matrix with the same dimensions.  
Multiclass Log Loss measures the error of model predictions (the lower the better).  
Muticlass Log Loss is applied in image classification, natural language processing, and recommendation systems.

#### Data Source

Data is acquired from [Kaggle Dog Breed Identification Competition](https://www.kaggle.com/c/dog-breed-identification/data).

#### Features / Data Dictionary

Our model analyzes image files (unstructured data) > deep learning / transfer learning.  
There are 120 unique dog breeds in the training set > multiclass classification with 120 classes.  
There are 10 222 images in the training set.  
There are 10 357 images in the test set.

## Importing Libraries

In [None]:
### importing tensorflow
import tensorflow
print(tensorflow.__version__)

### checking gpu availability
print(tensorflow.config.list_physical_devices())

### importing tensorflow hub
import tensorflow_hub as tfhub
print(tfhub.__version__)

### importing sklearn tools
from sklearn.model_selection import train_test_split

### other imports
from typing import List, Any
from pathlib import Path
import numpy
from pandas import read_csv, Series, DataFrame, concat, get_dummies
from IPython.display import Image

## Data Acquisition

#### Uploading Data

In [None]:
### unzipping project data
#!unzip "drive/MyDrive/Colab Data/dog-recognition.zip" -d "drive/MyDrive/Colab Data/"

#### Importing Labels

In [None]:
### importing labels
iLabels_df: DataFrame = read_csv(filepath_or_buffer="drive/MyDrive/Colab Data/Dog Recognition/labels.csv")

#### Exploring Labels

In [None]:
### dataframe head
iLabels_df.head()

In [None]:
### dataframe info
iLabels_df.info()

In [None]:
### unique breeds
iUnique_breeds: List[str] = iLabels_df["breed"].unique().tolist()
len(iUnique_breeds)

In [None]:
### images / breed
iLabels_df["breed"].value_counts()

In [None]:
### mean of images / breed
round(number=iLabels_df["breed"].value_counts().mean(), ndigits=3)

Google recommends at least 10 images per class.  
We have adequate data with ~85 images per class on average.

## Preparing Data

#### Creating Image Filepaths

In [None]:
### counting number of images in train folder
iImage_list: List[Any] = [image for image in Path("drive/MyDrive/Colab Data/Dog Recognition/train").iterdir()]
len(iImage_list)

In [None]:
### creating image filepaths from image ids
iLabels_df["imagepath"] = "drive/MyDrive/Colab Data/Dog Recognition/train/" + iLabels_df["id"] + ".jpg"

In [None]:
### exploring imagepaths: head
iLabels_df.head()

In [None]:
### exploring imagepaths: info
iLabels_df.info()

In [None]:
### exploring imagepaths: checking validity of random imagepath
print(iLabels_df.loc[9000, "breed"])
print()
print(iLabels_df.loc[9000, "imagepath"])
print()
Image(iLabels_df.loc[9000, "imagepath"])

#### Encoding Labels

In [None]:
### one hot encoding with pandas
iEncode_df: DataFrame = get_dummies(data=iLabels_df, columns=["breed"], dtype=int)
iEncode_df.drop(columns=["id","imagepath"], inplace=True)
iLabels_df = concat(objs=[iLabels_df,iEncode_df], axis="columns")

In [None]:
### exploring encoding: head
iLabels_df.head()

In [None]:
### exploring encoding: info
iLabels_df.info()

In [None]:
### exploring encoding: nan
iLabels_df.isna().any(axis="index").any()

#### Reducing and Splitting

In [None]:
### dataframe inits
work_df = DataFrame()
train_df = DataFrame()
valid_df = DataFrame()

In [None]:
### creating train and valid datasets
for breed in iUnique_breeds:
  work_df = iLabels_df.loc[iLabels_df["breed"] == breed].copy(deep=True)
  work_df = work_df.sample(n=12, random_state=42, ignore_index=True)
  train_df = concat(objs=[train_df, work_df.loc[:9]], ignore_index=True, copy=True)
  valid_df = concat(objs=[valid_df, work_df.loc[10:]], ignore_index=True, copy=True)

In [None]:
### verifying dimensions of train and valid datasets
train_df.shape, valid_df.shape

#### Creating Tensors

All machine learning algorithms require data in numerical format.  
So the first task is to turn images and labels into tensors.  
A tensor is a numerical matrix with n-dimensions, like a numpy ndarray.

In [None]:
### function creating image tensor
def imageTensor(pImage_path=str(), pImage_size=224):
  """
  Creates an image tensor from image filepath.
  """
  image_tensor = tensorflow.io.read_file(filename=pImage_path)
  image_tensor = tensorflow.image.decode_jpeg(contents=image_tensor, channels=3)
  image_tensor = tensorflow.image.convert_image_dtype(image=image_tensor, dtype=tensorflow.float32)
  image_tensor = tensorflow.image.resize(images=image_tensor, size=[pImage_size, pImage_size])
  return image_tensor

In [None]:
### testing image tensor function
imageTensor(pImage_path=iLabels_df.iloc[42]["imagepath"], pImage_size=224)

In [None]:
### function creating label tensor
def labelTensor(pBreed=str()):
  breed_index = iUnique_breeds.index(pBreed)
  label_array = numpy.zeros(shape=120, dtype="int8")
  label_array[breed_index] = 1
  return tensorflow.constant(value=label_array)

In [None]:
### testing label tensor function
labelTensor(pBreed=iLabels_df.iloc[42]["breed"])

#### Data Batches

GPUs have limited amount of memory.  
The entire training dataset may not fit into GPU memory.  
To resolve this, we split our datasets into batches of ~32 tensors.  
The neural network sees only one batch at a time.

In [None]:
### function creating data batches
def dataBatches(features=DataFrame(), labels=DataFrame(), batch_size=32, batch_type="Train"):
  ### creating test batches
  if batch_type == "Test":

#### Reducing Data: Working Subset

In [None]:
### splitting data working / rest
PERCENT_IMAGES = 0.1 #@param {type:"slider", min:0.1, max:1.0, step:0.1}
rest_features, work_features, rest_targets, work_targets = train_test_split(
    features_series,
    targets_df,
    test_size=PERCENT_IMAGES,
    random_state=42)