# Dog Breed Recognition Project

## 1. Problem

Our goal is to identify dog breed from a photo of the dog.  
The machine learning problem is **supervised learning > multiclass classification**.  
Our task is to build a neural network image classifier using TensorFlow and TensorFlow Hub.

## 2. Evaluation

The evaluation metric set for the competition is Multiclass Log Loss.  
Our target matrix contains N Dogs x M Breeds, true breed = 1, rest = 0.  
Our model predicts a probability matrix with the same structure.  
Multiclass Log Loss measures the error of model predictions (the lower the better).  
Muticlass Log Loss is applied in image classification, natural language processing, and recommendation systems.

## 3. Data

Data is acquired from [Kaggle Dog Breed Identification Competition](https://www.kaggle.com/c/dog-breed-identification/data).

## 4. Features

#### Data Dictionary

* Our model analyzes images (unstructured data) > deep learning / transfer learning.
* There are 120 dog breeds in the training set > multiclass classification with 120 classes.
* There are 10 222 images in the training set.
* There are 10 357 images in the test set.

#### Importing the Tools

In [None]:
### importing tensorflow
import tensorflow
print(tensorflow.__version__)

### importing tensorflow hub
import tensorflow_hub as tfhub
print(tfhub.__version__)

### checking gpu availability
print(tensorflow.config.list_physical_devices())

### importing sklearn tools
from sklearn.model_selection import train_test_split

### other imports
from pathlib import Path
from pandas import read_csv, Series, get_dummies
from IPython.display import Image

#### Uploading Data

In [None]:
### unzipping project data
#!unzip "drive/MyDrive/Colab Data/dog-recognition.zip" -d "drive/MyDrive/Colab Data/"

#### Importing and Exploring the Target Variable

In [None]:
### importing labels
labels_df = read_csv(filepath_or_buffer="drive/MyDrive/Colab Data/labels.csv")

In [None]:
### exploring labels: dataframe head
labels_df.head()

In [None]:
### exploring labels: dataframe info
labels_df.info()

In [None]:
### exploring labels: unique breeds
labels_df["breed"].unique().size

In [None]:
### expploring labels: visualizing instances / breed
labels_df["breed"].value_counts().plot.bar(figsize=(15,5));

In [None]:
### exploring labels: mean of instances / breed
labels_df["breed"].value_counts().mean()

Google recommends at least 10 images per class.  
We have adequate data with ~85 images per class on average.

## 5. Modeling

#### Preparing Data: Image Filepaths (Features)

In [None]:
### counting number of images in train folder
image_list = [image for image in Path("drive/MyDrive/Colab Data/train").iterdir()]
len(image_list)

In [None]:
### creating filepaths of images from image ids
features_series = "drive/MyDrive/Colab Data/train/" + labels_df["id"] + ".jpg"

In [None]:
### exploring imagepaths: series head
features_series.head()

In [None]:
### exploring imagepaths: series info
features_series.info()

In [None]:
### exploring imagepaths: checking validity of random imagepath
print(labels_df["breed"][9000])
print()
print(features_series[9000])
print()
Image(features_series[9000])

#### Preparing Data: Encoding Labels (Targets)

All machine learning algorithms require data in numerical format.  
So the first task is to turn training images into tensors.  
A tensor is a numerical matrix with n-dimensions, like a numpy ndarray.

In [None]:
### one hot encoding with pandas
targets_df = get_dummies(data=labels_df, columns=["breed"], dtype=int)
targets_df = targets_df.drop(columns="id")

In [None]:
### exploring targets: dataframe head
targets_df.head()

In [None]:
### exploring targets: dataframe info
targets_df.info()

In [None]:
### exploring targets: nan
targets_df.isna().any(axis="index").any()

#### Reducing Data: Working Subset

In [None]:
### splitting data working / rest
PERCENT_IMAGES = 0.1 #@param {type:"slider", min:0.1, max:1.0, step:0.1}
rest_features, work_features, rest_targets, work_targets = train_test_split(
    features_series,
    targets_df,
    test_size=PERCENT_IMAGES,
    random_state=42)

In [None]:
### exploring working datasets
work_features.shape, work_targets.shape

#### Preparing Data: Training / Validation Split

In [None]:
### splitting data train / valid
train_features, valid_features, train_targets, valid_targets = train_test_split(
    work_features,
    work_targets,
    test_size=0.2,
    random_state=42)

In [None]:
### exploring train and valid datasets
train_features.shape, train_targets.shape, valid_features.shape, valid_targets.shape

#### Preparing Data: `(Feature,Target)` Tensor Tuple

In [None]:
### function returning tensor tuple
def tensorTuple(pImage_path, pIimage_size, pLabel):

  ### reading the image file into string tensor
  iImage = tensorflow.io.read_file(filename=pImage_path)

  ### converting string tensor > constant tensor (decoding jpg)
  iImage = tensorflow.image.decode_jpeg(contents=iImage, channels=3)

  ### normalizing color channels to 0-1 range
  iImage = tensorflow.image.convert_image_dtype(image=iImage, dtype=tensorflow.float32)

  ### resizing image
  iImage = tensorflow.image.resize(images=iImage, size=[pIimage_size, pIimage_size])

  ### returning tensor tuple
  return iImage, tensorflow.constant(value=pLabel)

In [None]:
tensorTuple(pImage_path=train_features.iloc[42], pIimage_size=224, pLabel=train_targets.iloc[42])

#### Preparing Data: Batches

GPUs have limited amount of memory.  
The entire training dataset may not fit into GPU memory.  
To resolve this, we split our training dataset into batches of ~32 tensors.  
The neural network sees only one batch at a time during training.