# Image Classification - Lab

## Introduction

Note: this is a truncated lab, due to time constrains I was not able to get the data properly downloaded and my manual attempt failed to bring all the relevant information. I am leaving it as is here, and may be coming back on a later date to complete the exercise.

Now that you have a working knowledge of CNNs and have practiced implementing associated techniques in Keras, its time to put all of those skills together. In this lab, you'll work to complete a [Kaggle competition](https://www.kaggle.com/c/dog-breed-identification) on classifying dog breeds.


## Objectives

In this lab you will: 

- Compare and apply multiple techniques for tuning a model using data augmentation and pretrained models  

## Download and Load the Data

Start by downloading the data locally and loading it into a Pandas DataFrame. Be forewarned that this dataset is fairly large and it is advisable to close other memory intensive applications. Note: Part of the purpose of this lab is to use Jupyter Notebook locally.

The data can be found [here](https://www.kaggle.com/c/dog-breed-identification/data).

It's easiest if you download the data into this directory on your local computer. From there, be sure to uncompress the folder and subfolders. If you download the data elsewhere, be sure to modify the file path when importing the file below.

In [None]:
# No code per se, but download and decompress the data
# Brought in manually into a "Data" folder

## Preprocessing

Now that you've downloaded the data, its time to prepare it for some model building! You'll notice that the current structure provided is not the same as our lovely preprocessed folders that you've been given to date. Instead, you have one large training folder with images and a csv file with labels associated with each of these file types. 

Use this to create a directory substructure for a train-validation-test split as we have done previously. Also recall that you'll also want to use one-hot encoding as you are now presented with a multi-class problem as opposed to simple binary classification.

In [16]:
# import
import pandas as pd
import numpy as np
import os
import shutil
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [5]:
# open the labels.csv
df = pd.read_csv('dog_breeds/labels.csv')
df.head()

Unnamed: 0,id,breed
0,000bec180eb18c7604dcecc8fe0dba07,boston_bull
1,001513dfcb2ffafc82cccf4d8bbaba97,dingo
2,001cdf01b096e06d78e9e5112d419397,pekinese
3,00214f311d5d2247d5dfe4fe24b2303d,bluetick
4,0021f9ceb3235effd7fcde7f7538ed62,golden_retriever


In [6]:
# List the first 5 files in the directory
files = os.listdir('dog_breeds/')[:5]
print(files)


['labels.csv', 'sample_submission.csv']



In order to input the data into our standard pipeline, you'll need to organize the image files into a nested folder structure. At the top level will be a folder for the training data, a folder for the validation data, and a folder for the test data. Within these top directory folders, you'll then need to create a folder for each of the categorical classes (in this case, dog breeds). Finally, within these category folders you'll then place each of the associated image files. To save time, do this for just 3 of the dog breeds such as `'boston_bull'`, `'toy_poodle'`, and `'scottish_deerhound'`.

You're nested file structure should look like this:
* train
    * category_1
    * category_2
    * category_3
    ...
* val
    * category_1
    * category_2
    * category_3
    ...
* test 
    * category_1
    * category_2
    * category_3
    ...  

> **Hint**: To do this, you can use the `os` module which will you can use to execute many common bash commands straight from your python interpreter. For example, here's how you could make a new folder: 

```python
import os
os.mkdir('New_Folder_Name')
```
Start by creating top level folders for the train, validation, and test sets. Then, use your pandas DataFrame to split the example images for each breed of dog into a 80% train set, and 10% validation and test sets. Use `os.path.join()` with the information from the DataFrame to construct the relevant file path. With this, place the relevant images using the `shutil.copy()` into the appropriate directory. 

>> **Note**: It is worthwhile to try this exercise on your own, but you can also use the images stored under the `'data_org_subset/'` folder of this repository, in which the Kaggle dataset has already been subset and preprocessed.

In [11]:
# Your code here; transform the image files and then load them into Keras as tensors 
# (be sure to perform a train-val-test split)
# Select only the 3 specified dog breeds
selected_breeds = ['boston_bull', 'toy_poodle', 'scottish_deerhound']
selected_data = df[df['breed'].isin(selected_breeds)]

# Split the data into train, validation, and test sets (80%, 10%, 10%)
train_data, test_val_data = train_test_split(selected_data, test_size=0.2, random_state=42)
val_data, test_data = train_test_split(test_val_data, test_size=0.5, random_state=42)

# Define the top-level directories
base_dir = 'dog_data'
train_dir = os.path.join(base_dir, 'train')
val_dir = os.path.join(base_dir, 'val')
test_dir = os.path.join(base_dir, 'test')

# Create 'train', 'val', 'test' directories if they don't exist
for directory in [train_dir, val_dir, test_dir]:
    os.makedirs(directory, exist_ok=True)

# Function to copy images into respective folders if the file exists
def copy_images(data, src_directory, dst_directory):
    for index, row in data.iterrows():
        img_filename = row['id'] + '.jpg'
        src = os.path.join(src_directory, img_filename)
        dst = os.path.join(dst_directory, img_filename)
        if os.path.isfile(src):  # Check if the file exists before copying
            shutil.copy(src, dst)

# Copy images to train, validation, and test directories
copy_images(train_data, 'dog_breeds/train/', train_dir)
copy_images(val_data, 'dog_breeds/train/', val_dir)
copy_images(test_data, 'dog_breeds/train/', test_dir)

In [12]:
# Directories for train, validation, and test sets
train_dir = 'dog_data/train'
val_dir = 'dog_data/val'
test_dir = 'dog_data/test'

# Image data generators with augmentation for train set
train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True
)

# Image data generators for validation and test sets (only rescaling)
val_test_datagen = ImageDataGenerator(rescale=1./255)

# Flow training images in batches using train_datagen generator
train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size=(150, 150),
    batch_size=32,
    class_mode='categorical'
)

# Flow validation images in batches using val_test_datagen generator
val_generator = val_test_datagen.flow_from_directory(
    val_dir,
    target_size=(150, 150),
    batch_size=32,
    class_mode='categorical'
)

# Flow test images in batches using val_test_datagen generator
test_generator = val_test_datagen.flow_from_directory(
    test_dir,
    target_size=(150, 150),
    batch_size=32,
    class_mode='categorical'
)

Found 0 images belonging to 1 classes.
Found 0 images belonging to 0 classes.
Found 0 images belonging to 0 classes.


In [13]:
print('Number of unique breeds:', df.breed.nunique())
print(df.breed.value_counts()[:10])

Number of unique breeds: 120
breed
scottish_deerhound      126
maltese_dog             117
afghan_hound            116
entlebucher             115
bernese_mountain_dog    114
shih-tzu                112
great_pyrenees          111
pomeranian              111
basenji                 110
samoyed                 109
Name: count, dtype: int64


In [18]:
old_dir = 'dog_breeds/train/'
new_root_dir = 'data_org_subset/'

# Create new_root_dir if it doesn't exist
os.makedirs(new_root_dir, exist_ok=True)

dir_names = ['train', 'val', 'test']
for d in dir_names:
    new_dir = os.path.join(new_root_dir, d)
    os.makedirs(new_dir, exist_ok=True)
    
for breed in ['boston_bull', 'toy_poodle', 'scottish_deerhound']:
    print('Moving {} pictures.'.format(breed))
    # Create sub_directories
    for d in dir_names:
        new_dir = os.path.join(new_root_dir, d, breed)
        os.makedirs(new_dir, exist_ok=True)
    
    # Subset dataframe into train, validate, and test sets
    # Split is performed here to maintain class distributions
    temp = df[df.breed == breed]
    train, validate, test = np.split(temp.sample(frac=1), [int(.8*len(temp)), int(.9*len(temp))])
    print('Split {} imgs into {} train, {} val, and {} test examples.'.format(len(temp),
                                                                              len(train),
                                                                              len(validate),
                                                                              len(test)))
    for i, temp_split in enumerate([train, validate, test]):
        for row in temp_split.index:
            filename = temp_split['id'][row] + '.jpg'
            origin = os.path.join(old_dir, filename)
            destination = os.path.join(new_root_dir, dir_names[i], breed, filename)
            
            # Check if the file exists before copying
            if os.path.isfile(origin):
                shutil.copy(origin, destination)
            else:
                print(f"File {origin} not found.")

Moving boston_bull pictures.
Split 87 imgs into 69 train, 9 val, and 9 test examples.
File dog_breeds/train/309bf67309851334acebe0003c21b180.jpg not found.
File dog_breeds/train/faa2c24c801b37aca93ac744da51c2c3.jpg not found.
File dog_breeds/train/8200d65aa11022a2d877afa6cb632a7e.jpg not found.
File dog_breeds/train/c825a17e20a29f767bf4b915d036c502.jpg not found.
File dog_breeds/train/146be641443a270dd8116f65d53d0c9d.jpg not found.
File dog_breeds/train/20be3ce1dd7db9194c856726c2f154a3.jpg not found.
File dog_breeds/train/69406cd21bac290f6c6eaa6b0a968fc1.jpg not found.
File dog_breeds/train/d0abd5df53b9a41a506ea217a70a961a.jpg not found.
File dog_breeds/train/a55afed96114f7dcc48ee9ed9731f7da.jpg not found.
File dog_breeds/train/8233f609f7a39f0f834894e761d73aa0.jpg not found.
File dog_breeds/train/99fbeea51938834952e2bd18d3244dff.jpg not found.
File dog_breeds/train/6f0da929ad7f82be1c274dc0d2f56a4c.jpg not found.
File dog_breeds/train/d6857f130c251f01ab973358cbfccce1.jpg not found.
File

In [20]:
import os

directory = 'dog_breeds'  # Replace 'dog_breeds' with the path to your directory

# List all files and directories in the specified directory
contents = os.listdir(directory)
print("Contents of the directory:")
print(contents)

# Read the first few lines of each file in the directory
for file_name in contents:
    file_path = os.path.join(directory, file_name)
    if os.path.isfile(file_path):
        print(f"\nFirst few lines of file '{file_name}':")
        with open(file_path, 'r') as file:
            # Read the first 5 lines of each file
            for _ in range(5):
                line = file.readline()
                if line:
                    print(line.strip())  # Print each line without leading/trailing whitespace
                else:
                    break  # If end of file reached, break the loop


Contents of the directory:
['labels.csv', 'sample_submission.csv']

First few lines of file 'labels.csv':
id,breed
000bec180eb18c7604dcecc8fe0dba07,boston_bull
001513dfcb2ffafc82cccf4d8bbaba97,dingo
001cdf01b096e06d78e9e5112d419397,pekinese
00214f311d5d2247d5dfe4fe24b2303d,bluetick

First few lines of file 'sample_submission.csv':
id,affenpinscher,afghan_hound,african_hunting_dog,airedale,american_staffordshire_terrier,appenzeller,australian_terrier,basenji,basset,beagle,bedlington_terrier,bernese_mountain_dog,black-and-tan_coonhound,blenheim_spaniel,bloodhound,bluetick,border_collie,border_terrier,borzoi,boston_bull,bouvier_des_flandres,boxer,brabancon_griffon,briard,brittany_spaniel,bull_mastiff,cairn,cardigan,chesapeake_bay_retriever,chihuahua,chow,clumber,cocker_spaniel,collie,curly-coated_retriever,dandie_dinmont,dhole,dingo,doberman,english_foxhound,english_setter,english_springer,entlebucher,eskimo_dog,flat-coated_retriever,french_bulldog,german_shepherd,german_short-haired_poin

## Optional: Build a Baseline CNN

This is an optional step. Adapting a pretrained model will produce better results, but it may be interesting to create a CNN from scratch as a baseline. If you wish to, do so here.

In [None]:
# Create a baseline CNN model

## Loading a Pretrained CNN

## Feature Engineering with the Pretrained Model

As you may well have guessed, adapting a pretrained model will undoubtedly produce better results then a fresh CNN due to the limited size of training data. Import a pretrained model such as VGG-19 to use a convolutional base. Use this to transform the dataset into a rich feature space and add a few fully connected layers on top of the pretrained layers to build a classification model. (Be sure to leave the pretrained model frozen!)

In [None]:
# Your code here; add fully connected layers on top of the convolutional base

## Visualize History

Now fit the model and visualize the training and validation accuracy/loss functions over successive epochs.

In [None]:
# Your code here; visualize the training / validation history associated with fitting the model

In [None]:
# Save model

## Final Model Evaluation

Now that you've trained and validated the model, perform a final evaluation of the model on the test set.

In [None]:
# Your code here

## Summary

Congratulations! In this lab, you brought all of your prior deep learning skills together from preprocessing including one-hot encoding, to adapting a pretrained model. There are always ongoing advancements in CNN architectures and best practices, but you have a solid foundation and understanding at this point.