### Deyanira Taupier
### AD 470 - May 14, 2020
### Assignment: Presentation on a kernel 
###### Note: The following is a mixture of my notes on the kernel, explaining for presentation, and the author's code.

In [None]:
import warnings
warnings.filterwarnings("ignore")

## The Playground Prediction Competition
#### Description
    Who's a good dog? Who likes ear scratches? Well, it seems those fancy deep neural networks don't have all the answers. However, maybe they can answer that ubiquitous question we all ask when meeting a four-legged stranger: what kind of good pup is that?
    In this playground competition, you are provided a strictly canine subset of ImageNet in order to practice fine-grained image categorization. How well you can tell your Norfolk Terriers from your Norwich Terriers? With 120 breeds of dogs and a limited number training images per class, you might find the problem more, err, ruff than you anticipated.
#### Acknowledgments
    We extend our gratitude to the creators of the Stanford Dogs Dataset for making this competition possible: Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li.


# Dog Breed Identification
## Determine the breed of a dog in an image
Last Updated: 3 years ago

###  About this Competition
You are provided with a training set and a test set of images of dogs. Each image has a filename that is its unique id. The dataset comprises 120 breeds of dogs. The goal of the competition is to create a classifier capable of determining a dog's breed from a photo. 

#### The list of breeds is as follows:
    affenpinscher
    afghan_hound
    african_hunting_dog
    airedale
    american_staffordshire_terrier
    appenzeller
    australian_terrier
    basenji
    basset
    beagle
    bedlington_terrier
    bernese_mountain_dog
    black-and-tan_coonhound
    blenheim_spaniel
    bloodhound
    bluetick
    border_collie
    border_terrier
    borzoi
    boston_bull
    bouvier_des_flandres
    boxer
    brabancon_griffon
    briard
    brittany_spaniel
    bull_mastiff
    cairn
    cardigan
    chesapeake_bay_retriever
    chihuahua
    chow
    clumber
    cocker_spaniel
    collie
    curly-coated_retriever
    dandie_dinmont
    dhole
    dingo
    doberman
    english_foxhound
    english_setter
    english_springer
    entlebucher
    eskimo_dog
    flat-coated_retriever
    french_bulldog
    german_shepherd
    german_short-haired_pointer
    giant_schnauzer
    golden_retriever
    gordon_setter
    great_dane
    great_pyrenees
    greater_swiss_mountain_dog
    groenendael
    ibizan_hound
    irish_setter
    irish_terrier
    irish_water_spaniel
    irish_wolfhound
    italian_greyhound
    japanese_spaniel
    keeshond
    kelpie
    kerry_blue_terrier
    komondor
    kuvasz
    labrador_retriever
    lakeland_terrier
    leonberg
    lhasa
    malamute
    malinois
    maltese_dog
    mexican_hairless
    miniature_pinscher
    miniature_poodle
    miniature_schnauzer
    newfoundland
    norfolk_terrier
    norwegian_elkhound
    norwich_terrier
    old_english_sheepdog
    otterhound
    papillon
    pekinese
    pembroke
    pomeranian
    pug
    redbone
    rhodesian_ridgeback
    rottweiler
    saint_bernard
    saluki
    samoyed
    schipperke
    scotch_terrier
    scottish_deerhound
    sealyham_terrier
    shetland_sheepdog
    shih-tzu
    siberian_husky
    silky_terrier
    soft-coated_wheaten_terrier
    staffordshire_bullterrier
    standard_poodle
    standard_schnauzer
    sussex_spaniel
    tibetan_mastiff
    tibetan_terrier
    toy_poodle
    toy_terrier
    vizsla
    walker_hound
    weimaraner
    welsh_springer_spaniel
    west_highland_white_terrier
    whippet
    wire-haired_fox_terrier
    yorkshire_terrier

#### File descriptions
    train.zip - the training set, you are provided the breed for these dogs
    test.zip - the test set, you must predict the probability of each breed for each image
    sample_submission.csv - a sample submission file in the correct format
    labels.csv - the breeds for the images in the train set


## Kernel title: Pupper Keras CNN
The author lists the folders in our `input` directory to see what we are working with.

There is `dog-breed-identification` folder, within we have `labels.csv` and `sample_submission.csv`.

The author then uses `pandas` package for `.read_csv()` to read in the labels as training data and samples as testing data.

The head of each dataframe is then displayed to become familiar. 

Afterwards, the author is able to clean the data and create a model for training purposes.

Finally, the predictions are made! Now able to be visualized in graphs and displayed, then saved to output `.csv`/`.h5` files.

This notebook has some good scripts in it for a simple approach to the problem given of 'small amount of data'. 
    
                                                                                  [as noted in competition description]

Able to recognize the flow of data, this proved a bit similar to what we have been doing in class.

I also believe this is good example at attempting to solve a modern issue of non-human image data analysis. 

In [None]:
# import packages

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

import tensorflow.keras as keras
from keras import regularizers
from keras.applications.resnet50 import ResNet50, preprocess_input
from keras.models import Model
from keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.metrics import categorical_accuracy, top_k_categorical_accuracy, categorical_crossentropy
import os
from tqdm import tqdm
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import cv2
from keras.preprocessing.image import ImageDataGenerator

## Importing the Data

In [None]:
# directory of input data
print(os.listdir("../input"))

In [None]:
# grab the train and test data into dataframes
df_train = pd.read_csv('../input/dog-breed-identification/labels.csv') 
df_test = pd.read_csv('../input/dog-breed-identification/sample_submission.csv') 

In [None]:
print('labels : train dataframe')
df_train.head()

In [None]:
print('samples : test dataframe')
df_test.head()

In [None]:
# the breed is what we are after for each dog sample
labels = df_train['breed']
# but we need to get dummy variables for the breed columns
one_hot = pd.get_dummies(labels, sparse = True)
print('get dummies for breed data')
one_hot

In [None]:
print('one hot encode the labels into numpy array')
one_hot_labels = np.asarray(one_hot)
one_hot_labels

In [None]:
# the images will be resized to 64x64
im_resize = 64
print('sample the dog images after resize efforts')
#visualize a dogger
dogger1 = df_train['id'][0]
dogger2 = df_train['id'][1]
dogger3 = df_train['id'][2]
dogger4 = df_train['id'][3]
pupper1 = cv2.imread('../input/dog-breed-identification/train/{}.jpg'.format(dogger1))
pupper2 = cv2.imread('../input/dog-breed-identification/train/{}.jpg'.format(dogger2))
pupper3 = cv2.imread('../input/dog-breed-identification/train/{}.jpg'.format(dogger3), cv2.IMREAD_GRAYSCALE)
pupper4 = cv2.imread('../input/dog-breed-identification/train/{}.jpg'.format(dogger4), cv2.IMREAD_GRAYSCALE)
pupper4 = cv2.resize(pupper4, (im_resize, im_resize))
f, axarr = plt.subplots(2,2)
axarr[0,0].imshow(pupper1)
axarr[0,1].imshow(pupper2)
axarr[1,0].imshow(pupper3,cmap="gray", vmin=0, vmax=255)
axarr[1,1].imshow(pupper4,cmap="gray", vmin=0, vmax=255)
plt.xticks([])
plt.yticks([])

In [None]:
im_size = pupper1.shape
print('Image original shape', im_size)
print('Image resized shape', cv2.resize(pupper1, (im_resize, im_resize)).shape)


In [None]:
x_train = []
y_train = []
x_test = []
print('arrays to hold x/y/train/test data initialized')

## Cleaning the Data

In [None]:
i = 0 
for f, breed in tqdm(df_train.values):
    img = cv2.imread('../input/dog-breed-identification/train/{}.jpg'.format(f))
    img_resized = cv2.resize(img, (im_resize, im_resize))
    x_train.append(img_resized)
    label = one_hot_labels[i]
    y_train.append(label)
    i += 1
print('\nImages split into training sets, resized and one hot encoding labels appended.')

In [None]:
# no need for the big dataframe anymore
del df_train
print('df_train used; cleaned up.')

In [None]:
for f in tqdm(df_test['id'].values):
    img = cv2.imread('../input/dog-breed-identification/test/{}.jpg'.format(f))
    img_resized = cv2.resize(img, (im_resize, im_resize))
    x_test.append(img_resized)
print('\nTest data read and resized into x_test[].')

In [None]:
num_class = 120 #static ftw
print('Number of classes: ', num_class)

In [None]:
X_train, X_valid, Y_train, Y_valid = train_test_split(x_train, y_train, shuffle=True,  test_size=0.1)
print('Split into new values with train_test_split()')

In [None]:
del x_train, y_train
print('Initial values used; cleaned up.')

In [None]:
# author defines function to seek top 3 value
def top_3_accuracy(y_true, y_pred):
    return top_k_categorical_accuracy(y_true, y_pred, k=3)

## Modeling the Data

In [None]:
print('ImageDataGenerator being used to fit X_train.')

datagen = ImageDataGenerator(width_shift_range=0.2,
                            height_shift_range=0.2,
                            zoom_range=0.2,
                            rotation_range=30,
                            vertical_flip=False,
                            horizontal_flip=True)


datagen.fit(X_train)

In [None]:
print('ResNet50 will load the `resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5` file included as pretrained model.\n')

base_model = ResNet50(weights="../input/keras-pretrained-models/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5",include_top=False, input_shape=(im_resize, im_resize, 3))

x = base_model.output
x = Flatten()(x) # Flattens the input. Does not affect the batch size.
x = Dense(512, activation='relu')(x)
x = Dropout(0.25)(x)
# Dropout consists in randomly setting
#     a fraction `rate` of input units to 0 at each update during training time,
#     which helps prevent overfitting.

logits = Dense(num_class, activation='softmax')(x)

model = Model(base_model.input, logits)

model.compile(optimizer='Adam',
          loss='categorical_crossentropy', 
           metrics=[categorical_crossentropy, categorical_accuracy])

print('\nThis base model has Flatten(), Dense(relu), Dropout() applied before a Dense(softmax)')
print('\nFinally coming to a Model(), where it is compiled with adam optimizer, loss of categorical_crossentropy, looking for accuracy as well.')

In [None]:
print('Graph generating helper function defined...')
def gen_graph(history, title):
    plt.plot(history.history['categorical_accuracy'])
    plt.plot(history.history['val_categorical_accuracy'])
    plt.title('Accuracy ' + title)
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()
    plt.plot(history.history['categorical_crossentropy'])
    plt.plot(history.history['val_categorical_crossentropy'])
    plt.title('Loss ' + title)
    plt.ylabel('MLogLoss')
    plt.xlabel('Epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()

In [None]:
print('.flow() is used to train on the image data for X and Y arrays.')

train_generator = datagen.flow(np.array(X_train), np.array(Y_train), 
                               batch_size=32) 
# Takes data & label arrays, generates batches of augmented data.

In [None]:
print('.fit_generator() on the model using the train_generator just made and get the history.\n')

batch_size = 512

history_rmsprop = model.fit_generator(
    train_generator,
    epochs=30, steps_per_epoch=len(X_train) / batch_size,
    validation_data=(np.array(X_train), np.array(Y_train)), validation_steps=len(X_valid) / batch_size)


In [None]:
print('Use the model to run .predict() on x_test.')

preds = model.predict(np.array(x_test), verbose=1)

## Visualizing the Data

In [None]:
print('Accuracy plotted for predictions made.\n')

#plot the accuracy
gen_graph(history_rmsprop, 
              "ResNet50 RMSprop")

In [None]:
print('Dataframe is created with the prediction data and displayed.')

sub = pd.DataFrame(preds)
col_names = one_hot.columns.values
sub.columns = col_names
sub.insert(0, 'id', df_test['id'])
sub


In [None]:
print('The output dataframe created is saved to `output_rmsprop_aug.csv` and model saved as `rmsprop_v2_augmentation.h5`')

sub.to_csv('output_rmsprop_aug.csv', index=False)

model.save('rmsprop_v2_augmentation.h5')

val_loss starts decreasing, val_acc starts increasing. This is also fine as that means model built is learning and working fine.