# 1.0 Standard Bank Tech Impact Challenge: Animal classification

## Defining the question

### Specifying the Question

The objective of the challenge is to create a machine learning model to accurately predict the likelihood that an image contains a zebra, as opposed to an elephant. 

Challenge: https://zindi.africa/competitions/sbtic-animal-classification/data


### Metric for success
- Log loss

### Understanding the context

Total dataset contains 18,000+ images of zebras and elephants, sampled from the Snapshot Serengeti collection of more than 6 million animals. The data was retrieved from the Data Repository for the University of Minnesota, https://doi.org/10.13020/D6T11K, under a creative commons license, from a study titled: Camera Trap Images used in "Identifying Animal Species in Camera Trap Images using Deep Learning and Citizen Science".*

### Recording the experimental design

CRISP- DM methodology will be applied. Below steps will be undertaken to create the classifer.

- Business understanding - understanding the background
- Data understanding 
- Exploratory data analysis
- Feature engineering
- Data modelling
- Model interpretation

### Data relevance


## 2.0 Libraries Importation

In [None]:
#Data Manipulation Libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re #regular expressions
#Progress bar
from tqdm import tqdm
from datetime import datetime
#Read Images
import os
from skimage import io
#from skimage import io #returning error ImportError: cannot import name 'io' so temporarily commented
from PIL import Image
import cv2 # When open cv was used, there was an error in getting array from image. Using Pillow eliminated the error.

#Visualization
import matplotlib.pyplot as plt
import seaborn as sns

#Image copy
from shutil import copyfile
from random import seed
from random import random


#Model Pre-processing
#from sklearn.model_selection import train_test_split

#Modelling
import tensorflow as tf
import sys
from matplotlib import pyplot
from keras.models import Sequential
from keras.utils import to_categorical
from keras.applications.vgg16 import VGG16
from keras.layers import Conv2D,MaxPooling2D,Dense,Flatten,Dropout
from keras.models import Model
from keras.optimizers import SGD
from keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import  r2_score,roc_auc_score,f1_score,recall_score,precision_score,classification_report, confusion_matrix,log_loss
import random

In [None]:
# Increase rows and columns visible on the notebook
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 50)
pd.set_option('max_colwidth', 100)

# import required libraries
import warnings
warnings.filterwarnings("ignore")

### Explore Images in Directories

Check main directory

In [None]:
image_path = '../input/sbtic-animal-classification/SBTIC/'
os.listdir(image_path)

How many images in each of the directories

In [None]:
# How many images in directories
categories = ['test',  'train_zebras', 'train_elephants']

for category in categories:
    full_image_path = image_path +  category + "/" +category + "/"
    
    print(category,len(os.listdir(full_image_path)))
    

Create a dataframe with images

In [None]:
image_categories = []
file_names =[]
image_names = []
# Loop across the directories having images.
for category in categories:        
    full_image_path = image_path +  category + "/" +category + "/"
    image_file_names = [os.path.join(full_image_path, f) for f in os.listdir(full_image_path)] # Retrieve the filenames from the all the  directories. OS package used.
    excempt = full_image_path + '.DS_Store' #If using linux, hidden file .DS_Store is read which causes failure as not jpg. Remove it
    if excempt in image_file_names:
        image_file_names.remove(excempt)
    for file in image_file_names:         # Read the labels and load them into an array
        file_name = os.path.basename(file) ## Eliminate path from file name
        image_categories.append(category)
        file_names.append(file)
        image_names.append(file_name)

In [None]:
print(len(file_names))
print(len(image_names))
print(len(image_categories))

In [None]:
# df = pd.DataFrame(file_names,image_names)
# df_cat = pd.DataFrame(image_categories)

df = pd.DataFrame({'file_names': file_names, 'image_names': image_names,'image_categories':image_categories}, columns=['file_names', 'image_names','image_categories'])

# result = pd.merge(df,df_cat, how='outer')

In [None]:
#Delete directory if it exists.
import shutil

def ignore_absent_file(func, path, exc_inf):
    except_instance = exc_inf[1]
    if isinstance(except_instance, FileNotFoundError):
        return
    raise except_instance

shutil.rmtree('/kaggle/working/SBTIC/test', onerror=ignore_absent_file)

In [None]:
# create directories
dataset_home = 'SBTIC/'
subdirs = ['train/', 'validation/']
for subdir in subdirs:
    # create label subdirectories
    labeldirs = ['train_elephants/', 'train_zebras/']
    for labldir in labeldirs:
        newdir = dataset_home + subdir + labldir
        os.makedirs(newdir, exist_ok=True)

In [None]:
output_path = '/kaggle/working'
os.listdir(output_path)

In [None]:
# Copy files from input to output train and validaton directories and their corresponding class directories
seed = 1
val_ratio = 0.25
for index, row in df.iterrows():
    if row['image_categories'] != 'test':
        src = row['file_names']
        if random() < val_ratio:
            dst = '/kaggle/working/SBTIC/validation'+ '/' + row['image_categories'] + '/' +row['image_names']
        else:
            dst = '/kaggle/working/SBTIC/train'+ '/' + row['image_categories'] + '/' +row['image_names']
        copyfile(src, dst)

In [None]:
# How many images in directories
categories = ['train_zebras', 'train_elephants']
output_path = dst = '/kaggle/working/SBTIC/'
for category in categories:
    full_image_path = output_path +  'validation' + "/" +category + "/"
    print(category,len(os.listdir(full_image_path)))
for category in categories:
    full_image_path = output_path +  'train' + "/" +category + "/"
    print(category,len(os.listdir(full_image_path)))

### c) Upload Training images upload

In [None]:
#Function to upload and if need be resize the training images
def upload_train_images(image_path, categories ,height, width):
    images = []
    labels = []
    file_names =[]
    # Loop across the directories having images.
    for category in categories:
        
        # Append the  category directory into the main path
        full_image_path = image_path +  category + "/" +category + "/"
        # Retrieve the filenames from the all the three wheat directories. OS package used.
        image_file_names = [os.path.join(full_image_path, f) for f in os.listdir(full_image_path)]
        
        #If using linux, hidden file .DS_Store is read which causes failure as not jpg. Remove it
        excempt = full_image_path + '.DS_Store'
        if excempt in image_file_names:
            image_file_names.remove(excempt)
            
        # Read the images and load them into an array
        for file in image_file_names[0:100]:         
            image=io.imread(file) #io package from SKimage package
            # Resize?
            #image_from_array = Image.fromarray(image, 'RGB')
            ##Resize image
            #size_image = image_from_array.resize((height, width)) # no resize
            #Append image into list
            images.append(np.array(image))
            # Label for each image as per directory
            labels.append(category)
            file_names.append(file)
        
    return images, labels, file_names

## Invoke the function

#Image resize parameters if needed. Not resizing in this case so code below just a boilerplate incase resizing needed
height = 256
width = 256

categories = ['train_zebras', 'train_elephants'] 
train_images, train_categories, train_file_names  = upload_train_images('/kaggle/input/sbtic-animal-classification/SBTIC/',categories,height,width)
#Size and dimension of output image and labels
train_images = np.array(train_images)
train_categories = np.array(train_categories)
train_file_names = np.array(train_file_names)

#Check properties of uploaded images
print("Shape of training images is " + str(train_images.shape))
print("Shape of training labels is " + str(train_categories.shape))
print("Shape of training labels is " + str(train_file_names.shape))

In [None]:
## Eliminate path from file name
# use regular expressions to extract the name of image
image_names = []
for i in train_file_names:
    fname = os.path.basename(i)
    image_names.append(fname)

#View images
image_names = np.array(image_names)
print(len(image_names))
image_names[0:5]


### c) Display sample training images

a) Individual images

In [None]:
import random
def show_train_images(images, train_categories, train_file_names,image_names,images_count):
     for i in range(images_count):
        
        index = int(random.random() * len(images))
        plt.axis('off')
        plt.imshow(images[index])
        plt.show()
        
        print("Size of this image is " + str(images[index].shape))
        print("Class of the image is " + str(train_categories[index]))
        print("Image path is " + str(train_file_names[index]))        
        print("Image name is " + str(image_names[index]))   

#Execute the function
print("Train images, sizes and class labels")
show_train_images(train_images, train_categories,train_file_names,image_names, 10)

b) Display batch images

In [None]:
title = train_categories[1],image_names[1]
title

In [None]:
# a function to show the image batch
def show_batch_train_images(images,train_categories,image_names):
    plt.figure(figsize=(20,15))
    for n in range(20):
        ax = plt.subplot(5,5,n+1)
        index = int(random.random() * len(images))
        plt.imshow(images[index])
        title = train_categories[index],image_names[index]
        plt.title(title)
#         plt.title(CLASS_NAMES[labels[n]==1][0].title())
#         print("Size of this image is " + str(images[index].shape))
        plt.axis('off')

show_batch_train_images(train_images,train_categories,image_names)
plt.show()

### d) Categories of Training Images

In [None]:
#Categories of Images
pd.Series(train_categories).value_counts().reset_index().values.tolist()

Visualize the images distribution per label

In [None]:
# Plot chart
sns.countplot(df.image_categories)
# plt.show()
# df.image_categories

Above shows that the data is balanced

## Modelling

### Baseline CNN Model. 

3 Layer CNN with 3 by 3 filter and relu activation function.

In [None]:
# define cnn model
def define_model():
    model = Sequential()
    model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=(330, 330, 3)))
    model.add(MaxPooling2D((2, 2)))
    model.add(Dropout(0.2))
    model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Dropout(0.2))
    model.add(Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Dropout(0.2))
    model.add(Flatten())
    model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dropout(0.5))
    model.add(Dense(2, activation='softmax'))
    # compile model
    
    opt = SGD(lr=0.001, momentum=0.9)
 
    #Compile the model
    model.compile(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model
 
# plot diagnostic learning curves
def summarize_diagnostics(history):
    # plot loss
    plt.subplot(211)
    plt.title('Cross Entropy Loss')
    plt.plot(history.history['loss'], color='blue', label='train')
    plt.plot(history.history['val_loss'], color='orange', label='test')
    # plot accuracy
    plt.subplot(212)
    plt.title('Classification Accuracy')
    plt.plot(history.history['accuracy'], color='blue', label='train')
    plt.plot(history.history['val_accuracy'], color='orange', label='test')
    # save plot to file
    filename = sys.argv[0].split('/')[-1]
    plt.savefig(filename + '_plot.png')
    plt.close()
    
# run the test harness for evaluating a model
def run_test_harness():
    # define model
    print("Define Model")
    model = define_model()
    # create data generator
    print("Creating Image Data Generator")
    datagen = ImageDataGenerator(rescale=1.0/255.0)
    
    # prepare iterators
    print("Preparing iterators")
    train_it = datagen.flow_from_directory('/kaggle/working/SBTIC/train/', class_mode='binary', batch_size=64, target_size=(330, 330))
    test_it = datagen.flow_from_directory('/kaggle/working/SBTIC/validation/', class_mode='binary', batch_size=64, target_size=(330, 330))
    
    # fit model
    print("Fitting the model")
    history = model.fit_generator(train_it, steps_per_epoch=len(train_it),validation_data=test_it, validation_steps=len(test_it), epochs=5, verbose=1) #We 10 epochs before
    
    print("Testing the model")
    # evaluate model
    _, acc = model.evaluate_generator(test_it, steps=len(test_it), verbose=1)
    print('> %.3f' % (acc * 100.0))
    # learning curves
    summarize_diagnostics(history)
    return(history)


In [None]:
#Execute the Model
model_history = run_test_harness()

Baseline output

In [None]:
# plot Loss and classification accuracy
    plt.subplot(211)
    plt.title('Cross Entropy Loss')
    plt.plot(model_history.history['loss'], color='blue', label='train')
    plt.plot(model_history.history['val_loss'], color='orange', label='test')
    # plot accuracy
    plt.subplot(212)
    plt.title('Classification Accuracy')
    plt.plot(model_history.history['accuracy'], color='blue', label='train')
    plt.plot(model_history.history['val_accuracy'], color='orange', label='test')
    plt.show()

#### Image Augmentation

In [None]:
# Apply data augmentation on baseline model above.
# Create cnn model
def define_model():
    model = Sequential()
    model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=(330, 330, 3)))
    model.add(MaxPooling2D((2, 2)))
    model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Flatten())
    model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dense(2, activation='sigmoid'))
    # compile model
    opt = SGD(lr=0.001, momentum=0.9)
    model.compile(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model
 
# plot diagnostic learning curves
def summarize_diagnostics(history):
    # plot loss
    pyplot.subplot(211)
    pyplot.title('Cross Entropy Loss')
    pyplot.plot(history.history['loss'], color='blue', label='train')
    pyplot.plot(history.history['val_loss'], color='orange', label='test')
    # plot accuracy
    pyplot.subplot(212)
    pyplot.title('Classification Accuracy')
    pyplot.plot(history.history['accuracy'], color='blue', label='train')
    pyplot.plot(history.history['val_accuracy'], color='orange', label='test')
    # save plot to file
    filename = sys.argv[0].split('/')[-1]
    pyplot.savefig(filename + '_plot.png')
    pyplot.close()
 
# run the test harness for evaluating a model
def run_test_harness():
    # define model
    model = define_model()
    # create data generators
    train_datagen = ImageDataGenerator(rescale=1.0/255.0, width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True)
    test_datagen = ImageDataGenerator(rescale=1.0/255.0)
    
    # prepare iterators
    train_it = train_datagen.flow_from_directory('/kaggle/working/SBTIC/train/',class_mode='binary', batch_size=64, target_size=(330, 330))
    test_it = test_datagen.flow_from_directory('/kaggle/working/SBTIC/validation/',class_mode='binary', batch_size=64, target_size=(330, 330))
    
    # fit model
    history = model.fit_generator(train_it, steps_per_epoch=len(train_it),validation_data=test_it, validation_steps=len(test_it), epochs=5, verbose=1) # Were 10 epochs earlier
    # evaluate model
    _, acc = model.evaluate_generator(test_it, steps=len(test_it), verbose=1)
    print('> %.3f' % (acc * 100.0))
    # learning curves
    summarize_diagnostics(history)
    return(history)


Image Augmentation Results

In [None]:
da_model_history = run_test_harness()

In [None]:
plt.subplot(211)
plt.title('Cross Entropy Loss')
plt.plot(da_model_history.history['loss'], color='blue', label='train')
plt.plot(da_model_history.history['val_loss'], color='orange', label='test')
# plot accuracy
plt.subplot(212)
plt.title('Classification Accuracy')
plt.plot(da_model_history.history['accuracy'], color='blue', label='train')
plt.plot(da_model_history.history['val_accuracy'], color='orange', label='test')
plt.show()

### Transfer Learning : VGG 16

In [None]:
 # Create cnn model
def vgg_model():
    # load model
    model = VGG16(weights='imagenet',include_top=False, input_shape=(330, 330, 3)) #weights='imagenet'. Crosscheck before and after
    # mark loaded layers as not trainable
    for layer in model.layers:layer.trainable = False
    # add new classifier layers
    flat1 = Flatten()(model.layers[-1].output)
    class1 = Dense(128, activation='relu', kernel_initializer='he_uniform')(flat1)
    output = Dense(2, activation='softmax')(class1)
    # define new model
    model = Model(inputs=model.inputs, outputs=output)
    # compile model
    opt = SGD(lr=0.001, momentum=0.9)
    model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy']) #sparse_categorical_crossentropy
    return model
 
# plot diagnostic learning curves
def summarize_diagnostics(history):
    # plot loss
    pyplot.subplot(211)
    pyplot.title('Cross Entropy Loss')
    pyplot.plot(history.history['loss'], color='blue', label='train')
    pyplot.plot(history.history['val_loss'], color='orange', label='test')
    # plot accuracy
    pyplot.subplot(212)
    pyplot.title('Classification Accuracy')
    pyplot.plot(history.history['accuracy'], color='blue', label='train')
    pyplot.plot(history.history['val_accuracy'], color='orange', label='test')
    # save plot to file
    filename = sys.argv[0].split('/')[-1]
    pyplot.savefig(filename + '_plot.png')
    pyplot.close()
 
# run the test harness for evaluating a model
def run_test_harness():
    # define model
    model = vgg_model()
    # create data generators
    #train_datagen = ImageDataGenerator(rescale=1.0/255.0, width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True)
    train_datagen = ImageDataGenerator(rescale=1./255,rotation_range=40,width_shift_range=0.2,height_shift_range=0.2,shear_range=0.2,zoom_range=0.2,horizontal_flip=True,fill_mode='nearest')
    test_datagen = ImageDataGenerator(rescale=1.0/255.0)
    # prepare iterators                   
    train_it = train_datagen.flow_from_directory('/kaggle/working/SBTIC/train/',class_mode='categorical', batch_size=64, target_size=(330, 330))
    test_it = test_datagen.flow_from_directory('/kaggle/working/SBTIC/validation/',class_mode='categorical', batch_size=64, target_size=(330, 330))
    # fit model
    history = model.fit_generator(train_it, steps_per_epoch=len(train_it),validation_data=test_it, validation_steps=len(test_it), epochs=1, verbose=1) #Were 50 epochs earlier
    # evaluate model
    _, acc = model.evaluate_generator(test_it, steps=len(test_it), verbose=1)
    print('> %.3f' % (acc * 100.0))
    # learning curves
    summarize_diagnostics(history)
    return(history)

Transfer learning results

In [None]:
vgg_model

In [None]:
tl_model_history = run_test_harness()

In [None]:
plt.subplot(211)
plt.title('Cross Entropy Loss')
plt.plot(tl_model_history.history['loss'], color='blue', label='train')
plt.plot(tl_model_history.history['val_loss'], color='orange', label='test')
# plot accuracy
plt.subplot(212)
plt.title('Classification Accuracy')
plt.plot(tl_model_history.history['accuracy'], color='blue', label='train')
plt.plot(tl_model_history.history['val_accuracy'], color='orange', label='test')
plt.show()

### Train on whole dataset i.e both train and validation. To apply transfer learning and image augmentation model

Create a combined directory with all the train and validation images used earlier for training. We will need to do final training on all images

In [None]:
#Create a directory combining both train and validation dataset
dataset_home = 'SBTIC/'
subdirs = ['combined/']
for subdir in subdirs:
    # create label subdirectories
    labeldirs = ['train_elephants/', 'train_zebras/']
    for labldir in labeldirs:
        newdir = dataset_home + subdir + labldir
        os.makedirs(newdir, exist_ok=True)

In [None]:
output_path = '/kaggle/working/SBTIC/combined/train_zebras'
os.listdir(output_path)

Copy images into the combined directory

In [None]:
# Copy files from input to combined directory. 
seed = 1
for index, row in df.iterrows():
    if row['image_categories'] != 'test':
        src = row['file_names']
        dst = '/kaggle/working/SBTIC/combined'+ '/' + row['image_categories'] + '/' +row['image_names']
        copyfile(src, dst)

Count images and their corresponding class

In [None]:
# How many images in directories
categories = ['train_zebras', 'train_elephants']
output_path = dst = '/kaggle/working/SBTIC/combined'
for category in categories:
    full_image_path = output_path +   "/" +category + "/"
    print(full_image_path)
    print(category,len(os.listdir(full_image_path)))

In [None]:
# RUN the model on full dataset.
def run_final_model():
# define model
    model = vgg_model()
    # create data generator
    datagen = ImageDataGenerator(featurewise_center=True)
    # specify imagenet mean values for centering
    #datagen.mean = [123.68, 116.779, 103.939]
    # prepare iterator
    train_it = datagen.flow_from_directory('/kaggle/working/SBTIC/combined/',class_mode='categorical', batch_size=64, target_size=(330, 330))
    print("Fitting the model")
    # fit model
    model.fit_generator(train_it, steps_per_epoch=len(train_it), epochs=1, verbose=0) #Were 11 epochs
    # save model
    model.save('marine.h5')
    class_dictionary = train_it.class_indices
    print(train_it.classes)
    print(class_dictionary)
    return(train_it)


Execute the model

In [None]:
# Excecute the model
train_it = run_final_model()

## Subject the Model to Test Data

Load Test Data

In [None]:
#Import test data from test path
t_file_names =[]
t_file_path =[]
test_image_path = '../input/sbtic-animal-classification/SBTIC/test/test/'
test_image_file_names = [os.path.join(test_image_path, f) for f in os.listdir(test_image_path)] # Retrieve the filenames from the all the  directories. OS package used.
for tfile in test_image_file_names:         # Read the labels and load them into an array
        FILE = os.path.basename(tfile) ## Eliminate path from file name
        t_file_names.append(FILE)    
        t_file_path.append(tfile)
print(len(t_file_names))
print(len(t_file_path))

In [None]:
t_file_names[1]
t_file_path[1]

In [None]:
#Create Test Dataframe
df_test = pd.DataFrame({'t_file_names': t_file_names,'t_file_path':t_file_path}, columns=['t_file_names','t_file_path'])
df_test

Prediction for one image

In [None]:
# make a prediction for a new image.
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.models import load_model
 
# load and prepare the image
def load_image(filename):
    # load the image
    img = load_img(filename, target_size=(330, 330))
    # convert to array
    img = img_to_array(img)
    # reshape into a single sample with 3 channels
    img = img.reshape(1, 330, 330, 3)
    # center pixel data
    img = img.astype('float32')
    img = img - [123.68, 116.779, 103.939]
    return img
 
# load an image and predict the sample image
def run_sample_prediction():
    # load the image
    img = load_image('../input/sbtic-animal-classification/SBTIC/test/test/ASG001e15q_2.jpeg')
    # load model
    model = load_model('marine.h5')
    # predict the class
    y_predicted = model.predict(img) #oringinal
    y_classes = y_predicted.argmax(axis=-1)
    #y_classes = keras.np_utils.probas_to_classes(y_predicted)
    print("Prediction",y_predicted)
    #print("class",y_classes)
    print("rint",y_classes)
    return(y_predicted,y_classes)
 

In [None]:
#Check the Prediction_Result 
y_predicted,y_classes = run_prediction()

Predict whole test set

In [None]:
df_test[0:3]

In [None]:
def run_test_prediction():
    # load the image
    test_images =[]
    predictions =[]
    for index, row in df_test[0:3].iterrows():
        img = load_image(row['t_file_path'])
        test_images.append(row['t_file_names'])
        print(index)
        print(test_images)
        # load model
        model = load_model('marine.h5')
        # predict the class
        y_predicted = model.predict(img) 
        predictions.append(y_predicted)
        y_classes = y_predicted.argmax(axis=-1)
        print("Prediction",y_predicted)
        print("rint",y_classes)
    return(y_predicted,y_classes,test_images,predictions)

In [None]:
##Full test set prediction

,y_classes,test_images,predictions = run_test_prediction()

Map labels to prediction

In [None]:
column_names = []
labels = (train_it.class_indices)
dict_labels = dict((v,k) for k,v in labels.items())
for key, value in dict_labels.items():
    print(key, '->', value)
    column_names.append(value)
column_names.insert( 0, 'FILE');
column_names


In [None]:
predictions

Create Dataframe from predictions and Columns above

In [None]:
df_FILE = pd.DataFrame(test_images)
df_FILE
df_predicted = pd.DataFrame(np.concatenate(predictions))
df_predicted
result = pd.concat([df_FILE, df_predicted], axis=1)
result.columns =[column_names]
result
#  df_predicted = pd.DataFrame(test_images,np.concatenate(predictions), columns =column_names)


Check Sample Predicted Image Visually

In [None]:
img = load_image('../input/sbtic-animal-classification/SBTIC/test/test/ASG001e15q_2.jpeg')
plt.imshow(img)
plt.show()

In [None]:
labels = (train_generator.class_indices)
# labels = dict((v,k) for k,v in labels.items())
# predictions = [labels[k] for k in predicted_class_indices]

## 4.0 Images Pre-processing

In addition to images resizing done during importation, below preparation activities done before modelling.

#### a) Label Encoding. 

The train labels are string variables of two types. These will be encoded to convert them to numerical variables

In [None]:
#Label encoding to change 
print(np.unique(train_categories))
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
train_labels_enc = label_encoder.fit_transform(train_categories)
print(np.unique(train_labels_enc))
# Elephant = 0, zebra =1

Convert the encoded dependent values to categorical types. Reason is because ANN works best with categorical values

In [None]:
#Convert the predicted labels to categorical type
train_labels_cat = to_categorical(train_labels_enc)
print(train_categories)
print(train_labels_enc)
print(train_labels_cat)
##Display the categorical training labels
#Elephant
print(train_labels_cat[0])
print(train_labels_cat[199])
##
print(train_labels_cat[300])
print(train_labels_cat[399])

#### b) Normalization

Benefits of normalization
1. Reduce the effect of illumination's differences.
2. CNN converges faster on [0..1] data than on [0..255].

In [None]:
#Function to upload and if need be resize the training images
def upload_train_images(image_path, categories ,height, width):
    images = []
    labels = []
    file_names =[]
    # Loop across the directories having images.
    for category in categories:
        
        # Append the  category directory into the main path
        full_image_path = image_path +  category + "/" +category + "/"
        # Retrieve the filenames from the all the three wheat directories. OS package used.
        image_file_names = [os.path.join(full_image_path, f) for f in os.listdir(full_image_path)]
        
        #If using linux, hidden file .DS_Store is read which causes failure as not jpg. Remove it
        excempt = full_image_path + '.DS_Store'
        if excempt in image_file_names:
            image_file_names.remove(excempt)
            
        # Read the images and load them into an array
        for file in image_file_names[0:200]:         
            image=io.imread(file) #io package from SKimage package
            # Resize?
            #image_from_array = Image.fromarray(image, 'RGB')
            ##Resize image
            #size_image = image_from_array.resize((height, width)) # no resize
            #Append image into list
            image = image.astype('float32')/255
            images.append(np.array(image))
            # Label for each image as per directory
            labels.append(category)
            file_names.append(file)
        
    return images, labels, file_names

## Invoke the function

#Image resize parameters if needed. Not resizing in this case so code below just a boilerplate incase resizing needed
height = 256
width = 256

categories = ['train_zebras', 'train_elephants'] 
train_images, train_categories, train_file_names  = upload_train_images('/kaggle/input/sbtic-animal-classification/SBTIC/',categories,height,width)
#Size and dimension of output image and labels
train_images = np.array(train_images)
train_categories = np.array(train_categories)
train_file_names = np.array(train_file_names)

#Check properties of uploaded images
print("Shape of training images is " + str(train_images.shape))
print("Shape of training labels is " + str(train_categories.shape))
print("Shape of training labels is " + str(train_file_names.shape))

In [None]:
#Normalize the image pixels
train_images = train_images.astype('float32')/255

#### c) Split the test and validation.

The validation set will be used to test overfitting in our model. The test images cannot be used as they do not have labels.**

In [None]:
# Training to have 90% and validation 10%. High value of training taken so that we have ample training images. 
# The more the images, the better the model
X_train,X_valid,Y_train,Y_valid = train_test_split(train_images,train_labels_cat,test_size = 0.1,random_state=None)

print("X Train count is ",len(X_train),"Shape",X_train.shape, " and Y train count ",len(Y_train), "Shape", Y_train.shape )
print("X validation count is ",len(X_valid), "Shape",X_valid.shape," and Y validation count ", len(Y_valid), "Shape",Y_valid.shape)

## 5.0 Baseline Model

### Define the CNN model
Convolutional Neural Networks algorith was designed to map image data to an output variable hence is the best algorithm to use.

The benefit of using CNNs is their ability to develop an internal representation of a n-dimensional image. This allows the model to learn position and scale across different images, which is important when working with images.

In [None]:
#Define the CNN Model
#Sequential API to add one layer at a time starting from the input.
model = Sequential()
# Convolution layer with 32 filters first Conv2D layer.  
# Each filter transforms a part of the image using the kernel filter. The kernel filter matrix is applied on the whole image.
# Relu activation function used to add non linearity to the network.
model.add(Conv2D(filters=32, kernel_size=(5,5), activation='relu', input_shape=X_train.shape[1:]))
# Convolution layer with 64 filters second Conv2D layer 
model.add(Conv2D(filters=64, kernel_size=(3, 3), activation='relu'))
# Max pooling applied. Reduces the size of the image by half. Is a downsampling filter which looks at the 2 neighboring pixels and picks the maximal value
model.add(MaxPool2D(pool_size=(2, 2)))
# Drop applied as a regularization method, where a proportion of nodes in the layer are randomly ignored by setting their wieghts to zero for each training sample.
# This drops randomly a proportion of the network and forces the network to learn features in a distributed way. This improves generalization and reduces overfitting.
model.add(Dropout(rate=0.25))
model.add(Conv2D(filters=64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(rate=0.25))
# Flatten to convert the final feature maps into a one single 1D vector. Needed so as to make use of fully connected layers after some convolutional/maxpool layers.
# It combines all the found local features of the previous convolutional layers.
model.add(Flatten())
#Dense layer applied to create a fully-connected artificial neural networks classifier.
model.add(Dense(256, activation='relu'))
model.add(Dropout(rate=0.5))
#Neural net outputs distribution of probability of each class.
model.add(Dense(2, activation='softmax')) # 2 output classes
model.summary()

### Optimize and compile the model

OPTIMIZER: ADAM applied to minimize the loss function.

LOSS: categorical_crossentropy - multi-class log loss

Metrics: Categorical accuracy as it's classification problem

In [None]:
#Compilation of the model
model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.01), 
                    loss=tf.keras.losses.categorical_crossentropy, 
                    metrics = [tf.keras.metrics.categorical_accuracy])

Training.

In [None]:
#Using ten epochs for the training and saving the accuracy for each epoch
history = model.fit(X_train[1:10], Y_train[1:10], batch_size=32, epochs=12,
                    validation_data=(X_valid, Y_valid)) #  #,validation_split = 0.2, callbacks=callbacks,

#Class weight parameter specified for to rectify class imbalance ,class_weight=class_weights

In [None]:
#Display of the accuracy and the loss values
plt.figure(0)
plt.plot(history.history['categorical_accuracy'], label='training accuracy')
plt.plot(history.history['val_categorical_accuracy'], label='val accuracy')
plt.title('Accuracy')
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.legend()

plt.figure(1)
plt.plot(history.history['loss'], label='training loss')
plt.plot(history.history['val_loss'], label='val loss')
plt.title('Loss')
plt.xlabel('epochs')
plt.ylabel('loss')
plt.legend()
plt.show()

Baseline Model Accuracy

In [None]:
# Create dictionary and dataframe to hold results for various models
dict = {'Model':['Baseline CNN' ,'Mobile Net V2', 'Data Augmentation'], 
        'AUC': [0,0,0],
        'Log Loss':[0,0,0], 
        'F1 score':[0,0,0], 
        'Recall':[0,0,0], 
        'Precision':[0,0,0]} 
df_results = pd.DataFrame(dict,columns = ['Model','Log Loss','AUC','F1 score','Recall','Precision'])


# Function to calculate Results for each model
def model_results(model_type,y_test_data, y_prediction_data, y_test_class, y_pred_class):
    
    index_val = df_results[df_results['Model']==model_type].index
    
    #Asign scores to dataframe
    df_results.loc[index_val,'AUC'] = roc_auc_score(y_test_data, y_prediction_data)
    df_results.loc[index_val,'Log Loss'] = log_loss(Y_valid, y_prediction_data)
    df_results.loc[index_val,'F1 score'] = f1_score(y_test_class, y_pred_class,average='weighted')
    df_results.loc[index_val,'Recall'] = recall_score(y_test_class, y_pred_class,average='weighted')
    df_results.loc[index_val,'Precision'] = precision_score(y_test_class, y_pred_class,average='weighted')

    return(df_results)

In [None]:
#Baseline Prediction
y_prediction = model.predict(X_valid) # make predictions

#Baseline Results
dominant_y_valid=np.argmax(Y_valid, axis=1)
dominant_y_predict=np.argmax(y_prediction, axis=1)

model_results('Baseline CNN',Y_valid, y_prediction,dominant_y_valid,dominant_y_predict)

In [None]:
#Confusion Matrix
import itertools
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=75) 
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
            plt.text(j, i, cm[i, j],
            horizontalalignment="center",
            color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

class_names = range(3)
# cm = confusion_matrix(rounded_Y_valid , rounded_Y_predict_trf)
cm = confusion_matrix(dominant_y_valid , dominant_y_predict)
plt.figure(2)
plt.figure(figsize=(5,5))
plot_confusion_matrix(cm, classes=class_names, title='Confusion matrix')

## 6.0 Challenging the solution
### 6.1 Transfer Learning : Model to use is MobileNetV2

With transfer learning, instead of starting the learning process from scratch, you start from patterns that have been learned when solving a different problem. This way you leverage previous learnings and avoid starting from scratch.

More about MobileNetV2 here  - > https://ai.googleblog.com/2018/04/mobilenetv2-next-generation-of-on.html

a) Import the MobileNetV2 from keras

In [None]:
# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=X_train.shape[1:],
                                               include_top=False,
                                               weights='imagenet')

b) Train The model

In [None]:
#To use weights in the pre-trained model
base_model.trainable = False 

#Define the pre-trained model
pretrained_model = tf.keras.Sequential([base_model,tf.keras.layers.GlobalAveragePooling2D(),tf.keras.layers.Dense(3, activation="softmax")])

pretrained_model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.01), loss=tf.keras.losses.categorical_crossentropy, 
                         metrics = [tf.keras.metrics.categorical_accuracy])

pretrained_model.summary()

c) Fitting

In [None]:
#Fit the pretrained model to the  data
history_trf = pretrained_model.fit(X_train, Y_train, epochs=5,batch_size=32 , 
                validation_data=(X_valid, Y_valid), class_weight=class_weights)

Graph of accuracy and loss for training and validation

In [None]:
#Display of the accuracy and the loss values
plt.figure(0)
plt.plot(history_trf.history['categorical_accuracy'], label='training accuracy')
plt.plot(history_trf.history['val_categorical_accuracy'], label='val accuracy')
plt.title('Accuracy')
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.legend()

plt.figure(1)
plt.plot(history_trf.history['loss'], label='training loss')
plt.plot(history_trf.history['val_loss'], label='val loss')
plt.title('Loss')
plt.xlabel('epochs')
plt.ylabel('loss')
plt.legend()
plt.show()

#### Mobile Net V2 Transfer Running Results
#### a) AUC and Log Loss

In [None]:
#Mobile Net V2 Prediction
y_prediction_trf = pretrained_model.predict(X_valid) # make predictions

#Baseline Results
dominant_y_valid=np.argmax(Y_valid, axis=1)
dominant_y_predict=np.argmax(y_prediction_trf, axis=1)

model_results('Mobile Net V2',Y_valid, y_prediction_trf,dominant_y_valid,dominant_y_predict)

Improvement on AUC represents degree or measure of separability. 
It tells how much model is capable of distinguishing between classes. Higher the AUC, implies the model is  better at distinguishing between the wheat with step or leaf or is healthy

Low Log Loss means a low uncertainty of your model.


#### b) Classification Report

In [None]:
print(classification_report(dominant_y_valid , dominant_y_predict))

#### Confusion Matrix

In [None]:
#Confusion Matrix
import itertools
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=75) 
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
            plt.text(j, i, cm[i, j],
            horizontalalignment="center",
            color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

class_names = range(3)
# cm = confusion_matrix(rounded_Y_valid , rounded_Y_predict_trf)
cm = confusion_matrix(dominant_y_valid , dominant_y_predict)
Y_valid, y_predict_trf
plt.figure(2)
plt.figure(figsize=(5,5))
plot_confusion_matrix(cm, classes=class_names, title='Mobile Net V2 Confusion matrix')

### 6.2 Image Data Augmentation

We will generate more image data using ImageDataGenerator. The Image data generator package artificially creates training images through different ways of processing or combination of multiple processing, such as random rotation, shifts, shear and flips, etc.

In [None]:
image_gen = ImageDataGenerator(
    #featurewise_center=True,
    #featurewise_std_normalization=True,
    rescale=1./255,
    rotation_range=15,
    width_shift_range=.15,
    height_shift_range=.15,
    horizontal_flip=True)

#training the image preprocessing
image_gen.fit(X_train, augment=True)

In [None]:
#Subject the model to training with pretrained model
history_idg = pretrained_model.fit_generator(train_generator,
                                   epochs = 10,
                                   shuffle = False, 
                                   steps_per_epoch=3,
                                   validation_steps=1,
                                   validation_data=val_generator,
                                   class_weight=class_weights)

In [None]:
#Display of the accuracy and the loss values
plt.figure(0)
plt.plot(history_idg.history['categorical_accuracy'], label='training accuracy')
# plt.plot(history_idg.history['val_categorical_accuracy'], label='val accuracy')
plt.title('Accuracy')
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.legend()

plt.figure(1)
plt.plot(history_idg.history['loss'], label='training loss')
# plt.plot(history_idg.history['val_loss'], label='val loss')
plt.title('Loss')
plt.xlabel('epochs')
plt.ylabel('loss')
plt.legend()
plt.show()

In [None]:
# Prediction
y_prediction_idg = pretrained_model.predict(X_valid) # make predictions

logloss = log_loss(Y_valid, y_prediction_idg)
logloss

## 7.0 Subject the model to test data

a) Import the test data from test directory

In [None]:
#Function to upload the test images
def upload_test_images(image_path, height, width):
    test_images = []
    test_image_paths = []
        # Retrieve the filenames from the all the test directory
    test_image_file_names = [os.path.join(image_path, f) for f in os.listdir(image_path)]
        # Read the image pixels
    for file in test_image_file_names:
        test_image=io.imread(file)
        # Append image into list
        test_image_from_array = Image.fromarray(test_image, 'RGB')
        #Resize image
        test_size_image = test_image_from_array.resize((height, width))
        #Append image into list
        test_images.append(np.array(test_size_image))
        test_image_paths.append(file)
    return test_images,test_image_paths

## Invoke the function
#Image resize parameters
height = 256
width = 256
test_images,test_image_paths = upload_test_images('/kaggle/input/cgiar-computer-vision-for-crop-disease/ICLR/test/test/',height,width)
test_images = np.array(test_images)

In [None]:
#Size and dimension of test image
print("Shape of test images is " + str(test_images.shape))
# Check image paths
test_image_paths[0:5]

Image name is part of full image URL as above. We will seperate the name from the image path as below

In [None]:
# use regular expressions to extract the name of image
image_names = []
for i in test_image_paths:
#     name = i
    i = re.sub("[^A-Z0-9]", "", str(i))
    i = i.replace("JPG", "")
    i = i.replace("PNG", "")
    i = i.replace("JPEG", "")
    i = i.replace("JFIF", "")
    i = i.replace("JFIF", "")
    image_names.append(i)

#View images
image_names[0:5]

In [None]:
#Prediction for all images
y_prediction = model.predict_proba(test_images) # make predictions
y_prediction[400:500]

In [None]:
# Prediction for all images per test image
test_images = np.array(test_images)
preds = []
for img in tqdm(test_images):
    img = img[np.newaxis,:] # add a new dimension
    prediction = pretrained_model.predict_proba(img) # make predictions predict_proba
    preds.append(prediction) 
preds

In [None]:
#healthwheat =0 stem_rust = 2 ,leaf_rst =1
# create a dummy dataset
healthy_wheat = pd.Series(range(610), name="healthy_wheat", dtype=np.float32)
stem_rust = pd.Series(range(610), name="stem_rust", dtype=np.float32)
leaf_rust = pd.Series(range(610), name="leaf_rust", dtype=np.float32)
submission = pd.concat([healthy_wheat,stem_rust,leaf_rust], axis=1)

for i in range(0 ,len(preds)):
    submission.loc[i] = preds[i]

In [None]:
#Append the image names to the result output
submission["ID"] = image_names

In [None]:
submission.head(10)

In [None]:
cols = submission.columns.tolist()
cols = cols[-1:] + cols[:-1]
submission = submission[cols]

In [None]:
submission.columns

In [None]:
submission[submission['ID'] == 'ICLRELRIT5']

In [None]:
submission['ID1'] = submission['ID']

In [None]:
submission['ID'] = submission['ID'].str[1:]

In [None]:
# write to csv
submission.to_csv("sub.csv", index=False)

### Challenges

a) Huge data size of images. Took 15.5GB out of 16GB which caused kernel to crash.

b) Mixed up images by Zindi hence had to reload the dataset.

### Conclusion

Actions

a) To upload the above submission on zindi so as to get the results of the test data.

b) Optimize the combined data optimization and transfer learning model.

c) Consider other transfer models e.g Resnet

d) The stem rust and leaf rust conflicts in the model noted. Consider re-running the model with higher resolution with batch uploads.

### Improvements
Class weight

Transfer learning

MTL

Adaptive images --sic

Oversampling/downsampling

Having a validation set

Ensemble in image classification


Training augmentations

Random resized crop preserving aspect with scale ~ uniform(0.5, 1) using nearest-neighbor interpolation

Random horizontal and vertical flip, and 90 degrees rotation

Normalizing each image channel to N(0, 1)

For each channel: channel = channel * a + b, where a ~ N(1, 0.1), b ~ N(0, 0.1)

Test-time augmentations

Horizontal and vertical flip, and 90 degrees rotation

https://www.kaggle.com/c/recursion-cellular-image-classification/discussion/110457
random crop 384x384,
random flip,
random rotation multiple of 90 degree

Data augmentation
https://www.kaggle.com/c/recursion-cellular-image-classification/discussion/110337


https://www.hackerearth.com/practice/machine-learning/advanced-techniques/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3/tutorial/
Image classification: Here you can do scaling, resizing, removing noise (smoothening), annotating etc

STEPS
load train and test datasets
setup train/test image transforms
setup train/test data loaders

### References

https://zindi.africa/competitions/sbtic-animal-classification/data