# Aerial Cactus Identification

This notebook will document all of the steps I did for this kaggle competition. The goal is to create a classifier capable of predicting whether an image contains a cactus.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import cv2
import matplotlib.pyplot as plt
import os
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten, BatchNormalization
from keras import regularizers

from sklearn.utils import shuffle
print(os.listdir("../input"))

# First Look at the Data
The dataset contains a large number of thumbnail images of size 32x32. The labels for the pictures are provided within the **'train.csv'** file. Firstly, we will take a look at a couple sample images and also load the csv file into a pandas dataframe.

In [None]:
#Load CSV file into pandas
train_df = pd.read_csv('../input/train.csv')
train_df.head()

In [None]:
print("Prepping Data...")
image_directory = '../input/train/train/'

#Loading images and labels
X_train = [cv2.imread(image_directory + filename) for filename in os.listdir(image_directory)]
y_train = [train_df[train_df['id'] == filename].has_cactus.values for filename in os.listdir(image_directory)]

X_train = np.array(X_train)
y_train = np.array(y_train)
y_train = y_train.flatten()

X_train, y_train = shuffle(X_train, y_train, random_state = 2019)

print("Complete!")
print("X_train shape = {}".format(X_train.shape))
print("y_train shape = {}".format(y_train.shape))

In [None]:
#Show some random images
labels = ['No Cactus', 'Yes Cactus']

plt.figure(figsize=(10,10))
plt.subplot(1,3,1)
plt.title(labels[y_train[50]])
plt.imshow(X_train[50], interpolation='bilinear')

plt.subplot(1,3,2)
plt.title(labels[y_train[57]])
plt.imshow(X_train[57], interpolation='bilinear')

plt.subplot(1,3,3)
plt.title(labels[y_train[60]])
plt.imshow(X_train[60], interpolation='bilinear')

So just by looking at the images, it is pretty hard to tell if there is a cactus or not. That is because the images were heavily resized from the original dataset. Nevertheless, I think a simple convolutional neural network (CNN) will do the trick!

# Image Preprocessing: Normalize Images
First we are going to normalize the images (Zero Mean and Unit Variance). This trick has been proven to increase the performance of CNN's in general. As a rule of thumb, when you are dealing with computer vision tasks, you should normalize your images first.

In [None]:
X_train = (X_train - X_train.mean()) / X_train.std()

# Keras Model: Simple CNN
For our first model, we will use a single convolutional layer followed by two fully connected layers. We will be experimenting with the model architecture so it is generally a good idea to start as simple as possible and then work your way up.

In [None]:
model = Sequential()
model.add(Conv2D(64, (3,3), input_shape = (32,32,3), activation='relu'))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.summary()

In [None]:
model.fit(X_train, y_train, epochs=5, validation_split = 0.25)

As you can see, just by using a simple CNN and normalizing the data, we were able to achieve amazing results (over 95%). Let's try to improve the performance with a couple of tricks.

# Batch Normalization
Batch Normalization is a common trick used in today's deep learning networks. In simple terms, its purpose is to normalize the numbers found **WITHIN** the hidden layers. Remember how we first normalized our images before feeding them into the network? Well batch normalization pretty much does the same thing for the hidden layers inside of the network.

Here is a youtube video that explains it a lot better than me:
[Link](https://www.youtube.com/watch?time_continue=427&v=dXB-KQYkzNU)

In [None]:
model = Sequential()
model.add(Conv2D(64, (3,3), input_shape = (32,32,3), activation='relu'))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.summary()

In [None]:
model.fit(X_train, y_train, epochs=5, validation_split = 0.25)

Alright, so we can see that batch normalization did not really help the model. This is probably because the network is relatively small and error propogation due to non-normalized hidden layers are minimal.

# Submission
Alright, so we are ready to make our submission. First we are going to retrain our model on all of the testing data (we previously reserved 25% for validation). Then, we are going to use the model to make predictions.

In [None]:
model = Sequential()
model.add(Conv2D(64, (3,3), input_shape = (32,32,3), activation='relu'))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.summary()

In [None]:
model.fit(X_train, y_train, epochs=5)

In [None]:
#Prep the testing data
print("Prepping Testing Data...")
testing_directory = '../input/test/test/'
X_test = [cv2.imread(testing_directory + filename) for filename in os.listdir(testing_directory)]
X_test = np.array(X_test)
print("Complete!")
print("X_test shape = {}".format(X_test.shape))

In [None]:
#Making Predictions
predictions = model.predict_classes(X_test)
submission = pd.read_csv('../input/sample_submission.csv')
submission['has_cactus'] = predictions
submission.sample(5)

In [None]:
submission.to_csv('./submissions.csv', header=False)