# Creation of additional data using Keras ImageDataGenerator

There have been several great Kernels created so far that use Keras to develop convolutional neural networks to classify an image as containing an iceberg or not. I have been using a Keras made cnn, and my work right now focuses on 1. Tweaking the nn setup to optimize predictive ability. 2. Creating more data on which to train the nn.

In this kernel I discuss how I'm going about accomplishing goal number 2. This is an easy job using Keras, but the task at hand has a little catch. Keras ImageDataGenerator() is designed to take 1, 3, or 4 channels of data... but here we have two input channels. In order to solve this problem, I first add an additional dummy channel to the input data, and then when the data is generated I pull the dummy channel off and go about my nn training.


The first section here I draw from 'Exploring the Icebergs with skimage and Keras' by Kevin Mader for the read-in and train/test split. (You can pair the new data you make with that Kernel's cnn for some pretty good results!)

In [None]:
import pandas as pd
import numpy as np


from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical


In [None]:
# load function from: https://www.kaggle.com/kmader/exploring-the-icebergs-with-skimage-and-keras
# b/c I didn't want to reinvent the wheel
def load_and_format(in_path):
	""" take the input data in .json format and return a df with the data and an np.array for the pictures """
	out_df = pd.read_json(in_path)
	out_images = out_df.apply(lambda c_row: [np.stack([c_row['band_1'],c_row['band_2']], -1).reshape((75,75,2))],1)
	out_images = np.stack(out_images).squeeze()
	return out_df, out_images


train_df, train_images = load_and_format('../input/train.json')

test_df, test_images = load_and_format('../input/test.json')

Note here I am splitting off the validation set before I create the additional training instances, this is a conservative measure (i.e. you could make your new data off of all the training instances provided by moving the train/test split further down the notebook... but be wary of overfit!)

In [None]:

#also from https://www.kaggle.com/kmader/exploring-the-icebergs-with-skimage-and-keras
X_train, X_test, y_train, y_test = train_test_split(train_images,
		                                            to_categorical(train_df['is_iceberg']),
                                                    random_state = 42,
                                                    test_size = 0.5
                                                   )
print('Train', X_train.shape, y_train.shape)
print('Validation', X_test.shape, y_test.shape)


Below is a dummy channel of all zeros (in the same size as the two true channels).

In [None]:
dummy_dat = np.zeros((802,75,75,1), dtype=np.float32)


This dummy channel is simply concatenated along the fourth axis... upvote if you struggle with visualizing 4 dimensions as much as I do!

In [None]:
fudge_X_train = np.concatenate((X_train, dummy_dat), axis = 3)

Below we initiate the ImageDataGenerator. The params I use can be tweaked to your desire.

In [None]:
from keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
        rotation_range=40,
        width_shift_range=0.2,
        height_shift_range=0.2,
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True,
        fill_mode='nearest')



The data (with the dummy channel) is then fit to the generator.

I initialize the output arrays as the 'real' inputs (which we append our generated data to).

In [None]:
datagen.fit(fudge_X_train)

x_batches = fudge_X_train
y_batches = y_train

Here we loop through using the .flow() function to generate batches of modified training data.
The kernel has a batch size of 5 and an # of epochs of 5, these can both be increased to make more data!

Note the break statement to leave the generator... otherwise it loops forever!

In [None]:

epochs = 5

for e in range(epochs):
	print('Epoch', e)
	batches = 0
	per_batch = 5
	for x_batch, y_batch in datagen.flow(fudge_X_train, y_train, batch_size=per_batch):
		x_batches = np.concatenate((x_batches, x_batch), axis = 0)
		y_batches = np.concatenate((y_batches, y_batch), axis = 0)
		batches += 1
		if batches >= len(fudge_X_train) / per_batch:
			# we need to break the loop by hand because
			# the generator loops indefinitely
			break



Next we drop the dummy channel and can go about training our cnn with an expaned training data set!

In [None]:
x_train_new = x_batches[:,:,:,:2]
x_train_new.shape

In [None]:
y_batches.shape

Here is one of the original images:

In [None]:
import matplotlib.pyplot as plt

plt.imshow(x_train_new[500,:,:,0])

Do you think that is an iceberg or a boat?

How about one of the ones we have generated? Fake iceberg or fake boat?

In [None]:

plt.imshow(x_train_new[-32,:,:,0])
plt.show()


I am currently working to rejig the cnn params to classify more effictively when presented with these additional data

I hope you can take this an use it to improve the functionality of your own models by beefing up your training set... good luck to everyone! Take this larger training data set and pipe it into your best cnn!
