## Data Augmentation

<font color=blue>2 important things in data augmentation -

*a. What kind of augmentation is applied ? e.g. horizontal flip etc*

*b. How many augmentations are applied, usually 1 augment for each image so to avoid duplication.*

In [4]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

In [5]:
import os

In [6]:
from tensorflow.keras.preprocessing import image

*Note*

Data Augmentation is only applied during training. It has no role during testing. 

As seen below augmentation is only created for training dataset & not test dataset.

In [7]:
train_gen=image.ImageDataGenerator(vertical_flip=True, 
                                   horizontal_flip=True,
                                   rotation_range=30,
                                   rescale=1/255) 

test_gen=ImageDataGenerator(rescale=1./255)

In [8]:
v_batch_size = 35   # batch size 35 as fully divisible for 735 samples

In [None]:
#No of batches = 735/35 
#              = 21 batches

*<font color=blue>Note*
    
In case u had validation split as well as shown below in commented code, batch_size could be different for train & validation data. v_batch_size could be different like v_train_size and v_val_size.


In [1]:
# model.fit(train_set,epochs=5,
#           steps_per_epoch = test_set.samples/v_batch_size,
#           validation_data =val_set,
#           validation_steps=val_set.samples/v_batch_size)

Usually train data is shuffled while test data is not.

In [9]:
train_data=train_gen.flow_from_directory('data/waffle_pancakes_ds/train',
                                          shuffle=True,
                                          seed=0,
                                          batch_size=v_batch_size,
                                          target_size=(224,224),
                                          class_mode='binary')

Found 735 images belonging to 2 classes.


In [10]:
test_data=test_gen.flow_from_directory('data/waffle_pancakes_ds/test',
                                        shuffle=False,
                                        batch_size = v_batch_size,
                                        target_size=(224,224),
                                        class_mode='binary')

Found 389 images belonging to 2 classes.


In [11]:
model = Sequential()

model.add(Conv2D(64,(3,3),input_shape = (224,224,3), activation = 'relu')) 
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64,(3,3), activation = 'relu')) 
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64,(3,3), activation = 'relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

2023-09-29 01:11:28.594251: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [12]:
model.add(Flatten())  
 
model.add(Dense( activation = 'relu', units=64))
model.add(Dense( activation = 'sigmoid', units=1)) 

In [13]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 222, 222, 64)      1792      
                                                                 
 max_pooling2d (MaxPooling2D  (None, 111, 111, 64)     0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 109, 109, 64)      36928     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 54, 54, 64)       0         
 2D)                                                             
                                                                 
 conv2d_2 (Conv2D)           (None, 52, 52, 64)        36928     
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 26, 26, 64)       0

In [14]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

    - train_data is an iterator to the training dataset
    - In every batch the training data flows from the directory specified and is augmented.
    - Here every batch comprises 35(batch_size) augmented images of the training data
    - set the steps_per_epoch argument of fit method to n_train_samples / batch_size, where n_train_samples is the total number of training data you have.
    - This will ensure that in each epoch, each training sample is augmented only once and therefore n_train_samples transformed images will be generated in each epoch.
    - In the absence of this argument, different augmented versions of the same image might be passed to training which results in training with duplicate copies of the same image

Please Note :

    - After augmentation the number of training images does not increase per epoch
    - Different transformation is applied to each image in every epoch
    - Hence if we train our model for, 5 epochs, we have used 5 different augmented versions of each original image in training (or 100 * 5 = 500 different images in the whole training, instead of using just the 100 original images in the whole training)

*<font color=blue>steps_per_epoch is the number of batches*
    
This argument will ensure all 735 images r augmented once & only once in an epoch if samples/batch_size is perfectly divisible. 
    
e.g. if we take batch size as 35, so 735/35 = 21 batches, all training data used and augmented once.
    
e.g. if we take batch size as 30, so 735/30 = 24.5 batches. After floor division 24. So 24*30 = 720 images will be fully used while remaining 15 images will not be used so ur whole training dataset is not augmented.
    
In case not fully divisible, try to take closest possible batch size which covers most of the samples.
    
    

In [21]:
model.fit(train_data,steps_per_epoch= train_data.samples//v_batch_size, epochs=2)

# floor division // it will take lower bound. For scenarios when batch size not a
# factor of training samples better to do floor division.

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f93dc3ffe50>

Test accuracy/performance on unseen data is expected to be more when model trained with augmented data.

In [16]:
model.evaluate(test_data)



[0.8320676684379578, 0.6041131019592285]

In [17]:
result = model.predict(test_data)



In [18]:
result[30].argmax() == test_data.classes[30]

True

In [19]:
result[150].argmax() == test_data.classes[150]

True

*Note on model.fit*

In case we would have taken number of batches as 35 & batch size 35 then r samples would exceed 735 & we would get a warning 'Your input ran out of data' while fitting.

In [20]:
model.fit(train_data,steps_per_epoch= 35, epochs=2)


Epoch 1/2


<keras.callbacks.History at 0x7f93dc3f84c0>