# Data Augmentation

Since this is a small dataset, I used data augmentation in order to create more images.

Also, we could solve the data imbalance issue (since 61% of the data belongs to the tumorous class) using data augmentation.

## Import Necessary Modules

In [17]:
import tensorflow as tf
from keras.preprocessing.image import ImageDataGenerator
import cv2
import imutils
import matplotlib.pyplot as plt
from os import listdir
import time    

%matplotlib inline

In [18]:
# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return f"{h}:{m}:{round(s,1)}"

## Data Augmentation Function

### Inputs:
- **file_dir**: Directory containing the original images to be augmented.
- **n_generated_samples**: Number of augmented samples to generate for each original image.
- **save_to_dir**: Directory to save the generated augmented images.

### Data Augmentation Techniques:
- Rotation by a maximum of 10 degrees.
- Horizontal and vertical shifts by a maximum of 10% of the image dimensions.
- Shear transformation with a maximum shear of 0.1.
- Brightness adjustment within the range of 0.3 to 1.0.
- Horizontal and vertical flips.
- Nearest-neighbor filling mode for pixels outside the boundaries.

### Processing:
- Iterates through each file in the specified directory.
- Loads the original image using OpenCV.
- Reshapes the image and sets a prefix for the generated samples.
- Uses an image data generator to apply random transformations to generate augmented samples.
- Saves the augmented samples with prefixed filenames in the specified output directory.

### Output:
- Augmented images saved in the specified directory for each original image.
###   generate 5 augmented samples for each original image..

In essence, this function is a convenient way to create additional training data by applying various transformations to the original images, which can help improve the robustness and generalization of a machine learning model.


In [19]:
def augment_data(file_dir, n_generated_samples, save_to_dir):
    """
    Arguments:
        file_dir: A string representing the directory where images that we want to augment are found.
        n_generated_samples: A string representing the number of generated samples using the given image.
        save_to_dir: A string representing the directory in which the generated images will be saved.
    """
    
    #from keras.preprocessing.image import ImageDataGenerator
    #from os import listdir
    
    data_gen = ImageDataGenerator(rotation_range=10, 
                                  width_shift_range=0.1, 
                                  height_shift_range=0.1, 
                                  shear_range=0.1, 
                                  brightness_range=(0.3, 1.0),
                                  horizontal_flip=True, 
                                  vertical_flip=True, 
                                  fill_mode='nearest'
                                 )

    
    for filename in listdir(file_dir):
        # load the image
        image = cv2.imread(file_dir + '\\' + filename)
        # reshape the image
        image = image.reshape((1,)+image.shape)
        # prefix of the names for the generated sampels.
        save_prefix = 'aug_' + filename[:-4]
        # generate 'n_generated_samples' sample images
        i=0
        for batch in data_gen.flow(x=image, batch_size=1, save_to_dir=save_to_dir, 
                                           save_prefix=save_prefix, save_format='jpg'):
            i += 1
            if i > n_generated_samples:
                break

Remember that 61% of the data (155 images) are tumorous. And, 39% of the data (98 images) are non-tumorous.<br>
So, in order to balance the data we can generate 9 new images for every image that belongs to 'no' class and 6 images for every image that belongs the 'yes' class.<br>

In [28]:
start_time = time.time()



# augment data for the examples with label equal to 'yes' representing tumurous examples
augment_data(file_dir="C:/Users/lenovo/Desktop/Siar-dataset/Tumor", n_generated_samples=6, save_to_dir="C:/Users/lenovo/Desktop/Siar-dataset/yes")
# augment data for the examples with label equal to 'no' representing non-tumurous examples
augment_data(file_dir="C:/Users/lenovo/Desktop/Siar-dataset/Normal", n_generated_samples=9, save_to_dir="C:/Users/lenovo/Desktop/Siar-dataset/no")

end_time = time.time()
execution_time = (end_time - start_time)
print(f"Elapsed time: {hms_string(execution_time)}")

Elapsed time: 0:34:1.5


Let's see how many tumorous and non-tumorous examples after performing data augmentation:

In [29]:
def data_summary(main_path):
    
    yes_path = main_path+'yes'
    no_path = main_path+'no'
        
    # number of files (images) that are in the the folder named 'yes' that represent tumorous (positive) examples
    m_pos = len(listdir(yes_path))
    # number of files (images) that are in the the folder named 'no' that represent non-tumorous (negative) examples
    m_neg = len(listdir(no_path))
    # number of all examples
    m = (m_pos+m_neg)
    
    pos_prec = (m_pos* 100.0)/ m
    neg_prec = (m_neg* 100.0)/ m
    
    print(f"Number of examples: {m}")
    print(f"Percentage of positive examples: {pos_prec}%, number of pos examples: {m_pos}") 
    print(f"Percentage of negative examples: {neg_prec}%, number of neg examples: {m_neg}") 

In [31]:
augmented_data_path="C:/Users/lenovo/Desktop/Siar-dataset/"

In [32]:
data_summary(augmented_data_path)

Number of examples: 60361
Percentage of positive examples: 37.08851742018853%, number of pos examples: 22387
Percentage of negative examples: 62.91148257981147%, number of neg examples: 37974


That's it for this notebook. Now, we can use the augmented data to train our convolutional neural network.

# END.