# Data preparation using TensorFlow

I recently tried my skill in a competition on Kaggle called "Global Wheat Detection". To be fair I am fairly new in the field and as a beginner, while learning I focused on the model, how to train them etc. But a crucial step which most online courses don't pay attention to is called data pre-processing also termed as data-pipeline. it includes formatting your data so that it can be fed to your model. 

Each model is different and they require data in a specific format. reading images with cv2.imread() and then appending them to list will do the job but not efficiently because it will eat up so much RAM that you will have no choice but to reduce batch_size. if an image shape is (128,128,3) it will take somewhere around 10-30 kb. but when you load images as NumPy array it's size will vary according to it's NumPy data type. when you normalize images before feeding them default datatype is float32 (also referred to as single floating point precision). one value for this float 32 takes around 4 bytes, Hence after normalizing your image will take 

>total values in an image = 128*128*3 = 49152
>size of numpy array = 49152*4 = 196608 bytes = 192 kb

This dataset has 1580470 images, that means,

>size required = 192 * 1580470 = 303450240 kb ~ 289 GB of memory

I don't think any consumer-grade GPU has that much of RAM, As you can see this is why using NumPy array is not optimal<font size="4">

I myself in that old competition did the same thing used 256*256*3 images, but there were only 3422 images which I augmented using hence total images were 6844 it took me 6-7 GB of memory and I had to train the network on a batch size of 16 ultimately I failed miserably but I learned a lot from it and started looking for better ways to implement this. I will share a way which is suitable for all kind of classification tasks hope you will enjoy it

If You liked it hit the little " **^** " icon at upper right (I guess that's an upvote!)
So let's dive in!

# Step 1 - import the required libraries

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from IPython.display import Image, display
import tensorflow as tf
import pandas as pd

# Step 2 - Set directories

* csv_path - path to the train.csv provided in dataset
* base_directory - path to the train folder which contains all the sub-directories 

In [None]:
csv_path = "../input/landmark-recognition-2020/train.csv"
base_directory = "../input/landmark-recognition-2020/train"

# Step 3 - Read train.csv into pandas dataframe
* columns - tells what are the column names of the csv 
* annotations - We are loading train.csv as a pandas dataframe in this variable 
* data_frame - This is the dictionary of lists in which we will save our processed values from annotations

In [None]:
columns = ['id','landmark_id']
annotations = pd.read_csv(csv_path, usecols=columns)

data_frame = {"image_dir":[],"landmark_id":[]}

# Step 4 - Process the annotations data frame
* First we will loop over all rows of the data frame by storing "id" column value of the current row in image_id and "landmark_id" column value of the current row in land_id variable
* Then in the second line, we create an absolute path to the image. As explained in the data section first three-letter in id represent subdirectory structure that's what we are doing (* refer to data tab on competitions homepage)
* On line 3rd and 4th we store this absolute path in our dictionary which we defined earlier "data_frame"

In [None]:
for image_id, land_id in zip(annotations["id"],annotations["landmark_id"]):
    image_dir = "{}/{}/{}/{}/{}.jpg".format(base_directory,image_id[0],image_id[1],image_id[2],image_id)
    data_frame["image_dir"].append(image_dir)
    data_frame["landmark_id"].append(land_id)

# Step 5 - Save the new CSV 
* On the first line we first convert our dictionary into pandas data frame we call it "df"
* On the second line we save it as "train_data.csv" (we set index = False because we don't want an index in our CSV)

In [None]:
df = pd.DataFrame(data_frame)
df.to_csv("train_data.csv",index=False)

# Verify the algorithm
After you make something it's always good practice to verify the algorithm using a small sample from data let's do that
following cell loads our new "train_data.csv" into a pandas data frame called data_csv
(note = make sure you set dtype=str if the data frame is used to generate data for training )

In [None]:
data_csv_path = "./train_data.csv"

columns = ['image_dir','landmark_id']
data_csv = pd.read_csv(data_csv_path, usecols=columns,dtype=str)

let us check is everything is assigned properly 
Following cell prints
* first 7 entries of train.csv
* first 7 entries of train_data.csv
* last 7 entries of train.csv
* last 7 entries of train_data.csv

We'll manually check this 7 values if id-label pair is correct

In [None]:
# let us check is everthing is assigned properly
# This line just tell pandas to print whole string 
pd.options.display.max_colwidth = 100
print("First 7 entries of original train.csv")
print(annotations.head(7))
print("First 7 entries of our processed train_data.csv")
print(data_csv.head(7))
print("Last 7 entries of original train.csv")
print(annotations.tail(7))
print("Last 7 entries of our processed train_data.csv")
print(data_csv.tail(7))

Hurray! our algorithms seems to be doing perfectly
since labels are verified, let's verify that directories 
following cell displays images if images are displayed that means directories are also correct

In [None]:
# let's visualize some images from our csv
tail_part = data_csv.tail(5)
for image,label in zip(tail_part["image_dir"],tail_part["landmark_id"]):
    display(Image(image))

## Great Everything seems to be working fine
now just summerise how many images and classes the data contain and we will compare this after we generated image_flow

In [None]:
no_of_classes=len(annotations["landmark_id"].unique())
no_of_images=len(annotations["id"].unique())
print("There are total {} images belonging to {} classes".format(no_of_images,no_of_classes))

Great !
Now we know that there are 1580470 images which belongs to 81313 classes

# Step 6 - prepare data for training

Tensorflow is very generous it had provided us with a very handy tool called ImageDataGenerator
This tool performs mainly two functions 
* Augment the images as specified by the user 
* Load images in batches rather than loading them all at once.

we will look into both in following cells

### To use this feature we follow the following steps 

1. create a datagen object which dictates how much images will be augmented 

    Following cell does just that, In this cell, I have mentioned only some of the augmentation option TensorFlow provides you. What each of the argument does is pretty much self explanatory. In case you want full info on what each one does and what more options are available click [here](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator)



Also do not forget to import the function using

>from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [None]:
datagen = ImageDataGenerator(
    rescale=1./255,
    shear_range=0.2,
    rotation_range=30,
    width_shift_range=0.2,
    height_shift_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    validation_split=0.1,
    fill_mode='nearest')

2. Now augmentation is taken care of let's look into how to load images from our "data_csv" data frame Basically what this function does is load specified images at a time and won't load all of them at once. How many images to load is controlled by hyper-parameter batch_size. for example, if you specified batch_size = 32 then the function will first load 32 images after training on them is finished discard them and load new 32 images. This happens until all images are done training that is the completion of one epoch. Also in some datasets, all images are of different sizes this is a problem because CNN models require images of the same shape this also is taken care of for you in this method.

    okay, once you understand that. Let's look at the method, 
   
    To generate this flow of images from the data frame, We will call the "flow_from_dataframe" method on "datagen" object we created in the previous cell. As ImageDataGenerator this method also has different parameter you can learn more about them [here](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator#flow_from_dataframe)

    
   Some of the important one are :-

* dataframe - put your dataframe here

* x_col - specify column name which contain absolute directories of images

* y_col - specify colum name which contain respective class_id

* target_size - size of images all images will be processed to be of the size you define here

* color_mode - specify color mode see [here](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator#flow_from_dataframe)

* class_mode - specify class mode see [here](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator#flow_from_dataframe)

* batch_size - specify the batch size

* subset - wheter "training" or "validation" (only works if "validation_split=0.1" parameter is provided while creating datagen object see [here](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator#flow_from_dataframe))

    

    

Run the following cell note that this will take a while so if it appears as frozen it's not! let it work. After all, it is working on 1.5 million images!!

In [None]:
batch_size = 32
target_shape=(64,64)

train_generator = datagen.flow_from_dataframe(
        dataframe = data_csv,
        x_col = "image_dir",
        y_col = "landmark_id",
        target_size = target_shape,
        color_mode = "rgb",
        class_mode = "categorical",
        batch_size = batch_size,
        subset = 'training'
)
validation_generator = datagen.flow_from_dataframe(
        dataframe = data_csv,
        x_col = "image_dir",
        y_col = "landmark_id",
        target_size = target_shape,
        color_mode = "rgb",
        class_mode = "categorical",
        batch_size = batch_size,
        subset = 'validation'
)

remember from before we counted 1580470 images belonging to 81313 class?

Let's verify if eveything is loaded accurately 

>classes = 81313 (verified)

>images = train_images + validation images = 1422423 + 158047 = 1580470 (verified)

Congratulations! Everything working perfectly

Now the question is how to train them?

procedure is almost the same with some adjustment.

let's create a super simple model 

note that even such a simple model will have lots of parameters

In [None]:
mymodel = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=(64,64,3)),
    tf.keras.layers.MaxPooling2D(2),
    tf.keras.layers.Conv2D(filters=4, kernel_size=3, activation="relu"),
    tf.keras.layers.MaxPooling2D(2),
    tf.keras.layers.Conv2D(filters=4, kernel_size=3, activation="relu"),
    tf.keras.layers.MaxPooling2D(2),
    tf.keras.layers.Conv2D(filters=8, kernel_size=3, activation="relu"),
    tf.keras.layers.MaxPooling2D(2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(256, activation="sigmoid"),
    tf.keras.layers.Dense(no_of_classes, activation="softmax"),
])
mymodel.summary()

## Then Compile the model

In [None]:
mymodel.compile(optimizer="Adam", loss="categorical_crossentropy", metrics=['categorical_accuracy'])

## Finally! train the model ...

here syntax is little different 

* first argument is "train_generator" created in previous sevtions

* epoch - no of epoch to train your model

* steps_per_epochs - This should be equal to number of images // batch_size 

    Avoid hardcoding this values to avoid errors use train_generator.samples instead to find number of images

* validation_data - specify validation generator object

* validation_ steps - should be equal to (validation_generator.samples//batch_size)



I have trained only for 1 epoch because training an accurate model is out of the scope of this notebook it is something you have to figure out. this is a competition after all and also even one epoch will take a long time to train

## Convert this cell into the code cell to run 
I am not running following command because of time required due to large data even 1 epoch is taking 3hrs to complete

You can use fit function like below to train network

mymodel.fit(train_generator, 
            epochs=1,
            steps_per_epoch=train_generator.samples//batch_size, 
            validation_data = validation_generator,
            validation_steps=validation_generator.samples//batch_size)

Congratulation!! you have learned how to train a simple classification model But do know that this notebook is about loading data that you can use no problem but the model created in this model won't reach much accuracy in this case. classifying landmarks into 81k classes is no joke. This can not be achieved with a simple classification problem. You will have to implement things like DeLF (DEep Local Features) or think of something new. But don't be discouraged keep trying new things and pushing boundaries of Deep learning



## ALL THE BEST FOR COMPETITION !!