# 1. Data Pre-Processing

This notebook contains my version of image data pre-processing. The dataset I used to train the model is [FUNSD Dataset](https://guillaumejaume.github.io/FUNSD/dataset.zip). I downloaded the dataset, and save it to my local directory. 

The main goal of this notebook is to create a directory containing the bad version of my training dataset, and actually labeling it with the good version image (original image).

In [27]:
# Importing things
import os
import time
import traceback
import numpy
from numpy import asarray
from pathlib import Path
from PIL import Image
from datetime import datetime
from tensorflow.keras.utils import img_to_array
from tensorflow.keras.preprocessing.image import load_img

Here we define some parameters:

| Parameter Name | Description |
|---|---|
| `SCALE` | Resizing scale factor |
| `INPUT_DIM` | Input and Output patch sizes |
| `PAD` | Padding that need to be added to output patches |
| `STRIDE` | Larger STRIDE will result in higher pixel skipping. Which will reduce more image quality. On the dzlab's github, `STRIDE` explained as *"the stride which is the number of pixels we'll slide both in the horizontal and vertical axes to extract patches"* |

In [28]:
SCALE = 2.0
INPUT_DIM = 33
LABEL_SIZE = 21
PAD = int((INPUT_DIM - LABEL_SIZE) / 2.0)
STRIDE = 14

### 1.1 Making support functions
This part is making support function to actually reducing the image quality for the testing data. 

List functions to be made:

1. `resize_image` 
As its name suggests, this function will resize the image by the specified factor. 

If you want to downsample the image, you can set the factor by 1 / x or x / 100.
If you want to upsample the image, you can set the factor by x

2. convert_image_to_array
As its name suggests, this function will convert a raw image data to a numpy array. You may notice that this function is only contains 1 line, but trust me, as a skilled-issue user (me), you may want to do this for better understanding of the function.

3. downsize_upsize_image
This function will downsample, and then upsample the image. It says, if this happen, for some reason the image will be start degraded on its quality.

4. tight_crop_image
This function will 

5. crop_input
This function will slice through the input image to the destinated dimension.

6. crop_output
This function will slice through the target image to the destinated dimension.

In [29]:
# resize_image
def resize_image(image_array, factor):
    original_image = Image.fromarray(image_array)

    new_size = numpy.array(original_image.size) * factor
    new_size = new_size.astype(numpy.int32)
    new_size = tuple(new_size)

    resized = original_image.resize(new_size)
    resized = img_to_array(resized)
    resized = resized.astype(numpy.uint8)
    
    return resized


# convert image to array
# This function will convert an image to numpy array spatial data.
# @param
#  - str image_path
# 
# @return 
#  - numpy array
def convert_image_to_array(image_path):
    return asarray(Image.open(image_path))

def downsize_upsize_image(image, scale):
    scaled = resize_image(image, 1.0 / scale)
    scaled = resize_image(scaled, scale) # In the reference, the scale is divided by 1.0. What changes over it?

    return scaled

def tight_crop_image(image, scale):
    height, width = image.shape[:2]

    width -= int(width % scale)
    height -= int(height % scale)

    return image[:height, :width]

def crop_input(image, x, y):
    x_slice = slice(x, x + INPUT_DIM)
    y_slice = slice(y, y + INPUT_DIM)
    return image[y_slice, x_slice]

def crop_output(image, x, y):
    x_slice = slice(x + PAD, x + PAD + LABEL_SIZE)
    y_slice = slice(y + PAD, y + PAD + LABEL_SIZE)
    
    return image[y_slice, x_slice]

def write_log(log, type, session):
    # [2023-10-01T00:00][INFO] Some message
    # Put in ../../logs
    # File name is current time session with format of [Notebook - Session Ymd H:i]
    operation = "x";
    log_path = "../../logs/" + "Notebook - Session "+session + ".log";

    if(Path(log_path).is_file()):
        operation = "a"

    fopen = open(log_path, operation);
    fopen.write("[" + time.strftime("%Y-%m-%d %H:%M:%S") + "]" + "["+type+"]" + log + "\n");
    fopen.close();


### 1.2 Making the bad and the good image version.
Since the code from dzlab mainly uses Google Collab, and unfortunately I'm using Windows, I have to kind of change how the code is interacting with the image entirely.

This algorithm below explains my methodology:

1. Set the directory using Path from pathlib
2. For every file within the directory:
    > 1. Pre-Process the image using keras img_to_array
    > 2. Save the original image array file to disk.
    > 3. Making LowRes images from the normal quality image.
    > 4. Saving the LowRes images to disk.

In [30]:
from tqdm import tqdm;

def load_image():
    directory = Path("E:\\New folder\\TrainTest")
    current_session = time.strftime("%Y%m%d %H%M");

    for file in tqdm(os.listdir(directory)):
        # Generate target filename
        file_name = datetime.now().strftime("%Y%m%d%H%M%S") + "_" + file + ".numpy"
        
        try:
            full_path = os.path.join(directory, file)

            # Call out 2.1
            image = img_to_array(load_img(full_path))
            image = image.astype(numpy.uint8)
            
            # Call out 2.2
            numpy.save("../resources/np_image_original/original_" + file_name, image)

            # Call out 2.3

            image = tight_crop_image(image, SCALE)
            scaled = downsize_upsize_image(image, SCALE)

            height, width = image.shape[:2]

            for y in range(0, height - INPUT_DIM + 1, STRIDE):
                for x in range(0, width - INPUT_DIM + 1, STRIDE):
                    crop = crop_input(scaled, x, y)
                    target = crop_output(image, x, y)

                    numpy.save("../resources/np_image_input/input_" + file_name + '.np', crop)
                    numpy.save("../resources/np_image_output/target_" + file_name + '.np', target)

            write_log("Successfully writing numpy array from " + file_name, "INFO", current_session)
        except Exception as e:
            write_log("Skipping writing numpy array from " + file_name + ": " + traceback.format_exc(), "ERROR", current_session)

        

In [31]:
# Execute the load_image command
load_image()

100%|██████████| 2000/2000 [2:00:58<00:00,  3.63s/it]  
