# Text Recognition with CTC

In [1]:
import tensorflow as tf
import cv2
import numpy as np
import matplotlib.pyplot as plt
import pickle
import pandas as pd
import re
from os.path import exists

In [None]:
import ctc_utils as utils
from importlib import reload 
import warnings

# Ensure we have always the latest state and 
# not the last import in memory
reload(utils)

In [2]:
img_size = (32, 128)


## Model

### CNN Part


The idea here is to cut our image into several features (could be interpreted as smaller areas of the picture).

<img src="imgs/cnn_result.png" />

To do so, we ill use a succession of 
- **Convolution** to extract
- **Batch Normalization** to prevent our model from overfitting (equivalent of Dropout) 

#### Batch normalization

TLDR: Solve **internal covariate shift** and simplify 

**Normalization** is a procedure to change the value of the numeric variable in the dataset to a typical scale, without misshaping contrasts in the range of value.

**Batch normalization** is a technique for training very deep neural networks that normalizes the contributions to a layer for every mini-batch.

In neural networks, the output of the first layer feeds into the second layer, the output of the second layer feeds into the third, and so on. When the parameters of a layer change, so does **the distribution of inputs to subsequent layers**.

> We define Internal Covariate Shift as the change in the distribution of network activations due to the change in network parameters during training.

These shifts in input distributions can be problematic for neural networks, especially deep neural networks that could have a large number of layers.

Batch normalization is a method intended to mitigate internal covariate shift for neural networks.

<img src="imgs/batch_normalization.jpeg" />

This has the impact of settling the learning process and drastically decreasing the number of training epochs required to train deep neural networks.

Sources: 
- https://machinelearning.wtf/terms/internal-covariate-shift/
- https://towardsdatascience.com/batch-normalisation-in-deep-neural-network-ce65dd9e8dbf#:~:text=Batch%20normalization%20solves%20a%20major,you%20can%20often%20remove%20dropout.
- https://towardsdatascience.com/understanding-dataset-shift-f2a5a262a766

#### Conv

<img src="imgs/conv_padding.gif" />

- Padding '**valid**' is the first figure. The filter window stays inside the image. When padding == "VALID", there can be a loss of information. Generally, elements on the right and the bottom of the image tend to be ignored. How many elements are ignored depends on the size of the kernel and the stride.

- Padding '**same**' is the third figure. The output is the same size.


#### Pooling/Flattening

**Pooling** is the process of merging. So it’s basically for the purpose of **reducing the size of the data**.

<img src="imgs/pooling_flattening.png" />

**Flattening** is converting the data into a 1-dimensional array for inputting it to the next layer. We flatten the output of the convolutional layers to create a single long feature vector.

Sources: 

- https://towardsdatascience.com/the-most-intuitive-and-easiest-guide-for-convolutional-neural-network-3607be47480

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Conv2D, BatchNormalization, MaxPooling2D, LeakyReLU, Dropout


hidden_layer_count = 256

model = tf.keras.Sequential()

# Layer 1
model.add(Conv2D(
        filters=32,
        kernel_size=(5,5),
        padding='SAME',
        input_shape = (img_size[0], img_size[1], 1)
    )
)
model.add(BatchNormalization())
model.add(LeakyReLU())
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))

# Layer 2
model.add(Conv2D(filters=64, kernel_size=(5,5), padding='SAME'))
model.add(BatchNormalization())
model.add(LeakyReLU())
model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))

# Layer 3
model.add(Conv2D(filters=128, kernel_size=(3,3), padding='SAME'))
model.add(BatchNormalization())
model.add(LeakyReLU())
model.add(MaxPooling2D(pool_size=(1,2), strides=(1,2)))

# Layer 4
model.add(Conv2D(filters=128, kernel_size=(3,3), padding='SAME'))
model.add(BatchNormalization())
model.add(LeakyReLU())
model.add(MaxPooling2D(pool_size=(1,2), strides=(1,2)))

# Layer 5
model.add(Conv2D(filters=256, kernel_size=(3,3), padding='SAME'))
model.add(BatchNormalization())
model.add(LeakyReLU())
model.add(MaxPooling2D(pool_size=(1,2), strides=(1,2)))


At this point, we have cut our entry image into X smaller parts. Those parts can be read as a time serie.

Indeed, there is a strong connection between one part and the one coming after by construction (a word is a succession of letters).

We can therefore make use of a **RNN** now in our model.

NB: a part does not correspond exactly to a letter, it is more the idea behind it. It could be interesting to see the importance of the size of each part (number of cuts).


### RNN part

In [None]:
from tensorflow.keras.layers import GRU, Bidirectional, Dense, Lambda

# Remove axis 2
model.add(Lambda(lambda x :tf.squeeze(x, axis=2)))
# Bidirectionnal RNN
model.add(Bidirectional(GRU(hidden_layer_count, return_sequences=True)))
model.add(Dense(100))
model.summary()

TODO LSTM could be a good idea ? 

## Connectionist Temporal Classification

In [None]:
def encode_labels(labels, charList):
    # Hash Table
    table = tf.lookup.StaticHashTable(
        tf.lookup.KeyValueTensorInitializer(
            charList,
            np.arange(len(charList)),
            value_dtype=tf.int32
        ),
        -1,
        name='char2id'
    )
    return table.lookup(
    tf.compat.v1.string_split(labels, delimiter=''))

a = encode_labels(y_train[:5], charList)
tf.sparse.to_dense(a)

dataset = tf.data.Dataset.from_tensor_slices((np.expand_dims(X_train,-1), y_train))
dataset = dataset.shuffle(1000).batch(64)

In [None]:
def loss(labels, logits):
    return tf.reduce_mean(
            tf.nn.ctc_loss(
                labels = labels,
                logits = logits,
                logit_length = [logits.shape[1]]*logits.shape[0],
                label_length = None,
                logits_time_major = False,
                blank_index=-1
            )
        )

def train_op(model, inputs, targets):
    with tf.GradientTape() as tape:
        # Prédiction de notre modèle
        y_pred = model(inputs, training=True)
        # Calcule de l'erreur de notre modèle
        loss_value = tf.reduce_mean(loss(targets, y_pred))
       
    # Calculer le gradient de la fonction de perte
    grads = tape.gradient(loss_value, model.trainable_variables)
    # Descente de gradient
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    # Retourner la valeur de la fonction de perte
    return loss_value.numpy()