# Introduction

The goal of this notebook is to classify an image of a handwritten letter into one of 33 categories/letters of the Russian alphabet using deep learning technologies (computer vision).

References:
- The dataset was prepared and uploaded by [Olga Belitskaya](https://www.kaggle.com/olgabelitskaya) including the total of 14190 images.

Letter Symbols => Letter Labels:

а=>1, б=>2, в=>3, г=>4, д=>5, е=>6, ё=>7, ж=>8, з=>9, и=>10,
й=>11, к=>12, л=>13, м=>14, н=>15, о=>16, п=>17, р=>18, с=>19, т=>20,
у=>21, ф=>22, х=>23, ц=>24, ч=>25, ш=>26, щ=>27, ъ=>28, ы=>29, ь=>30,
э=>31, ю=>32, я=>33

- Most of the explanations are borrowed from the [Kaggle *Deep Learning* course](https://www.kaggle.com/learn/deep-learning) and this [kernel](https://www.kaggle.com/yassineghouzam/introduction-to-cnn-keras-0-997-top-6/output#Introduction-to-CNN-Keras---Acc-0.997-(top-8%).

I will develope a Sequential Convolutional Neural Network for this project. I've chosen to build it with keras API (Tensorflow backend) which is very intuitive. Firstly, I will prepare the data (handwritten letters images) then I will focus on the CNN modeling and evaluation.

This Notebook has four main parts:

1. Data preparation
2. Model creation and tuning
3. Model evaluation
4. Model prediction on test data

The logbook of this project can be found [here](https://docs.google.com/spreadsheets/d/15L4IlWvsdMmVphFHvqlhz3lmE25VTatBQBejyFZUyK0/edit?usp=sharing).
It includes the following tabs:
* Time spent on the project
* Ideas to improve the model
* Logbook of different configurations I tested
* Lessons learnt from the project

The **GitHub repository** of this project is [here](https://github.com/TatianaSnauwaert/Deep_RU_letters).

# 1. Data Preparation
## 1.1 Load libraries

In [None]:
# load the libraries 

import sys
import seaborn as sns
import numpy as np
np.set_printoptions(threshold=sys.maxsize)
import pandas as pd
import os
import h5py
import PIL
import cv2
import tensorflow as tf
import tensorflow.keras as keras

from PIL import Image

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

from tensorflow.keras.preprocessing import image
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import load_model

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, Dropout, MaxPooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator

import matplotlib.pylab as plt
from matplotlib import cm
%matplotlib inline
import matplotlib.pylab as pylab
pylab.rcParams["figure.figsize"] = (14,8)

## 1.2 Load input folder and see an example of input data

In [None]:
input_folder = '/kaggle/input/russian-handwritten-letters/all_letters_image/all_letters_image/'
all_letters_filename = os.listdir(input_folder)
len(all_letters_filename)

In [None]:
i = Image.open("/kaggle/input/russian-handwritten-letters/all_letters_image/all_letters_image/20_102.png")
i

This is one of our images. Each image has a size of 32 by 32 pixels. We then convert each image into a 3d numpy array.

In [None]:
i_arr = np.array(i)
i_arr

* All 32 matrices inside this array represent one image. 
* Each matrix represents 1 line of this image. 
* One line of the image is 32 pixels long, so each matrix has 32 rows. 
* Each row inside a matrix has 4 columns and it represents 1 pixel. For that pixel each column represents the color values - how red, green and blue it is - plus the opacity of the colors (last column).  

That is why each matrix is 32 by 4. The total amount of pixels inside one image is 32 * 32 = 1024.

Each color value can be found in the range of [0:255]. It means there are 256 shades for each color. In total all of the combinations of these colors give us 256ˆ3 = 16 777 216 possible colors.

## 1.3 Convert images to tensors

In [None]:
# Helper functions to preprocess an image into a tensor. 
# We will use the default RGB mode 
# instead of a possible RGBA as the opacity doesn't seem to be important in this task

#TO DO: describe the function

def img_to_array(img_name, input_folder):
    img = image.load_img(input_folder + img_name, target_size=(32,32))
    x = image.img_to_array(img)
    return np.expand_dims(x, axis=0)
def data_to_tensor(img_names, input_folder):
    list_of_tensors = [img_to_array(img_name, input_folder) for img_name in img_names]
    return np.vstack(list_of_tensors)

In [None]:
data = pd.read_csv("../input/russian-handwritten-letters/all_letters_info.csv")
image_names = data['file']
letters = data[ 'letter']
backgrounds = data['background'].values
targets = data['label'].values
tensors = data_to_tensor(image_names, input_folder)
tensors[0]

In [None]:
# Print the shape 
print ('Tensor shape:', tensors.shape)
print ('Target shape', targets.shape)

In [None]:
# Read from files and display images using OpenCV
def display_images(img_path, ax):
    img = cv2.imread(input_folder + img_path)
    ax.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
    
fig = plt.figure(figsize=(16, 4))
for i in range(12):
    ax = fig.add_subplot(2, 6, i + 1, xticks=[], yticks=[], title=letters[i*100])
    display_images(image_names[i*100], ax)

Let's see the labels distribution.

In [None]:
g = sns.countplot(targets)

The classes are perfectly balanced which is very important for a classification model. If classes are imbalanced the model will try to maximize the accuracy of the majority class while leaving out other classes leading to less accurate predictions for minority classes. 

## 1.4 Data preprocessing
### 1.4.1 Normalization
We perform a normalization to reduce the effect of illumination's differences.

Moreover a CNN converges faster on [0..1] data than on [0..255].
We will transform the input data to the float type and then divide by 255 (maximum brightness for each color). 

In [None]:
X = tensors.astype("float32")/255

In [None]:
arr = X[0]
arr_ = np.squeeze(arr)
plt.imshow(arr_)
plt.show()

In [None]:
targets[0]

In [None]:
y = targets

img_rows, img_cols = 32, 32 # because our pictures are 32 by 32 pixels
num_classes = 33 # because there are 33 letters in the Russina alphabet

y = keras.utils.to_categorical(y-1, num_classes) # targets-1 because our list starts with 1 and not 0 as expected by keras

In [None]:
print(X.shape)
print(y.shape)

### 1.4.2 Letter detection to remove background and other noise

The below function reference: [an answer to a stockoverflow question.](https://stackoverflow.com/questions/24385714/detect-text-region-in-image-using-opencv)

p.s.: In the end this algorithm is not working for this dataset (see below).

In [None]:
def captch_ex(file_name):
    img = cv2.imread(file_name)
    img_final = cv2.imread(file_name)
    img2gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    ret, mask = cv2.threshold(img2gray, 180, 255, cv2.THRESH_BINARY)
    image_final = cv2.bitwise_and(img2gray, img2gray, mask=mask)
    ret, new_img = cv2.threshold(image_final, 180, 255, cv2.THRESH_BINARY)  # for black text , cv.THRESH_BINARY_INV

    '''
            line  8 to 12  : Remove noisy portion 
    '''
    kernel = cv2.getStructuringElement(cv2.MORPH_CROSS, (3,
                                                         3))  # to manipulate the orientation of dilution , large x means horizonatally dilating  more, large y means vertically dilating more
    dilated = cv2.dilate(new_img, kernel, iterations=9)  # dilate , more the iteration more the dilation

    # for cv2.x.x

    contours, hierarchy = cv2.findContours(new_img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)  # findContours returns 3 variables for getting contours

    for contour in contours:
        # get rectangle bounding contour
        [x, y, w, h] = cv2.boundingRect(contour)

        # Don't plot small false positives that aren't text
        if w < 35 and h < 35:
            continue

        # draw rectangle around contour on original image
        cv2.rectangle(img, (x, y), (x + w, y + h), (255, 0, 255), 2)

        '''
        #you can crop image and send to OCR  , false detected will return no text :)
        cropped = img_final[y :y +  h , x : x + w]

        s = file_name + '/crop_' + str(index) + '.png' 
        cv2.imwrite(s , cropped)
        index = index + 1

        '''
    # write original image with added contours to disk
    cv2.imshow('captcha_result', img)
    cv2.waitKey()


In [None]:
file_name = '/kaggle/input/russian-handwritten-letters/all_letters_image/all_letters_image/04_100.png'
# captch_ex(file_name)

Looks like this algorithm is unable to identify the contours of a letter as it returns None for this line:

contours, hierarchy = cv2.findContours(new_img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)  

After careful examination I found that pixel intensities do not differ significantly for a letter and the rest of the image. I suspect this is the reason for the letter detection failure. Maybe try a different algorithm later.

Next I will try to center the letters and crop the images.

### 1.4.3 Greyscale images

Turning images to grey scale didn't improve the model but even worsen the validation and train scores by a few percent. I decided to convert back to RGB in the final model.

In [None]:
# Grayscaled tensors
X_grey = np.dot(X[...,:3], [0.299, 0.587, 0.114])
# X_grey = tf.expand_dims(X_grey, axis=3)
print ('Grayscaled Tensor shape:', X_grey.shape)

In [None]:
plt.imshow(X_grey[0], cmap=plt.get_cmap("gray"))

# 2. Model creation and tuning
## 2.1 Train_test split

In [None]:
# Split the data into train, validation and test sets.

X_train_whole, X_test, y_train_whole, y_test = train_test_split(X, y, test_size=0.1, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train_whole, y_train_whole, test_size=0.1, random_state=1)

## 2.2 Data augmentation

A very straightforward way to understand why data augmentation works is by thinking of it as a way to artificially expand our dataset. As is the case with deep learning applications, the more data, the merrier.

Each time the neural network sees the same image, it's a bit different due to the stochastic data augmentation being applied to it.  This difference can be seen as noise being added to our data sample each time, and this noise forces the neural network to learn generalised features instead of overfitting on the dataset.

### 2.2.1 Horizontal flipping of the part of dataset

Horizontal flipping is a data augmentation technique. Usually a part of data is randomly flipped horizontally to provide a wider set of images for a model to better learn the patterns. Our dataset contains letters, so not all of them can be flipped. If we flip letters that are not symmetrical, it will confuse the algorithm as a letter won't be recognizable.

Below I have tried to only flip symmetrical letters. The full list of those include 10 letters: ж=>8, л=>13, м=>14, н=>15, о=>16, п=>17, т=>20, ф=>22, х=>23, ш=>26.

For that purpose I have subsetted all the lines from the training set that correspond to these 10 letters and flipped these images with the tensorflow function.

In [None]:
# flip_labels = [8,13,14,15,16,17,20,22,23,26]
# flip_labels = 8

In [None]:
# mask = np.isin(y_train, flip_labels)
# X_train_to_flip = X_train[mask]
# flipped_y_train = y_train[mask]
# len(flipped_y_train)
# flipped_X_train = tf.image.flip_left_right(X_train_to_flip)

In [None]:
# plt.imshow(X_train_to_flip[15])
# plt.show()

In [None]:
# plt.imshow(flipped_X_train[15])
# plt.show()

In [None]:
# aug_X_train = np.concatenate((X_train, flipped_X_train), axis=0)
# len(aug_X_train)

In [None]:
# aug_y_train = np.concatenate((y_train, flipped_y_train), axis=0)
# len(aug_y_train)

Transform our labels into categorical data using keras tools: 
it automatically does the one-hot encoding turning the target column into 33 columns for each letter. 

In every column ones for this column's letter and all others are zeros. 

In [None]:
# img_rows, img_cols = 32, 32 # because our pictures are 32 by 32 pixels
# num_classes = 33 # because there are 33 letters in the Russina alphabet
# aug_y_train = keras.utils.to_categorical(aug_y_train-1, num_classes) # targets-1 because our list starts with 1 and not 0 as expected by keras

After testing this change on the training and validation data I found that the validation score has increased slightly but the validation score has dropped significantly - from 87.8% to 0.35%. I then tested flipping only 1 letter. Turns out the validation score dropped only a few percentile. I made a conclusion that augmenting data only for some classes in the dataset has caused a class imbalance issue and consequently such a dramatic drop in the validation score. The model wasn't generalizing well.

A lot of time was spent on this idea but in the end I have to remove it from the model. 
Note for the future: think through the idea before implementing it - what are the possible outcomes and how much time it might take.

### 2.2.2 Automatic random data augmentation

Below I will use a built-in function to perform different kind of random data augmentation.

By applying just a couple of these transformations to our training data, we can easily double or triple the number of training examples and create a very robust model.

In [None]:
datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample/image mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=10,  # randomly rotate images in the range (degrees, 0 to 180)
        zoom_range = 0.1, # Randomly zoom image 
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=False,  # randomly flip images
        vertical_flip=False)  # randomly flip images

# grey_X_train = tf.expand_dims(X_train, axis=3)
# grey_X_val = tf.expand_dims(X_val, axis=3)
# grey_X_test = tf.expand_dims(X_test, axis=3)

datagen.fit(X_train)

For the data augmentation, I chose to :

* Randomly rotate some training images by 10 degrees;
* Randomly Zoom by 10% some training images;
* Randomly shift images horizontally by 10% of the width;
* Randomly shift images vertically by 10% of the height.

## 2.3 Define the model

First we always instantiate the Sequential model. We then add the first Conv2D layer in which we specify the number of filters/convolutions, kernel_size (the shape of the convolution itself - can be an integer (if hight and width are the same) or a tuple). 

We need to specify an activation function as well.

Rectified Linear Units (ReLU) is the most commonly used activation function in deep learning models. 

f(x)=max(0,x) 

Activation functions serve two primary purposes: 1) Help a model account for interaction effects. 2) Help a model account for non-linear effects (the effect of increasing the predictor by one is different at different values of that predictor). 

[ReLU kaggle](https://www.kaggle.com/dansbecker/rectified-linear-units-relu-in-deep-learning)

We must also specify the input_shape only in the first layer (number of pixel rows, pixel columns and number of color channels). 

The second important layer in CNN is the pooling (MaxPooling2D) layer. This layer simply acts as a downsampling filter. It looks at the 2 neighboring pixels and picks the maximum value. These are used to reduce computational cost, and to some extent also reduce overfitting. We have to choose the pooling size (i.e the area pooled each time):the higher the pooling dimension is, the more the significant the downsampling is.

Dropout is a regularization method, where a proportion of nodes in the layer are randomly ignored (setting their wieghts to zero) for each training sample. This drops randomly a propotion of the network and forces the network to learn features in a distributed way. This technique also improves generalization and reduces the overfitting.

Then we add as many layers as we like, then add the Flatten layer (reduces the number of dimensions from 4d to 2d),
at the end we add the Dense layer which connects all the neurons (unlike in the previous layers where not all neurons might be connected) and is basically just an artificial neural networks (ANN) classifier.

In [None]:
# Define the model architecture

deep_RU_model = Sequential()

deep_RU_model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (img_rows,img_cols,3)))
deep_RU_model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
deep_RU_model.add(MaxPooling2D(pool_size=(2,2)))
deep_RU_model.add(Dropout(0.25))


deep_RU_model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
deep_RU_model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
deep_RU_model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
deep_RU_model.add(Dropout(0.25))


deep_RU_model.add(Flatten())
deep_RU_model.add(Dense(256, activation = "relu"))
deep_RU_model.add(Dropout(0.5))
deep_RU_model.add(Dense(33, activation = "softmax"))

## 2.4 Compile the model
In this section we define **HOW** the model gets its weights. 
In other words we configure the model for training. 

3 important concepts:
- loss function
- gradient descent
- backward propagation

Loss function measures how good our model's predictions are.
Loss = f (actual, prediction)
We try to minimize the loss, so the closer the prediction is to the actual value, the lower the loss is.
The model's loss function number will change if we change the weights. We use "categorical_crossentropy" loss function for the multiclass (>2 classes) classification.

The model finds the best weights through the "gradient descent": it takes one step and predicts in which direction it goes downhill the fastest, then it goes that direction one more step and repeats until it can't go down anymore.

To see which way is downhill - we use a "backward propagation".
First consider the weights in the layer before the Dense layer. Since we know in the training data the actual label of that particular image, we slightly increase those weights that lead to the correct label and slightly decease weights that lead to the wrong label. Then we continue the same way to go back until the first layer after the input data.

The size of weight changing is determined by the learning rate.
The optimizer="adam" is the special variation of gradient descent that automatically figures out the best learning rate.

The most important function is the optimizer. This function will iteratively improve parameters (filters kernel values, weights and bias of neurons ...) in order to minimise the loss.

I chose RMSprop (with default values), it is a very effective optimizer. The RMSProp update adjusts the Adagrad method in a very simple way in an attempt to reduce its aggressive, monotonically decreasing learning rate. We could also have used Stochastic Gradient Descent ('sgd') optimizer, but it is slower than RMSprop.

A metric is a function that is used to judge the performance of a model. 

Accuracy is the fraction of predictions that our model got right = Number of correct predictions / Total number of predictions.

In [None]:
# Define the optimizer
optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)

In [None]:
# Compile the model: 

deep_RU_model.compile(loss="categorical_crossentropy", optimizer = optimizer,metrics=["accuracy"])

**Annealing method of the learning rate
**

In order to make the optimizer converge faster and closest to the global minimum of the loss function, we can use an annealing method of the learning rate (LR).

The LR is the step by which the optimizer walks through the 'loss landscape'. The higher LR, the bigger are the steps and the quicker is the convergence. However the sampling is very poor with an high LR and the optimizer could probably fall into a local minima.

Its better to have a decreasing learning rate during the training to reach efficiently the global minimum of the loss function.

To keep the advantage of the fast computation time with a high LR, i decreased the LR dynamically every X steps (epochs) depending if it is necessary (when accuracy is not improved).

With the ReduceLROnPlateau function from Keras.callbacks, i choose to reduce the LR by half if the accuracy is not improved after 3 epochs.

In [None]:
# Set a learning rate annealer
learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc', 
                                            patience=3, 
                                            verbose=1, 
                                            factor=0.5, 
                                            min_lr=0.00001)

## 2.5 Early stopping and Model Chekpoint

Reference:
https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/

Too many epochs can lead to overfitting of the training dataset, whereas too few may result in an underfit model. **Early stopping ** is a method that allows you to specify an arbitrary large number of training epochs and stop training once the model performance stops improving on a hold out validation dataset.

Keras supports the early stopping of training via a callback called EarlyStopping.

This callback allows you to specify the performance measure to monitor, the trigger, and once triggered, it will stop the training process.

Often, the first sign of no further improvement may not be the best time to stop training. This is because the model may coast into a plateau of no improvement or even get slightly worse before getting much better.

We can account for this by adding a delay to the trigger in terms of the number of epochs on which we would like to see no improvement. This can be done by setting the “patience” argument.

The EarlyStopping callback will stop training once triggered, but the model at the end of training may not be the model with best performance on the validation dataset.

An additional callback is required that will save the best model observed during training for later use. This is the **ModelCheckpoint** callback.

The ModelCheckpoint callback is flexible in the way it can be used, but in this case we will use it only to save the best model observed during training as defined by a chosen performance measure on the validation dataset.

Saving and loading models requires that HDF5 support has been installed on your workstation. 
It may be interesting to know the value of the performance measure and at what epoch the model was saved. This can be printed by the callback by setting the “verbose” argument to “1“.

In [None]:
es = EarlyStopping(monitor='val_accuracy', mode='max', verbose=1, patience=50)
mc = ModelCheckpoint('best_model.h5', monitor='val_accuracy', mode='max', verbose=1, save_best_only=True)

## 2.6 Fit the model

Model doesn't use all of the data for each step, it only takes some of it. Can be regulated with the batch_size parameter.
One time we go throuth the data is called an epoch.

In [None]:
history = deep_RU_model.fit(datagen.flow(X_train,y_train, batch_size=90), validation_data = (X_val, y_val),
                            epochs=139, callbacks=[learning_rate_reduction, es, mc])

# 3. Model evaluation
### Training and validation curves

In [None]:
# load the saved model
# saved_model = load_model('/kaggle/input/deep-ru-letters-cnn-tutorial/best_model.h5')
saved_model = load_model('best_model.h5')

# evaluate the model
_, train_acc = saved_model.evaluate(X_train, y_train, verbose=0)
_, valid_acc = saved_model.evaluate(X_val, y_val, verbose=0)

print('Train: %.3f, Valid: %.3f' % (train_acc, valid_acc))

In [None]:
# Plot the loss and accuracy curves for training and validation 
fig, ax = plt.subplots(2,1)
ax[0].plot(history.history['loss'], color='b', label="Training loss")
ax[0].plot(history.history['val_loss'], color='r', label="validation loss",axes =ax[0])
legend = ax[0].legend(loc='best', shadow=True)

ax[1].plot(history.history['accuracy'], color='b', label="Training accuracy")
ax[1].plot(history.history['val_accuracy'], color='r',label="Validation accuracy")
legend = ax[1].legend(loc='best', shadow=True)

The model reaches almost 96% accuracy on the validation dataset after 139 epochs. The validation accuracy is greater than the training accuracy almost evry time during the training. That means that our model dosen't not overfit the training set. The number of epochs could be reduced - there will probably be a slight drop in accuracy (few percent) but the computation won't take that long (~2h15), otherwise GPU could be used to speed it up.

# 4. Model prediction on test data

In [None]:
_, test_acc = saved_model.evaluate(X_test, y_test, verbose=0)
print('Test: %.3f' % (test_acc))

### Confusion matrix

In [None]:
y_pred = deep_RU_model.predict(X_test)
y_pred = np.argmax(y_pred, axis=1)
y_test = np.argmax(y_test, axis=1)
confusion_mtx = confusion_matrix(y_test, y_pred) 
sns.heatmap(confusion_mtx, annot=True, fmt='d')

Here we can see that our CNN performs very well on all letters with few only errors considering the size of the test set.

Let's investigate the most important errors. For that purpose we need to get the difference between the probabilities of a real value and the predicted one.

In [None]:
# Display some error results 

# Convert one-hot vector to labels
Y_true = y_test

# Predict the values from the test dataset
Y_pred = saved_model.predict(X_test)
# Convert predictions from one-hot vectors to labels
Y_pred_classes = np.argmax(Y_pred,axis = 1) 

# Errors are difference between predicted labels and true labels
errors = (Y_pred_classes - Y_true != 0)

Y_pred_classes_errors = Y_pred_classes[errors]
Y_pred_errors = Y_pred[errors]
Y_true_errors = Y_true[errors]
X_val_errors = X_test[errors]

def display_errors(errors_index,img_errors,pred_errors, obs_errors):
    """ This function shows 6 images with their predicted and real labels"""
    n = 0
    nrows = 2
    ncols = 3
    fig, ax = plt.subplots(nrows,ncols,sharex=True,sharey=True)
    for row in range(nrows):
        for col in range(ncols):
            error = errors_index[n]
            ax[row,col].imshow((img_errors[error]))
            ax[row,col].set_title("Predicted label :{}\nTrue label :{}".format(pred_errors[error],obs_errors[error]))
            n += 1

# Probabilities of the wrongly predicted letters
Y_pred_errors_prob = np.max(Y_pred_errors,axis = 1)

# Predicted probabilities of the true values in the error set
true_prob_errors = np.diagonal(np.take(Y_pred_errors, Y_true_errors, axis=1))

# Difference between the probability of the predicted label and the true label
delta_pred_true_errors = Y_pred_errors_prob - true_prob_errors

# Sorted list of the delta prob errors
sorted_delta_errors = np.argsort(delta_pred_true_errors)

# Top 6 errors 
most_important_errors = sorted_delta_errors[-6:]

# Show the top 6 errors
display_errors(most_important_errors, X_val_errors, Y_pred_classes_errors, Y_true_errors)

During data processing we've shifted letters'order by 1 to make it start from 0. 
The new ordering is as follows:

а=>0, б=>1, в=>2, г=>3, д=>4, е=>5, ё=>6, ж=>7, з=>8, и=>9,
й=>10, к=>11, л=>12, м=>13, н=>14, о=>15, п=>16, р=>17, с=>18, т=>19,
у=>20, ф=>21, х=>22, ц=>23, ч=>24, ш=>25, щ=>26, ъ=>27, ы=>28, ь=>29,
э=>30, ю=>31, я=>32

Most of the errors, especially the second row, could be easily made by a human the same way. 

# Conclusion

This is a 5 layers Sequential Convolutional Neural Network for Russian handwritten letters recognition. Dataset includs 14190 images with the fowllowing split: 
* 80% train data;
* 10% validation data;
* 10% test data.

My CNN model's architechture: In -> [[Conv2D->relu]* 2 -> MaxPool2D -> Dropout] * 2 -> Flatten -> Dense -> Dropout -> Out

I achieved 95.6% of accuracy with this CNN trained for ~ 2h15 on a single CPU. For those who have a >= 3.0 GPU capabilites (from GTX 650 - to recent GPUs), you can use tensorflow-gpu with keras. Computation will be much much faster!

Converting images to grey scale didn't improve the algorithm.
I've changed the optimizer from "adam" to RMSprop, used an annealing method to find the optimal learning rate, applied several random data augmentation techniques and early stopping to figure out the optimal number of epochs to run the model for. ModelCheckpoint callback is used to save the best model (including the best weights), so it can be used as a pretrained model in the future.  

The model makes mistakes mostly in those cases where a human eye would've probably guessed wrong as well.

Possible future improvement ideas are the following:
* try multilabel (add background as a target) classification;
* improve loading function;
* remove the backgrounds and then from every sample create many new ones with different backgrounds (as a way of data augmentation);
* visualize the model to see what else can be improved (how it makes a decision);
* try monitoring val_loss instead of val_accuracy