This is my best private score kernel (0.99280),
so I would like to share what I did.
The point is the method described in [Keras Learning Rate Finder](https://www.pyimagesearch.com/2019/08/05/keras-learning-rate-finder/)
to find effective leraning rate range for the model.
I used the values in the found range to train the model.

## Contents
1. [Preparation](#Preparation)
1. [Making Model](#MakingModel)
1. [Data Augmentation](#DataAugmentation)
1. [Finding Effective Learning Rate](#FindingEffectiveLearningRate)
1. [Training](#Training)
1. [Submit Prediction](#SubmitPrediction)
1. [Reference](#Reference)

<div id='Preparation'>
## 1. Preparation

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
# https://keras.io/getting-started/faq/#how-can-i-obtain-reproducible-results-using-keras-during-development
# How can I obtain reproducible results using Keras during development?
import random as rn
import tensorflow as tf

rand_seed = 53
# maybe no effect, since this should be set before the program starts.
%env PYTHONHASHSEED=0
np.random.seed(rand_seed)
rn.seed(rand_seed)

tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(1)
tf.random.set_seed(rand_seed)

In [None]:
def read_data(file_name):
    file_path = '../input/Kannada-MNIST/' + file_name
    data_df = pd.read_csv(file_path)
    pixels_df = data_df.drop(columns='label')
    pixels_array = pixels_df.to_numpy(dtype=np.uint8)
    reshaped_pixels_array = pixels_array.reshape(-1, 28, 28, 1)
    labels_array = data_df.label.values
    return (reshaped_pixels_array, labels_array)

For [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html),
I specified the stratify option.
This ensures that the distribution of each splitted data becomes
the same as the one in the specified data.
In this case, train_test_split tries to split the whole data
into training and test with the same distribution of labels in the whole data.

In [None]:
from sklearn.model_selection import train_test_split

X, y = read_data('train.csv')
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.1, random_state=rand_seed, stratify=y)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
from PIL import Image
import matplotlib.pyplot as plt

def draw_image(im_array, subplot=(1, 1, 1)):
    im = Image.fromarray(im_array.reshape(28, 28))
    plt.subplot(*subplot)
    plt.imshow(im, cmap='gray')
    plt.axis("off")

In [None]:
draw_image(X_train[0])
plt.show()

In [None]:
unique_y = np.unique(y_train)
print(unique_y)

In [None]:
label_count = len(unique_y)
print(label_count)

In [None]:
# TODO: Fix, xticks and bar positions are not aligned.
def draw_hist(x, title):
    global label_count
    plt.hist(x, bins=label_count, rwidth=0.8)
    plt.title(title)
    plt.xlabel('labels')
    plt.ylabel('counts')
    plt.xticks(np.arange(10))
    plt.show()

In [None]:
draw_hist(y_train, "Distributions of all train labels")

In [None]:
draw_hist(y_test, "Distributions of all test labels")

<div id='MakingModel'>
## 2. Making Model

For a convolution block, I used "Conv2D --> ReLU --> BatchNormalization".
I specified he_normal as the [kernel_initializer](https://keras.io/initializers/).
Without this, sometimes loss of the model became larger and larger and did not converged.

In [None]:
from keras.layers import Conv2D, BatchNormalization
from keras.initializers import he_normal

def make_conv_layer(filter_size, suffix, inputs):
    x = Conv2D(
        filter_size, kernel_size=(3, 3), padding='same', activation='relu',
        kernel_initializer=he_normal(seed=rand_seed), name='conv_' + suffix)(inputs)
    outputs = BatchNormalization(name='bn_' + suffix)(x)
    return outputs

The model consists of the following blocks:
1. Scaling by dividing 255.0.
1. 3 Convolutions with filter size 64, then MaxPooling and SpatialDropout(0.25)
1. 3 Convolutions with filter size 128, then MaxPooling and SpatialDropout(0.25)
1. 3 Convolutions with filter size 256, then GlobalAveragePooling
1. Dense with 256 units, then Dropout(0.25) and outputs with Dense(label_count).

Some points are:
* Scaling is done here, because:
    * To do just once.
    * To save memory.  Data type for image is unsigned integer (0..255) and its size is 1 byte. Dividing by floting point number 255.0 makes floating point result of 4 or 8 bytes. So dividing whole data comsumes a lot of memory.
* [GlobalAveragePooling2D](https://keras.io/layers/pooling/) is used
to connect the last convolution block to the Dence.
This saves the number of parameters for the Dence and seems the performance is much the same.
* sparse_categorical_crossentropy is used for the [loss function](https://keras.io/losses/).
This function accepts the label value directly, so one hot encoding is not necessary.

In [None]:
from keras.layers import Input, Lambda, MaxPooling2D, SpatialDropout2D, Dense
from keras.layers import GlobalAveragePooling2D, Dropout
from keras.models import Model

def make_model():
    inputs = Input(shape=(28, 28, 1), name="input")
    x = Lambda(lambda v: v / 255.0, name='scaling')(inputs)
    
    x = make_conv_layer(64, '1_1', x)
    x = make_conv_layer(64, '1_2', x)
    x = make_conv_layer(64, '1_3', x)
    x = MaxPooling2D(pool_size=(2, 2), name="maxpool_1")(x)
    x = SpatialDropout2D(0.25, name='sp_dpout_1')(x)

    x = make_conv_layer(128, '2_1', x)
    x = make_conv_layer(128, '2_2', x)
    x = make_conv_layer(128, '2_3', x)
    x = MaxPooling2D(pool_size=(2, 2), name='maxpool_2')(x)
    x = SpatialDropout2D(0.25, name='sp_dpout_2')(x)

    x = make_conv_layer(256, '3_1', x)
    x = make_conv_layer(256, '3_2', x)
    x = make_conv_layer(256, '3_3', x)
    x = GlobalAveragePooling2D(name='gblavgpool')(x)

    x = Dense(256, activation='relu', name='dense')(x)
    x = Dropout(0.25, name='dropout')(x)
    outputs = Dense(label_count, activation='softmax', name='outputs')(x)

    model = Model(inputs=inputs, outputs=outputs)
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['sparse_categorical_accuracy'])
    
    return model

In [None]:
train_model = make_model()
train_model.summary()

<div id='DataAugmentation'>
## 3. Data Augmentation

I used [ImageDataGenerator](https://keras.io/preprocessing/image/)
to make variations of the traing data.
I drew heatmaps to check how the significant pixels are distributed
in the original and generated images.

In [None]:
import seaborn as sns

def draw_sum_heatmap(X):
    X_sum = np.sum(X, axis=0, dtype=np.float32) / 255.0
    X_reshaped_sum = np.reshape(X_sum, (28, 28))
    sns.heatmap(X_reshaped_sum)
    plt.show()

In [None]:
draw_sum_heatmap(X)

In [None]:
from keras.preprocessing.image import ImageDataGenerator

train_image_generator = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=10,
    zoom_range=0.2,
    fill_mode='constant',
    cval=0,
    data_format='channels_last')
test_image_generator = ImageDataGenerator()

By calling next(), batch_size number of images are generated.
In this case, it is 60,000. 

In [None]:
sample_image_flow = train_image_generator.flow(
    X, y, batch_size=len(X), shuffle=False, seed=rand_seed)
X_generated_sample, _ = next(sample_image_flow)
print(X_generated_sample.shape)
draw_sum_heatmap(X_generated_sample)

In [None]:
batch_size = 256
steps_per_epoch = (X_train.shape[0] + batch_size - 1) // batch_size

train_image_flow = train_image_generator.flow(
    X_train, y_train, batch_size=batch_size, shuffle=True, seed=rand_seed)
test_image_flow = test_image_generator.flow(
    X_test, y_test, batch_size=batch_size, shuffle=True, seed=rand_seed)

print("batch_size: {0}".format(batch_size))
print("steps_per_epoch: {0}".format(steps_per_epoch))

<div id='FindingEffectiveLearningRate'>
## 4. Finding Effective Learning Rate

I referred [Keras Learning Rate Finder](https://www.pyimagesearch.com/2019/08/05/keras-learning-rate-finder/) to make this portion.
The idea is:
* Sweep learning rate from far too small to far too large.
* Monitor loss while sweeping.
* The point where the loss becomes decreasing is the minimum available learning rate.
* The point where the loss stops decreasing is the maximum available learning rate.

In [None]:
from keras import backend as K

def get_lr(model):
    return K.get_value(model.optimizer.lr)

def set_lr(model, lr):
    K.set_value(model.optimizer.lr, lr)

In [None]:
def find_lr_on_batch_end(model, logs, lr_list, loss_list, lr_mult):
    lr = get_lr(model)
    lr_list.append(lr)
    loss = logs['loss']
    loss_list.append(loss)
    
    set_lr(model, lr * lr_mult)

In [None]:
find_lr_start_lr = 1e-10
find_lr_end_lr = 1.0

find_lr_epochs = 3
find_lr_total_batch_count = find_lr_epochs * steps_per_epoch
find_lr_lr_mult = (find_lr_end_lr / find_lr_start_lr) ** (1.0 / find_lr_total_batch_count)

print("find_lr_epochs: {0}".format(find_lr_epochs))
print("find_lr_total_batch_count: {0}".format(find_lr_total_batch_count))
print("find_lr_lr_mult: {0}".format(find_lr_lr_mult))

In [None]:
from keras.callbacks import LambdaCallback

find_lr_model = make_model()
find_lr_lr_list = []
find_lr_loss_list = []

find_lr_callback = LambdaCallback(
    on_batch_end=lambda batch, logs: find_lr_on_batch_end(
        find_lr_model, logs, find_lr_lr_list, find_lr_loss_list, find_lr_lr_mult))

In [None]:
set_lr(find_lr_model, find_lr_start_lr)
find_lr_history = find_lr_model.fit_generator(
    train_image_flow,
    steps_per_epoch=steps_per_epoch,
    epochs=find_lr_epochs,
    validation_data=test_image_flow,
    callbacks=[find_lr_callback],
    verbose=2)

In [None]:
plt.plot(find_lr_lr_list, find_lr_loss_list)
plt.xscale('log')
plt.ylim(0, 4)
plt.title('Learning Rate vs Loss')
plt.xlabel('Learning Rate (Log Scale)')
plt.ylabel('Loss')
plt.show()

<div id='Training'>
## 5. Training

* I used 3 learning rates, maximum, medium, and minimum of the range
found in the previous step.
* I specified the epoch explicitly for changing the learning rate.
It's simple.
By the loss and accuracy plots below,
they are improved at the point where the rate changed.
* 150 epochs seems a bit too long, however sometimes might get a fantastic result,
because the training step is not deterministic.
* I chose the best model by using the validation accuracy.

In [None]:
from keras.callbacks import LearningRateScheduler

train_epochs = 150

def lr_schedule(epoch_index, current_lr):
    if epoch_index == 0:
        new_lr = 1e-3
    elif epoch_index == 49:
        new_lr = 3e-4
    elif epoch_index == 99:
        new_lr = 1e-4
    else:
        new_lr = current_lr

    if new_lr != current_lr:
        print(
            "Epoch {0}: Learning late changed from {1:.5f} to {2:.5f}".format(
            epoch_index + 1, current_lr, new_lr))
    return new_lr

lr_scheduler = LearningRateScheduler(lr_schedule, verbose=0)

In [None]:
from keras.callbacks import ModelCheckpoint

best_model_file_name = "best_model.hdf5"
model_check_point = ModelCheckpoint(
    best_model_file_name, monitor='val_sparse_categorical_accuracy', mode='max',
    verbose=0, save_best_only=True, save_weights_only=True, period=1)

In [None]:
train_history = train_model.fit_generator(
    train_image_flow,
    steps_per_epoch=steps_per_epoch,
    epochs=train_epochs,
    validation_data=test_image_flow,
    callbacks=[lr_scheduler, model_check_point],
    verbose=2)

In [None]:
best_train_model = make_model()
best_train_model.load_weights(best_model_file_name)

In [None]:
train_result = best_train_model.evaluate(X_test, y_test)
print(train_result)

In [None]:
def draw_loss(history, ylim):
    plt.figure(figsize=(12, 4))
    plt.plot(history.history['loss'], label='loss')
    plt.plot(history.history['val_loss'], label='val_loss')
    plt.title('Loss and Val Loss')
    plt.xlabel('epochs')
    plt.ylabel('loss')
    plt.ylim(*ylim)
    plt.legend()
    plt.show()

In [None]:
draw_loss(train_history, (0.0, 0.05))

In [None]:
def draw_acc(history, ylim):
    plt.figure(figsize=(12, 4))
    plt.plot(history.history['sparse_categorical_accuracy'], label='acc')
    plt.plot(history.history['val_sparse_categorical_accuracy'], label='val_acc')
    plt.title('Acc and Val Acc')
    plt.xlabel('epochs')
    plt.ylabel('accuracy')
    plt.ylim(*ylim)
    plt.legend()
    plt.show()

In [None]:
draw_acc(train_history, (0.99, 1.0))

<div id='SubmitPrediction'>
## 6. Submit Prediction

In [None]:
test_df = pd.read_csv('../input/Kannada-MNIST/test.csv')
test_df.head()

In [None]:
test_pixels_df = test_df.drop(columns='id')
test_pixels_array = test_pixels_df.to_numpy(dtype=np.uint8)
test_images = test_pixels_array.reshape(-1, 28, 28, 1)
print(test_images.shape)

In [None]:
draw_image(test_images[0])

In [None]:
model_preds = best_train_model.predict(test_images)
print(model_preds.shape)

In [None]:
pred_labels = np.argmax(model_preds, axis=1)
draw_hist(pred_labels, "Distributions of prediction labels")

In [None]:
sample_submission_df = pd.read_csv('../input/Kannada-MNIST/sample_submission.csv')
sample_submission_df.head()

In [None]:
sample_submission_df['label'] = pred_labels
sample_submission_df.head()

In [None]:
sample_submission_df.to_csv('submission.csv', index=False)
print('Done!')

<div id='Reference'>
## 7. Reference

I referred the following documents and kernels.
Many thanks to the authors of them.

* [How to use pre-trained models in kernels on Kaggle](https://www.kaggle.com/paultimothymooney/how-to-use-pre-trained-models-in-kernels-on-kaggle) -- at first, I planed to use transfer learning.
* [Indian way to learn CNN](https://www.kaggle.com/shahules/indian-way-to-learn-cnn) -- referred for reading and handling data.
* [Keras Learning Rate Finder](https://www.pyimagesearch.com/2019/08/05/keras-learning-rate-finder/) -- so I specified learning rate with confidence.
* [Cyclical Learning Rates with Keras and Deep Learning](https://www.pyimagesearch.com/2019/07/29/cyclical-learning-rates-with-keras-and-deep-learning/) -- I tried.
* [An implementation of DropConnect Layer in Keras](https://github.com/andry9454/KerasDropconnect),
[Fork of Keras CNN - DropConnect](https://www.kaggle.com/naraque/fork-of-keras-cnn-dropconnect) -- I tried too.
* [Deep Dive in KannadaMnist with tfkeras](https://www.kaggle.com/xiejialun/deep-dive-in-kannadamnist-with-tfkeras) -- "Symmetric Cross Entropy" used in this kernel is interesting and I tried.