<a href="https://colab.research.google.com/github/rahiakela/computer-vision-research-and-practice/blob/main/computer-vision-case-studies/handwriting-recognition/01_handwriting_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Handwriting recognition

**Authors:** [A_K_Nain](https://twitter.com/A_K_Nain), [Sayak Paul](https://twitter.com/RisingSayak)<br>
**Date created:** 2021/08/16<br>
**Last modified:** 2021/08/16<br>
**Description:** Training a handwriting recognition model with variable-length sequences.

**Blog reference:** https://keras.io/examples/vision/handwriting_recognition/

##Introduction

This example shows how the [Captcha OCR](https://keras.io/examples/vision/captcha_ocr/)
example can be extended to the
[IAM Dataset](https://fki.tic.heia-fr.ch/databases/iam-handwriting-database),
which has variable length ground-truth targets. Each sample in the dataset is an image of some
handwritten text, and its corresponding target is the string present in the image.
The IAM Dataset is widely used across many OCR benchmarks, so we hope this example can serve as a
good starting point for building OCR systems.

##Setup

In [2]:
from tensorflow.keras.layers.experimental.preprocessing import StringLookup
from tensorflow import keras

import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import os

np.random.seed(42)
tf.random.set_seed(42)

##Data collection

In [1]:
%%shell

wget -q https://git.io/J0fjL -O IAM_Words.zip
unzip -qq IAM_Words.zip

mkdir data
mkdir data/words
tar -xf IAM_Words/words.tgz -C data/words
mv IAM_Words/words.txt data

rm -rf IAM_Words.zip
rm -rf IAM_Words



Preview how the dataset is organized. Lines prepended by "#" are just metadata information.

In [3]:
!head -20 data/words.txt

#--- words.txt ---------------------------------------------------------------#
#
# iam database word information
#
# format: a01-000u-00-00 ok 154 1 408 768 27 51 AT A
#
#     a01-000u-00-00  -> word id for line 00 in form a01-000u
#     ok              -> result of word segmentation
#                            ok: word was correctly
#                            er: segmentation of word can be bad
#
#     154             -> graylevel to binarize the line containing this word
#     1               -> number of components for this word
#     408 768 27 51   -> bounding box around this word in x,y,w,h format
#     AT              -> the grammatical tag for this word, see the
#                        file tagset.txt for an explanation
#     A               -> the transcription for this word
#
a01-000u-00-00 ok 154 408 768 27 51 AT A
a01-000u-00-01 ok 154 507 766 213 48 NN MOVE


##Dataset splitting

In [5]:
base_path = "data"
words_list = []

words = open(f"{base_path}/words.txt", "r").readlines()
for line in words:
  if line[0] == "#":
    continue
  # We don't need to deal with errored entries
  if line.split(" ")[1] != "err":
    words_list.append(list)
  
print(len(words_list))
np.random.shuffle(words_list)

96456


We will split the dataset into three subsets with a 90:5:5 ratio (train:validation:test).

In [6]:
split_index = int(0.9 * len(words_list))
train_samples = words_list[:split_index]
test_samples = words_list[split_index:]

val_split_index = int(0.5 * len(test_samples))
validation_samples = test_samples[:val_split_index]
test_samples = test_samples[val_split_index:]

assert len(words_list) == len(train_samples) + len(validation_samples) + len(test_samples)

print(f"Total training samples: {len(train_samples)}")
print(f"Total validation samples: {len(validation_samples)}")
print(f"Total test samples: {len(test_samples)}")

Total training samples: 86810
Total validation samples: 4823
Total test samples: 4823


##Data input pipeline

We start building our data input pipeline by first preparing the image paths.