<a href="https://colab.research.google.com/github/rahiakela/computer-vision-research-and-practice/blob/main/computer-vision-case-studies/handwriting-recognition/01_handwriting_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Handwriting recognition

**Authors:** [A_K_Nain](https://twitter.com/A_K_Nain), [Sayak Paul](https://twitter.com/RisingSayak)<br>
**Date created:** 2021/08/16<br>
**Last modified:** 2021/08/16<br>
**Description:** Training a handwriting recognition model with variable-length sequences.

**Blog reference:** https://keras.io/examples/vision/handwriting_recognition/

##Introduction

This example shows how the [Captcha OCR](https://keras.io/examples/vision/captcha_ocr/)
example can be extended to the
[IAM Dataset](https://fki.tic.heia-fr.ch/databases/iam-handwriting-database),
which has variable length ground-truth targets. Each sample in the dataset is an image of some
handwritten text, and its corresponding target is the string present in the image.
The IAM Dataset is widely used across many OCR benchmarks, so we hope this example can serve as a
good starting point for building OCR systems.

##Setup

In [1]:
from tensorflow.keras.layers.experimental.preprocessing import StringLookup
from tensorflow import keras

import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import os

np.random.seed(42)
tf.random.set_seed(42)

##Data collection

In [2]:
%%shell

wget -q https://git.io/J0fjL -O IAM_Words.zip
unzip -qq IAM_Words.zip

mkdir data
mkdir data/words
tar -xf IAM_Words/words.tgz -C data/words
mv IAM_Words/words.txt data

rm -rf IAM_Words.zip
rm -rf IAM_Words



Preview how the dataset is organized. Lines prepended by "#" are just metadata information.

In [3]:
!head -20 data/words.txt

#--- words.txt ---------------------------------------------------------------#
#
# iam database word information
#
# format: a01-000u-00-00 ok 154 1 408 768 27 51 AT A
#
#     a01-000u-00-00  -> word id for line 00 in form a01-000u
#     ok              -> result of word segmentation
#                            ok: word was correctly
#                            er: segmentation of word can be bad
#
#     154             -> graylevel to binarize the line containing this word
#     1               -> number of components for this word
#     408 768 27 51   -> bounding box around this word in x,y,w,h format
#     AT              -> the grammatical tag for this word, see the
#                        file tagset.txt for an explanation
#     A               -> the transcription for this word
#
a01-000u-00-00 ok 154 408 768 27 51 AT A
a01-000u-00-01 ok 154 507 766 213 48 NN MOVE


##Dataset splitting

In [16]:
base_path = "data"
words_list = []

words = open(f"{base_path}/words.txt", "r").readlines()
for line in words:
  if line[0] == "#":
    continue
  # We don't need to deal with errored entries
  if line.split(" ")[1] != "err":
    words_list.append(line)
  
print(len(words_list))
np.random.shuffle(words_list)

96456


We will split the dataset into three subsets with a 90:5:5 ratio (train:validation:test).

In [17]:
split_index = int(0.9 * len(words_list))
train_samples = words_list[:split_index]
test_samples = words_list[split_index:]

val_split_index = int(0.5 * len(test_samples))
validation_samples = test_samples[:val_split_index]
test_samples = test_samples[val_split_index:]

assert len(words_list) == len(train_samples) + len(validation_samples) + len(test_samples)

print(f"Total training samples: {len(train_samples)}")
print(f"Total validation samples: {len(validation_samples)}")
print(f"Total test samples: {len(test_samples)}")

Total training samples: 86810
Total validation samples: 4823
Total test samples: 4823


In [26]:
train_samples[:10]

['r06-076-07-06 ok 177 1807 2010 76 53 CC or\n',
 'n01-004-01-01 ok 180 614 906 246 69 JJ unable\n',
 'g06-011f-00-03 ok 203 778 721 46 70 INO of\n',
 'f04-011-07-01 ok 145 504 1976 118 78 BEDZ was\n',
 'e04-103-01-01 ok 174 471 916 205 123 VB plank\n',
 'g06-047g-04-05 ok 182 924 1430 193 67 NP Europe\n',
 'm06-056-04-11 ok 158 2061 1537 11 21 , ,\n',
 'j06-026-03-04 ok 185 1593 1416 341 129 NN sunlight\n',
 'm06-019-01-12 ok 189 1837 949 142 50 CD three\n',
 'a04-043-02-05 ok 186 1906 1113 59 68 INO of\n']

##Data input pipeline

We start building our data input pipeline by first preparing the image paths.

In [28]:
base_image_path = os.path.join(base_path, "words")

def get_image_paths_and_labels(samples):
  paths = []
  corrected_samples = []

  for (i, file_line) in enumerate(samples):
    line_split = file_line.strip()
    line_split = line_split.split(" ")
    
    # Each line split will have this format for the corresponding image:
    # part1/part1-part2/part1-part2-part3.png
    image_name = line_split[0]
    partI = image_name.split("-")[0]
    partII = image_name.split("-")[1]
    img_path = os.path.join(base_image_path, partI, partI +"-"+ partII, image_name + ".png")

    if os.path.getsize(img_path):
      paths.append(img_path)
      corrected_samples.append(file_line.split("\n")[0])

  return paths, corrected_samples

In [29]:
train_img_paths, train_labels = get_image_paths_and_labels(train_samples)
validation_img_paths, validation_labels = get_image_paths_and_labels(validation_samples)
test_img_paths, test_labels = get_image_paths_and_labels(test_samples)

Then we prepare the ground-truth labels.

In [30]:
# Find maximum length and the size of the vocabulary in the training data.
train_labels_cleaned = []
characters = set()
max_len = 0

for label in train_labels:
  label = label.split(" ")[-1].strip()
  for char in label:
    characters.add(char)
  
  max_len = max(max_len, len(label))
  train_labels_cleaned.append(label)

print("Maximum length: ", max_len)
print("Vocab size: ", len(characters))

# Check some label samples
train_labels_cleaned[:10]

Maximum length:  21
Vocab size:  78


['or',
 'unable',
 'of',
 'was',
 'plank',
 'Europe',
 ',',
 'sunlight',
 'three',
 'of']

Now we clean the validation and the test labels as well.

In [32]:
def clean_labels(labels):
  cleaned_labels = []
  for label in labels:
    label = label.split(" ")[-1].strip()
    cleaned_labels.append(label)
  return cleaned_labels

In [34]:
validation_labels_cleaned = clean_labels(validation_labels)
test_labels_cleaned = clean_labels(test_labels)

##Building the character vocabulary