<h2> HandWritten Text Recognition


<h4>S Meena Padnekar<br>
M.Tech, Department Of Computer Science<br>
CUSAT



<h4>AIM</h4><br>
The main objective of this project is to classify the individual words in a handwritten document, so that the handwritten text can be translated to digital form. Convolutional Neural Network is used to classify the words in the handwritten document and for character segmentation, Long Short Term Memory networks (LSTM) is used.  LSTM is an artificial recurrent neural network (RNN) architecture used in the field of deep learning.

<h4>DATASET</h4><br>
Words in the IAM Handwriting Dataset is used to train the model.<br>
IAM Handwriting Dataset is a collection of handwritten passages by several writers. This dataset contain handwritten text over many forms (where a form is an image with lines of text from different writers), sentences, lines, words. The words were then segmented; all associated form label metadata is provided in text file.


<h3>IMPLEMENTATION</h3><br>
Total of 80 characters can be recognised

<h3>Image Preprocessing Module<br></h3>
Grayscale image of size 128x32 is the input to the model.<br>
Images are resized 

In [None]:
import cv2
import random
import numpy as np

def preprocess(img,imageSize,dataAug=False):

    if img is None:
        img = np.zeros([imageSize[1],imageSize[2]])
    
    if dataAug:
        stretch = (random.random()-0.5)
        wStretched = max(int(img.shape[1] * (1 + stretch)), 1)
        img = cv2.resize(img, (wStretched, img.shape[0]))

    (wt,ht) = imageSize
    (h,w)   = img.shape
    fx = w/wt
    fy = h/ht
    f = max(fx,fy)
    newSize = (max(min(wt, int(w / f)), 1), max(min(ht, int(h / f)), 1))
    img = cv2.resize(img,newSize)

    target = np.ones([ht, wt]) * 255
    target[0:newSize[1], 0:newSize[0]]=img

    img = cv2.transpose(target)
    (m, s) = cv2.meanStdDev(img)

    m = m[0][0]
    s = s[0][0]
    img = img - m
    img = img / s if s>0 else img

    return img

<h4>DataLoader Module</h4><br>
This is module is used to fetch the images for training and validation

In [None]:
import os
import random
import numpy as np
import cv2
# from imagePreprocessing import preprocess

class Sample:
    # sample from the dataset
    def __init__(self, gtText, filePath):
        self.gtText = gtText
        self.filePath = filePath

class Batch:
    # batch containing images and ground truth texts
    def __init__(self, gtTexts, imgs):
        self.imgs = np.stack(imgs, axis=0)
        self.gtTexts = gtTexts

class DataLoader:
    # loads data which corresponds to IAM format, see: http://www.fki.inf.unibe.ch/databases/iam-handwriting-database" 

    def __init__(self, filePath, batchSize, imgSize, maxTextLen):
        "loader for dataset at given location, preprocess images and text according to parameters"

        assert filePath[-1]=='/'

        self.dataAugmentation = False
        self.currIdx = 0
        self.batchSize = batchSize
        self.imgSize = imgSize
        self.samples = []
    
        f=open(filePath+'words.txt')
        chars = set()
        bad_samples = []
        bad_samples_reference = ['a01-117-05-02.png', 'r06-022-03-05.png']
        for line in f:
            # ignore comment line
            if not line or line[0]=='#':
                continue

            lineSplit = line.strip().split(' ')
            # assert len(lineSplit) >= 9

            # filename: part1-part2-part3 --> part1/part1-part2/part1-part2-part3.png
            fileNameSplit = lineSplit[0].split('-')
            fileName = filePath + 'words/' + fileNameSplit[0] + '/' + fileNameSplit[0] + '-' + fileNameSplit[1] + '/' + lineSplit[0] + '.png'

            # GT text are columns starting at 9
            gtText = self.truncateLabel(' '.join(lineSplit[8:]), maxTextLen)
            chars = chars.union(set(list(gtText)))

            #  check if image is not empty
            if not os.path.getsize(fileName):
                bad_samples.append(lineSplit[0] + '.png')
                continue

            # put sample into list
            self.samples.append(Sample(gtText, fileName))

        # some images in the IAM dataset are known to be damaged, don't show warning for them
        if set(bad_samples) != set(bad_samples_reference):
            print("Warning, damaged images found:", bad_samples)
            print("Damaged images expected:", bad_samples_reference)

        # split into training and validation set: 95% - 5%
        splitIdx = int(0.95 * len(self.samples))
        self.trainSamples = self.samples[:splitIdx]
        self.validationSamples = self.samples[splitIdx:]

        # put words into lists
        self.trainWords = [x.gtText for x in self.trainSamples]
        self.validationWords = [x.gtText for x in self.validationSamples]

        # number of randomly chosen samples per epoch for training 
        self.numTrainSamplesPerEpoch = 25000 

        # start with train set
        self.trainSet()

        # list of all chars in dataset
        self.charList = sorted(list(chars))


    def truncateLabel(self, text, maxTextLen):
        # ctc_loss can't compute loss if it cannot find a mapping between text label and input 
        # labels. Repeat letters cost double because of the blank symbol needing to be inserted.
        # If a too-long label is provided, ctc_loss returns an infinite gradient
        cost = 0
        for i in range(len(text)):
            if i != 0 and text[i] == text[i-1]:
                cost += 2
            else:
                cost += 1
            if cost > maxTextLen:
                return text[:i]
        return text

    def trainSet(self):
        # switch to randomly chosen subset of training set
        self.dataAugmentation = True
        self.currIdx = 0
        random.shuffle(self.trainSamples)
        self.samples = self.trainSamples[:self.numTrainSamplesPerEpoch]


    def validationSet(self):
        # switch to validation set
        self.dataAugmentation = False
        self.currIdx = 0
        self.samples = self.validationSamples


    def getIteratorInfo(self):
        # current batch index and overall number of batches
        return (self.currIdx // self.batchSize + 1, len(self.samples) // self.batchSize)


    def hasNext(self):
        # "iterator"
        return self.currIdx + self.batchSize <= len(self.samples)

    
    def getNext(self):
        # iterator
        batchRange = range(self.currIdx, self.currIdx + self.batchSize)
        gtTexts = [self.samples[i].gtText for i in batchRange]
        imgs = [preprocess(cv2.imread(self.samples[i].filePath, cv2.IMREAD_GRAYSCALE), self.imgSize, self.dataAugmentation) for i in batchRange]
        self.currIdx += self.batchSize
        return Batch(gtTexts, imgs)

<h4>Model</h4><br>
Input to the neural network is an image of size 128*32<br>
<h5>CNN Layers</h5>
There are 5 CNN Layers. These layers are trained to extract relevant features from the image. Kernal filter of size 5x5 is applied on the first 2 layers and then filter of sie 3x3 in the last 3 layers. Non linear RELU function is applied. A max pooling layer is used after each CNN layer such that this layer summarize the image regions and downsize by 2 in each layer<br>
The output of CNN layer is feature map of size 32x256<br>
<h5>RNN Layers</h5><br>
LTSM (a variation of RNN) is used in this model. There are 2 LSTM layers. The feature sequence contain 256 features per time-step. The RNN layer propagates relevant information through this sequence. The out sequence is a matrix of size 32X80. 
<h5>CTC Layers</h5><br>
In training phase, the rnn output is used to calculate the loss and while predicting, it decoded the RNN output matix to get the final text 

In [None]:
import sys
import numpy as np
import tensorflow as tf


class DecoderType:
    BestPath = 0
    BeamSearch = 1
    WordBeamSearch = 2


class Model:

    batchSize = 50
    imageSize = (128, 32)
    maxTextLen = 32

    def __init__(self, charList, decoderType=DecoderType.BestPath, mustRestore=False, dump=False):
        self.dump = dump
        self.charList = charList
        self.decoderType = decoderType
        self.mustRestore = mustRestore
        self.snapID = 0

        self.is_train = tf.placeholder(tf.bool, name='is_train')

        self.inputImg = tf.placeholder(tf.float32, shape=(
            None, Model.imageSize[0], Model.imageSize[1]))

        self.setupCNN()
        self.setupRNN()
        self.setupCTC()

        self.batchesTrained = 0
        self.learningRate = tf.placeholder(tf.float32, shape=[])
        self.update_op = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
        with tf.control_dependencies(self.update_op):
            self.optimizer = tf.train.RMSPropOptimizer(
                self.learningRate).minimize(self.loss)

        (self.sess, self.saver) = self.setupTF()

    def setupCNN(self):
        # create cnn layer and return output of these layers
        cnnIn4d = tf.expand_dims(input=self.inputImg, axis=3)

        # list of parameters for the layers
        kernelVals = [5, 5, 3, 3, 3]
        featureVals = [1, 32, 64, 128, 128, 256]
        strideVals = poolVals = [(2, 2), (2, 2), (1, 2), (1, 2), (1, 2)]
        numLayers = len(strideVals)
        # CNN layer
        pool = cnnIn4d

        for i in range(numLayers):
            kernel = tf.Variable(tf.truncated_normal(
                [kernelVals[i], kernelVals[i], featureVals[i], featureVals[i + 1]], stddev=0.1))
            conv = tf.nn.conv2d(
                pool, kernel, padding='SAME', strides=(1, 1, 1, 1))
            conv_norm = tf.layers.batch_normalization(
                conv, training=self.is_train)
            relu = tf.nn.relu(conv_norm)
            pool = tf.nn.max_pool(relu, (1, poolVals[i][0], poolVals[i][1], 1), (
                1, strideVals[i][0], strideVals[i][1], 1), 'VALID')

        self.cnnOut4d = pool

    def setupRNN(self):
        # create rnn layers and return output of these layers
        rnnIn3d = tf.squeeze(self.cnnOut4d, axis=[2])

        numHidden = 256
        cell = [tf.contrib.rnn.LSTMCell(
            num_units=numHidden, state_is_tuple=True) for _ in range(2)]  # 2 layers

        stacked = tf.contrib.rnn.MultiRNNCell(cell, state_is_tuple=True)

        # bidirectional rnn
        ((fw, bw), _) = tf.nn.bidirectional_dynamic_rnn(
            cell_fw=stacked, cell_bw=stacked, inputs=rnnIn3d, dtype=rnnIn3d.dtype)

        concat = tf.expand_dims(tf.concat([fw, bw], 2), 2)

        kernel = tf.Variable(tf.truncated_normal(
            [1, 1, numHidden * 2, len(self.charList) + 1], stddev=0.1))

        self.rnnOut3d = tf.squeeze(tf.nn.atrous_conv2d(
            value=concat, filters=kernel, rate=1, padding='SAME'), axis=[2])

    def setupCTC(self):
        # calculate loss, decode the word and return
        self.ctcIn3d = tf.transpose(self.rnnOut3d, [1, 0, 2])
        self.gtText = tf.SparseTensor(tf.placeholder(tf.int64, shape=[None, 2]), tf.placeholder(
            tf.int32, shape=[None]), tf.placeholder(tf.int64, shape=[2]))

        # loss for batch
        self.seqLen = tf.placeholder(tf.int32, [None])
        self.loss = tf.reduce_mean(tf.nn.ctc_loss(labels=self.gtText, inputs=self.ctcIn3d, sequence_length=self.seqLen, ctc_merge_repeated=True))

        # loss for each element
        self.savedCtcInput = tf.placeholder(
            tf.float32, shape=[Model.maxTextLen, None, len(self.charList)+1])
        self.lossPerElement = tf.nn.ctc_loss(
            labels=self.gtText, inputs=self.savedCtcInput, sequence_length=self.seqLen, ctc_merge_repeated=True)

        # decoder
        if self.decoderType == DecoderType.BestPath:
            self.decoder = tf.nn.ctc_greedy_decoder(
                inputs=self.ctcIn3d, sequence_length=self.seqLen)
        elif self.decoderType == DecoderType.BeamSearch:
            self.decoder = tf.nn.ctc_beam_search_decoder(
                inputs=self.ctcIn3d, sequence_length=self.seqLen, beam_width=50, merge_repeated=False)
        elif self.decoderType == DecoderType.WordBeamSearch:
            word_beam_search_module = tf.load_op_library('TFWordBeamSearch.so')

            # prepare information about language (dictionary, characters in dataset, characters forming words)
            chars = str().join(self.charList)
            wordChars = open('model/wordCharList.txt').read().splitlines()[0]
            corpus = open('data/corpus.txt').read()

            # decode using the "Words" mode of word beam search
            self.decoder = word_beam_search_module.word_beam_search(tf.nn.softmax(
                self.ctcIn3d, dim=2), 50, 'Words', 0.0, corpus.encode('utf8'), chars.encode('utf8'), wordChars.encode('utf8'))

    def setupTF(self):
        # print('Python: '+sys.version)
        # print('Tensorflow: '+tf.__version__)
        # TF session
        sess = tf.Session()  
        saver = tf.train.Saver(max_to_keep=1,reshape=True)  # saver saves model to file
        modelDir = 'model/'     
        latestSnapshot = tf.train.latest_checkpoint(modelDir)  # is there a saved model?

        # if model must be restored (for inference), there must be a snapshot
        if self.mustRestore and not latestSnapshot:
            raise Exception('No saved model found in: ' + modelDir)

        # load saved model if available
        if latestSnapshot:
            print('Init with stored values from ' + latestSnapshot)
            saver.restore(sess, latestSnapshot)
        else:
            print('Init with new values')
            sess.run(tf.global_variables_initializer())

        return (sess, saver)

    def toSparse(self, texts):
    
        indices = []
        values = []
        shape = [len(texts), 0] # last entry must be max(labelList[i])

        # go over all texts
        for (batchElement, text) in enumerate(texts):
            # convert to string of label (i.e. class-ids)
            labelStr = [self.charList.index(c) for c in text]
            # sparse tensor must have size of max. label-string
            if len(labelStr) > shape[1]:
                shape[1] = len(labelStr)
            # put each label into sparse tensor
            for (i, label) in enumerate(labelStr):
                indices.append([batchElement, i])
                values.append(label)

        return (indices, values, shape)

    def decoderOutputToText(self, ctcOutput, batchSize):
    
        # contains string of labels for each batch element
        encodedLabelStrs = [[] for i in range(batchSize)]

        # word beam search: label strings terminated by blank
                    if self.decoderType == DecoderType.`:
            blank = len(self.charList)
            for b in range(batchSize):
                for label in ctcOutput[b]:
                    if label == blank:
                        break
                    encodedLabelStrs[b].append(label)

        # TF decoders: label strings are contained in sparse tensor
        else:
            # ctc returns tuple, first element is SparseTensor
            decoded = ctcOutput[0][0]

        # go over all indices and save mapping: batch -> values
        idxDict = {b: [] for b in range(batchSize)}
        for (idx, idx2d) in enumerate(decoded.indices):
            label = decoded.values[idx]
            batchElement = idx2d[0]  # index according to [b,t]
            encodedLabelStrs[batchElement].append(label)

        # map labels to chars for all batch elements
        return [str().join([self.charList[c] for c in labelStr]) for labelStr in encodedLabelStrs]

    def trainBatch(self, batch):
        # feed a batch into the NN to train it
        numBatchElements = len(batch.imgs)
        sparse = self.toSparse(batch.gtTexts)
        rate = 0.01 if self.batchesTrained < 10 else (0.001 if self.batchesTrained < 10000 else 0.0001)  # decay learning rate
        evalList = [self.optimizer, self.loss]
        feedDict = {self.inputImg: batch.imgs, self.gtText: sparse, self.seqLen: [Model.maxTextLen] * numBatchElements, self.learningRate: rate, self.is_train: True}
        (_, lossVal) = self.sess.run(evalList, feedDict)
        self.batchesTrained += 1
        return lossVal

    def dumpNNOutput(self, rnnOutput):
        # dump the output of the NN to CSV file(s)
        dumpDir = 'dump/'
        if not os.path.isdir(dumpDir):
            os.mkdir(dumpDir)

        # iterate over all batch elements and create a CSV file for each one
        maxT, maxB, maxC = rnnOutput.shape
        for b in range(maxB):
            csv = ''
            for t in range(maxT):
                for c in range(maxC):
                    csv += str(rnnOutput[t, b, c]) + ';'
                    csv += '\n'
                    fn = dumpDir + 'rnnOutput_'+str(b)+'.csv'
                    print('Write dump of NN to file: ' + fn)
                    with open(fn, 'w') as f:
                        f.write(csv)

    def inferBatch(self, batch, calcProbability=False, probabilityOfGT=False):
        # feed a batch into the NN to recognize the texts

        # decode, optionally save RNN output
        numBatchElements = len(batch.imgs)
        evalRnnOutput = self.dump or calcProbability
        evalList = [self.decoder] + ([self.ctcIn3dTBC] if evalRnnOutput else [])
        feedDict = {self.inputImg: batch.imgs, self.seqLen: [
        Model.maxTextLen] * numBatchElements, self.is_train: False}
        evalRes = self.sess.run(evalList, feedDict)
        decoded = evalRes[0]
        texts = self.decoderOutputToText(decoded, numBatchElements)

        # feed RNN output and recognized text into CTC loss to compute labeling probability
        probs = None
        if calcProbability:
            sparse = self.toSparse(batch.gtTexts) if probabilityOfGT else self.toSparse(texts)
            ctcInput = evalRes[1]
            evalList = self.lossPerElement
            feedDict = {self.savedCtcInput: ctcInput, self.gtText: sparse, self.seqLen: [Model.maxTextLen] * numBatchElements, self.is_train: False}
            lossVals = self.sess.run(evalList, feedDict)
            probs = np.exp(-lossVals)

        # dump the output of the NN to CSV file(s)
        if self.dump:
            self.dumpNNOutput(evalRes[1])

        return (texts, probs)

    def save(self):
        # save model to file
        self.snapID += 1
        self.saver.save(self.sess, 'model/snapshot', global_step=self.snapID)
 


In [None]:
from __future__ import division
import numpy as np
import os 
import argparse
import cv2

import editdistance
# from imagePreprocessing import preprocess
# from model import Model, DecoderType
# from DataProcessing import DataLoader, Batch

# File path 
class FilePath:
    input = 'data/test.png'
    charList = 'model/charList.txt'
    accuracy = 'model/accuracy.txt'
    train = 'data/'
    corpus = 'data/corpus.txt'

def train(model, loader):
    epoc = 0
    bestCharErrorRate = float('inf')
    noImprovement = 0
    earlyStopping = 5

    while True:
        epoc += 1
        print('Epoc',epoc)
        loader.trainSet()
        print('Training Neural Network')
        while loader.hasNext():
            iterInfo = loader.getIteratorInfo()
            Batch = loader.getNext()
            loss = model.trainBatch(Batch)
            print('Batch : ',iterInfo[0],'/',iterInfo[1],' Loss =',loss)
        
        print('Validate')
        charErrorRate = validate(model,loader)

        if charErrorRate < bestCharErrorRate:
            print('Increase in accuracy. Saving Model')
            bestCharErrorRate = charErrorRate
            noImprovement = 0
            model.save()
            open(FilePath.accuracy,'w').write('Validation Character error rate of the saved model%f%%'%(bestCharErrorRate*100))
        else:
            print('No increase in Accuracy')
            noImprovement +=1
        
        # stopping if no improving in acc after 5 epoc
        if noImprovement>=earlyStopping:
            break




def validate(model, loader):
    loader.validationSet()
    numCharErr = 0
    numCharTotal = 0
    numWordOK = 0
    numWordTotal = 0
    while loader.hasNext():
        iterInfo = loader.getIteratorInfo()
        print('Batch:', iterInfo[0],'/', iterInfo[1])
        batch = loader.getNext()
        (recognized,_) = model.inferBatch(batch)

        for i in range(len(recognized)):
            numWordOK += 1 if batch.gtText[i] == recognized[i] else 0 
            numWordTotal +=1
            dist = editdistance.eval(recognized[i],batch.gtText[i])
            numCharErr += dist
            numCharTotal += len(batch.gtText[i])
            print('[OK]' if dist==0 else '[ERR:%d]' % dist,'"' + batch.gtTexts[i] + '"', '->', '"' + recognized[i] + '"')

    # print validation result
    charErrorRate = numCharErr/numCharTotal if numCharTotal !=0 else 0
    wordAccuracy = numWordOK/numWordTotal if numWordTotal !=0 else 0
    print('Character error rate: %f%%. Word accuracy: %f%%.' % (charErrorRate*100.0, wordAccuracy*100.0))
    return charErrorRate

def recognize(model,InImage):
    img = preprocess(cv2.imread(InImage,cv2.IMREAD_GRAYSCALE),Model.imageSize)
    batch = Batch(None,[img])
    (recognized,probability) = model.inferBatch(batch, True)
    print('Recognized:', '"' + recognized[0] + '"')
    print('Probability:', probability[0])


def main():
    parser = argparse.ArgumentParser()

    parser.add_argument('--train', help='train the Neural network', action='store_true')
    parser.add_argument('--validate', help='test the Neural network', action='store_true')
    parser.add_argument('--beamsearch', help='use beam search instead of best path decoding', action='store_true')
    parser.add_argument('--wordbeamsearch', help='use word beam search instead of best path decoding', action='store_true')
    parser.add_argument('--dump', help='store the NN weights', action='store_true')

    args = parser.parse_args()
    
    decoderType = DecoderType.BestPath
    if args.beamsearch:
        decoderType = DecoderType.BeamSearch
    elif args.wordbeamsearch:
        decoderType = DecoderType.WordBeamSearch

    if args.train or args.validate :
        # load training data
        # execute training and validation
        loader = DataLoader(FilePath.train,Model.batchSize,Model.imageSize,Model.maxTextLen)
        open(FilePath.charList, 'w').write(str().join(loader.charList))
        open(FilePath.corpus,'w').write(str(' ').join(loader.trainWords + loader.validationWords))

        if args.train:
            # training
            model = Model(loader.charList,decoderType)
            train(model, loader)
        elif args.validate:
            # validate
            model = Model(loader,charList,decoderType,mustRestore=True)
            validate(model, loader)
    else:
        print(open(FilePath.accuracy).read())
        model = Model(open(FilePath.charList).read(), decoderType, mustRestore=True, dump=args.dump)
        recognize(model,FilePath.input)
            
if __name__ == '__main__':
    main()

<h4>RESULT</h4><br>
A model that can recognize words of size upto 32 characters was  implemented.<br>
Validation Character error rate of the saved model is 16.507551%<br>


<h4>REFERENCE</h4><br>
<p>1. Handwritten Text Recognition, Batuhan Balci, Dan Saadati, Dan Shiferaw
<h4>LINKS</h4><br>
Full code : https://github.com/smeenapadnekar/HandWritten-Text-Recognition <br>
Dataset : http://www.fki.inf.unibe.ch/databases/iam-handwriting-database <br>
TUTORIALS : https://www.tensorflow.org/tutorials/images/cnn
