Skip to content
Data Science Immersive - Galvanize Final Capstone Project
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Handwriting Image Recognition using Keras

Author: Rosie M Martinez, ScD, MPH

This project was created for a final capstone for the Galvanize Data Science Immersive program in Denver, CO. This repo will not be monitored regularly

Last update: 4/18/2019

Table of Contents


The goal of this capstone was to caption images of handwritten text using a CNN-LSTM Seq-2-Seq method. This project was completed over two months where the first month was working on getting a model up and running and understanding the data. For more information on that, click here

The Data:

The IAM Handwriting database was first published in 1999 and was developed for handwritten recognition. This database contains forms of unconstrained handwritten text. Each form can be divided into lines of text and each line of text can be divided into individual words. dataset

Back to Top

In order to access this data, you will need to create a user login name and download the words data and words.txt file from here. Put the entire words folder and the words.txt file into a folder labeled data

Your file structure should look similar to this:

  ├── checkpoints
  ├── data
  │   ├── labels.json
  │   ├── words.txt
  │   ├── words
  │   │   ├── a01
  │   │   │   ├── a01-000u
  │   │   │   │   ├── a01-000u-00-00.png
  │   │   │   │   └── ...   
  │   │   │   └── ...     
  │   │   ├── a02
  │   │   └── ...
  ├── models
  ├── src
  │   ├── config.yml
  │   ├──
  │   ├──
  │   ├──
  │   ├──
  │   ├──
  │   ├──
  │   └──

The IAM handwriting dataset contains 1,539 forms written be 657 unique individuals. In total, there are 13,353 lines and 115,320 separated words.

Back to Top

Data Cleaning:

For the purposes of this capstone, the data were cleaned based on a few features.

  1. Images were first examined and any image with a width of 900px or greater were excluded. (202 words)
  • This was because many of these extra wide images were errors and actually images of entire sentences rather than a single image
  1. Images listed in the words.txt file as an 'err'. (18662 words)
  • This was a personal preference. I choose to run a sensitivity analysis examining the results between the filter from cleaning step 1 and this cleaning step. I found that my word error rate and character error rate were both reduced by 2% and my model ran faster (an hour less).

After the data were clean, I ended up with:

  • 96,456 words
    • 78,129 in my training set
    • 8,691 in my validation set
    • 9,646 in my testing set

If you want to run using these same splits of files, make sure that in your config.yaml file, the 'labels_file' uses the data/labels.json path. If you want to create your own split, run the file (making sure the words.txt file is in the data folder

Back to Top

CNN-LSTM Modeling:

This model uses a convolutional neural network that feeds into a sequential LSTM network using the Seq-2-Seq framework found here.

Back to Top


In order to predict using this framework, you will need to train the model first. There are one of two ways you can do this.

  1. iPython or Python method:
  • From the command line:

      python src/
  • From ipython

      run src/
  1. Command Line Interface:

       python -c src/config.yml --train


Once the model has been trained and weights have been saved in the models folder, you can predict on your testing dataset.

  1. iPython or Python method:
  • From the command line:

      python src/
  • From ipython

      run src/
  1. Command Line Interface:

       python -c src/config.yml --predict

Back to Top


My results based on the model I trained were:

Character Error Rate (CER) = 10.3%
Word Error Rate (WER) = 24.4%

Examples of output from prediction:

Back to Top

Back to Top

Back to Top

Examples of words with their true labels and actual labels:

Back to Top

Final Thoughts:

While this model is far from perfect, there was a lot of progress between the capstone 2 project and this final capstone.

This particular portion of was focused on generating 'captions' for images of handwritten words. I modified previous work done by giovanniguidi after cleaning the data. This data was trained on a CNN-LSTM based model, using Seq2Seq based logic to predict next characters.

While I tried to account for many of the mislabeled words, there were too many to go through by hand, however, regardless, this model worked well for and still provided a low error rate of characters.

One other topic that I didn't consider for this current project was working with data augmentation, dealing with slanted or cursive words. Many people in the literature found that these models had a hard time predicting cursive or slanted words, so next steps would be to add that step to my image processing pipeline so I can make sure I can get as "clean" of an image as possible. link

Next Steps:

  • Improve metrics, reducing CER and/or WER for character based model
  • Scale model up, using words to line on portion of data
  • Run CNN-LSTM model on full dataset for words to line
  • Attempt a working demo of the CNN-LSTM words to line model
  • Examine other methods of handwriting recognition (CTC loss)

Back to Top


  1. giovanniguidi's GitHub
  2. Google's Seq2Seq
  3. Tensorhub Github Neural Machine Translation Tutorial
  4. Manish Chabliani's article on Seq2Seq
  5. Cole Murray's article on Building an image caption generator
  6. Sequence to Sequence Learning with Neural Networks paper by Sutskever, H, Vinyals O, and Le QV
You can’t perform that action at this time.