Skip to content
Optical Character Recognition
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
Images_Tests Add images for presentation Jun 9, 2016
header Paragraph detection Jun 9, 2016
src Paragraph detection Jun 9, 2016
tests Paragraph detection Jun 9, 2016
.gitignore Doxygen comments Jun 6, 2016
Doxyfile Doxygen comments Jun 6, 2016
Makefile Doxygen comments Jun 6, 2016 Fix wording Jul 11, 2016
dataset.tar.gz Add new characters in dataset May 30, 2016


  • Jeanselme Vincent
  • Clin Matthieu


This project has been undertook from May 24th to June 10th 2016. We want to recognize text from a scanned image. However, we do not use Neural Network in order to understand every issue and complexity of the problem.

Pre Processing

This section presents the different techniques used in order to clean the image and correct its rotation.


Several filters exist, however few have good results on test images where characters are considered as noise. It is why the bilateral and the median filters that are implemented are not used in the current version due to poor results. Linears filters present more interesting results. The software uses a linear filter with a mask which corresponds to a gaussian correction. This way, it reduces the noise.


A global Otsu binarization is used in order to remove all the softened noise. However this method could be more accurate by a local binarization.


First, the projection in the Hough space is computed. Thus, we implemented an heuristic in order to find the rotation needed in order to have the highest number of straight lines parallel to the bottom of the page.

Further work

  • Enhance the text by increasing the contrast.
  • Find other filters in order to delete all the remaining noise.
  • Cut the frame of the text.
  • Correct other distortions.

Segmentation algorithms

This section will present various segmentation algorithms implemented in this project. Algorithms are presented in their order of implementation (which also means complexity in this case).

Sliding window

The sliding window algorithm is the easiest way to dissociate lines and characters but it is also the most limited one. The principle is to look for a row (or column) where all pixels are white (meaning that there is a space between two lines of text or two characters).

###The pros :

  • Easy implementation
  • Able to spot paragraphs and words (bigger space)

###The cons :

  • Noise will be considered as line
  • Low DPI that leads to missing pixels (cutting a character in two) will be considered as two distinct characters.
  • A noise pixel which links two characters will not be cut by this method

Overlapping segmentation

The overlapping segmentation algorithm comes from the publication : Separation of overlapping character N. Papamarkos and T. Koutalianos Department of Electrical and Computer Engineering Democritus University of Thrace, 67100 Xanthi, Greece.. This algorithm shows good result when two characters are overlapping. Overlapping here means that the second character starts before the first one is finished (looking from a vertical line point of view).

###The pros :

  • When there is at least one pixel between two characters, it can separate them
  • Keeps accents on letter (and dot on 'i' and 'j' for example)

###The cons :

  • Characters have to be 'full' (no missing pixels) overwise the algorithm will separate them
  • It cannot separated two merged characters.


The oversegmentation algorithm is a dynamic algorithm. We first try to detect where the cuts are more likely to appear and then we try all the outcomes. This algorithm is bound to the recognition part as it needs it to weight(ponderate ?) the cuts. That's why the results of this algorithm is strongly dependent on the quality of the recognition algorithm and make it tricky to use.


There are three kinds of segmentation :

  • Undersegmentation : The segmentation performed can still group more than one character in a box
  • Segmentation : The segmentation is always correct. One cut means one character
  • Oversegmentation : One character can be cut several times

###The pros :

  • Can segment touching characters (->meaning ?)
  • As seen in the litterature, gives really good results when the recognition part is handled by a neural network

###The cons :

  • Due to oversegmentation, it can easily find two words where there was only one ('m' becomes 'rn')
  • The compute cost grow exponentially if there is a lot of possible cuts
  • Some heuristics have to be made and we haven't been able to found publications on it for now as there all using a neural network to learn the best way to cut..

Character Recognition

This section presents the recognition process.

Clustering - KMeans

Due to the large number of images of each category, it is necessary to compute K clusters for each label. For each letter, the algorithm gathers closest images in an average image. To compute the different clusters, the KMeans algorithm is used.

Recognition - KNN

After computing the clusters, we compare the image obtained by segmentation to the different clusters. The K identic closest pictures define the label of the character. The MSE distance is used, even if Chamfer distance has been tried with and without skeletonization. (-> I didn't understand, but 'has been tried' seems hazardous)

Further work

  • Neural Network is the solution for recognition

Post Processing

Some imperfections remain after the recognition, it is why it is necessary to correct the obtained text.


In order to correct incoherences in the text, one needs to know the language of the text, it is why this software could be used only on english text. The hidden states are the real letters and the observed states are the result of the recogntion. The Viterbi algorithm computes the most probable word respecting the transition between letters in english.


Even after the HMM, there are some mistakes. The algorithm computes the Levenshtein distance between the result and the dictionnary and replaces the word by the closest word. In order to have this correction, add -DLEVENSHTEIN in the Makefile. (the README is not a garbage can >.<)

Further work

  • Change the Levenshtein distance in order to allow more insertion of letters.
  • Take account of the error of over-segmentation in the HMM by creating new states.


The dataset used in this project can be found on The Chars74K dataset page. However we add some special characters as '.', ',', ''', '\', '/' ... In order to insert new characters, you have to create a new directory in which you include new images, then you have to change the bijections between the directory's name and the character label.


Compilation options

  • DSKELETONIZATION : Computes the skeltonization on each charcters.
  • DLEVENSHTEIN : Computes at the end of the HMM analysis the most probable word.
  • DCHAMFER : Computes the distance of Chamfer on each characters.
  • DKNN : Computes the KNN algorithm for recognition.

How to install and run OCR ?

  • Unzip the dataset at the root of the project.
  • Change the options of compilation in the Makefile.
  • And type : "make" at the root.


  • At the first running, there is a long learning phase, which could go on 3 hours.
  • If you change the compilation options, do "make clean; make" and delete the result_dataset and the hmm_save file.
You can’t perform that action at this time.