GitHub - techiewonk/awesome-ocr

1. Software
- 1.1. OCR engines
- 1.2. Older and possibly abandoned OCR engines
- 1.3. OCR file formats
  - 1.3.1. hOCR
  - 1.3.2. ALTO XML
  - 1.3.3. TEI
  - 1.3.4. PAGE XML
- 1.4. OCR CLI
1. Deskewing and Dewarping
- 2.1. OCR GUI
1. Text detection and localization
- 3.1. OCR Preprocessing
1. Segmentation
- 4.1. Line Segmentation
- 4.2. Character Segmentation
- 4.3. Word Segmentation
- 4.4. Document Segmentation
- 4.5. Form Segmentation
1. Handwritten
1. Table detection
1. Language detection
- 7.1. OCR as a Service
- 7.2. OCR evaluation
- 7.3. OCR libraries by programming language
  - 7.3.1. Crystal
  - 7.3.2. Elixir
  - 7.3.3. Go
  - 7.3.4. Java
  - 7.3.5. .Net
  - 7.3.6. Object Pascal
  - 7.3.7. PHP
  - 7.3.8. Python
  - 7.3.9. Javascript
  - 7.3.10. Ruby
  - 7.3.11. Rust
  - 7.3.12. R
  - 7.3.13. Swift
- 7.4. OCR training tools
1. Datasets
- 8.1. Ground Truth
1. Video Text Spotting
1. Font detection
1. Optical Character Recognition Engines and Frameworks
1. Awesome lists
1. Proprietary OCR Engines
1. Cloud based OCR Engines (SaaS)
1. File formats and tools
1. Datasets
1. Data augmentation and Synthetic data generation
1. Pre OCR Processing
1. Post OCR Correction
1. Benchmarks
1. misc
1. Literature
- 22.1. OCR-related publication and link lists
- 22.2. Blog Posts and Tutorials
- 22.3. OCR Showcases
- 22.4. Academic articles
  - 22.4.1. 2011 and before
  - 22.4.2. 2012
  - 22.4.3. 2013
  - 22.4.4. 2014
  - 22.4.5. 2015
  - 22.4.6. 2016
  - 22.4.7. 2017
  - 22.4.8. 2018
  - 22.4.9. 2019
  - 22.4.10. 2020

# Awesome OCR

This list contains links to great software tools and libraries and literature related to Optical Character Recognition (OCR).

Contributions are welcome, as is feedback.

1. Software

1.1. OCR engines

tesseract - The definitive Open Source OCR engine Apache 2.0
EasyOCR - OCR engine built on PyTorch by JaidedAI, Apache 2.0
ocropus - OCR engine based on LSTM, Apache 2.0
ocropus 0.4 - Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++
kraken - Ocropus fork with sane defaults
gocr - OCR engine under the GNU Public License led by Joerg Schulenburg.
Ocrad - The GNU OCR. GPL
ocular - Machine-learning OCR for historic documents
SwiftOCR - fast and simple OCR library written in Swift
attention-ocr - OCR engine using visual attention mechanisms
RWTH-OCR - The RWTH Aachen University Optical Character Recognition System
simple-ocr-opencv and its fork - A simple pythonic OCR engine using opencv and numpy
Calamari - OCR Engine based on OCRopy and Kraken
doctr - A seamless & high-performing OCR library powered by Deep Learning

1.2. Older and possibly abandoned OCR engines

Clara OCR - Open source OCR in C GPL
Cuneiform - CuneiForm OCR was developed by Cognitive Technologies
Eye - an experimental Java OCR (image-to-text) application
kognition - An omnifont OCR software for KDE
OCRchie - Modular Optical Character Recognition Software
ocre - o.c.r. easy
xplab - A GTK 2 tool for pattern matching
hebOCR - Hebrew character recognition library (previously named hocr, see Wikipedia article) GPL

1.3. OCR file formats

1.3.1. hOCR

hocr-tools - Tools for doing various useful things with hOCR files, Apache 2.0
hocr-spec - hOCR 1.2 specification
ocr-transform - CLI tool to convert between hOCR and ALTO, MIT
hocr-parser - hOCR Specification Python Parser
hOCRTools - hOCR to ALTO conversion XSLT

1.3.2. ALTO XML

ALTO XML Schema - XML Schema and development of the ALTO XML format
ALTO XML Documentation - Documentation and use cases for ALTO
alto-tools - Various tools to work with ALTO files, Python
AbbyyToAlto - PHP script converting from Abbyy 6 to ALTO XML

1.3.3. TEI

TEI-OCR - TEI customization for OCR generated layout and content information
TEI SIG on Libraries - Best Practices for TEI in Libraries
GDZ - METS/TEI-based GDZ document format

1.3.4. PAGE XML

PAGE-XML Schema - XML schema of the PAGE XML format along with documentation and examples
omni:us Pages Format (OPF) - XML schema very similar to PAGE XML that has some additional features.
py-pagexml - Python library for handling PAGE XML and OPF files.

1.4. OCR CLI

OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Pdf2PdfOCR - A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported.
Ocrocis - Project manager interface for Ocropy, see also external project homepage
tesseract-recognize - Tesseract-based tool that outputs result in Page XML format (docker image).

2. Deskewing and Dewarping

MORAN_v2 (paper:2019) - A Multi-Object Rectified Attention Network for Scene Text Recognition
thomasjhaung/deep-learning-for-document-dewarping
unproject_text - Perspective recovery of text using transformed ellipses
unpaper - a post-processing tool for scanned sheets of paper, especially for book pages that have been scanned from previously created photocopies.
deskew - Library used to deskew a scanned document
deskewing - Contains code to deskew images using MLPs, LSTMs and LLS tranformations
skew_correction - De-skewing images with slanted content by finding the deviation using Canny Edge Detection.
page_dewarp - Page dewarping and thresholding using a "cubic sheet" model
text_deskewing - Rotate text images if they are not straight for better text detection and recognition.
galfar/deskew - Deskew is a command line tool for deskewing scanned text documents. It uses Hough transform to detect "text lines" in the image. As an output, you get an image rotated so that the lines are horizontal.
xellows1305/Document-Image-Dewarping - No code :(
https://github.com/RaymondMcGuire/BOOK-CONTENT-SEGMENTATION-AND-DEWARPING
Docuwarp
Alyn
DewarpNet

2.1. OCR GUI

moz-hocr-editor - Firefox Addon for editing hOCR files Discontinued
qt-box-editor - QT4 editor of tesseract-ocr box files.
ocr-gt-tools - Client-Server application for editing OCR ground truth.
Paperwork - Using scanners and OCR to grep paper documents the easy way.
Paperless - Scan, index, and archive all of your paper documents.
gImageReader - gImageReader is a simple Gtk/Qt front-end to tesseract-ocr.
VietOCR - A Java/.NET GUI frontend for Tesseract OCR engine, including jTessBoxEditor a graphical Tesseract box data editor
PoCoTo - Fast interactive batch corrections of complete OCR error series in OCR'ed historical documents.
OCRFeeder - GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more.
PRImA PAGE Viewer - Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR.
LAREX - A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
archiscribe - Web application for transcribing OCR ground truth from Archive.org. Deployed instance available at https://archiscribe.jbaiter.de/, results are available in @jbaiter/archiscribe-corpus.
nw-page-editor - Simple app for visual editing of Page XML files. Provides desktop and server docker-based versions.

3. Text detection and localization

DB
DeepReg
CornerText - paper:2018) - Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation
RRPN - (paper:2018) - Arbitrary-Oriented Scene Text Detection via Rotation Proposals
MASTER-TF - (paper:2021) - TensorFlow reimplementation of "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021).
MaskTextSpotterV3 - (paper:2020) - Mask TextSpotter v3 is an end-to-end trainable scene text spotter that adopts a Segmentation Proposal Network (SPN) instead of an RPN.
TextFuseNet - (paper:2020) A PyTorch implementation of "TextFuseNet: Scene Text Detection with Richer Fused Features".
SATRN- (paper:2020) - Official Tensorflow Implementation of Self-Attention Text Recognition Network (SATRN) (CVPR Workshop WTDDLE 2020).
cvpr20-scatter-text-recognizer - (paper:2020) - Unofficial implementation of CVPR 2020 paper "SCATTER: Selective Context Attentional Scene Text Recognizer"
seed - ([paper:2020[https://arxiv.org/pdf/2005.10977.pdf]) - This is the implementation of the paper "SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition"
vedastr - A scene text recognition toolbox based on PyTorch
AutoSTR - (paper:2020) Efficient Backbone Search for Scene Text Recognition
Decoupled-attention-network - (paper:2019) Pytorch implementation for "Decoupled attention network for text recognition".
Bi-STET - (paper:2020) Implementation of Bidirectional Scene Text Recognition with a Single Decoder
kiss - (paper:2019
Deformable Text Recognition - (paper:2019)
MaskTextSpotter - (paper:2019)
CUTIE - (paper:2019
AttentionOCR - (paper:2019)
crpn - (paper:2019)
Scene-Text-Detection-with-SPECNET - Repository for Scene Text Detection with Supervised Pyramid Context Network with tensorflow.
Character-Region-Awareness-for-Text-Detection
Real-time-Scene-Text-Detection-and-Recognition-System - End-to-end pipeline for real-time scene text detection and recognition.
ocr_attention - Robust Scene Text Recognition with Automatic Rectification.
masktextspotter.caffee2 - The code of "Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes".
InceptText-Tensorflow - An Implementation of the alogrithm in paper IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection.
textspotter - An End-to-End TextSpotter with Explicit Alignment and Attention
RRD - RRD: Rotation-Sensitive Regression for Oriented Scene Text Detection.
crpn - Corner-based Region Proposal Network.
SSTDNet - Implement 'Single Shot Text Detector with Regional Attention, ICCV 2017 Spotlight'.
R2CNN - caffe re-implementation of R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection.
RRPN - Source code of RRPN ---- Arbitrary-Oriented Scene Text Detection via Rotation Proposals
Tensorflow_SceneText_Oriented_Box_Predictor - This project modify tensorflow object detection api code to predict oriented bounding boxes. It can be used for scene text detection.
DeepSceneTextReader - This is a c++ project deploying a deep scene text reading pipeline with tensorflow. It reads text from natural scene images. It uses frozen tensorflow graphs. The detector detect scene text locations. The recognizer reads word from each detected bounding box.
DeRPN - A novel region proposal network for more general object detection ( including scene text detection ).
Bartzi/see - SEE: Towards Semi-Supervised End-to-End Scene Text Recognition
Bartzi/stn-ocr - Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition
beacandler/R2CNN - caffe re-implementation of R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection
HsiehYiChia/Scene-text-recognition - Scene text detection and recognition based on Extremal Region(ER)
R2CNN_Faster-RCNN_Tensorflow - Rotational region detection based on Faster-RCNN.
corner - Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation
Corner_Segmentation_TextDetection - Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation.
TextSnake.pytorch - A PyTorch implementation of ECCV2018 Paper: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes
AON - Implementation for CVPR 2018 text recognition Paper by Tensorflow: "AON: Towards Arbitrarily-Oriented Text Recognition"
pixel_link - Implementation of our paper 'PixelLink: Detecting Scene Text via Instance Segmentation' in AAAI2018
seglink - An Implementation of the seglink alogrithm in paper Detecting Oriented Text in Natural Images by Linking Segments (=> pixe_link)
SSTD - Single Shot Text Detector with Regional Attention
MORAN_v2 - MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition
Curve-Text-Detector - This repository provides train＆test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking table.
HCIILAB/DeRPN - A novel region proposal network for more general object detection ( including scene text detection ).
TextField - TextField: Learning A Deep Direction Field for Irregular Scene Text Detection (TIP 2019)
tensorflow-TextMountain - TextMountain: Accurate Scene Text Detection via Instance Segmentation
Bartzi/see - Code for the AAAI 2018 publication "SEE: Towards Semi-Supervised End-to-End Scene Text Recognition"
bgshih/aster - Recognizing cropped text in natural images.
ReceiptParser - A fuzzy receipt parser written in Python.
vedastr

3.1. OCR Preprocessing

NoiseRemove.java in MathOCR - Java implementation of Adaptive degraded document image binarization by B. Gatos , I. Pratikakis, S.J. Perantonis
binarize.c in ZBar - C implementations of two binarization algorithms, based on Sauvola
typeface-corpus - A repository for typefaces to train Tesseract and OCRopus for natural history collections and digital humanities.
binarizewolfjolion - Comparison of binarization algorithms. Blog post
crop_morphology.py in oldnyc - Cropping a page to just the text block
Whiteboard Picture Cleaner - Shell one-liner/script to clean up and beautify photos of whiteboards
Fred's ImageMagick script textcleaner - Processes a scanned document of text to clean the text background
localcontrast - Fast O(1) local contrast optimization

4. Segmentation

4.1. Line Segmentation

ARU-Net - Deep Learning Chinese Word Segment
sbb_textline_detector

4.2. Character Segmentation

4.3. Word Segmentation

4.4. Document Segmentation

LayoutParser
eynollah
chulwoopack/docstrum
LAREX - LAREX is a semi-automatic open-source tool for layout analysis on early printed books.
leonlulu/DeepLayout - Deep learning based page layout analysis
dhSegment
Pay20Y/Layout_Analysis
rbaguila/document-layout-analysis
P2PaLA - Page to PAGE Layout Analysis Tool
ocroseg - This is a deep learning model for page layout analysis / segmentation.
DIVA-DIA/DIVA_Layout_Analysis_Evaluator - Layout Analysis Evaluator for the ICDAR 2017 competition on Layout Analysis for Challenging Medieval Manuscripts
ocrsegment - a deep learning model for page layout analysis / segmentation.
ARU-Net
xy-cut-tree
ocrd_segment
LayoutML
LayoutLMv2
eynollah

4.5. Form Segmentation

https://github.com/doxakis/form-segmentation

5. Handwritten

https://github.com/arthurflor23/handwritten-text-recognition
https://github.com/awslabs/handwritten-text-recognition-for-apache-mxnet
https://github.com/0x454447415244/HandwritingRecognitionSystem
https://github.com/SparshaSaha/Handwritten-Number-Recognition-With-Image-Segmentation
https://github.com/ThomasDelteil/HandwrittenTextRecognition_MXNet
SimpleHTR - Handwritten Text Recognition (HTR) system implemented with TensorFlow.
handwriting-ocr - OCR software for recognition of handwritten text
AWSLabs: handwritten text regognition for Apache MXNet
vloison/Handwritten_Text_Recognition
https://github.com/sushant097/Handwritten-Line-Text-Recognition-using-Deep-Learning-with-Tensorflow
https://github.com/qurator-spk/sbb_textline_detection

6. Table detection

TableNet - Unofficial implementation of ICDAR 2019 paper : TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images.
image-table-ocr
TreeStructure - Table Extraction Tool
TableTrainNet - Table recognition inside douments using neural networks.
table_layout_detection_research
TableBank
Camelot
ocr-table - Extract tables from scanned image PDFs using Optical Character Recognition.
ExtractTable-py
image-table-ocr

7. Language detection

lingua - The most accurate natural language detection library for Java and other JVM languages, suitable for long and short text alike
langdetect
whatthelang - Lightning Fast Language Prediction rocket
wiki-lang-detect

7.1. OCR as a Service

Open OCR - Run Tesseract in Docker containers
tesseract-web-service - An implementation of RESTful web service for tesseract-OCR using tornado.
docker-ocropy - A Docker container for running the ocropy OCR system.
ABBYY Cloud OCR SDK Code samples - Code samples for using the proprietary commercial ABBYY OCR API.
nidaba - An expandable and scalable OCR pipeline
gamera - A meta-framework for building document processing applications, e.g. OCR
ocr-tools - Project to provide CLI and web service interfaces to common OCR engines
ocrad-docker - Run the ocrad OCR engine in a docker container
kraken-docker - Run the kraken OCR engine in a docker container
Konfuzio - Free Online OCR up to 2.000 pages per month and OCR API by [@atraining], see https://youtu.be/NZKUrKyFVA8 (code is not open)
ocr.space - Free Online OCR and OCR API by @a9t9 based on Tesseract (code is not open)
OCR4all - Provides OCR services through web applications. Included Projects: LAREX, OCRopus, calamari and nashi.

7.2. OCR evaluation

ISRI OCR Evaluation Tools with a User Guide from 1996 :!:
- isri-ocr-evaluation-tools - further development by @eddieantonio (2015, 2016)
- ancientgreekocr-evaluation-tools - further development by @nickjwhite (2013, 2014)
ocrevalUAtion - Cross-format evaluation, CLI and GUI
ngram-ocr-eval - Brute and simple OCR evaluation using ngrams
quack - Quality-Assurance-tool for scans with corresponding ALTO-files

7.3. OCR libraries by programming language

7.3.1. Crystal

tesseract-ocr - A Crystal wrapper for tesseract-ocr.

7.3.2. Elixir

tesseract_ocr - Elixir library wrapping the tesseract executable.

7.3.3. Go

gosseract - Golang OCR library, wrapping Tesseract-ocr.

7.3.4. Java

Tess4J - Java Native Access bindings to Tesseract.
tess-two - Tools for compiling Tesseract on Android and Java API.

7.3.5. .Net

tesseract for .net - A .Net wrapper for tesseract-ocr.

7.3.6. Object Pascal

TTesseractOCR4 - Object Pascal binding for tesseract-ocr 4.x.

7.3.7. PHP

Tesseract OCR for PHP - Tesseract PHP bindings.

7.3.8. Python

pytesseract - A Python wrapper for Google Tesseract.
pyocr - A Python wrapper for Tesseract and Cuneiform.
ocrodjvu - A library and standalone tool for doing OCR on DjVu documents, wrapping Cuneiform, gocr, ocrad, ocropus and tesseract
tesserocr - A Python wrapper for the tesseract-ocr API

7.3.9. Javascript

ocracy - pure javascript lstm rnn implementation based on ocropus
gocr.js - Javascript port (emscripten) of gocr
ocrad.js - Javascript port (emscripten) of ocrad
tesseract.js - Javascript port (emscripten) of Tesseract
node-tesseract-ocr - A simple wrapper for the Tesseract OCR package.
node-tesseract-native - C++ module for node providing OCR with tesseract and leptonica.

7.3.10. Ruby

rtesseract - Ruby library wrapping the tesseract and imagemagick executables.
ruby-tesseract - Native Tesseract bindings for Ruby MRI and JRuby
ocr_space - API wrapper for free ocr service ocr.space. Includes CLI

7.3.11. Rust

tesseract.rs - Rust bindings for tesseract OCR.
leptess - Productive and safe Rust bindings/wrappers for tesseract and leptonica.

7.3.12. R

tesseract - R bindings for tesseract OCR.

7.3.13. Swift

Tesseract OCR iOS - Swift and Objective-C wrapper for Tesseract OCR.
SwiftOCR - Fast and simple OCR library written in Swift. Optimized for recognizing short, one line long alphanumeric codes.

7.4. OCR training tools

glyph-miner - A system for extracting glyphs from early typeset prints
ocrodeg - Document image degradation for OCR data augmentation

8. Datasets

8.1. Ground Truth

archiscribe-corpus - >4,200 lines transcribed from 19th Century German prints via archiscribe CC-BY 4.0
CIS OCR Test Set - 2 example documents each in German/Latin/Greek with ground truth for PoCoTo

Rescribe - Transcriptions of Caroline Minuscule Manuscripts PDM 1.0

CLTK - Corpora from Classical Language Toolkit PDM 1.0
DIVA-HisDB - 150 pages^PAGE-XML of three medieval manuscripts CC-BY-NC 3.0
EarlyPrintedBooks - ~8,800 lines from several early printed books CC-BY-NC-SA 4.0
EEBO-TCP - 25,363 EEBO documents transcribed by TCP PDM 1.0
ECCO-TCP - 2,188 ECCO documents transcribed by TCP PDM 1.0
eMOP-TCP - 2,188 ECCO-TCP documents, cleaned up by eMOP PDM 1.0
Evans-TCP - 4,977 Evans documents transcribed by TCP
FDHN - Finnish Digitised Historical Newspapers, Paper, (free) registration required, Terms of Use
FROC-MSS - 4 Old French Medieval Manuscripts CC-BY 4.0
GERMANA - 764 Spanish manuscript pages, (free) registration required non-commercial use only
GT4HistOCR - Ground Truth for German Fraktur and Early Modern Latin CC-BY 4.0
imagessan - Sanskrit images & ground truth (Devanagari script)
IMPACT-BHL - 2,418 pages^PAGE-XML from the Biodiversity Heritage Library, XML@GitHub CC-BY 3.0
IMPACT-BL - 294 pages^PAGE-XML from the British Library, (free) registration required PDM 1.0
IMPACT-BNE - 215 pages^PAGE-XML from the National Library of Spain, (free) registration required, XML@GitHub CC-BY-NC-SA 4.0
IMPACT-BNF - 151 pages^PAGE-XML from the National Library of France, (free) registration required CC-BY-NC-SA 4.0
IMPACT-KB - 142 pages^PAGE-XML from the National Library of the Netherlands CC-BY 4.0
IMPACT-NKC - 187 pages^PAGE-XML from the Czech National Library, (free) registration required CC-BY-NC-SA 4.0
IMPACT-NLB - 19 pages^PAGE-XML from the National Library of Bulgaria, (free) registration required CC-BY-NC-ND 4.0
IMPACT-NUK - 209 pages^PAGE-XML from the National Library of Slovenia, (free) registration required CC-BY-NC-SA 4.0
IMPACT-PSNC - 478 pages^PAGE-XML from four Polish digital libraries, XML@GitHub CC-BY 3.0
LascivaRoma/lexical - Transcription of 19th century lexical resources for Latin learning
MJSynth - 9m synthetic images covering 90k English words
OCR19thSAC - 19,000 pages Swiss Alpine Club yearbooks transcribed via Text+Berg digital CC-BY 4.0
OCR-D - 180 pages^PAGE-XML of German historical prints from OCR-D CC-BY-SA 4.0
OCR_GS_Data - Double-checked Arabic Gold Standard from OpenITI
old-books - 322 old books from Project Gutenberg GPL 3.0
PRImA-ENP - 528 pages^PAGE-XML historic newspapers from Europeana Newspapers, (free) registration required PDM 1.0
RODRIGO - 853 Spanish manuscript pages, (free) registration required non-commercial use only
Toebler-OCR - (Kraken) Ground Truth transcription of few pages of the Tobler-Lommatzsch: Altfranzösisches Wörterbuch

9. Video Text Spotting

10. Font detection

typefont - The first open-source library that detects the font of a text in a image.

11. Optical Character Recognition Engines and Frameworks

DAVAR-lab-OCR
CRNN.tf2
ocr.pytorch
PytorchOCR
MMOCR
doctr
Master OCR
xiaofengShi/CHINESE-OCR
PaddleOCR
Urdu-Ocr
ocr.pytorch
ocular - Ocular is a state-of-the-art historical OCR system.
OCR++
pytextrator - python ocr using tesseract/ with EAST opencv detector
OCR-D
ocrd_tesserocr
Deeplearning-OCR
PICCL
cnn_lstm_ctc_ocr - Tensorflow-based CNN+LSTM trained with CTC-loss for OCR.
PassportScanner - Scan the MRZ code of a passport and extract the firstname, lastname, passport number, nationality, date of birth, expiration date and personal numer.
pannous/tensorflow-ocr - OCR using tensorflow with attention.
BowieHsu/tensorflow_ocr - OCR detection implement with tensorflow v1.4.
GRCNN-for-OCR - This is the implementation of the paper "Gated Recurrent Convolution Neural Network for OCR"
go-ocr - A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.
insightocr - MXNet OCR implementation. Including text recognition and detection.
ocr_densenet - The first Xi'an Jiaotong University Artificial Intelligence Practice Contest (2018AI Practice Contest - Picture Text Recognition) first; only use the densenet to identify the Chinese characters
CNN_LSTM_CTC_Tensorflow - CNN+LSTM+CTC based OCR implemented using tensorflow.
tmbdev/clstm - A small C++ implementation of LSTM networks, focused on OCR.
VistaOCR
tesseract.js
Tesseract
kaldi
ocropus3 - Repository collecting all the submodules for the new PyTorch-based OCR System.
calamari
ocropy - Python-based tools for document analysis and OCR
chinese_ocr
deep_ocr - make a better chinese character recognition OCR than tesseract.
ocular
textDetectionWithScriptID
transcribus
FastText - Library for efficient text classification and representation learning
GOCR
Ocrad
franc - Natural language detection
ocrfeeder
emedvedev/attention-ocr - A Tensorflow model for text recognition (CNN + seq2seq with visual attention) available as a Python package and compatible with Google Cloud ML Engine.
da03/attention-ocr - Visual Attention based OCR
dhlab-epfl/dhSegment - Generic framework for historical document processing
https://github.com/mawanda-jun/TableTrainNet
https://github.com/kermitt2/delft
https://github.com/chulwoopack/docstrum
grobid - A machine learning software for extracting information from scholarly documents
lapdftext - LA-PDFText is a system for extracting accurate text from PDF-based research articles
https://github.com/beratkurar/textline-segmentation-using-fcn
https://github.com/OCR4all
https://github.com/OCR4all/LAREX
https://github.com/OCR4all/OCR4all
https://github.com/andbue/nashi
http://kraken.re/
kraken
gosseract - Go package for OCR (Optical Character Recognition), by using Tesseract C++ library.
EasyOCR - Ready-to-use OCR with 40+ languages supported including Chinese, Japanese, Korean and Thai.
invoice-scanner-react-native
Arabic-OCR

12. Awesome lists

13. Proprietary OCR Engines

14. Cloud based OCR Engines (SaaS)

15. File formats and tools

nw-page-editor - Simple app for visual editing of Page XML files
hocr
alto
PageXML
ocr-fileformat - Validate and transform various OCR file formats
hocr-tools - Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

16. Datasets

http://www.iapr-tc11.org/mediawiki/index.php/Datasets_List
https://icdar2019.org/competitions-2/
https://rrc.cvc.uab.es/#
https://lionbridge.ai/datasets/15-best-ocr-handwriting-datasets/
https://github.com/xylcbd/ocr-open-dataset
ICDAR datasets
https://github.com/OpenArabic/OCR_GS_Data
https://github.com/cs-chan/Total-Text-Dataset
scenetext - This is a synthetically generated dataset, in which word instances are placed in natural scene images, while taking into account the scene layout.
Total-Text-Dataset
ocr-open-dataset

17. Data augmentation and Synthetic data generation

DocCreator - DIAR software for synthetic document image and groundtruth generation, with various degradation models for data augmentation.
Scene-Text-Image-Transformer - Scene Text Image Transformer
Belval/TextRecognitionDataGenerator - A synthetic data generator for text recognition
Sanster/text_renderer
awesome-SynthText
Text-Image-Augmentation
UnrealText
SynthText_Chinese_version

18. Pre OCR Processing

ajgalleo/document-image-binarization
PRLib - Pre-Recognize Library - library with algorithms for improving OCR quality.
sbb_binarization -

19. Post OCR Correction

KBNLresearch/ochre - Toolbox for OCR post-correction
cisocrgroup/PoCoTo - The CIS OCR PostCorrectionTool
afterscan

20. Benchmarks

TedEval
clovaai/deep-text-recognition-benchmark - Text recognition (optical character recognition) with deep learning methods.
dinglehopper - dinglehopper is an OCR evaluation tool and reads ALTO, PAGE and text files.
CLEval

21. misc

ocrodeg - a small Python library implementing document image degradation for data augmentation for handwriting recognition and OCR applications.
scantailor - Scan Tailor is an interactive post-processing tool for scanned pages.
jlsutherland/doc2text - help researchers fix these errors and extract the highest quality text from their pdfs as possible.
mauvilsa/nw-page-editor - Simple app for visual editing of Page XML files.
Transkribus - Transkribus is a comprehensive platform for the digitisation, AI-powered recognition, transcription and searching of historical documents.
http://projectnaptha.com/
https://github.com/4lex4/scantailor-advanced
open-semantic-search - Open Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
ocrserver - A simple OCR API server, seriously easy to be deployed by Docker, on Heroku as well
cosc428-structor - ~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.
nidaba - An expandable and scalable OCR pipeline
https://github.com/MaybeShewill-CV/CRNN_Tensorflow
OCRmyPDF

22. Literature

22.1. OCR-related publication and link lists

IMPACT: Tools for text digitisation - List of tools software projects related, some related to OCR
OCR-D - List of OCR-related academic articles in the context of the OCR-D project. 🇩🇪
Mendeley Group "OCR - Optical Character Recognition" - Collection of 34 papers on OCR
eadh.org projects - List of Digital Humanities-related projects in Europe, some related to OCR
Wikipedia: Comparison of optical character recognition software
OCR [and Deep Learning] by @handong1587
Ocropus Wiki: Publications

22.2. Blog Posts and Tutorials

Tesseract Blends Old and New OCR Technology (2016) @theraysmith
- Tutorial@DAS2016, Updated "What You Always Wanted to Know" slides
What You Always Wanted To Know About Tesseract (2014) @theraysmith
- Tutorial@DAS2014, includes demos
Extracting text from an image using Ocropus (2015)
Training an Ocropus OCR model (2015) @danvk
Ocropus Wiki: Compute errors and confusions (2016) @zuphilip
Ocropus Wiki: Working with Ground Truth (2016) @zuphilip
OCRopus (2016) @jze
- mostly on column separation in ocropus
10 Tips for making your OCR project succeed (2013) @cneud
- general things to consider for OCR projects
Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology -
- feature list for a commercial image pre-processing library; has nice before-after samples for pre-processing steps related to OCR
Extracting Text from PDFs; Doing OCR; all within R @shawngraham
- How to work with OCR from PDFs in the R programming environment
Tutorial: Command-line OCR on a Mac @bmschmidt
- Tutorial on how to run tesseract in Mac OSX
Practical Expercience with OCRopus Model Training (2016) @jze
Homemade Manuscript OCR (1): OCRopy (2017) @Jean-Baptiste-Camps
- Tutorial on applying OCR to medieval manuscripts with OCRopy
Optimizing Binarization for OCRopus (2017) @jze
Prototype demo for OCR postfix in Danish Newspapers (2016) @thomasegense
How Can I OCR My Dictionary? (2016) @JessedeDoes
"Needlessly complex" blog (2016) @mzucker. Several image processing how-tos (Python based), particularly:
- Page dewarping (code)
- Compressing and enhancing hand-written notes (code)
- Unprojecting text with ellipses (code)
(Open-Source-)OCR-Workflows (2017) @wrznr 🇩🇪 overview of the state of the art in open source OCR and related technologies (binarisation, deskewing, layout recognition, etc.), lots of example images and information on the @OCR-D project.
A gentle introduction to OCR (2018) @shgidi
Worauf kann ich mich verlassen? Arbeiten mit digitalisierten Quellen, Teil 1: OCR (2019) @eliaskreyenbuehl 🇩🇪 A reflection/criticism on OCR quality, OCR pitfalls in Fraktur fonts.

22.3. OCR Showcases

abbyy-finereader-ocr-senate - Using OCR to parse scanned Senate Financial Disclosure forms.
cvOCR - An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract
MathOCR - A printed scientific document recognition system, pre-alpha

22.4. Academic articles

22.4.7. 2017

Telugu OCR Framework using Deep Learning (2015/2017) Achanta, Hastie
- see also TeluguOCR, banti_telugu_ocr, chamanti_ocr, #49
EAST(official) - (tf1/py2) A tensorflow implementation of EAST text detector
AdvancedEAST - (tf1/py2) AdvancedEAST is an algorithm used for Scene image text detect, which is primarily based on EAST, and the significant improvement was also made, which make long text predictions more accurate.
kurapan/EAST Implementation of EAST scene text detector in Keras
songdejia/EAST - This is a pytorch re-implementation of EAST: An Efficient and Accurate Scene Text Detector.
HaozhengLi/EAST_ICPR - Forked from argman/EAST for the ICPR MTWI 2018 CHALLENGE
deepthinking-qichao/EAST_ICPR2018
SakuraRiven/EAST
EAST-Detector-for-text-detection-using-OpenCV - Text Detection from images using OpenCV
easy-EAST

22.4.8. 2018

A Two-Stage Method for Text Line Detection in Historical Documents (2018) Grüning, Leifert, Strauß, Labahn. Code available at https://github.com/TobiasGruening/ARU-Net
tensorflow_PSENet - This is a tensorflow re-implementation of PSENet: Shape Robust Text Detection with Progressive Scale Expansion Network
PAN-PSEnet
PSENet - Shape Robust Text Detection with Progressive Scale Expansion Network.
FOTS paper:2018
FOTS - An Implementation of the FOTS: Fast Oriented Text Spotting with a Unified Network.
FOTS_OCR
TextBoxes++ paper:2018
TextBoxes_plusplus (offical) TextBoxes++: A Single-Shot Oriented Scene Text Detector
Shun14/TextBoxes_plusplus_Tensorflo - Textboxes_plusplus implementation with Tensorflow (python)

22.4.9. 2019

RAFT paper:2019
CRAFT-pytorch (official) - Pytorch implementation of CRAFT text detector.
autonise/CRAFT-Remade
s3nh/pytorch-text-recognition
backtime92/CRAFT-Reimplementation
fcakyon/craft-text-detector - PyTorch implementation of CRAFT
YongWookHa/craft-text-detector
faustomorales/keras-ocr - A packaged and flexible version of the CRAFT text detector and Keras CRNN recognition model.
fcakyon/craft-text-detector

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

techiewonk/awesome-ocr

Folders and files

Latest commit

History

Repository files navigation

1. Software

1.1. OCR engines

1.2. Older and possibly abandoned OCR engines

1.3. OCR file formats

1.3.1. hOCR

1.3.2. ALTO XML

1.3.3. TEI

1.3.4. PAGE XML

1.4. OCR CLI

2. Deskewing and Dewarping

2.1. OCR GUI

3. Text detection and localization

3.1. OCR Preprocessing

4. Segmentation

4.1. Line Segmentation

4.2. Character Segmentation

4.3. Word Segmentation

4.4. Document Segmentation

4.5. Form Segmentation

5. Handwritten

6. Table detection

7. Language detection

7.1. OCR as a Service

7.2. OCR evaluation

7.3. OCR libraries by programming language

7.3.1. Crystal

7.3.2. Elixir

7.3.3. Go

7.3.4. Java

7.3.5. .Net

7.3.6. Object Pascal

7.3.7. PHP

7.3.8. Python

7.3.9. Javascript

7.3.10. Ruby

7.3.11. Rust

7.3.12. R

7.3.13. Swift

7.4. OCR training tools

8. Datasets

8.1. Ground Truth

9. Video Text Spotting

10. Font detection

11. Optical Character Recognition Engines and Frameworks

12. Awesome lists

13. Proprietary OCR Engines

14. Cloud based OCR Engines (SaaS)

15. File formats and tools

16. Datasets

17. Data augmentation and Synthetic data generation

18. Pre OCR Processing

19. Post OCR Correction

20. Benchmarks

21. misc

22. Literature

22.1. OCR-related publication and link lists

22.2. Blog Posts and Tutorials

22.3. OCR Showcases

22.4. Academic articles

22.4.1. 2011 and before

22.4.2. 2012

22.4.3. 2013

22.4.4. 2014

22.4.5. 2015

22.4.6. 2016

22.4.7. 2017

22.4.8. 2018

22.4.9. 2019

22.4.10. 2020

About

Resources

Stars

Watchers

Forks

Releases

Packages