Bindings to Tesseract OCR engine
R C++ Shell
Latest commit 6a38448 Dec 15, 2016 @jeroenooms jeroenooms Update Travis
Permalink
Failed to load latest commit information.
R Fix CMD check Dec 6, 2016
man Fix CMD check Dec 6, 2016
src add cleanup script Dec 14, 2016
tools Bundle training data on Windows Nov 2, 2016
.Rbuildignore Clean up Nov 1, 2016
.gitignore first commit Mar 2, 2016
.travis.yml Update Travis Dec 15, 2016
DESCRIPTION Prepare for release Dec 7, 2016
LICENSE first commit Mar 2, 2016
NAMESPACE Tweak and cleanup Nov 22, 2016
NEWS Add support for magick OCR Nov 23, 2016
README.md Typo Nov 13, 2016
appveyor.yml Clean up Nov 1, 2016
cleanup add cleanup script Dec 14, 2016
configure Try other workaround Nov 12, 2016
configure.win Minor tweaks Nov 3, 2016
tesseract.Rproj first commit Mar 2, 2016

README.md

tesseract

Extract text from an image. Requires that you have training data for the language you are reading. Works best for images with high contrast, little noise and horizontal text.

Build Status AppVeyor Build Status Coverage Status CRAN_Status_Badge CRAN RStudio mirror downloads Github Stars

Hello World

Simple example

text <- ocr("http://jeroenooms.github.io/images/testocr.png")
cat(text)

Roundtrip test: render PDF to image and OCR it back to text

library(pdftools)
library(tiff)

# A PDF file with some text
setwd(tempdir())
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]

# Render pdf to jpeg/tiff image
bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, "page.tiff")

# Extract text from images
out <- ocr("page.tiff")
cat(out)

Installation

On Windows and MacOS the package binary package can be installed from CRAN:

install.packages("tesseract")

Installation from source on Linux or OSX requires the Tesseract library (see below).

Install from source

On Debian or Ubuntu install libtesseract-dev and libleptonica-dev. Also install tesseract-ocr-eng to run english examples.

sudo apt-get install -y libtesseract-dev libleptonica-dev tesseract-ocr-eng

On Fedora we need tesseract-devel and leptonica-devel

sudo yum install tesseract-devel leptonica-devel

On RHEL and CentOS we need tesseract-devel and leptonica-devel from EPEL

sudo yum install epel-release
sudo yum install tesseract-devel leptonica-devel

On OS-X use tesseract from Homebrew:

brew install tesseract

Tesseract uses training data to perform OCR. Most systems default to English training data. To improve OCR performance for other langauges you can to install the training data from your distribution. For example to install the spanish training data:

On other platforms you can manually download training data from github and store it in a path on disk that you pass in the datapath parameter. Alternatively you can set a default path via the TESSDATA_PREFIX environment variable.