Bindings to Tesseract OCR engine for R
Clone or download
Latest commit 36c1593 Aug 26, 2018

README.md

tesseract

Extract text from an image. Requires that you have training data for the language you are reading. Works best for images with high contrast, little noise and horizontal text.

Project Status: Active – The project has reached a stable, usable state and is being actively developed. Build Status AppVeyor Build Status Coverage Status CRAN_Status_Badge CRAN RStudio mirror downloads Github Stars

Hello World

Simple example

# Simple example
text <- ocr("https://jeroen.github.io/images/testocr.png")
cat(text)

# Get XML HOCR output
xml <- ocr("https://jeroen.github.io/images/testocr.png", HOCR = TRUE)
cat(xml)

Roundtrip test: render PDF to image and OCR it back to text

# Full roundtrip test: render PDF to image and OCR it back to text
curl::curl_download("https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf", "R-intro.pdf")
orig <- pdftools::pdf_text("R-intro.pdf")[1]

# Render pdf to png image
img_file <- pdftools::pdf_convert("R-intro.pdf", format = 'tiff', pages = 1, dpi = 400)

# Extract text from png image
text <- ocr(img_file)
unlink(img_file)
cat(text)

Installation

On Windows and MacOS the package binary package can be installed from CRAN:

install.packages("tesseract")

Installation from source on Linux or OSX requires the Tesseract library (see below).

Install from source

On Debian or Ubuntu install libtesseract-dev and libleptonica-dev. Also install tesseract-ocr-eng to run examples.

sudo apt-get install -y libtesseract-dev libleptonica-dev tesseract-ocr-eng

On Fedora we need tesseract-devel and leptonica-devel

sudo yum install tesseract-devel leptonica-devel

On RHEL and CentOS we need tesseract-devel and leptonica-devel from EPEL

sudo yum install epel-release
sudo yum install tesseract-devel leptonica-devel

On OS-X use tesseract from Homebrew:

brew install tesseract

Tesseract uses training data to perform OCR. Most systems default to English training data. To improve OCR results for other languages you can to install the appropriate training data. On Windows and OSX you can do this in R using tesseract_download():

tesseract_download('fra')

On Linux you need to install the appropriate training data from your distribution. For example to install the spanish training data:

Alternatively you can manually download training data from github and store it in a path on disk that you pass in the datapath parameter or set a default path via the TESSDATA_PREFIX environment variable. Note that the Tesseract 4 and Tesseract 3 use different training data format. Make sure to download training data from the branch that matches your libtesseract version.