Skip to content

cnsuhao/php-tesseract

 
 

Repository files navigation

php-tesseract

Build Status

This extension is currently under development.

Installation

Initial Setup

$ git clone https://github.com/insecia/php-tesseract.git
$ cd php-tesseract
$ docker-compose build extension-builder

Compile Extension

$ docker-compose run --rm extension-builder
$ docker-compose build php 

Run Tests

$ docker-compose run --rm php make test

Run Script

$ docker-compose run --rm php php script_name.php

Usage

Basic usage

$tesseract = \Tesseract\Tesseract::fromFile('image.jpg');
$textContent = $tesseract->getText();

It's also possible to define a certain rectangular area of the image from which the tesseract lib should extract text.

$tesseract = \Tesseract\Tesseract::fromFile('image.jpg');
$textContent = $tesseract->getRectangle(500, 500, 1000, 1000)->getText();

A tesseract instance can also be created from a string that contains the binary content of an image. This has the advantage of not requiring the creation of a temporary file.

$textContent = \Tesseract\Tesseract::fromString($imageContent)->getText();

One or multiple languages can also be specified. Note that the language file for the specified languages must be installed. Refer to the Dockerfile for usage under Alpine or the tesseract-ocr documentation.

$tesseract = \Tesseract\Tesseract::fromFile('image.jpg', [
    \Tesseract\Language\GERMAN,
    \Tesseract\Language\ENGLISH
]);
$textContent = $tesseract->getText();

It is also possible to choose a different page seg mode.

$tesseract = \Tesseract\Tesseract::fromFile('image.jpg');
echo $tesseract->setPageSegMode(\Tesseract\PageSegMode\SINGLE_WORD)->getText();

About

A PHP 7.1+ extension for working with the tesseract ocr library

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 65.7%
  • PHP 16.8%
  • C 7.0%
  • M4 4.7%
  • Ruby 4.2%
  • Dockerfile 0.9%
  • Shell 0.7%