const tesseract = require('tesseractocr')
Pass an image to Tesseract to recognize text from it. This method spawns a new tesseract process with the given options and streams the source into it. Callback is fired when the child process gets finished.
source
can be a file path, astream.Readable
or aBuffer
options
is an object containing options for the recognition process. All the following options are optional:execPath
specify the path of tesseract executable. Defaults towhich tesseract
(searchesPATH
).language
specify language(s) used for OCR. This can be a string or an array of stringstessdataDir
specify the location of tessdata pathuserWords
specify the location of user words fileuserPatterns
specify the location of user patterns fileconfig
set value for config variables. This can be a string, an array of strings or an object specifying key-value pairsconfigfiles
One or more file path(s) pointing to tesseract configuration files. This can be a string or an array of stringspsm
specify page segmentation modeoem
specify OCR Engine modeoutput
specify the output format. Valid values are:"alto"
,"hocr"
,"pdf"
,"tsv"
or"txt"
. It's an alias for theconfigfiles
option which makes the JS API more expressive. Check out the CONFIGFILES part of tesseract docs for more information.
callback(err, text)
an optional error-first callback function
tesseract.recognize(`${__dirname}/image.png`, (err, text) => { /* ... */ })
tesseract.recognize(Buffer.from(/* ... */), (err, text) => { /* ... */ })
tesseract.recognize(fs.createReadStream(/* ... */), (err, text) => { /* ... */ })
tesseract.recognize('image.jpeg', { language: 'swe' }, (err, text) => { /* ... */ })
tesseract.recognize('image.tiff').then(console.log, console.error)
Factory to create custom recognizer functions with fixed options.
Parameters are identical to recognize()
.
const recognize = tesseract.withOptions({
psm: 4,
language: [ 'fin', 'eng' ],
config: ['tessedit_do_invert=0']
})
recognize(`${__dirname}/image.png`, (err, text) => { /* ... */ })
recognize(Buffer.from(/* ... */), (err, text) => { /* ... */ })
recognize(fs.createReadStream(/* ... */), (err, text) => { /* ... */ })
recognize('image.jpeg', (err, text) => { /* ... */ })
recognize('image.tiff').then(console.log, console.error)
promise
thePromise
instance returned by therecognize()
call that initiated the given recognition process.
Abort an ongoing recognition process.
const proc = tesseract.recognize(source)
// ... for e.g. an http server...
request.on('timeout', () => tesseract.recognize.cancel(proc))
Abort all the ongoing recognition processes.
for (let i = 0; i < 10; i++) {
tesseract
.recognize(source + i)
.then(console.log)
}
process.on('exit', () => tesseract.recognize.cancelAll())
Get the list of the installed languages on the current system.
execPath
specify the path of tesseract executable. Defaults towhich tesseract
(searchesPATH
).callback(err, list)
an optional error-first callback function
tesseract.listLanguages((err, list) => { /* ... */ })
tesseract.listLanguages('/usr/local/bin/tesseract', (err, list) => { /* ... */ })
tesseract.listLanguages().then(console.log, console.error)
The module's main export is the recognize()
method itself, so you can simply:
const recognize = require('tesseractocr')
recognize('source', (err, text) => { /* ... */ })
All the methods of this module are supporting both callback and promise based usage. That means these are working well with async functions:
async function parseTextFromImage(source) {
let text = await recognize(source)
// do stuff with recognized text...
text = text.split(' ')
return text
}
Recognition can be aborted in the event of e.g. a client timeout. This terminates the given Tesseract child process:
const proc = reconize(source)
request.on('timeout', () => recognize.cancel(proc))
It's also possible to terminate all the in-progress Tesseract processes in the event of e.g. force shutdown:
process.on('exit', () => recognize.cancelAll())