Model used for trained data #42

cmjordan42 · 2024-01-10T15:59:50Z

It doesn't seem like the Tesseract trained data set is optional (i.e. 'fast' vs 'best') and as far as I can tell, you are using 'fast'. Is that the case?

There may also be corruption somewhere in the trained data you have (at least, for eng) as I just noticed totally nonsensical series of characters in the conversion of a single basic word when it is multi-line. Something like...

The brown fox jumps over the lazy
qj2]a%sLo1

The text was updated successfully, but these errors were encountered:

Tentacule · 2024-01-10T22:32:22Z

In the docker image, trained data are downloaded from here: https://github.com/tesseract-ocr/tessdata/, from the readme it looks like a version between 'best' and 'fast'

If you want to try other data (ie. https://github.com/tesseract-ocr/tessdata_best), you'll have to download them and manually run PgsToSrt.

It'll be interesting if you could provide me a sample of the file you are using.

GFoley83 · 2024-04-16T04:50:43Z

The eng.traineddata from tessdata_best, coupled with the latest version of PgsToSrt (1.4.5 at time of writing) should give you pretty much perfect results @cmjordan42.

See my comment here: #41 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model used for trained data #42

Model used for trained data #42

cmjordan42 commented Jan 10, 2024

Tentacule commented Jan 10, 2024

GFoley83 commented Apr 16, 2024 •

edited

Loading

Model used for trained data #42

Model used for trained data #42

Comments

cmjordan42 commented Jan 10, 2024

Tentacule commented Jan 10, 2024

GFoley83 commented Apr 16, 2024 • edited Loading

GFoley83 commented Apr 16, 2024 •

edited

Loading