Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model used for trained data #42

Open
cmjordan42 opened this issue Jan 10, 2024 · 2 comments
Open

Model used for trained data #42

cmjordan42 opened this issue Jan 10, 2024 · 2 comments

Comments

@cmjordan42
Copy link

It doesn't seem like the Tesseract trained data set is optional (i.e. 'fast' vs 'best') and as far as I can tell, you are using 'fast'. Is that the case?

There may also be corruption somewhere in the trained data you have (at least, for eng) as I just noticed totally nonsensical series of characters in the conversion of a single basic word when it is multi-line. Something like...

The brown fox jumps over the lazy
qj2]a%sLo1
@Tentacule
Copy link
Owner

In the docker image, trained data are downloaded from here: https://github.com/tesseract-ocr/tessdata/, from the readme it looks like a version between 'best' and 'fast'

If you want to try other data (ie. https://github.com/tesseract-ocr/tessdata_best), you'll have to download them and manually run PgsToSrt.

It'll be interesting if you could provide me a sample of the file you are using.

@GFoley83
Copy link

GFoley83 commented Apr 16, 2024

The eng.traineddata from tessdata_best, coupled with the latest version of PgsToSrt (1.4.5 at time of writing) should give you pretty much perfect results @cmjordan42.

See my comment here: #41 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants