-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract version? #9
Comments
psgrip uses the available tesseract that's installed. And you should also define where's the tess data to use with an environment variable. In case you're using the docker image, the image was built three weeks ago using latest Debian bullseye, installing Could you have examples of the output differences? |
Here's a zip with two files, one with the pgsrip output, and one that was generated by SubtitleEdit (tesseract 5,2) Mostly it seems like the accuracy is off in the pgsrip version, there are missed spacings, incorrect punctuation, and also sometimes certain speaker names are not included (see subtitle #5). I was not using docker, just running locally. Appreciate the help. |
Now I understand and I can see the differences. Since you're not using the docker image, I'm assuming pgsrip is using exactly the same tesseract installation that subtitleedit is using. And I'm also assuming it is using the very same trained data as subtitleedit. Let's point one by one:
For instance, the bellow example shows
You can control what tags you want to use with the option
And you can even specify your own cleanit rules if needed:
For a given subtitle track, we create a single image (if the image gets too big, we could have 2 or more) and then we apply some image processing to prepare the input data for tesseract. It's mainly to have it monocromatic and with clear edges. This is not perfect, but enhances tesseract accuracy. In initial versions I was calling tesseract multiple times for each subtitle entry. Then I realized that putting all subtitles in a single image is better for tesseract's AI, so tesseract can see multiple occurrences of a given character. But the results can vary a lot depending on the subtitles font/style and also the image processing part. The only way to improve this is to fine tune some parameters, but I need to have the PGS to do that. Another common OCR issue is when the wrong language is used/passed to tesseract. But I assume this is not happening, since pgsrip takes the language information from the track metadata. I also see that
I would like to improve And one final thing, only if you're willing to try something else, you could check if running |
I'm looking Since tesseract is very powerful and generic, probably |
I'm wondering in your case when running I suspect your results are not optimal because of the trained data. |
I'm working on a Windows PC, so I had installed pgsrip using the instructions from the repo README.md into a WSL Ubuntu VM, so it should be completely logically separated from SubtitleEdit, which is installed on Windows. Hence, I don't think pgsrip and SubtitleEdit are using the same tesseract data. |
Thanks for the information. I published a new release with a few things:
If you're still willing to help, you could execute with the following options:
There should be some output pointing to a temporary folder where you could find the extracted PGS and 1 or more PNG files. |
Here's the output from the debug command |
I tweaked pgsrip to increase accuracy:
Releasing a new version with it |
Thanks for the great work on this. I was wondering what version of Tesseract this uses?
Reason is I tested pgsrip via CLI against SubtitleEdit which uses Tesseract 5.2, and the latter seems to be more accurate with fewer errors, but I suppose that could also just be due to the config of tesseract in pgsrip.
Appreciate the help. Thanks.
The text was updated successfully, but these errors were encountered: