Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract version? #9

Closed
imaadh opened this issue Feb 2, 2023 · 10 comments
Closed

Tesseract version? #9

imaadh opened this issue Feb 2, 2023 · 10 comments

Comments

@imaadh
Copy link

imaadh commented Feb 2, 2023

Thanks for the great work on this. I was wondering what version of Tesseract this uses?

Reason is I tested pgsrip via CLI against SubtitleEdit which uses Tesseract 5.2, and the latter seems to be more accurate with fewer errors, but I suppose that could also just be due to the config of tesseract in pgsrip.

Appreciate the help. Thanks.

@ratoaq2
Copy link
Owner

ratoaq2 commented Feb 2, 2023

psgrip uses the available tesseract that's installed. And you should also define where's the tess data to use with an environment variable.

In case you're using the docker image, the image was built three weeks ago using latest Debian bullseye, installing tesseract-ocr amd64 4.1.1-2.1 and using the trained data from https://github.com/tesseract-ocr/tessdata_best

Could you have examples of the output differences?
If the difference is because of the version, I could check how to install the latest on our docker image.
Otherwise only looking in detail the differences to see how to improve

@imaadh
Copy link
Author

imaadh commented Feb 4, 2023

Here's a zip with two files, one with the pgsrip output, and one that was generated by SubtitleEdit (tesseract 5,2)

srt_files.zip

Mostly it seems like the accuracy is off in the pgsrip version, there are missed spacings, incorrect punctuation, and also sometimes certain speaker names are not included (see subtitle #5).

I was not using docker, just running locally.

Appreciate the help.

@ratoaq2
Copy link
Owner

ratoaq2 commented Feb 4, 2023

Now I understand and I can see the differences. Since you're not using the docker image, I'm assuming pgsrip is using exactly the same tesseract installation that subtitleedit is using. And I'm also assuming it is using the very same trained data as subtitleedit.

Let's point one by one:

  1. pgsrip uses cleanit to post-process the extracted subtitle. The default rules/tags used by cleanit are:
  • ocr: Fix common OCR errors
  • tidy: Fix common formatting issues (e.g.: extra/missing spaces after punctuation)
  • no-sdh: Remove SDH descriptions
  • no-lyrics: Remove lyrics
  • no-spam

For instance, the bellow example shows cleanit removing the SDH descriptions from the subtitle:

subtitleedit
194
00:10:12,027 --> 00:10:13,195
ELLIOT:
Are you still there?
pgsrip
177
00:10:12,028 --> 00:10:13,196
Are you still there?

You can control what tags you want to use with the option -t, --tag:

  -t, --tag TEXT                  Rule tags to be used, e.g. ocr, tidy, no-
                                  sdh, no-style, no-lyrics, no-spam (can be
                                  used multiple times).

And you can even specify your own cleanit rules if needed:

  -c, --config PATH               cleanit configuration path to be used
  1. Spacing issues / OCR errors:
pgsrip
388
00:25:04,420 --> 00:25:06,422
pointing toalistener
on Tyrell's machine.

For a given subtitle track, we create a single image (if the image gets too big, we could have 2 or more) and then we apply some image processing to prepare the input data for tesseract. It's mainly to have it monocromatic and with clear edges. This is not perfect, but enhances tesseract accuracy. In initial versions I was calling tesseract multiple times for each subtitle entry. Then I realized that putting all subtitles in a single image is better for tesseract's AI, so tesseract can see multiple occurrences of a given character. But the results can vary a lot depending on the subtitles font/style and also the image processing part. The only way to improve this is to fine tune some parameters, but I need to have the PGS to do that.

Another common OCR issue is when the wrong language is used/passed to tesseract. But I assume this is not happening, since pgsrip takes the language information from the track metadata.

I also see that subtitleedit does some post processing in the extracted subtitle, since some entries that are 3 lines are changed to 2 lines. Probably they also have some common OCR fixes applied in the subtitle.
For some issues that I see in pgsrip, a new rule in cleanit would solve them, for instance the following error is a good example:

182
00:10:29,671 --> 00:10:32,382
If I'm alive,
! must have been right.

I would like to improve pgsrip, but I would need the PGS used for that.

And one final thing, only if you're willing to try something else, you could check if running pgsrip from the docker image produces the same result or not.

@ratoaq2
Copy link
Owner

ratoaq2 commented Feb 4, 2023

I'm looking subtitleedit code and it seems they used their own trained data:
https://github.com/SubtitleEdit/support-files/tree/master/tessdata

Since tesseract is very powerful and generic, probably subtitleedit created their own trained data feeding the AI only with subtitles, which seems a good idea

@ratoaq2
Copy link
Owner

ratoaq2 commented Feb 4, 2023

I'm wondering in your case when running pgsrip, which trained data tesseract is using...
did you install tesseract yourself? did you downloaded any trained data? how are you executing psgrip?

I suspect your results are not optimal because of the trained data.

@imaadh
Copy link
Author

imaadh commented Feb 5, 2023

I'm working on a Windows PC, so I had installed pgsrip using the instructions from the repo README.md into a WSL Ubuntu VM, so it should be completely logically separated from SubtitleEdit, which is installed on Windows. Hence, I don't think pgsrip and SubtitleEdit are using the same tesseract data.

@ratoaq2
Copy link
Owner

ratoaq2 commented Feb 5, 2023

Thanks for the information.

I published a new release with a few things:

  • I updated the instructions in order to install the latest tesseract.
  • I also tested a pure windows installation and updated the instructions for it
  • I added some more options to the cli to keep the extracted PGS file and to dump the generated image (that can help troubleshooting and to further optimize the image before hand it over to tesseract)

If you're still willing to help, you could execute with the following options:

pgsrip --keep-temp-files --debug -vvv <your_media_path>

There should be some output pointing to a temporary folder where you could find the extracted PGS and 1 or more PNG files.
The PNG files could give us a hint how the image that we're passing to tesseract looks like. Maybe, depending on the subtitle font, the image processing needs to be enhanced and with that I could improve this tool.

@ratoaq2
Copy link
Owner

ratoaq2 commented Feb 5, 2023

I'm trying subtitleedit myself. I see there's plenty of options there...

What I found is that when using pure tesseract 5.3.0, without fallbacks and fixes I'm getting this error:
image

And if I put to fallback to tesseract 3.02, it parses correctly.
It seems subtitleedit has some dictionary to validate what was parsed and if there's some strange word, it tries fallbacks, like tesseract 3.02, and this old version of tesseract seems to be more accurate when parsing some font types/styles.

@imaadh
Copy link
Author

imaadh commented Feb 7, 2023

Thanks for the information.

I published a new release with a few things:

  • I updated the instructions in order to install the latest tesseract.
  • I also tested a pure windows installation and updated the instructions for it
  • I added some more options to the cli to keep the extracted PGS file and to dump the generated image (that can help troubleshooting and to further optimize the image before hand it over to tesseract)

If you're still willing to help, you could execute with the following options:

pgsrip --keep-temp-files --debug -vvv <your_media_path>

There should be some output pointing to a temporary folder where you could find the extracted PGS and 1 or more PNG files. The PNG files could give us a hint how the image that we're passing to tesseract looks like. Maybe, depending on the subtitle font, the image processing needs to be enhanced and with that I could improve this tool.

Here's the output from the debug command

1.en.srt-13-0.zip

@ratoaq2
Copy link
Owner

ratoaq2 commented Feb 12, 2023

I tweaked pgsrip to increase accuracy:

  • Added some border to the png image
  • Increased gaps between subtitle entries in png image
  • Switched tesseract from PSM 11 to PSM 6, since the text and font is uniform

Releasing a new version with it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants