Tesseract version? #9

imaadh · 2023-02-02T11:57:27Z

Thanks for the great work on this. I was wondering what version of Tesseract this uses?

Reason is I tested pgsrip via CLI against SubtitleEdit which uses Tesseract 5.2, and the latter seems to be more accurate with fewer errors, but I suppose that could also just be due to the config of tesseract in pgsrip.

Appreciate the help. Thanks.

ratoaq2 · 2023-02-02T16:44:00Z

psgrip uses the available tesseract that's installed. And you should also define where's the tess data to use with an environment variable.

In case you're using the docker image, the image was built three weeks ago using latest Debian bullseye, installing tesseract-ocr amd64 4.1.1-2.1 and using the trained data from https://github.com/tesseract-ocr/tessdata_best

Could you have examples of the output differences?
If the difference is because of the version, I could check how to install the latest on our docker image.
Otherwise only looking in detail the differences to see how to improve

imaadh · 2023-02-04T10:17:26Z

Here's a zip with two files, one with the pgsrip output, and one that was generated by SubtitleEdit (tesseract 5,2)

srt_files.zip

Mostly it seems like the accuracy is off in the pgsrip version, there are missed spacings, incorrect punctuation, and also sometimes certain speaker names are not included (see subtitle #5).

I was not using docker, just running locally.

Appreciate the help.

ratoaq2 · 2023-02-04T16:22:27Z

Now I understand and I can see the differences. Since you're not using the docker image, I'm assuming pgsrip is using exactly the same tesseract installation that subtitleedit is using. And I'm also assuming it is using the very same trained data as subtitleedit.

Let's point one by one:

pgsrip uses cleanit to post-process the extracted subtitle. The default rules/tags used by cleanit are:

ocr: Fix common OCR errors
tidy: Fix common formatting issues (e.g.: extra/missing spaces after punctuation)
no-sdh: Remove SDH descriptions
no-lyrics: Remove lyrics
no-spam

For instance, the bellow example shows cleanit removing the SDH descriptions from the subtitle:

subtitleedit
194
00:10:12,027 --> 00:10:13,195
ELLIOT:
Are you still there?

pgsrip
177
00:10:12,028 --> 00:10:13,196
Are you still there?

You can control what tags you want to use with the option -t, --tag:

  -t, --tag TEXT                  Rule tags to be used, e.g. ocr, tidy, no-
                                  sdh, no-style, no-lyrics, no-spam (can be
                                  used multiple times).

And you can even specify your own cleanit rules if needed:

  -c, --config PATH               cleanit configuration path to be used

Spacing issues / OCR errors:

pgsrip
388
00:25:04,420 --> 00:25:06,422
pointing toalistener
on Tyrell's machine.

For a given subtitle track, we create a single image (if the image gets too big, we could have 2 or more) and then we apply some image processing to prepare the input data for tesseract. It's mainly to have it monocromatic and with clear edges. This is not perfect, but enhances tesseract accuracy. In initial versions I was calling tesseract multiple times for each subtitle entry. Then I realized that putting all subtitles in a single image is better for tesseract's AI, so tesseract can see multiple occurrences of a given character. But the results can vary a lot depending on the subtitles font/style and also the image processing part. The only way to improve this is to fine tune some parameters, but I need to have the PGS to do that.

Another common OCR issue is when the wrong language is used/passed to tesseract. But I assume this is not happening, since pgsrip takes the language information from the track metadata.

I also see that subtitleedit does some post processing in the extracted subtitle, since some entries that are 3 lines are changed to 2 lines. Probably they also have some common OCR fixes applied in the subtitle.
For some issues that I see in pgsrip, a new rule in cleanit would solve them, for instance the following error is a good example:

182
00:10:29,671 --> 00:10:32,382
If I'm alive,
! must have been right.

I would like to improve pgsrip, but I would need the PGS used for that.

And one final thing, only if you're willing to try something else, you could check if running pgsrip from the docker image produces the same result or not.

ratoaq2 · 2023-02-04T16:34:20Z

I'm looking subtitleedit code and it seems they used their own trained data:
https://github.com/SubtitleEdit/support-files/tree/master/tessdata

Since tesseract is very powerful and generic, probably subtitleedit created their own trained data feeding the AI only with subtitles, which seems a good idea

ratoaq2 · 2023-02-04T16:39:36Z

I'm wondering in your case when running pgsrip, which trained data tesseract is using...
did you install tesseract yourself? did you downloaded any trained data? how are you executing psgrip?

I suspect your results are not optimal because of the trained data.

imaadh · 2023-02-05T11:44:02Z

I'm working on a Windows PC, so I had installed pgsrip using the instructions from the repo README.md into a WSL Ubuntu VM, so it should be completely logically separated from SubtitleEdit, which is installed on Windows. Hence, I don't think pgsrip and SubtitleEdit are using the same tesseract data.

ratoaq2 · 2023-02-05T21:54:41Z

Thanks for the information.

I published a new release with a few things:

I updated the instructions in order to install the latest tesseract.
I also tested a pure windows installation and updated the instructions for it
I added some more options to the cli to keep the extracted PGS file and to dump the generated image (that can help troubleshooting and to further optimize the image before hand it over to tesseract)

If you're still willing to help, you could execute with the following options:

pgsrip --keep-temp-files --debug -vvv <your_media_path>

There should be some output pointing to a temporary folder where you could find the extracted PGS and 1 or more PNG files.
The PNG files could give us a hint how the image that we're passing to tesseract looks like. Maybe, depending on the subtitle font, the image processing needs to be enhanced and with that I could improve this tool.

ratoaq2 · 2023-02-05T22:39:01Z

I'm trying subtitleedit myself. I see there's plenty of options there...

What I found is that when using pure tesseract 5.3.0, without fallbacks and fixes I'm getting this error:

And if I put to fallback to tesseract 3.02, it parses correctly.
It seems subtitleedit has some dictionary to validate what was parsed and if there's some strange word, it tries fallbacks, like tesseract 3.02, and this old version of tesseract seems to be more accurate when parsing some font types/styles.

imaadh · 2023-02-07T11:01:53Z

Thanks for the information.

I published a new release with a few things:

I updated the instructions in order to install the latest tesseract.

I also tested a pure windows installation and updated the instructions for it

I added some more options to the cli to keep the extracted PGS file and to dump the generated image (that can help troubleshooting and to further optimize the image before hand it over to tesseract)

If you're still willing to help, you could execute with the following options:

pgsrip --keep-temp-files --debug -vvv <your_media_path>

There should be some output pointing to a temporary folder where you could find the extracted PGS and 1 or more PNG files. The PNG files could give us a hint how the image that we're passing to tesseract looks like. Maybe, depending on the subtitle font, the image processing needs to be enhanced and with that I could improve this tool.

Here's the output from the debug command

1.en.srt-13-0.zip

ratoaq2 · 2023-02-12T13:17:13Z

I tweaked pgsrip to increase accuracy:

Added some border to the png image
Increased gaps between subtitle entries in png image
Switched tesseract from PSM 11 to PSM 6, since the text and font is uniform

Releasing a new version with it

ratoaq2 closed this as completed in d7579b2 Feb 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract version? #9

Tesseract version? #9

imaadh commented Feb 2, 2023 •

edited

Loading

ratoaq2 commented Feb 2, 2023

imaadh commented Feb 4, 2023 •

edited

Loading

ratoaq2 commented Feb 4, 2023

ratoaq2 commented Feb 4, 2023

ratoaq2 commented Feb 4, 2023 •

edited

Loading

imaadh commented Feb 5, 2023 •

edited

Loading

ratoaq2 commented Feb 5, 2023

ratoaq2 commented Feb 5, 2023

imaadh commented Feb 7, 2023

ratoaq2 commented Feb 12, 2023

Tesseract version? #9

Tesseract version? #9

Comments

imaadh commented Feb 2, 2023 • edited Loading

ratoaq2 commented Feb 2, 2023

imaadh commented Feb 4, 2023 • edited Loading

ratoaq2 commented Feb 4, 2023

ratoaq2 commented Feb 4, 2023

ratoaq2 commented Feb 4, 2023 • edited Loading

imaadh commented Feb 5, 2023 • edited Loading

ratoaq2 commented Feb 5, 2023

ratoaq2 commented Feb 5, 2023

imaadh commented Feb 7, 2023

ratoaq2 commented Feb 12, 2023

imaadh commented Feb 2, 2023 •

edited

Loading

imaadh commented Feb 4, 2023 •

edited

Loading

ratoaq2 commented Feb 4, 2023 •

edited

Loading

imaadh commented Feb 5, 2023 •

edited

Loading