Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for PSM and OEM Tesseract options #22

Merged
merged 1 commit into from
May 5, 2021
Merged

Conversation

MusikAnimal
Copy link
Member

@MusikAnimal MusikAnimal commented Apr 28, 2021

This also includes validations for the new options. In order to use
OcrException outside EngineBase::getText(), we replace the try/catch
approach with an exception listener. Consequently default parameter
values known only to the controller have to be exposed as a public
static property.

Bug: T280213

@MusikAnimal MusikAnimal changed the title First draft, still plenty to do. Add support for PSM and OEM Tesseract options Apr 28, 2021
@MusikAnimal
Copy link
Member Author

MusikAnimal commented Apr 28, 2021

This unnecessarily includes the frontend bits, too (I didn't read the ticket properly!), but still has some of the backend needed for the API.

I thought about having TesseractEngine have setters like setPsm() but we both instantiate the OCR engine and do the OCRing all in EngineBase::getText() right now. So I thought to add an array $extras to the signature, to leave room for whatever options other engines might have? I have other ideas, too.

Copy link
Member

@samwilson samwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about having TesseractEngine have setters like setPsm() but we both instantiate the OCR engine and do the OCRing all in EngineBase::getText() right now. So I thought to add an array $extras to the signature, to leave room for whatever options other engines might have? I have other ideas, too.

I like the idea of setters like TesseractEngine::setPsm(), and calling them in the controller. Maybe all optional options (not to be tautological or anything) should be separate methods — i.e. add setLangs(). This would mean the engine objects would have a bunch of setters, but at any point it'd still be possible to call getText($image) and it still make sense.

We've already got a conditional in OcrController::__construct() for google vs tesseract, so maybe that can be expanded (from a ternary) and contain all the engine-specific bits?

* @return string
* @throws OcrException
*/
public function getText(string $imageUrl, ?array $langs = null): string
public function getText(string $imageUrl, ?array $langs = null, ?array $_extras = null): string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo underscore in $_extras.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was to indicate $extras isn't used in this method, we just had to have it as an arg due to the signature in the abstract EngineBase. But, I'll go with your suggestion of having setters instead of an $extras arg.

templates/_tesseract_options.html.twig Outdated Show resolved Hide resolved
@MusikAnimal MusikAnimal force-pushed the tesseract-options branch 2 times, most recently from afa76b5 to fb45ad2 Compare May 4, 2021 02:32
@MusikAnimal
Copy link
Member Author

@samwilson Let me know what you think now. As suggested I went the route of adding Engine-specific setters, set in the controller. However for validation purposes, I want to throw an OcrException when currently we're only doing our try/catch when calling getText(). So instead, I created an ExceptionListener so we can throw OcrExceptions anywhere and they should be properly handled. The drawback is I lose all contextual bits from the controller (such as $this->params). To get around this, I changed OcrController to have a static class property of defaults. The URL query params get merged into them inside the ExceptionListener. The static class property is what I'm not completely happy with, but I'm not sure about alternative ways to do this since we can't magically 'resume' execution in the controller after intercepting an exception in an event listener. On the bright side, our exception handling logic lives in one place, and also sort of like how we can set default values for params in one place OcrController::$params).

@@ -20,11 +20,21 @@ class OcrController extends AbstractController
/** @var Intuition */
protected $intuition;

/** @var EngineBase */
/** @var TesseractEngine|GoogleCloudVisionEngine */
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to do this to appease Phan, otherwise it complains about calling the Engine-specific setters like setPsm()

return;
}

$isApi = str_contains($this->request->getPathInfo(), 'api.php');
Copy link
Member Author

@MusikAnimal MusikAnimal May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun discovery: Symfony comes with PHP 8 polyfills!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, TIL! :-)

This also includes validations for the new options. In order to use
OcrException outside EngineBase::getText(), we replace the try/catch
approach with an exception listener. Consequently default parameter
values known only to the controller have to be exposed as a public
static property.

Bug: T280213
@MusikAnimal MusikAnimal marked this pull request as ready for review May 4, 2021 02:58
Copy link
Member

@samwilson samwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, works well. I'm happy to merge as-is, but just wondering about the DPI option.

return;
}

$isApi = str_contains($this->request->getPathInfo(), 'api.php');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, TIL! :-)

* Note this is used by the ExceptionListener.
* @var mixed[]
*/
public static $params = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we look at adding the dpi option? We currently get warnings on most images:

Warning: Invalid resolution 0 dpi. Using 70 instead.

I'm happy to hold that over for a future patch, but will it work with this system? It won't have a max value (well, I guess it will in practice, but it's a different sort of value).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From playing around, it seems there is a valid DPI range from 70 to 2400. I added a DPI option, but depending on what other options you use, you still get the same error. For instance, with a PSM of 0, and OEM of 3, and DPI of 300, I see:

Error! The command did not produce any output.

Generated command:
"tesseract" - "/tmp/ocrXMeZIy" --psm 0 --oem 3 --dpi 300

Returned message:
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.

Seems like there might be an issue with thiagoalessio/tesseract_ocr. But either way, I have no idea what options require you to set a DPI, and I suspect many of our users won't, either. Do we know what the conditions are where DPI is needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah good points. I think the DPI is probably most useful for things like newspapers, which might also have a scan resolution set in the Index page. Let's leave it for now.

@samwilson samwilson merged commit 3cc37e1 into main May 5, 2021
@samwilson samwilson deleted the tesseract-options branch May 5, 2021 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants