-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for PSM and OEM Tesseract options #22
Conversation
f11bb0f
to
7d833ad
Compare
This unnecessarily includes the frontend bits, too (I didn't read the ticket properly!), but still has some of the backend needed for the API. I thought about having |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about having TesseractEngine have setters like setPsm() but we both instantiate the OCR engine and do the OCRing all in EngineBase::getText() right now. So I thought to add an array $extras to the signature, to leave room for whatever options other engines might have? I have other ideas, too.
I like the idea of setters like TesseractEngine::setPsm()
, and calling them in the controller. Maybe all optional options (not to be tautological or anything) should be separate methods — i.e. add setLangs()
. This would mean the engine objects would have a bunch of setters, but at any point it'd still be possible to call getText($image)
and it still make sense.
We've already got a conditional in OcrController::__construct()
for google vs tesseract, so maybe that can be expanded (from a ternary) and contain all the engine-specific bits?
* @return string | ||
* @throws OcrException | ||
*/ | ||
public function getText(string $imageUrl, ?array $langs = null): string | ||
public function getText(string $imageUrl, ?array $langs = null, ?array $_extras = null): string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo underscore in $_extras
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was to indicate $extras
isn't used in this method, we just had to have it as an arg due to the signature in the abstract EngineBase. But, I'll go with your suggestion of having setters instead of an $extras
arg.
afa76b5
to
fb45ad2
Compare
@samwilson Let me know what you think now. As suggested I went the route of adding Engine-specific setters, set in the controller. However for validation purposes, I want to throw an |
@@ -20,11 +20,21 @@ class OcrController extends AbstractController | |||
/** @var Intuition */ | |||
protected $intuition; | |||
|
|||
/** @var EngineBase */ | |||
/** @var TesseractEngine|GoogleCloudVisionEngine */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to do this to appease Phan, otherwise it complains about calling the Engine-specific setters like setPsm()
return; | ||
} | ||
|
||
$isApi = str_contains($this->request->getPathInfo(), 'api.php'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun discovery: Symfony comes with PHP 8 polyfills!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, TIL! :-)
fb45ad2
to
9302650
Compare
This also includes validations for the new options. In order to use OcrException outside EngineBase::getText(), we replace the try/catch approach with an exception listener. Consequently default parameter values known only to the controller have to be exposed as a public static property. Bug: T280213
9302650
to
1ec2c1d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, works well. I'm happy to merge as-is, but just wondering about the DPI option.
return; | ||
} | ||
|
||
$isApi = str_contains($this->request->getPathInfo(), 'api.php'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, TIL! :-)
* Note this is used by the ExceptionListener. | ||
* @var mixed[] | ||
*/ | ||
public static $params = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we look at adding the dpi
option? We currently get warnings on most images:
Warning: Invalid resolution 0 dpi. Using 70 instead.
I'm happy to hold that over for a future patch, but will it work with this system? It won't have a max value (well, I guess it will in practice, but it's a different sort of value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From playing around, it seems there is a valid DPI range from 70 to 2400. I added a DPI option, but depending on what other options you use, you still get the same error. For instance, with a PSM of 0, and OEM of 3, and DPI of 300, I see:
Error! The command did not produce any output.
Generated command:
"tesseract" - "/tmp/ocrXMeZIy" --psm 0 --oem 3 --dpi 300
Returned message:
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Seems like there might be an issue with thiagoalessio/tesseract_ocr. But either way, I have no idea what options require you to set a DPI, and I suspect many of our users won't, either. Do we know what the conditions are where DPI is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah good points. I think the DPI is probably most useful for things like newspapers, which might also have a scan resolution set in the Index page. Let's leave it for now.
This also includes validations for the new options. In order to use
OcrException outside EngineBase::getText(), we replace the try/catch
approach with an exception listener. Consequently default parameter
values known only to the controller have to be exposed as a public
static property.
Bug: T280213