Conversation
|
This is a first implementation to support the Open Source kraken OCR engine. I created this draft pull request to allow public review and comments although the implementation is still incomplete. OCR for cropped image still needs testing, and there is also currently no unit test code for the new engine. |
|
Also, would you mind opening a Phabricator task for this work, so it can be tracked there? Thanks! |
Done, see https://phabricator.wikimedia.org/T345055. I also updated the PR here to solve a merge conflict. |
d773984 to
a704816
Compare
|
See also my test installation. |
f1764a1 to
6d69032
Compare
|
The test installation is meanwhile available on https://kraken-ocr.wmcloud.org/. |
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Both are not language specific, but support historic and current scripts used by many European languages. Signed-off-by: Stefan Weil <sw@weilnetz.de>
Kraken is an Open Source OCR engine with trainable segmentation and OCR models. It can work with printed and handwritten texts. This initial implementation comes with two generic OCR models which can be used on a wide range of German publications, but also with other languages which are based on Latin script. Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Cropping is also implemented now, but still untested. Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
…test) Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Segmentation models are currently only supported for kraken. All other OCR engines return an empty list. Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Parthiv-M
left a comment
There was a problem hiding this comment.
This review focuses on the following main things
- Files related to Transkribus have also been modified, which I think is not ideal
- A new Transkribus model has also been added, which should be removed
- Normalise text has been added here for all engines (we would want that in a separate PR)
- Renaming the newly added API route
Overall, we'd like to keep the non-Kraken changes out of the way before testing kraken once again
| /** | ||
| * Get a list of available segmentation models for use with a specific OCR engine. | ||
| * | ||
| * @Route("/api/available_segmentation_models", name="apiSegmentationModels", methods={"GET"}) |
There was a problem hiding this comment.
This route, since it is related to Kraken, should be named /api/kraken/available_segmentation_models
There was a problem hiding this comment.
Having segmentation models is not kraken specific. All OCR processes require a segmentation step, and if that step uses AI, it also requires a model. That's why I did not use a route with "kraken" here. So even if it is currently only used for kraken, I'd suggest to use a generic route.
| "german-fraktur-19th-20th-century": { | ||
| "transkribus": { | ||
| "htr": 37738 | ||
| } | ||
| }, |
There was a problem hiding this comment.
This Transkribus model should not be added along with Kraken changes, it would be better to separate it out
|
|
||
| $points = ''; | ||
| if ( $crop ) { | ||
| $x = $crop['x']; |
There was a problem hiding this comment.
We'd prefer to isolate changes other than those related to Kraken to another PR!
| $this->transkribusEngine = $this->instantiateEngine( 'transkribus' ); | ||
|
|
||
| $this->transkribusEngine = $this->instatiateEngine( 'transkribus' ); | ||
| $this->krakenEngine = $this->instantiateEngine( 'kraken' ); |
| description: A web service for Tesseract, Google and Transkribus OCR engines. | ||
| version: 1.0.0 | ||
| description: A web service for Kraken, Tesseract, Google and Transkribus OCR engines. | ||
| version: 1.4.0 |
There was a problem hiding this comment.
I believe it has been bumped up to 1.4.4 now.
| path_patterns: | ||
| - ^/api$ | ||
| - ^/api/available_langs$ | ||
| - ^/api/available_segmentation_models$ |
There was a problem hiding this comment.
Will need to change this in accordance with my comment on route path
| { | ||
| "@metadata": {}, | ||
| "title": "WikimediaOCR", | ||
| "title": "WikimediaOCR – Kraken Test", |
There was a problem hiding this comment.
This should remain as WikimediaOCR
| "regenerator-runtime": "^0.13.11", | ||
| "select2": "^4.0.13", | ||
| "select2-bootstrap-theme": "0.1.0-beta.10", | ||
| "stylelint": "^15.10.3", |
There was a problem hiding this comment.
stylelint can be removed from this PR as well
| "test": "grunt test" | ||
| } | ||
| }, | ||
| "dependencies": {} |
There was a problem hiding this comment.
Empty entries should be removed
| 'fro' => 'Franceis, François, Romanz (1400-1600)', | ||
| 'ger-hd-m1' => 'Transkribus German handwriting M1', | ||
| 'ger-15' => '15th-16th century German', | ||
| 'german-fraktur-19th-20th-century' => 'German Fraktur 19th-20th century', |
There was a problem hiding this comment.
This is a Transkribus model and needs to be removed from this PR
|
It is actually a pitty that this PR never got merged. It would have been a great tool for the transcription of texts on wikisource. |
|
So let me fix the conflicts first. |
No description provided.