Add kraken OCR engine#89

Draft

stweil wants to merge 27 commits intowikimedia:mainfrom

Contributor

stweil commented Aug 19, 2023

No description provided.

stweil marked this pull request as draft

August 19, 2023 11:13

Contributor Author

stweil commented Aug 19, 2023

This is a first implementation to support the Open Source kraken OCR engine.

I created this draft pull request to allow public review and comments although the implementation is still incomplete. OCR for cropped image still needs testing, and there is also currently no unit test code for the new engine.

stweil force-pushed the kraken branch from 9e5607e to 4227fae Compare

August 22, 2023 09:31

Member

samwilson commented Aug 28, 2023

Also, would you mind opening a Phabricator task for this work, so it can be tracked there? Thanks!

Contributor Author

stweil commented Aug 28, 2023

Also, would you mind opening a Phabricator task for this work, so it can be tracked there? Thanks!

Done, see https://phabricator.wikimedia.org/T345055. I also updated the PR here to solve a merge conflict.

stweil force-pushed the kraken branch 2 times, most recently from d773984 to a704816 Compare

August 28, 2023 08:39

Contributor Author

stweil commented Aug 28, 2023 •

edited

Loading

See also my test installation.

stweil force-pushed the kraken branch from bca0c1e to 0edc944 Compare

August 31, 2023 12:13

stweil force-pushed the kraken branch 15 times, most recently from f1764a1 to 6d69032 Compare

September 22, 2023 16:02

Contributor Author

stweil commented Sep 22, 2023

The test installation is meanwhile available on https://kraken-ocr.wmcloud.org/.

stweil added 5 commits

October 12, 2023 13:30


          Fix typo in name of newly introduced method (instatiate -> instantiate)

b0c99b1

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Add OCR models Fraktur and Latin for Tesseract

4c0565d

Both are not language specific, but support historic and current scripts
used by many European languages.

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Add new OCR engine kraken

f4e2bd8

Kraken is an Open Source OCR engine with trainable segmentation and
OCR models. It can work with printed and handwritten texts.

This initial implementation comes with two generic OCR models which
can be used on a wide range of German publications, but also with
other languages which are based on Latin script.

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          WebProfilerBundle

2c3fb6b

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Remove unneeded code for Transkribus OCR engine

803fe24

Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil added 21 commits

October 12, 2023 11:45


          Add script for kraken OCR

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Update KrakenEngine to support language selection

b4683e4

Cropping is also implemented now, but still untested.

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Add models for kraken OCR

28f8db2

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Suppress warning from phpcs because usage of popen

166ad14

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Add austriannewspapers model for kraken

e597b4d

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Add missing documentation for new OCR engine kraken (required for CI …

a54c70c

…test)

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Update package.json

d039754

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Update package-lock.json

a0c799c

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Support segmentation model for kraken OCR engine

22703d5

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Update API version for new release with kraken OCR engine

bc93361

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Add new API /api/available_segmentation_models

c9723b7

Segmentation models are currently only supported for kraken.
All other OCR engines return an empty list.

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Add segmentation model for kraken OCR

85775c5

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Add more OCR models for Tesseract

c6fe3c8

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          npm: Add missing dependency stylelint

3f316bb

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Fix iteration over Transkribus line models

903ccfb

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Fix description of OpenAPI parameters langs and crop

b1427e8

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Add Transkribus model german-fraktur-19th-20th-century

d7fc002

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Fix kraken_ocr script

4d67cfd

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Add new OCR parameter to normalize the result text

dac4739

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Modify title shown on test web page

b5640ec

Signed-off-by: Stefan Weil <sw@weilnetz.de>


          Fix code injection

f281e6e

Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil force-pushed the kraken branch from 6d69032 to f281e6e Compare

October 12, 2023 12:07


          Improve code to fix code injection

fa67a34

Signed-off-by: Stefan Weil <sw@weilnetz.de>

Parthiv-M requested changes

View reviewed changes

Collaborator

Parthiv-M left a comment

This review focuses on the following main things

Files related to Transkribus have also been modified, which I think is not ideal
A new Transkribus model has also been added, which should be removed
Normalise text has been added here for all engines (we would want that in a separate PR)
Renaming the newly added API route

Overall, we'd like to keep the non-Kraken changes out of the way before testing kraken once again

src/Controller/OcrController.php

+              	/**
+              	 * Get a list of available segmentation models for use with a specific OCR engine.
+              	 *
+              	 * @Route("/api/available_segmentation_models", name="apiSegmentationModels", methods={"GET"})

Collaborator

Parthiv-M Dec 7, 2023

This route, since it is related to Kraken, should be named /api/kraken/available_segmentation_models

Contributor Author

stweil Dec 12, 2023

Having segmentation models is not kraken specific. All OCR processes require a segmentation step, and if that step uses AI, it also requires a model. That's why I did not use a route with "kraken" here. So even if it is currently only used for kraken, I'd suggest to use a generic route.

public/langs.json

Comment on lines +289 to +293

+                  "german-fraktur-19th-20th-century": {
+                      "transkribus": {
+                          "htr": 37738
+                      }
+                  },

Collaborator

Parthiv-M Dec 7, 2023

This Transkribus model should not be added along with Kraken changes, it would be better to separate it out

src/Engine/TranskribusEngine.php

               		$points = '';
               		if ( $crop ) {
               			$x = $crop['x'];

Collaborator

Parthiv-M Dec 7, 2023

We'd prefer to isolate changes other than those related to Kraken to another PR!

tests/Engine/EngineBaseTest.php

Comment on lines +39 to +41

    
              		$this->transkribusEngine = $this->instantiateEngine( 'transkribus' );

              		$this->transkribusEngine = $this->instatiateEngine( 'transkribus' );

              		$this->krakenEngine = $this->instantiateEngine( 'kraken' );

Collaborator

Parthiv-M Dec 7, 2023

instantiate() fixed in #115

config/packages/nelmio_api_doc.yaml

    
                          description: A web service for Tesseract, Google and Transkribus OCR engines.

                          version: 1.0.0

                          description: A web service for Kraken, Tesseract, Google and Transkribus OCR engines.

                          version: 1.4.0

Collaborator

Parthiv-M Dec 7, 2023

I believe it has been bumped up to 1.4.4 now.

config/packages/nelmio_api_doc.yaml

                       path_patterns:
                           - ^/api$
                           - ^/api/available_langs$
+                          - ^/api/available_segmentation_models$

Collaborator

Parthiv-M Dec 7, 2023

Will need to change this in accordance with my comment on route path

i18n/en.json

               {
                   "@metadata": {},
-                  "title": "WikimediaOCR",
+                  "title": "WikimediaOCR – Kraken Test",

Collaborator

Parthiv-M Dec 7, 2023

This should remain as WikimediaOCR

package.json

                       "regenerator-runtime": "^0.13.11",
                       "select2": "^4.0.13",
                       "select2-bootstrap-theme": "0.1.0-beta.10",
+                      "stylelint": "^15.10.3",

Collaborator

Parthiv-M Dec 7, 2023

stylelint can be removed from this PR as well

package.json

                       "test": "grunt test"
-                  }
+                  },
+                  "dependencies": {}

Collaborator

Parthiv-M Dec 7, 2023

Empty entries should be removed

src/Engine/EngineBase.php

               		'fro' => 'Franceis, François, Romanz (1400-1600)',
               		'ger-hd-m1' => 'Transkribus German handwriting M1',
               		'ger-15' => '15th-16th century German',
+              		'german-fraktur-19th-20th-century' => 'German Fraktur 19th-20th century',

Collaborator

Parthiv-M Dec 7, 2023

This is a Transkribus model and needs to be removed from this PR

wrznr commented May 2, 2025

It is actually a pitty that this PR never got merged. It would have been a great tool for the transcription of texts on wikisource.

Contributor Author

stweil commented May 2, 2025

So let me fix the conflicts first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet