How to train OCR? #33

Paladond · 2019-07-30T08:14:03Z

How to train OCR?
Describe, please, each steps and all needed(images, json, how to run script and so on)
Thanks you.

ApelSYN · 2019-07-31T07:48:17Z

На Русском

Сначала необходимо получить примеры номеров из заданного масива фотографий автомобилей. Для этого под свои нужды можно переделать вот этот пример: https://github.com/ria-com/nomeroff-net/blob/master/tools/avto-nomer-tool/py/avto-nomer_grab.ipynb
Для того чтоб создать JSON-описания к полученным образцам номеров, необходимо применить вот этот скрипт на nodejs

./console.js --section=default --action=createAnnotations  --opt.baseDir=../../datasets/ocr/your_dataset

Вот небольшая документация к этому инструментарию: https://github.com/ria-com/nomeroff-net/tree/master/tools/avto-nomer-tool

Далее, можно попробовать распознать номера с помощью европейской модели, т.к. она самая универсальная для таких случаеы, пример скрипта как это делать можете посмотреть тут
https://github.com/ria-com/nomeroff-net/blob/master/tools/avto-nomer-tool/py/ocr_dataset_checker.ipynb
Этот пунк можно пропустить, если не разберетесь с работой этого примера.
Самый трудоемкой задачей является модерация датасета, для этого мы создали спечиальную админ-панель на nodejs, документация по установке можно посмотреть тут:
https://github.com/ria-com/nomeroff-net/tree/master/moderation
Отмодерированый датасет нужно поделить на тестовую, обучающую и валидационную выборку. Это можно сделать с помощью уже упомянутого ранее набора утилит
https://github.com/ria-com/nomeroff-net/tree/master/tools/avto-nomer-tool
А именно

./console.js --section=default --action=dataSplit --opt.splitRate=0.2  --opt.srcDir=../../datasets/ocr/draft --opt.targetDir=../../datasets/ocr/test

Для хорошего результата обучающая выборка должна нащитывать хотя бы 5000 примеров, но мы рекомендуем 10 000
Создайте скрипт, основываясь на примерах из папки train, например https://github.com/ria-com/nomeroff-net/blob/master/train/ocr-ge.ipynb и натренируйте свой датасет

In English

First you need to get sample numbers from a given car photo array. To do this, you can redo this example for your needs: https://github.com/ria-com/nomeroff-net/blob/master/tools/avto-nomer-tool/py/avto-nomer_grab.ipynb
In order to create JSON descriptions for the received sample numbers, you need to apply this script on nodejs

./console.js --section=default --action=createAnnotations  --opt.baseDir=../../datasets/ocr/your_dataset

Next, you can try to recognize the numbers using the European model, because It is the most universal for such cases. You can see an example of the script here.
https://github.com/ria-com/nomeroff-net/blob/master/tools/avto-nomer-tool/py/ocr_dataset_checker.ipynb
Этот пунк можно пропустить, если не разберетесь с работой этого примера.
The most time-consuming task is moderation dataset, for this we created a special admin panel for nodejs, installation documentation can be found here:
https://github.com/ria-com/nomeroff-net/tree/master/moderation
The moderated dataset needs to be divided into a test, training and validation sample. This can be done using the previously mentioned toolkit
https://github.com/ria-com/nomeroff-net/tree/master/tools/avto-nomer-tool
Example:

./console.js --section=default --action=dataSplit --opt.splitRate=0.2  --opt.srcDir=../../datasets/ocr/draft --opt.targetDir=../../datasets/ocr/test

For a good result, the training sample should contain at least 5,000 examples, but we recommend 10,000
Create a script based on examples from the [train] folder (https://github.com/ria-com/nomeroff-net/blob/master/train), for example https://github.com/ria-com/nomeroff-net/blob/master/train/ocr-ge.ipynb and train your dataset

Paladond · 2019-08-02T14:44:16Z

Thanks, it's a good description.

Paladond closed this as completed Aug 2, 2019

Provide feedback