Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would it be possible to train a German model? #82

Open
gitapii opened this issue Jan 20, 2024 · 3 comments
Open

Would it be possible to train a German model? #82

gitapii opened this issue Jan 20, 2024 · 3 comments

Comments

@gitapii
Copy link

gitapii commented Jan 20, 2024

Hi,

I recently tested this repo as nuget package and it seems to be a very good Paddle OCR solution for .NET. Would it be also possible to train/finetune a German model (maybe locally) or use the inference model from 'PaddlePaddle/PaddleOCR#1048'?

It's quite similar to English, but you have 4 more characters (ä, ö, ü, ß). At the moment, the model recognizes them as (a, o, u) without the dots above. It would be great.

Kind regards,

@n0099
Copy link
Contributor

n0099 commented Jan 20, 2024

It's quite similar to English, but you have 4 more characters (ä, ö, ü, ß).

https://en.wikipedia.org/wiki/List_of_Latin-script_letters
https://en.wikipedia.org/wiki/Template:ISO_15924_script_codes_and_related_Unicode_data
https://knowyourmeme.com/memes/theyre-the-same-picture

In fact, every model can ONLY recognize chars out of the predefined characters dictionary at the train time since recognize will just output a list of index for each character in the dictionary, so if you match dict.txt other than the dictionary being used while training, indexes won't match together and leads to meaningless chars
https://github.com/PaddlePaddle/PaddleOCR/blob/1bc550064457b9ab7821f92f16ac5629239ae95a/doc/doc_ch/models_list.md?plain=1#L45 They claimed the latest v4 model of ch_PP-OCRv4_det is suited for 【最新】原始超轻量模型,支持中英文、多语种文本检测 but there are some missing Latin letter variants in the dictionary so they will never get recognized as it should be:

public static LocalDictOnlineRecognizationModel ChineseV4 => new("ch_PP-OCRv4_rec", "ppocr_keys_v1.txt", new Uri("https://paddleocr.bj.bcebos.com/PP-OCRv4/chinese/ch_PP-OCRv4_rec_infer.tar"), ModelVersion.V4);

and there's a v3 model that is trained with a latin_dict.txt contains more variant letters:

public static LocalDictOnlineRecognizationModel LatinV3 => new("latin_PP-OCRv3_rec", "latin_dict.txt", new Uri("https://paddleocr.bj.bcebos.com/PP-OCRv3/multilingual/latin_PP-OCRv3_rec_infer.tar"), ModelVersion.V3);

If you want to use the oldest v2 model german_mobile_v2.0_rec_infer from PaddlePaddle/PaddleOCR#1048 (comment) which seems to be trained with german_dict.txt, then you may define

public static LocalDictOnlineRecognizationModel GermanV2 => new("german_mobile_v2.0_rec_infer", "german_dict.txt", new Uri("https://paddleocr.bj.bcebos.com/dygraph_v2.0/multilingual/german_mobile_v2.0_rec_infer.tar"), ModelVersion.V2);

but this shouldn't work due to all dictionaries copied from PaddleOCR/ppocr/utils/dict for usages in Sdcb.PaddleOCR.Models.Local*|Online are bundled in Sdcb.PaddleOCR.Models.Shared as assembly resource

<ItemGroup>
<EmbeddedResource Include="dicts\arabic_dict.txt" />
<EmbeddedResource Include="dicts\chinese_cht_dict.txt" />
<EmbeddedResource Include="dicts\cyrillic_dict.txt" />
<EmbeddedResource Include="dicts\devanagari_dict.txt" />
<EmbeddedResource Include="dicts\en_dict.txt" />
<EmbeddedResource Include="dicts\japan_dict.txt" />
<EmbeddedResource Include="dicts\ka_dict.txt" />
<EmbeddedResource Include="dicts\korean_dict.txt" />
<EmbeddedResource Include="dicts\latin_dict.txt" />
<EmbeddedResource Include="dicts\ppocr_keys_v1.txt" />
<EmbeddedResource Include="dicts\table_structure_dict.txt" />
<EmbeddedResource Include="dicts\table_structure_dict_ch.txt" />
<EmbeddedResource Include="dicts\ta_dict.txt" />
<EmbeddedResource Include="dicts\te_dict.txt" />
</ItemGroup>
for
public static List<string> LoadDicts(string dictName)
to read
return new StreamDictFileRecognizationModel(RootDirectory, SharedUtils.LoadDicts(DictName), Version);

@gitapii
Copy link
Author

gitapii commented Jan 20, 2024

Thank you very much! I've overlooked the latin_dict. The mentioned 4 chars are there, being also recognizable. So it's already working when selecting LocalFullModels.LatinV3 as model. I'm going to optimize it, thx!

@juvebogdan
Copy link

@gitapii Hi. I wanted to use german language as well. Did you have to finetune or it works out of the box?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants