Historical Multilingual and Monolingual TEAMS Models. The following languages are covered:
- English (British Library Corpus - Books)
- German (Europeana Newspaper)
- French (Europeana Newspaper)
- Finnish (Europeana Newspaper, Digilib)
- Swedish (Europeana Newspaper, Digilib)
- Dutch (Delpher Corpus)
- Norwegian (NCC Corpus)
We pretrain a "Training ELECTRA Augmented with Multi-word Selection" (TEAMS) model:
We pretrain the hmTEAMS model on a v3-32 TPU Pod. All details can be found here.
We perform experiments on various historic NER datasets, such as HIPE-2022 or ICDAR Europeana. All results incl. hyper-parameters can be found here.
Our pretrained hmTEAMS model can be obtained from the Hugging Face Model Hub:
We release the following models, trained on various Historic NER Datasets (HIPE-2020, HIPE-2022, ICDAR):
Language | Model(s) |
---|---|
English | AjMC (HIPE-2022) - TopRes19th (HIPE-2022) |
German | AjMC (HIPE-2022) - NewsEye - HIPE-2020 |
French | AjMC (HIPE-2022) - ICDAR-Europeana - LeTemps (HIPE-2022) - NewsEye - HIPE-2020 |
Finnish | NewsEye (HIPE-2022) |
Swedish | NewsEye (HIPE-2022) |
Dutch | ICDAR-Europeana |
- 25.09.2024: All hmTEAMS models are now released under permissive Apache 2.0 license.
- 08.09.2023: Evaluation on German and French HIPE-2020 datasets added here.
- 01.09.2023: Evaluation on German and French NewsEye datasets added here.
- 28.08.2023: Evaluation on TopRes19th dataset added here.
- 27.08.2023: Evaluation on LeTemps dataset is added here.
- 06.08.2023: Evaluation on various historic NER datasets are completed. Results can be found here.
- 01.08.2023: hmTEAMS organization can be found on the Model Hub. More information of how to access trained hmTEAMS models are coming soon.
- 25.05.2023: Initial version of this repo.
We thank Luisa März, Katharina Schmid and Erion Çano for their fruitful discussions about Historical Language Models.
Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️