Skip to content

stefan-it/historic-domain-adaptation-icdar

main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 

Data Centric Domain Adaptation for Historical Text with OCR Errors

This repository contains code and datasets that are used in our paper "Data Centric Domain Adaptation for Historical Text with OCR Errors" by Luisa März, Stefan Schweter, Nina Poerner, Benjamin Roth and Hinrich Schütze. The publicly accessible preprint can be found here.

Changelog

  • 24.08.2022: Add license and instructions to use the datasets in Flair.
  • 14.08.2022: Mention corpus stats for French and Dutch. Add BibTeX entry.
  • 07.12.2021: Release of French and Dutch data used for our experiments.
  • 16.07.2021: Initial version of this repo.

Datasets

The data used for our experiments can be found in the data folder of this repository.

Stats

The following table shows an overview of the corpus stats for each language:

Language Training Sentences Development Sentences Test Sentences
French 7,936 992 992
Dutch 5,777 722 723

These stats can be calculated with the flair_stats.py script using Flair (commit: 7578403).

Code

Code for training our models will be released in near future.

Usage in Flair

With latest Flair master branch, native support for our released datasets was added. It is possible to load our datasets with the following lines of code:

from flair.datasets import NER_ICDAR_EUROPEANA

french_corpus = NER_ICDAR_EUROPEANA(language="fr")
dutch_corpus  = NER_ICDAR_EUROPEANA(language="nl")

License

We release the data under CC0 1.0 Universal (CC0 1.0) license (Same license as used for Europeana NER Corpora).

Citation

You can use the following BibTeX entry for citing our paper/data:

@InProceedings{10.1007/978-3-030-86331-9_48,
    author="M{\"a}rz, Luisa
    and Schweter, Stefan
    and Poerner, Nina
    and Roth, Benjamin
    and Sch{\"u}tze, Hinrich",
    editor="Llad{\'o}s, Josep
    and Lopresti, Daniel
    and Uchida, Seiichi",
    title="Data Centric Domain Adaptation for Historical Text with OCR Errors",
    booktitle="Document Analysis and Recognition -- ICDAR 2021",
    year="2021",
    publisher="Springer International Publishing",
    address="Cham",
    pages="748--761",
    abstract="We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora.",
    isbn="978-3-030-86331-9"
}

About

Data Centric DomainAdaptation for Historical Text with OCR Errors

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages