Skip to content

Augraphy: Creating Realistic Document Image Datasets with Data Augmentation

Notifications You must be signed in to change notification settings

sparkfish/augraphy-paper

Repository files navigation

Augraphy: Creating Realistic Document Image Datasets with Data Augmentation

This paper introduces Augraphy, a Python library for constructing data augmentation pipelines which produce distortions commonly seen in real-world document image datasets. Augraphy stands apart from other data augmentation tools by providing many different strategies to produce augmented versions of clean document images that appear as if they have been altered by standard office operations, such as printing, scanning, and faxing through old or dirty machines, degradation of ink over time, and handwritten markings. This paper discusses the Augraphy tool, and shows how it can be used both as a data augmentation tool for producing diverse training data for tasks such as document denoising, and also for generating challenging test data to evaluate model robustness on document image modeling tasks.

Why Augraphy

Document denoising and binarization are fundamental problems in the document processing space, but current datasets are often too small and lack sufficient complexity to effectively train and benchmark modern data-driven machine learning models. To fill this gap, we introduce Augraphy, a tool designed for creating new document image datasets for training and benchmarking document denoisers and binarizers.

Augraphy provides a pipeline for generating a wide variety of noise and distortions that simulate real-world conditions commonly seen in real-world document images. As a result, the tool helps create more realistic datasets and promote more accurate modeling of real-world document images by creating datasets that are augmented versions of clean, "born digital" images to render collections with clean/dirty pairs. In this paper, we discuss the creation process of Augraphy and demonstrate its utility by training convolutional denoisers which remove real noise features with a high degree of human-perceptible fidelity, establishing baseline performance for a new Augraphy benchmark.

Comparison with Related Work

Augraphy provides several strategies to produce augmented versions of clean document images, including printing, scanning, and faxing through old or dirty machines, degradation of ink over time, and handwritten markings. These techniques are designed to produce augmented images that appear as if they have been altered by standard office operations. Augraphy's unique approach to data augmentation sets it apart from other tools available.

Library Number of Augmentations Document Centric Pipeline Based Python License
Augmentor 27 MIT
Albumentations 216 MIT
imgaug 168 MIT
Augly 34 MIT
DocCreator 12 LGPL-3.0
Augraphy (ours) 30+ MIT

Applications

Below is a real-life sample from RVL-CDIP that was re-created using Augraphy to demonstrate how it can be used to introduce noisy perturbations to document images like the original noise seen here. This shows a clean reproduction based on the dirty sample from RVL-CDIP. The final image shows how we recreated a noisy version of this document by applying Augraphy augmentations to the clean reproduction.

Noisy Original Clean Reproduction Augraphy Reproduction
Examples of Augraphy-generated document images

The Augraphy tool can be used for both training and testing document image modeling tasks. By using the generated data, one can train a document denoiser that removes real noise features with a high degree of human-perceptible fidelity, which establishes baseline performance for a new benchmark.

The tool can also be used to generate challenging test data to evaluate model robustness on document image modeling tasks such as image classification, segmentation, and OCR, which are critical tasks in document analysis. The tool can also be used for the creation of synthetic document images for testing and evaluation, which can be especially useful in cases where data privacy or other concerns prevent the use of real-world documents.

A variety of research projects have used Augraphy to improve the accuracy and effectiveness of document processing algorithms and systems. By providing a means to generate complex document images with various distortions, Augraphy can help address the challenge of developing robust, real-world document processing applications.

Conclusion

Augraphy is a powerful tool that enables the creation of realistic document image datasets for more accurate modeling of real-world document image data. Its unique data augmentation strategies and versatility make it a valuable resource for researchers

The Augraphy Paper

This repository is used to maintain the notes used for the official Augraphy paper to be published in a forthcoming conference.

If you used Augraphy in your research, please cite the project and/or the paper as appropriate.

BibTeX:

@inproceedings{augraphy_paper,
  author = {Groleau, Alexander and Chee, Kok Wei and Larson, Stefan and Maini, Samay and Boarman, Jonathan},
  title = {Augraphy: A Data Augmentation Library for Document Images},
  booktitle = {Proceedings of the 17th International Conference on Document Analysis and Recognition ({ICDAR})},
  year = {2023},
  url = {https://arxiv.org/pdf/2208.14558.pdf}
}

@software{augraphy_library,
  author = {The Augraphy Project},
  title = {Augraphy: an augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes},
  url = {https://github.com/sparkfish/augraphy},
  version = {8.1.0}
}