Skip to content

yungyDPR/TrainingData

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

91 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Training datasets for training GROBID sale catalogues models

Each directory of this repository contains datasets created to train GROBID sale catalogues models. Datasets are divided based on where original documents are being kept, and then are organized by authors/auction houses.

Annotated files are in the TEI-XML format.

Naming convention

  • BnF files are named with their Gallica ark identifier.
  • INHA files are named with their digital identifier ("identifiant numérique") provided in their online notice.

GROBID models

  • Segmentation : the segmentation model aims to obtain a high level segmentation of the catalogues.

Data quality

Before being pushed to the main branch, annotated files have at least been proofread once, and are validated with an XSD by a Github action.

Toolbox

This repository also contains a set of tools that can be used on the training sets.

  • PDF Preprocessing
  • Quality assessment
  • XML validity checker (used by a Github action)

DataCatalogue organization information

Organization logo

Logo by Alix Chagué, inspiration from Loading Artist.

About

Training datasets for GROBID sale catalogues models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%