Read and extract text and other content from PDFs in C# (port of PDFBox)
-
Updated
Nov 2, 2024 - C#
Read and extract text and other content from PDFs in C# (port of PDFBox)
OCR engine for all the languages
Document Layout Analysis resources repos for development with PdfPig.
Page to PAGE Layout Analysis Tool
Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
A powerful CLI tool for visualization and encoding of PAGE-XML files
About The repo gt_structure_1_4 is part of the OCR-D Ground Truth Structure corpus. Only the structure of the printed page is annotated. The corpus was created as a result of the DFG project OCR-D.
OCR-D guidelines for Ground Truth production
OCR-D wrapper for page-xml-draw
The repo gt_structure_1_3 is part of the OCR-D Ground Truth Structure corpus. Only the structure of the printed page is annotated. The corpus was created as a result of the DFG project OCR-D.
The repo gt_structure_1_2 is part of the OCR-D Ground Truth Structure corpus. Only the structure of the printed page is annotated. The corpus was created as a result of the DFG project OCR-D.
The GBN Dataset consists German-Brazilian historical newspapers, along with their digital and binarized images and ground truth files.
Add a description, image, and links to the page-xml topic page so that developers can more easily learn about it.
To associate your repository with the page-xml topic, visit your repo's landing page and select "manage topics."