A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
-
Updated
Sep 20, 2024 - Python
A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
OCR engine for all the languages
针对文档类图像,整合版面分析、文字识别、表格识别和公式识别结果,还原版面布局信息。
A toolbox of ocr models and algorithms based on MindSpore
A Unified Toolkit for Deep Learning Based Document Image Analysis
Analysis of Chinese and English layouts 中英文版面分析
YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis
PdfDet aims to simplify PDF layout detect tasks for users.
A python package to structure files using visual and style informations
An official implementation of paper "Paragraph2Graph: A Language-independent GNN-based framework for layout analysis"
[ICDAR 2023] SelfDocSeg: A self-supervised vision-based approach towards Document Segmentation (Oral)
Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset
Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset
OCR-D compliant toolset for optical layout recognition on historical german-language documents published in Brazil
A powerful CLI tool for visualization and encoding of PAGE-XML files
OCR-D wrapper for page-xml-draw
This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.
A more complete example of programming with PDFMiner, which continues where the default documentation stops
BA-thesis in history.
Add a description, image, and links to the layout-analysis topic page so that developers can more easily learn about it.
To associate your repository with the layout-analysis topic, visit your repo's landing page and select "manage topics."