📝 Docs Parsing Techniques

A curated collection of Jupyter notebooks for experimenting with state-of-the-art OCR, document parsing, table extraction, and chart understanding techniques. This repository enables easy benchmarking and practical usage of the latest open-source and cloud-based solutions for document image processing.

🚀 Notebooks Overview

Notebook	Description
bytedance-dolphin-image-parsing.ipynb	Document page parsing with Dolphin by ByteDance
docling-documents-parsing-and-tables-extraction.ipynb	Parsing and table extraction with Docling
florence-2-large-ocr-documents-pages.ipynb	OCR of document pages using Florence 2 Large
florence-2-large-ocr-images-real-life-scenarios.ipynb	Real-life scenario OCR with Florence 2 Large
gemini-2-5-pro-on-chart-and-table-extraction.ipynb	Chart/table extraction using Gemini 2.5 Pro
got-ocr2-0-docs-parsing.ipynb	Document pages parsing with GOT-OCR2.0 and Gemini 2.5 Flash
marker-docs-parsing.ipynb	Marker-based document parsing experiments
mistralocr-docs-parsing.ipynb	Document parsing using MistralOCR
monkeyocr-docs-pages-parsing.ipynb	Document parsing with MonkeyOCR
nanonets-OCR-s_docs_parsing.ipynb	Advanced document parsing using Nanonets-OCR-s
ollama-llama3-2-vision-usage.ipynb	Using Llama3-2 Vision for document parsing
paddleocr-3-0-docs-parsing.ipynb	Parsing with PaddleOCR 3.0 PP-StructureV3
pix2text-docs-pages-parsing.ipynb	Document parsing using Pix2Text
smoldocling-documents-understanding.ipynb	Document understanding with SmolDocling
zerox-pdf-parsing.ipynb	PDF parsing experiments with Zerox
qwen2-vl-2b-docs-parsing.ipynb	Documents pages parsing with Qwen2-VL-2B

📖 Project Goals

Benchmark different OCR/document parsing models on real documents.
Demonstrate table, chart, and text extraction workflows.
Compare open-source and commercial solutions.
Provide ready-to-use code snippets for rapid prototyping.

🛠️ Usage

Clone the repository:

git clone https://github.com/AdemBoukhris457/Docs_Parsing_Techniques.git

Install dependencies as needed for each notebook (see the first cells of each .ipynb for requirements).
Launch Jupyter Notebook or JupyterLab and open any notebook of interest.
Run the cells and adapt the code for your documents.

📌 Notes

Some notebooks require model weights or API keys, check comments in each notebook for details.
Results, insights, and sample outputs are provided inline.

🔗 Related Resources

📂 You can find more notebooks, experiments, and datasets related to document parsing and OCR on my Kaggle profile: 👉 https://www.kaggle.com/ademboukhris/code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📝 Docs Parsing Techniques

🚀 Notebooks Overview

📖 Project Goals

🛠️ Usage

📌 Notes

🔗 Related Resources

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Different_Tables_Images_Testing		Different_Tables_Images_Testing
NVIDIA_Annual_Report		NVIDIA_Annual_Report
cga_images		cga_images
pdf_files_pages		pdf_files_pages
tables_and_plots_for_testing		tables_and_plots_for_testing
Nanonets-OCR-s_docs_parsing.ipynb		Nanonets-OCR-s_docs_parsing.ipynb
README.md		README.md
bytedance-dolphin-image-parsing.ipynb		bytedance-dolphin-image-parsing.ipynb
docling-documents-parsing-and-tables-extraction.ipynb		docling-documents-parsing-and-tables-extraction.ipynb
florence-2-large-ocr-documents-pages.ipynb		florence-2-large-ocr-documents-pages.ipynb
florence-2-large-ocr-images-real-life-scenarios.ipynb		florence-2-large-ocr-images-real-life-scenarios.ipynb
gemini-2-5-pro-on-chart-and-table-extraction.ipynb		gemini-2-5-pro-on-chart-and-table-extraction.ipynb
got-ocr2-0-docs-parsing.ipynb		got-ocr2-0-docs-parsing.ipynb
marker-docs-parsing.ipynb		marker-docs-parsing.ipynb
mistralocr-docs-parsing.ipynb		mistralocr-docs-parsing.ipynb
monkeyocr-docs-pages-parsing.ipynb		monkeyocr-docs-pages-parsing.ipynb
ollama-llama3-2-vision-usage.ipynb		ollama-llama3-2-vision-usage.ipynb
paddleocr-3-0-docs-parsing.ipynb		paddleocr-3-0-docs-parsing.ipynb
pix2text-docs-pages-parsing.ipynb		pix2text-docs-pages-parsing.ipynb
qwen2-vl-2b-docs-parsing.ipynb		qwen2-vl-2b-docs-parsing.ipynb
smoldocling-documents-understanding.ipynb		smoldocling-documents-understanding.ipynb
zerox-pdf-parsing.ipynb		zerox-pdf-parsing.ipynb

AdemBoukhris457/Docs_Parsing_Techniques

Folders and files

Latest commit

History

Repository files navigation

📝 Docs Parsing Techniques

🚀 Notebooks Overview

📖 Project Goals

🛠️ Usage

📌 Notes

🔗 Related Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages