Wizard Extract

WizardExtract is a Python library for reliable text extraction from PDFs, Office documents, and images. It supports local OCR with Tesseract and cloud OCR with Azure Document Intelligence. It provides page and sheet selection, hybrid PDF handling that combines native text with OCR, and deterministic I/O. With Azure prebuilt-layout it can also return tables and key-value pairs.

Installation

Requires Python 3.9+.

pip install wizardextract

Optional extras:

Azure OCR: pip install "wizardextract[azure]"

For OCR capabilities, ensure you have Tesseract installed on your system.

Quick start

import wizardextract as we

text = we.extract_text("example.pdf")
print(text)

API overview

Method	Purpose
`extract_text`	Local text extraction with optional Tesseract OCR
`extract_text_azure`	Cloud extraction via Azure (text, tables, key-value)

Text extraction

Parameters

input_data: [str, bytes, Path]
extension: The file extension, required only if input_data is bytes.
pages: Page/sheet selection.
• Paged (PDF, DOCX, TIFF): 1, "1-3", [1, 3, "5-8"]
• Excel (XLSX/XLS): sheet index (int), name (str), or mixed list
ocr: Enables OCR using Tesseract. Applies to PDF/DOCX and image-based files.
language_ocr: Language code for OCR. Defaults to 'eng'.

Examples

Basic:

import wizardextract as we

txt = we.extract_text("docs/report.pdf")

From bytes:

from pathlib import Path
import wizardextract as we

raw = Path("img.png").read_bytes()
txt_img = we.extract_text(raw, extension="png")

Paged selection and OCR:

import wizardextract as we

sel = we.extract_text("docs/big.pdf", pages=[1, 3, "5-7"])
ocr_txt = we.extract_text("scan.tiff", ocr=True, language_ocr="ita")

Supported Formats

Format	OCR Option
PDF	Optional
DOC	No
DOCX	Optional
XLSX	No
XLS	No
TXT	No
CSV	No
JSON	No
HTML	No
HTM	No
TIF	Default
TIFF	Default
JPG	Default
JPEG	Default
PNG	Default
GIF	Default

Azure OCR

Parameters

input_data: [str, bytes, Path]
extension: File extension when bytes are passed.
language_ocr: OCR language code (ISO-639).
pages: Page selection (int, "1,3,5-7", or list).
azure_endpoint: Azure Document Intelligence endpoint URL.
azure_key: Azure API key.
azure_model_id: "prebuilt-read" (text only) or "prebuilt-layout" (text + tables + key-value).
hybrid: If True, for PDFs: native text via PyMuPDF and images via OCR.

Example

import wizardextract as we

res = we.extract_text_azure(
    "invoice.pdf",
    language_ocr="ita",
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_key="<KEY>",
    azure_model_id="prebuilt-layout",
    hybrid=True,
)

print(res.text)
print(res.pretty_tables[:1])
print(res.key_value)

License

AGPL-3.0-or-later.

RESOURCES

Contact & Author

Author: Mattia Rubino
Email: textwizard.dev@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
asset		asset
docs_source		docs_source
test		test
wizardextract		wizardextract
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
readthedocs.yml		readthedocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wizard Extract

Contents

Installation

Quick start

API overview

Text extraction

Parameters

Examples

Supported Formats

Azure OCR

Parameters

Example

License

RESOURCES

Contact & Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wizard Extract

Contents

Installation

Quick start

API overview

Text extraction

Parameters

Examples

Supported Formats

Azure OCR

Parameters

Example

License

RESOURCES

Contact & Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages