WizardExtract is a Python library for reliable text extraction from PDFs, Office documents, and images. It supports local OCR with Tesseract and cloud OCR with Azure Document Intelligence. It provides page and sheet selection, hybrid PDF handling that combines native text with OCR, and deterministic I/O. With Azure prebuilt-layout it can also return tables and key-value pairs.
Requires Python 3.9+.
pip install wizardextractOptional extras:
- Azure OCR:
pip install "wizardextract[azure]"
For OCR capabilities, ensure you have Tesseract installed on your system.
import wizardextract as we
text = we.extract_text("example.pdf")
print(text)| Method | Purpose |
|---|---|
extract_text |
Local text extraction with optional Tesseract OCR |
extract_text_azure |
Cloud extraction via Azure (text, tables, key-value) |
input_data:[str, bytes, Path]extension: The file extension, required only ifinput_dataisbytes.pages: Page/sheet selection.
• Paged (PDF, DOCX, TIFF):1,"1-3",[1, 3, "5-8"]
• Excel (XLSX/XLS): sheet index (int), name (str), or mixed listocr: Enables OCR using Tesseract. Applies to PDF/DOCX and image-based files.language_ocr: Language code for OCR. Defaults to'eng'.
Basic:
import wizardextract as we
txt = we.extract_text("docs/report.pdf")From bytes:
from pathlib import Path
import wizardextract as we
raw = Path("img.png").read_bytes()
txt_img = we.extract_text(raw, extension="png")Paged selection and OCR:
import wizardextract as we
sel = we.extract_text("docs/big.pdf", pages=[1, 3, "5-7"])
ocr_txt = we.extract_text("scan.tiff", ocr=True, language_ocr="ita")| Format | OCR Option |
|---|---|
| Optional | |
| DOC | No |
| DOCX | Optional |
| XLSX | No |
| XLS | No |
| TXT | No |
| CSV | No |
| JSON | No |
| HTML | No |
| HTM | No |
| TIF | Default |
| TIFF | Default |
| JPG | Default |
| JPEG | Default |
| PNG | Default |
| GIF | Default |
input_data:[str, bytes, Path]extension: File extension whenbytesare passed.language_ocr: OCR language code (ISO-639).pages: Page selection (int,"1,3,5-7", or list).azure_endpoint: Azure Document Intelligence endpoint URL.azure_key: Azure API key.azure_model_id:"prebuilt-read"(text only) or"prebuilt-layout"(text + tables + key-value).hybrid: IfTrue, for PDFs: native text via PyMuPDF and images via OCR.
import wizardextract as we
res = we.extract_text_azure(
"invoice.pdf",
language_ocr="ita",
azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
azure_key="<KEY>",
azure_model_id="prebuilt-layout",
hybrid=True,
)
print(res.text)
print(res.pretty_tables[:1])
print(res.key_value)Author: Mattia Rubino
Email: textwizard.dev@gmail.com
