TextWizard is a Python library to extract, clean, and analyze text from PDFs, DOCX, images, CSV, HTML/XML, and more. It includes local OCR (Tesseract), cloud OCR with Azure Document Intelligence, multi-backend NER, language detection, lexical statistics, and HTML utilities.
- Installation
- Quick start
- API overview
- Text extraction
- Azure OCR
- HTML cleaning
- XML cleaning
- CSV cleaning
- Named-Entity Recognition (NER)
- Spell checking
- Language detection
- Text statistics
- Text similarity
- Beautiful HTML
- HTML to Markdown
- License
- Resources
TextWizard is a Python toolkit for end-to-end text ingestion: it extracts, cleans, and analyzes content from PDFs, Office documents, images, HTML/XML, CSV, and plain text. It unifies local OCR (Tesseract) and Azure Document Intelligence, normalizes noisy markup, and exposes text, tables, and key-value pairs through one consistent API.
It targets production pipelines: deterministic I/O, page selection and hybrid PDF handling, multi-backend NER (spaCy, Stanza), language detection at 160+ languages, compact spell-checking tries, lexical statistics, and HTML utilities (sanitization, pretty-print, HTML→Markdown). The goal is to be a dependable, high-level building block for practical text extraction and cleanup in Python.
Requires Python 3.9+.
pip install textwizardOptional extras:
- Azure OCR:
pip install "textwizard[azure]" - NER:
pip install "textwizard[ner]" - Everything:
pip install "textwizard[all]"
For OCR capabilities, ensure you have Tesseract installed on your system.
For spaCy models, e.g.:python -m spacy download en_core_web_sm.
import textwizard as tw
text = tw.extract_text("example.pdf")
print(text)| Method | Purpose |
|---|---|
extract_text |
Local text extraction with optional Tesseract OCR |
extract_text_azure |
Cloud extraction via Azure (text, tables, key-value) |
clean_html |
High-level HTML cleaning with semantic flags |
clean_xml |
XML cleanup and normalization |
clean_csv |
CSV cleanup with configurable dialect |
extract_entities |
NER via spaCy / Stanza / spaCy-Stanza |
correctness_text |
Spell checking |
lang_detect |
Language detection |
analyze_text_statistics |
Lexical metrics (entropy, Zipf, Gini, …) |
text_similarity |
Similarity: cosine, jaccard, levenshtein |
beutifull_html |
Pretty-print HTML |
html_to_markdown |
Convert HTML → Markdown |
input_data:[str, bytes, Path]extension: The file extension, required only ifinput_dataisbytes.pages: Page/sheet selection.
• Paged (PDF, DOCX, TIFF):1,"1-3",[1, 3, "5-8"]
• Excel (XLSX/XLS): sheet index (int), name (str), or mixed listocr: Enables OCR using Tesseract. Applies to PDF/DOCX and image-based files.language_ocr: Language code for OCR. Defaults to'eng'.
Basic:
import textwizard as tw
txt = tw.extract_text("docs/report.pdf")From bytes:
from pathlib import Path
import textwizard as tw
raw = Path("img.png").read_bytes()
txt_img = tw.extract_text(raw, extension="png")Paged selection and OCR:
import textwizard as tw
sel = tw.extract_text("docs/big.pdf", pages=[1, 3, "5-7"])
ocr_txt = tw.extract_text("scan.tiff", ocr=True, language_ocr="ita")| Format | OCR Option |
|---|---|
| Optional | |
| DOC | No |
| DOCX | Optional |
| XLSX | No |
| XLS | No |
| TXT | No |
| CSV | No |
| JSON | No |
| HTML | No |
| HTM | No |
| TIF | Default |
| TIFF | Default |
| JPG | Default |
| JPEG | Default |
| PNG | Default |
| GIF | Default |
input_data:[str, bytes, Path]extension: File extension whenbytesare passed.language_ocr: OCR language code (ISO-639).pages: Page selection (int,"1,3,5-7", or list).azure_endpoint: Azure Document Intelligence endpoint URL.azure_key: Azure API key.azure_model_id:"prebuilt-read"(text only) or"prebuilt-layout"(text + tables + key-value).hybrid: IfTrue, for PDFs: native text via PyMuPDF and images via OCR.
import textwizard as tw
res = tw.extract_text_azure(
"invoice.pdf",
language_ocr="ita",
azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
azure_key="<KEY>",
azure_model_id="prebuilt-layout",
hybrid=True,
)
print(res.text)
print(res.pretty_tables[:1])
print(res.key_value)text:strHTML input.remove_script: Remove executable tags (<script>,<template>).remove_metadata_tags: Remove metadata (<link>,<meta>,<base>,<noscript>,<style>,<title>).remove_flow_tags: Remove flow content (<address>,<div>,<input>, …).remove_sectioning_tags: Remove sectioning content (<article>,<aside>,<nav>, …).remove_heading_tags: Remove heading tags (<h1>–<h6>).remove_phrasing_tags: Remove phrasing content (<audio>,<code>,<textarea>, …).remove_embedded_tags: Remove embedded content (<iframe>,<embed>,<img>).remove_interactive_tags: Remove interactive content (<button>,<input>,<select>).remove_palpable: Remove palpable elements (<address>,<math>,<table>, …).remove_doctype: Remove<!DOCTYPE html>.remove_comments: Remove HTML comments.remove_specific_attributes: Remove specific attributes (supports wildcards).remove_specific_tags: Remove specific tags (supports wildcards).remove_empty_tags: Drop empty tags.remove_content_tags: Remove content of given tags.remove_tags_and_contents: Remove tags and their contents.
There are three modes with different return types:
| Mode | How to trigger | Output | Description |
|---|---|---|---|
| A – text-only | No parameters provided (all None) |
str (plain text) |
Extracts text, skips script-supporting tags, inserts safe spaces. |
| B – structural clean | At least one flag is True |
str (serialized HTML) |
Removes/unwraps per flags. Supports wildcard tag/attribute removal, content stripping, empty-tag pruning. |
| C – text with preservation | Parameters present and all False |
str (text + preserved markup) |
Extracts text but preserves groups explicitly set to False (and comments/doctype if set False). |
A) Text-only (no params)
import textwizard as tw
txt = tw.clean_html("<div><p>Hello</p><script>x()</script></div>")
print(txt) # -> "Hello"B) Structural clean (HTML out)
import textwizard as tw
html = """
<html><head><title>x</title><script>evil()</script></head>
<body>
<article><h1>Title</h1><img src="a.png"><p id="k" onclick="x()">hello</p></article><!-- comment -->
</body></html>
"""
out = tw.clean_html(
html,
remove_script=True,
remove_metadata_tags=True,
remove_embedded_tags=True,
remove_specific_attributes=["id", "on*"],
remove_empty_tags=True,
remove_comments=True,
remove_doctype=True,
)
print(out)Output
<html>
<body>
<article><h1>Title</h1><p>hello</p></article>
</body></html>C) Text with preservation (False flags)
import textwizard as tw
html = "<html><body><article><h1>T</h1><p>Body</p><!-- c --></article></body></html>"
txt = tw.clean_html(
html,
remove_sectioning_tags=False, # keep <article> in output
remove_heading_tags=False, # keep <h1> in output
remove_comments=False, # keep comments
)
print(txt)Output
<article><h1>T</h1>Body<!-- c --></article>Wildcard selectors
import textwizard as tw
html = '<div id="hero" data-track="x" onclick="h()"><img src="a.png"></div>'
out = tw.clean_html(
html,
remove_specific_attributes=["id", "data-*", "on*"],
remove_specific_tags=["im_"],
)
print(out) Output
<html><head></head><body><div></div></body></html>text:str | bytesXML input.remove_comments: Remove<!-- ... -->.remove_processing_instructions: Remove<? ... ?>.remove_cdata_sections: Unwrap<![CDATA[...]]>.remove_empty_tags: Drop empty elements.remove_namespaces: Drop prefixes andxmlns.remove_duplicate_siblings: Keep only the first identical sibling.collapse_whitespace: Collapse runs of whitespace.remove_specific_tags: Delete tags (supports wildcards).remove_content_tags: Keep tag but delete inner content.remove_attributes: Delete attributes (supports wildcards).remove_declaration: Drop<?xml ...?>.normalize_entities: Convert entities like&→&.
import textwizard as tw
xml = "<root xmlns='ns'><a/><b>ok</b><!-- x --></root>"
fixed = tw.clean_xml(
xml,
remove_namespaces=True,
remove_empty_tags=True,
remove_comments=True,
normalize_entities=True,
)
print(fixed)Output
<root><b>ok</b></root>Behavior
- Columns can be removed by name (with header) or 0-based index.
remove_row_indexuses 0-based indices over the parsed rows. If a header exists, it is row0.remove_valuesblanks matching cells. Supports wildcards*and?.remove_empty_columns/remove_empty_rowsrun after other edits.- Output is serialized with the provided dialect (
delimiter,quotechar,quoting, etc.).
Parameters
text: Raw CSV string.delimiter,quotechar,escapechar,doublequote,skipinitialspace,lineterminator,quoting.remove_columns: Name or 0-based index (or list).remove_row_index: 0-based index (or list).remove_values: Literal values or wildcard patterns to blank out.remove_duplicates_rows: Remove duplicate rows.trim_whitespace: Strip whitespace inside fields.remove_empty_columns: Drop empty columns.remove_empty_rows: Drop empty rows.
Example
import textwizard as tw
csv_data = """id,name,age,city,salary
1,John,30,New York,50000
2,Jane,25,,40000
3,,35,Los Angeles,60000
4,Mark,45,,70000
5,Sarah,40,New York,
1,John,30,New York,50000
"""
out = tw.clean_csv(
csv_data,
delimiter=",",
remove_columns=["id", "salary"],
remove_values=["John", "50000"],
trim_whitespace=True,
remove_empty_columns=True,
remove_empty_rows=True,
remove_duplicates_rows=True,
)
print(out)Output
name,age,city
,30,New York
Jane,25,
,35,Los Angeles
Mark,45,
Sarah,40,New York
text:strinput.engine:'spacy' | 'stanza' | 'spacy_stanza'(default'spacy').model: spaCy model name or path (spaCy engine only).language: ISO code for Stanza engines.device:'auto' | 'cpu' | 'gpu'(default'auto').
import textwizard as tw
sample = (
"Alex Rivera traveled to Springfield to meet the research team at Northstar Analytics on 14 March 2025. "
"The next day, he signed a pilot agreement with Horizon Bank and gave a talk at the University of Westland at 10:30 AM."
)
res = tw.extract_entities(sample)
print([e.text for e in res.entities["PERSON"]])
print([e.text for e in res.entities["GPE"]])
print([e.text for e in res.entities["ORG"]])Output
['Alex Rivera']
['Springfield']
['Northstar Analytics', 'Horizon Bank', 'the University of Westland']
text: String to analyze.language: ISO code.dict_dir: Folder with*.marisa.zstdictionaries. IfNone, user data dir and on-demand downloads.use_mmap:Trueto memory-map the uncompressed trie.
import textwizard as tw
check = tw.correctness_text("Thiss sentense has a typo.", language="en")
print(check)Output
{'errors_count': 2, 'errors': ['thiss', 'sentense']}Language identification via character n-gram profiles. Candidate gating guided by priors and linguistic cues, then probability estimation for each language. Supports 161 languages. Returns a top-1 ISO code or a probability-ordered list.
text: Input string (Unicode).top_k: Number of candidates to return (default3).profiles_dir: Override the bundled profiles directory.use_mmap: IfTrue, memory-map the profile tries (lower RAM; first access may be slightly slower).return_top1: IfTrue, return only the best language code; otherwise a list of(lang, prob).
Top-1 (single code)
import textwizard as tw
text = "Ciao, come stai oggi?"
lang = tw.lang_detect(text, return_top1=True)
print(lang) Output
it
Top-k distribution
import textwizard as tw
text = "The quick brown fox jumps over the lazy dog."
langs = tw.lang_detect(text, top_k=5, return_top1=False)
print(langs) Output
[('en', 0.9999376335362183), ('mg', 4.719212057614953e-05), ('fy', 1.4727973350205069e-05), ('rm', 2.8718519851832537e-07), ('la', 1.5918465665694727e-07)]
Batch example
import textwizard as tw
tests = [
"これは日本語のテスト文です。",
"Alex parle un peu français, aber nicht so viel.",
"¿Dónde está la estación de tren?",
]
for s in tests:
print("TOP1:", tw.lang_detect(s, return_top1=True))Output
TOP1: ja
TOP1: fr
TOP1: es
Custom profiles & mmap
from pathlib import Path
import textwizard as tw
langs = tw.lang_detect(
"Buongiorno a tutti!",
profiles_dir=Path("/opt/textwizard/profiles"), # custom profiles
use_mmap=True, # lower RAM
top_k=3,
)
print(langs)Computes: entropy, zipf.slope, zipf.r2, vocab_gini, type_token_ratio, hapax_ratio, simpson_index, yule_k, avg_word_length.
Tokens are lower-cased and split on whitespace.
import textwizard as tw
stats = tw.analyze_text_statistics("a a a b b c d e f g")
print(stats)Output
{'entropy': 2.646, 'zipf': {'slope': -0.605, 'r2': 0.838}, 'vocab_gini': 0.229, 'type_token_ratio': 0.7, 'hapax_ratio': 0.5, 'simpson_index': 0.82, 'yule_k': 800.0, 'avg_word_length': 1.0}Compute a similarity score between two strings using one of three measures.
Returns a float in [0.0, 1.0] (1.0 ≡ identical).
Parameters
a,b: Strings to compare.method:"cosine" | "jaccard" | "levenshtein"(default"cosine").
Notes
- Tokenization for cosine/jaccard uses lowercase word tokens matched by
\w+(Unicode letters, digits, underscore). - Quick guide:
| Method | Best for | Trade-offs |
|---|---|---|
| cosine | “bag of words” overlap incl. short phrases | needs some tokens; bigram TF helps with order |
| jaccard | set overlap (unique words) | ignores frequency; robust to duplicates |
| levenshtein | character-level edits | O(len(a)·len(b)); great for short strings |
- Example
import textwizard as tw
s1 = tw.text_similarity("kitten", "sitting", method="levenshtein")
s2 = tw.text_similarity("hello world", "hello brave world", method="jaccard")
s3 = tw.text_similarity("abc def", "abc xyz", method="cosine")
print(s1, s2, s3)Output
0.5714285714285714
0.6666666666666666
0.33333333333333337
Pretty-print raw HTML without changing its semantics. Controls indentation, attribute quoting/sorting, whitespace normalization, and optional DOCTYPE insertion.
html: Raw HTML string.indent: Spaces per indentation level (default2).quote_attr_values:"always" | "spec" | "legacy"(default"spec").quote_char:"or'(default").use_best_quote_char: Iftrue, auto-pick the quote char that needs fewer escapes.minimize_boolean_attributes: Iftrue, render compact booleans (e.g.,disabled).use_trailing_solidus: Iftrue, add a trailing slash on void elements (<br />).space_before_trailing_solidus: Add a space before that slash when used.escape_lt_in_attrs: Escape<and>inside attribute values.escape_rcdata: Escape within RCData (<script>,<style>,<textarea>).resolve_entities: Prefer named entities when serializing.alphabetical_attributes: Sort attributes alphabetically.strip_whitespace: Trim/collapse whitespace in text nodes.include_doctype: Prepend<!DOCTYPE html>if missing.expand_mixed_content: Put each child of mixed-content nodes on its own line.expand_empty_elements: Render empty non-void elements on two lines.
import textwizard as tw
html = """
<body>
<button id='btn1' class="primary" disabled="disabled">
Click <b>me</b>
</button>
<img alt="Logo" src="/static/logo.png">
</body>
"""
pretty = tw.beautiful_html(
html=html,
indent=4,
alphabetical_attributes=True,
minimize_boolean_attributes=True,
quote_attr_values="always",
strip_whitespace=True,
include_doctype=True,
expand_mixed_content=True,
expand_empty_elements=True,
)
print(pretty)Output
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<button class="primary" disabled id="btn1">
Click
<b>
me
</b>
</button>
<img alt="Logo" src="/static/logo.png">
</body>
</html>html: Raw HTML input.
import textwizard as tw
md = tw.html_to_markdown("<h1>Hello</h1><p>World</p>")
print(md)Output
# Hello
WorldAuthor: Mattia Rubino
Email: textwizard.dev@gmail.com
