TextWizard

TextWizard is a Python library to extract, clean, and analyze text from PDFs, DOCX, images, CSV, HTML/XML, and more. It includes local OCR (Tesseract), cloud OCR with Azure Document Intelligence, multi-backend NER, language detection, lexical statistics, and HTML utilities.

What is TextWizard?

TextWizard is a Python toolkit for end-to-end text ingestion: it extracts, cleans, and analyzes content from PDFs, Office documents, images, HTML/XML, CSV, and plain text. It unifies local OCR (Tesseract) and Azure Document Intelligence, normalizes noisy markup, and exposes text, tables, and key-value pairs through one consistent API.

It targets production pipelines: deterministic I/O, page selection and hybrid PDF handling, multi-backend NER (spaCy, Stanza), language detection at 160+ languages, compact spell-checking tries, lexical statistics, and HTML utilities (sanitization, pretty-print, HTML→Markdown). The goal is to be a dependable, high-level building block for practical text extraction and cleanup in Python.

Installation

Requires Python 3.9+.

pip install textwizard

Optional extras:

Azure OCR: pip install "textwizard[azure]"
NER: pip install "textwizard[ner]"
Everything: pip install "textwizard[all]"

For OCR capabilities, ensure you have Tesseract installed on your system.
For spaCy models, e.g.: python -m spacy download en_core_web_sm.

Quick start

import textwizard as tw

text = tw.extract_text("example.pdf")
print(text)

API overview

Method	Purpose
`extract_text`	Local text extraction with optional Tesseract OCR
`extract_text_azure`	Cloud extraction via Azure (text, tables, key-value)
`clean_html`	High-level HTML cleaning with semantic flags
`clean_xml`	XML cleanup and normalization
`clean_csv`	CSV cleanup with configurable dialect
`extract_entities`	NER via spaCy / Stanza / spaCy-Stanza
`correctness_text`	Spell checking
`lang_detect`	Language detection
`analyze_text_statistics`	Lexical metrics (entropy, Zipf, Gini, …)
`text_similarity`	Similarity: `cosine`, `jaccard`, `levenshtein`
`beutifull_html`	Pretty-print HTML
`html_to_markdown`	Convert HTML → Markdown

Text extraction

Parameters

input_data: [str, bytes, Path]
extension: The file extension, required only if input_data is bytes.
pages: Page/sheet selection.
• Paged (PDF, DOCX, TIFF): 1, "1-3", [1, 3, "5-8"]
• Excel (XLSX/XLS): sheet index (int), name (str), or mixed list
ocr: Enables OCR using Tesseract. Applies to PDF/DOCX and image-based files.
language_ocr: Language code for OCR. Defaults to 'eng'.

Examples

Basic:

import textwizard as tw

txt = tw.extract_text("docs/report.pdf")

From bytes:

from pathlib import Path
import textwizard as tw

raw = Path("img.png").read_bytes()
txt_img = tw.extract_text(raw, extension="png")

Paged selection and OCR:

import textwizard as tw

sel = tw.extract_text("docs/big.pdf", pages=[1, 3, "5-7"])
ocr_txt = tw.extract_text("scan.tiff", ocr=True, language_ocr="ita")

Supported Formats

Format	OCR Option
PDF	Optional
DOC	No
DOCX	Optional
XLSX	No
XLS	No
TXT	No
CSV	No
JSON	No
HTML	No
HTM	No
TIF	Default
TIFF	Default
JPG	Default
JPEG	Default
PNG	Default
GIF	Default

Azure OCR

Parameters

input_data: [str, bytes, Path]
extension: File extension when bytes are passed.
language_ocr: OCR language code (ISO-639).
pages: Page selection (int, "1,3,5-7", or list).
azure_endpoint: Azure Document Intelligence endpoint URL.
azure_key: Azure API key.
azure_model_id: "prebuilt-read" (text only) or "prebuilt-layout" (text + tables + key-value).
hybrid: If True, for PDFs: native text via PyMuPDF and images via OCR.

Example

import textwizard as tw

res = tw.extract_text_azure(
    "invoice.pdf",
    language_ocr="ita",
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_key="<KEY>",
    azure_model_id="prebuilt-layout",
    hybrid=True,
)

print(res.text)
print(res.pretty_tables[:1])
print(res.key_value)

HTML cleaning

Parameters

text: str HTML input.
remove_script: Remove executable tags (<script>, <template>).
remove_metadata_tags: Remove metadata (<link>, <meta>, <base>, <noscript>, <style>, <title>).
remove_flow_tags: Remove flow content (<address>, <div>, <input>, …).
remove_sectioning_tags: Remove sectioning content (<article>, <aside>, <nav>, …).
remove_heading_tags: Remove heading tags (<h1>–<h6>).
remove_phrasing_tags: Remove phrasing content (<audio>, <code>, <textarea>, …).
remove_embedded_tags: Remove embedded content (<iframe>, <embed>, <img>).
remove_interactive_tags: Remove interactive content (<button>, <input>, <select>).
remove_palpable: Remove palpable elements (<address>, <math>, <table>, …).
remove_doctype: Remove <!DOCTYPE html>.
remove_comments: Remove HTML comments.
remove_specific_attributes: Remove specific attributes (supports wildcards).
remove_specific_tags: Remove specific tags (supports wildcards).
remove_empty_tags: Drop empty tags.
remove_content_tags: Remove content of given tags.
remove_tags_and_contents: Remove tags and their contents.

Behavior

There are three modes with different return types:

Mode	How to trigger	Output	Description
A – text-only	No parameters provided (all `None`)	`str` (plain text)	Extracts text, skips script-supporting tags, inserts safe spaces.
B – structural clean	At least one flag is `True`	`str` (serialized HTML)	Removes/unwraps per flags. Supports wildcard tag/attribute removal, content stripping, empty-tag pruning.
C – text with preservation	Parameters present and all `False`	`str` (text + preserved markup)	Extracts text but preserves groups explicitly set to `False` (and comments/doctype if set `False`).

Examples

A) Text-only (no params)

import textwizard as tw
txt = tw.clean_html("<div><p>Hello</p><script>x()</script></div>")
print(txt)  # -> "Hello"

B) Structural clean (HTML out)

import textwizard as tw

html = """
<html><head><title>x</title><script>evil()</script></head>
<body>
  <article><h1>Title</h1><img src="a.png"><p id="k" onclick="x()">hello</p></article><!-- comment -->
</body></html>
"""
out = tw.clean_html(
    html,
    remove_script=True,
    remove_metadata_tags=True,
    remove_embedded_tags=True,
    remove_specific_attributes=["id", "on*"],
    remove_empty_tags=True,
    remove_comments=True,
    remove_doctype=True,
)
print(out)

Output

<html>
<body>
  <article><h1>Title</h1><p>hello</p></article>

</body></html>

C) Text with preservation (False flags)

import textwizard as tw

html = "<html><body><article><h1>T</h1><p>Body</p><!-- c --></article></body></html>"
txt = tw.clean_html(
    html,
    remove_sectioning_tags=False,   # keep <article> in output
    remove_heading_tags=False,      # keep <h1> in output
    remove_comments=False,          # keep comments
)
print(txt)

Output

<article><h1>T</h1>Body<!-- c --></article>

Wildcard selectors

import textwizard as tw
html = '<div id="hero" data-track="x" onclick="h()"><img src="a.png"></div>'
out = tw.clean_html(
    html,
    remove_specific_attributes=["id", "data-*", "on*"],
    remove_specific_tags=["im_"],
)
print(out)

Output

<html><head></head><body><div></div></body></html>

XML cleaning

Parameters

text: str | bytes XML input.
remove_comments: Remove .
remove_processing_instructions: Remove <? ... ?>.
remove_cdata_sections: Unwrap <![CDATA[...]]>.
remove_empty_tags: Drop empty elements.
remove_namespaces: Drop prefixes and xmlns.
remove_duplicate_siblings: Keep only the first identical sibling.
collapse_whitespace: Collapse runs of whitespace.
remove_specific_tags: Delete tags (supports wildcards).
remove_content_tags: Keep tag but delete inner content.
remove_attributes: Delete attributes (supports wildcards).
remove_declaration: Drop <?xml ...?>.
normalize_entities: Convert entities like & → &.

Examples

import textwizard as tw

xml = "<root xmlns='ns'><a/><b>ok</b><!-- x --></root>"
fixed = tw.clean_xml(
    xml,
    remove_namespaces=True,
    remove_empty_tags=True,
    remove_comments=True,
    normalize_entities=True,
)
print(fixed)

Output

<root><b>ok</b></root>

CSV cleaning

Behavior

Columns can be removed by name (with header) or 0-based index.
remove_row_index uses 0-based indices over the parsed rows. If a header exists, it is row 0.
remove_values blanks matching cells. Supports wildcards * and ?.
remove_empty_columns / remove_empty_rows run after other edits.
Output is serialized with the provided dialect (delimiter, quotechar, quoting, etc.).

Parameters

text: Raw CSV string.
delimiter, quotechar, escapechar, doublequote, skipinitialspace, lineterminator, quoting.
remove_columns: Name or 0-based index (or list).
remove_row_index: 0-based index (or list).
remove_values: Literal values or wildcard patterns to blank out.
remove_duplicates_rows: Remove duplicate rows.
trim_whitespace: Strip whitespace inside fields.
remove_empty_columns: Drop empty columns.
remove_empty_rows: Drop empty rows.

Example

import textwizard as tw

csv_data = """id,name,age,city,salary
1,John,30,New York,50000
2,Jane,25,,40000
3,,35,Los Angeles,60000
4,Mark,45,,70000
5,Sarah,40,New York,
1,John,30,New York,50000
"""
out = tw.clean_csv(
    csv_data,
    delimiter=",",
    remove_columns=["id", "salary"],
    remove_values=["John", "50000"],
    trim_whitespace=True,
    remove_empty_columns=True,
    remove_empty_rows=True,
    remove_duplicates_rows=True,
)
print(out)

Output

name,age,city
,30,New York
Jane,25,
,35,Los Angeles
Mark,45,
Sarah,40,New York

Named-Entity Recognition (NER)

Parameters

text: str input.
engine: 'spacy' | 'stanza' | 'spacy_stanza' (default 'spacy').
model: spaCy model name or path (spaCy engine only).
language: ISO code for Stanza engines.
device: 'auto' | 'cpu' | 'gpu' (default 'auto').

Examples

import textwizard as tw

sample = (
    "Alex Rivera traveled to Springfield to meet the research team at Northstar Analytics on 14 March 2025. "
    "The next day, he signed a pilot agreement with Horizon Bank and gave a talk at the University of Westland at 10:30 AM."
)
res = tw.extract_entities(sample)
print([e.text for e in res.entities["PERSON"]])
print([e.text for e in res.entities["GPE"]])
print([e.text for e in res.entities["ORG"]])

Output

['Alex Rivera']
['Springfield']
['Northstar Analytics', 'Horizon Bank', 'the University of Westland']

Spell checking

Parameters

text: String to analyze.
language: ISO code.
dict_dir: Folder with *.marisa.zst dictionaries. If None, user data dir and on-demand downloads.
use_mmap: True to memory-map the uncompressed trie.

Example

import textwizard as tw

check = tw.correctness_text("Thiss sentense has a typo.", language="en")
print(check)

Output

{'errors_count': 2, 'errors': ['thiss', 'sentense']}

Language detection

Language identification via character n-gram profiles. Candidate gating guided by priors and linguistic cues, then probability estimation for each language. Supports 161 languages. Returns a top-1 ISO code or a probability-ordered list.

Parameters

text: Input string (Unicode).
top_k: Number of candidates to return (default 3).
profiles_dir: Override the bundled profiles directory.
use_mmap: If True, memory-map the profile tries (lower RAM; first access may be slightly slower).
return_top1: If True, return only the best language code; otherwise a list of (lang, prob).

Examples

Top-1 (single code)

import textwizard as tw

text = "Ciao, come stai oggi?"
lang = tw.lang_detect(text, return_top1=True)
print(lang)

Output

it

Top-k distribution

import textwizard as tw

text = "The quick brown fox jumps over the lazy dog."
langs = tw.lang_detect(text, top_k=5, return_top1=False)
print(langs)

Output

[('en', 0.9999376335362183), ('mg', 4.719212057614953e-05), ('fy', 1.4727973350205069e-05), ('rm', 2.8718519851832537e-07), ('la', 1.5918465665694727e-07)]

Batch example

import textwizard as tw

tests = [
    "これは日本語のテスト文です。",
    "Alex parle un peu français, aber nicht so viel.",
    "¿Dónde está la estación de tren?",
]
for s in tests:
    print("TOP1:", tw.lang_detect(s, return_top1=True))

Output

TOP1: ja
TOP1: fr
TOP1: es

Custom profiles & mmap

from pathlib import Path
import textwizard as tw

langs = tw.lang_detect(
    "Buongiorno a tutti!",
    profiles_dir=Path("/opt/textwizard/profiles"),  # custom profiles
    use_mmap=True,                                   # lower RAM
    top_k=3,
)
print(langs)

Text statistics

Parameters

Computes: entropy, zipf.slope, zipf.r2, vocab_gini, type_token_ratio, hapax_ratio, simpson_index, yule_k, avg_word_length.
Tokens are lower-cased and split on whitespace.

Example

import textwizard as tw

stats = tw.analyze_text_statistics("a a a b b c d e f g")
print(stats)

Output

{'entropy': 2.646, 'zipf': {'slope': -0.605, 'r2': 0.838}, 'vocab_gini': 0.229, 'type_token_ratio': 0.7, 'hapax_ratio': 0.5, 'simpson_index': 0.82, 'yule_k': 800.0, 'avg_word_length': 1.0}

Text similarity

Compute a similarity score between two strings using one of three measures.
Returns a float in [0.0, 1.0] (1.0 ≡ identical).

Parameters

a, b: Strings to compare.
method: "cosine" | "jaccard" | "levenshtein" (default "cosine").

Notes

Tokenization for cosine/jaccard uses lowercase word tokens matched by \w+ (Unicode letters, digits, underscore).
Quick guide:

Method	Best for	Trade-offs
cosine	“bag of words” overlap incl. short phrases	needs some tokens; bigram TF helps with order
jaccard	set overlap (unique words)	ignores frequency; robust to duplicates
levenshtein	character-level edits	`O(len(a)·len(b))`; great for short strings

Example

import textwizard as tw

s1 = tw.text_similarity("kitten", "sitting", method="levenshtein")
s2 = tw.text_similarity("hello world", "hello brave world", method="jaccard")
s3 = tw.text_similarity("abc def", "abc xyz", method="cosine")
print(s1, s2, s3)

Output

0.5714285714285714
0.6666666666666666
0.33333333333333337

Beautiful HTML

Pretty-print raw HTML without changing its semantics. Controls indentation, attribute quoting/sorting, whitespace normalization, and optional DOCTYPE insertion.

Parameters

html: Raw HTML string.
indent: Spaces per indentation level (default 2).
quote_attr_values: "always" | "spec" | "legacy" (default "spec").
quote_char: " or ' (default ").
use_best_quote_char: If true, auto-pick the quote char that needs fewer escapes.
minimize_boolean_attributes: If true, render compact booleans (e.g., disabled).
use_trailing_solidus: If true, add a trailing slash on void elements (<br />).
space_before_trailing_solidus: Add a space before that slash when used.
escape_lt_in_attrs: Escape < and > inside attribute values.
escape_rcdata: Escape within RCData (<script>, <style>, <textarea>).
resolve_entities: Prefer named entities when serializing.
alphabetical_attributes: Sort attributes alphabetically.
strip_whitespace: Trim/collapse whitespace in text nodes.
include_doctype: Prepend <!DOCTYPE html> if missing.
expand_mixed_content: Put each child of mixed-content nodes on its own line.
expand_empty_elements: Render empty non-void elements on two lines.

Example

import textwizard as tw

html = """
<body>
  <button id='btn1' class="primary" disabled="disabled">
    Click   <b>me</b>
  </button>
  <img alt="Logo" src="/static/logo.png">
</body>
"""

pretty = tw.beautiful_html(
    html=html,
    indent=4,
    alphabetical_attributes=True,
    minimize_boolean_attributes=True,
    quote_attr_values="always",
    strip_whitespace=True,
    include_doctype=True,
    expand_mixed_content=True,
    expand_empty_elements=True,
)

print(pretty)

Output

<!DOCTYPE html>
<html>
    <head>
    </head>
    <body>

        <button class="primary" disabled id="btn1">
            Click
            <b>
                me
            </b>

        </button>

        <img alt="Logo" src="/static/logo.png">

    </body>
</html>

HTML to Markdown

Parameters

html: Raw HTML input.

Example

import textwizard as tw

md = tw.html_to_markdown("<h1>Hello</h1><p>World</p>")
print(md)

Output

# Hello

World

License

AGPL-3.0-or-later.

RESOURCES

Contact & Author

Author: Mattia Rubino
Email: textwizard.dev@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
asset		asset
corpus/dictionaries		corpus/dictionaries
docs_source		docs_source
test		test
textwizard		textwizard
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
make.bat		make.bat
pyproject.toml		pyproject.toml
readthedocs.yml		readthedocs.yml

License

textwizard-dev/TextWizard

Folders and files

Latest commit

History

Repository files navigation

TextWizard

Contents

What is TextWizard?

Installation

Quick start

API overview

Text extraction

Parameters

Examples

Supported Formats

Azure OCR

Parameters

Example

HTML cleaning

Parameters

Behavior

Examples

XML cleaning

Parameters

Examples

CSV cleaning

Named-Entity Recognition (NER)

Parameters

Examples

Spell checking

Parameters

Example

Language detection

Parameters

Examples

Text statistics

Parameters

Example

Text similarity

Beautiful HTML

Parameters

Example

HTML to Markdown

Parameters

Example

License

RESOURCES

Contact & Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages