Skip to content

textwizard-dev/WizardSpell

Repository files navigation

WizardSpell Banner


WizardSpell

PyPI - Version PyPI - Downloads/month License

WizardSpell is a Python library for Dictionary-based spell checking with Unicode-aware tokenization and light text normalization. Supports 62 languages via compressed Marisa-Trie dictionaries. Returns a compact report with the total number of misspellings and the list of offending tokens.


Contents


Installation

Requires Python 3.9+.

pip install wizardspell

Quick start

import wizardspell as ws

res = ws.spell_checking("Thiss sentense has a typo.", language="en")
print(res)

Spell Checking

Behavior

  • Normalizes common Unicode quirks (e.g., smart quotes, zero-width joiners).
  • Ignores numbers and leading/trailing punctuation when deciding correctness.
  • Treats ' / variants as equivalent.
  • Looks up each token against the selected language dictionary.

Parameters

Parameter Description
text (str) Raw input text.
language (str, default "en") ISO-639 code.
dict_dir (str | Path | None) Directory containing one or more *.marisa.zst (or decompressed *.marisa) dictionaries. If None: uses a per-user cache directory and auto-downloads the required dictionary if missing.
use_mmap (bool, default False) True → memory-map the on-disk .marisa file (lowest RAM; fastest startup). False → load the entire trie into RAM (higher RAM; highest steady-state throughput).

Return value

dict with:

  • errors_countint total misspellings
  • errorslist[str] of misspelled tokens (normalized/case-folded)
import wizardspell as ws

check = ws.spell_checking("Thiss sentense has a typo.", language="en")
print(check)

Output

{"errors_count": 2, "errors": ["thiss", "sentense"]}

Examples

Basic

import wizardspell as ws

res = ws.spell_checking("Thiss sentense has a typo.", language="en")
print(res)

Output

{"errors_count": 2, "errors": ["thiss", "sentense"]}

Italian example

import wizardspell as ws
print(ws.spell_checking("Queso è un tes , di preva.", language="it"))

Output

{"errors_count": 3, "errors": ["queso", "tes", "preva."]}

Custom dictionary directory & mmap

import wizardspell as ws
from pathlib import Path

res = ws.spell_checking(
    "Coloar centre thetre",
    language="en",
    dict_dir=Path("~/WizardSpell_dicts"),
    use_mmap=True,
)
print(res)

Output

{"errors_count": 2, "errors": ["coloar", "thetre"]}

Operational notes

  • Cache location (when dict_dir=None): a per-user data directory is used. You can override it via the first existing of: WIZARDSPELL_DATA_DIR / WIZARDSPELL_DICT_DIR / WIZARDSPELL_HOME (environment variables).
  • Auto-download: when a dictionary is missing and dict_dir is not set, WizardSpell downloads the compressed *.marisa.zst once and reuses it subsequently.
  • File formats:
    • *.marisa.zst files are decompressed on the fly (into memory) or to an adjacent *.marisa file when use_mmap=True.
    • If you already have an uncompressed *.marisa file in dict_dir, it is used directly.
  • Performance:
    • use_mmap=True → minimal RAM, fastest startup; excellent for large dictionaries or constrained environments.
    • use_mmap=False → maximal throughput once loaded; best when RAM is plentiful.
  • Chinese requires jieba; all other languages work out-of-the-box.
  • Output tokens in errors are normalized/case-folded; they may differ in casing from the original text.

Available dictionaries

Code Language Code Language
af Afrikaans an Aragonese
ar Arabic as Assamese
be Belarusian bg Bulgarian
bn Bengali bo Tibetan
br Breton bs Bosnian
ca Catalan cs Czech
da Danish de German
el Greek en English
eo Esperanto es Spanish
fa Persian fr French
gd Scottish Gaelic gn Guarani
gu Gujarati (gu_IN) he Hebrew
hi Hindi hr Croatian
id Indonesian is Icelandic
it Italian ja Japanese
kmr Kurmanji Kurdish kn Kannada
ku Central Kurdish lo Lao
lt Lithuanian lv Latvian
mr Marathi nb Norwegian Bokmål
ne Nepali nl Dutch
nn Norwegian Nynorsk oc Occitan
or Odia pa Punjabi
pl Polish pt Portuguese (EU)
ro Romanian ru Russian
sa Sanskrit si Sinhala
sk Slovak sl Slovenian
sq Albanian sr Serbian
sv Swedish sw Swahili
ta Tamil te Telugu
th Thai tr Turkish
uk Ukrainian vi Vietnamese

License

AGPL-3.0-or-later

Resources

Contact & Author

Author: Mattia Rubino
Email: textwizard.dev@gmail.com

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages