### Introduction

Persian text preprocessing presents unique challenges in natural language processing (NLP), due to issues like inconsistent orthography, pseudo-spaces, complex morphology, and limited standardized tools. Handling these nuances properly is essential for building accurate and reliable language models and downstream NLP applications.

**[Shekar](https://github.com/amirivojdan/shekar)** is an open-source Python library designed to simplify and enhance Persian text preprocessing. It offers a modular and efficient pipeline for a variety of tasks including normalization, punctuation and stopword removal, stemming and lemmatization, spell correction, and word embedding generation.

This notebook demonstrates practical examples of how to use Shekar for preprocessing Persian text. By the end, you’ll be able to integrate Shekar into your own NLP workflows with ease and clarity.

To install Shekar, run the following cell:

In [None]:
!pip install shekar -U

Collecting shekar
  Downloading shekar-0.1.21-py3-none-any.whl (930 kB)
     ------------------------------------ 930.3/930.3 kB 503.3 kB/s eta 0:00:00
Collecting onnxruntime>=1.22.1
  Downloading onnxruntime-1.22.1-cp310-cp310-win_amd64.whl (12.7 MB)
     -------------------------------------- 12.7/12.7 MB 486.7 kB/s eta 0:00:00
Collecting regex>=2024.11.6
  Downloading regex-2025.7.34-cp310-cp310-win_amd64.whl (276 kB)
     -------------------------------------- 276.0/276.0 kB 1.1 MB/s eta 0:00:00
Collecting python-bidi>=0.6.6
  Using cached python_bidi-0.6.6-cp310-cp310-win_amd64.whl (160 kB)
Collecting tokenizers>=0.21.2
  Downloading tokenizers-0.21.4-cp39-abi3-win_amd64.whl (2.5 MB)
     ---------------------------------------- 2.5/2.5 MB 558.8 kB/s eta 0:00:00
Collecting arabic-reshaper>=3.0.0
  Using cached arabic_reshaper-3.0.0-py3-none-any.whl (20 kB)
Collecting pillow>=11.2.1
  Downloading pillow-11.3.0-cp310-cp310-win_amd64.whl (7.0 MB)
     --------------------------------


[notice] A new release of pip available: 22.3.1 -> 25.2
[notice] To update, run: C:\Users\amiri\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip


### Preprocessing with Shekar

The `shekar.preprocessing` module provides a rich set of building blocks for cleaning, normalizing, and transforming Persian text. These classes form the foundation of text preprocessing workflows and can be used independently or combined in a `Pipeline`.

Here are some of the key text transformers available in the module:

- **`SpacingStandardizer`**: Removes extra spaces and adjusts spacing around punctuation.
- **`AlphabetNormalizer`**: Converts Arabic characters to standard Persian forms.
- **`NumericNormalizer`**: Converts English and Arabic numerals into Persian digits.
- **`PunctuationNormalizer`**: Standardizes punctuation symbols.
- **`EmojiRemover`**: Removes emojis.
- **`EmailMasker` / `URLMasker`**: Mask or remove emails and URLs.
- **`DiacriticsRemover`**: Removes Persian/Arabic diacritics.
- **`PunctuationRemover`**: Removes all punctuation characters.
- **`RedundantCharacterRemover`**: Shrinks repeated characters like "سسسلام".
- **`ArabicUnicodeNormalizer`**: Converts Arabic presentation forms (e.g., ﷽) into Persian equivalents.
- **`StopWordsRemover`**: Removes frequent Persian stopwords.
- **`NonPersianRemover`**: Removes all non-Persian content (optionally keeps English).
- **`HTMLTagRemover`**: Cleans HTML tags but retains content.
- **`SpacingNormalizer`**: Standardizes the spaces.

##### Example 1: Remove Emojis and Punctuation

In [2]:
from shekar.preprocessing import EmojiRemover, PunctuationRemover

emoji_remover = EmojiRemover()
punct_remover = PunctuationRemover()

text = "ایران سرای من است! 🌍😊"
text = emoji_remover.fit_transform(text)
text = punct_remover.fit_transform(text)
text = text.strip()

print(text)

ایران سرای من است


In [3]:
from shekar.preprocessing import SpacingNormalizer

punct_spacing_standardizer = SpacingNormalizer()
text = "شرکت « گوگل »اعلام کرد ."
print("before standardization:", text)
text = punct_spacing_standardizer.fit_transform(text).strip()
print("after standardization:", text)

before standardization: شرکت « گوگل »اعلام کرد .
after standardization: شرکت «گوگل» اعلام کرد.


In Shekar, all preprocessing transformers implement both the **fit_transform()** method and the **__call__()** method. This allows you to use them like functions. Calling a transformer directly is the same as calling .fit_transform().

So we could rewrite the previous cell as follows:

In [4]:
from shekar.preprocessing import EmojiRemover, PunctuationRemover

emoji_remover = EmojiRemover()
punct_remover = PunctuationRemover()

text = "ایران سرای من است! 🌍😊"
text = punct_remover(emoji_remover(text)).strip()

print(text)

ایران سرای من است


This version is more concise and produces the exact same output!

##### Example 2: Normalize Persian Characters

In [5]:
from shekar.preprocessing import AlphabetNormalizer

alphabet_normalizer = AlphabetNormalizer()
text = "نشان‌دهندة قائدة"
normalized = alphabet_normalizer(text)
print(normalized)

نشان‌دهنده قائده


##### Example 3: Remove Stopwords

In [6]:
from shekar.preprocessing import StopWordRemover

stopword_remover = StopWordRemover()
text = "این یک جملهٔ نمونه است"
cleaned = stopword_remover(text)
print(cleaned)

جملهٔ نمونه


#### Creating Custom Transformers

In Shekar, you can easily define your own text transformation logic by subclassing `BaseTextTransformer`. This allows you to integrate any custom rule-based or pattern-based transformation into the Shekar pipeline system.

All you need to do is implement the `_function(self, text: str) -> str` method, which takes a string and returns the transformed version.

Note that the _function() method is automatically invoked by the class when you call the transformer. In most cases, defining this method is sufficient. However, if you need more control over the transformation logic (such as managing state, performing setup, or handling input types differently), you can also override the __init__(), fit(), transform(), and fit_transform() methods directly.

##### Example: WhitespaceStripper

This custom transformer removes leading and trailing whitespace from input strings.

In [7]:
from shekar.base import BaseTextTransform


class WhitespaceStripper(BaseTextTransform):
    def _function(self, text: str) -> str:
        return text.strip()

You can now use it like any other Shekar component:

In [8]:
text = "   سلام دنیا!   "
whitespace_stripper = WhitespaceStripper()

print(whitespace_stripper(text))

سلام دنیا!


#### Pipelines: Chaining Text Transformations

Shekar's `Pipeline` class allows you to chain multiple text preprocessing steps together into a seamless and reusable workflow. Inspired by Unix-style piping, Shekar also supports the `|` operator for combining transformers, making your code not only more readable but also expressive and modular.

##### Why Pipelines?

Text preprocessing often involves applying several transformations in sequence. Instead of writing nested function calls or multiple intermediate steps, Shekar’s `Pipeline` lets you define a clean and testable chain of operations.

For example, instead of writing:

In [9]:
text = "ایران سرای من است! 🌍😊"
text = whitespace_stripper(punct_remover(emoji_remover(text)))

print(text)

ایران سرای من است


The same sequence of transformations can be constructed using the | operator, creating a concise and expressive pipeline.

In [10]:
text = "ایران سرای من است! 🌍😊"
pipeline = EmojiRemover() | PunctuationRemover() | WhitespaceStripper()
output = pipeline(text)
print(output)

ایران سرای من است


This approach clearly shows the order of transformations: first remove emojis, then punctuation, and finally trim whitespace. It reads naturally and makes the preprocessing flow easy to understand at a glance.

The same transformation chain can also be written explicitly using the Pipeline class:

In [11]:
from shekar import Pipeline
from shekar.preprocessing import EmojiRemover, PunctuationRemover

pipeline = Pipeline(
    [
        ("emoji", EmojiRemover()),
        ("punct", PunctuationRemover()),
        ("strip", WhitespaceStripper()),
    ]
)

text = "ایران سرای من است! 🌍😊"
output = pipeline(text)
print(output)

ایران سرای من است


##### Batch Processing with Pipelines

Note that Pipelines also support batch processing. You can pass a list (or any iterable) of strings to the pipeline, and it will apply the transformations to each item in sequence.

In [12]:
texts = ["درود! 🌟", "چطوری؟! 😄"]
cleaned_texts = pipeline(texts)
cleaned_texts

<generator object Pipeline.fit_transform.<locals>.generator at 0x0000029271C58200>

Keep in mind that the result is a generator, not a list. This makes the pipeline more memory-efficient, especially when processing large datasets. You can convert the output to a list if needed:

In [13]:
print(list(cleaned_texts))

['درود', 'چطوری']


In [14]:
texts = ["درود! 🌟", "چطوری؟! 😄"]
cleaned_texts = pipeline(texts)

##### Using Pipelines as Decorators
You can apply a pipeline to specific arguments in a function using the `.on_args()` method:

In [15]:
@pipeline.on_args(["first_name", "last_name"])
def process(first_name: str, last_name: str) -> str:
    return f"{first_name} {last_name}"


processed_texts = process(first_name="🌟علی", last_name="!احمدی")
print(processed_texts)

علی احمدی


Summary

- Pipelines let you chain transformations cleanly.
- You can build them explicitly or using the `|` operator.
- Pipelines support strings, lists, and even decorators.
- The result is more modular, testable, and elegant preprocessing code.

In [16]:
from shekar import SentenceTokenizer

text = "هدف ما کمک به یکدیگر است! ما می‌توانیم با هم کار کنیم."
sentence_tokenizer = SentenceTokenizer()
sentences = sentence_tokenizer.tokenize(text)

for sentence in sentences:
    print(sentence)

هدف ما کمک به یکدیگر است!
ما می‌توانیم با هم کار کنیم.


In [17]:
from shekar import WordTokenizer

tokenizer = WordTokenizer()

text = "چه سیب‌های قشنگی! حیات نشئهٔ تنهایی است."
tokens = tokenizer.tokenize(text)
print(tokens)

<generator object WordTokenizer._function.<locals>.<genexpr> at 0x0000029271C58970>


In [21]:
from shekar.transforms import Flatten

flatten = Flatten()
text = [["سلام", "دنیا"], ["این", "یک", "جمله"]]
list(flatten(text))


['سلام', 'دنیا', 'این', 'یک', 'جمله']

In [19]:
text = "هدف ما کمک به یکدیگر است! ما می‌توانیم با هم کار کنیم."

pipeline = SentenceTokenizer() | WordTokenizer() | Flatten()
output = pipeline(text)
print(list(output))

['هدف', 'ما', 'کمک', 'به', 'یکدیگر', 'است', '!', 'ما', 'می\u200cتوانیم', 'با', 'هم', 'کار', 'کنیم', '.']
