
# Natural Language Processing Homework 2  

sharif University of technology


### 1. Proper Correction of Ezafe Usage

**Objective:**  
Develop a system that identifies and corrects common errors in the usage of the Persian Ezafe (the connecting vowel “ِ”) and the attached letter “ه.” These errors are prevalent in everyday writing and can lead to mispronunciations or misunderstandings.

**Requirements:**  
- The system must accept an input Persian text.
- Identify the errors where the Ezafe is missing or misused (e.g., incorrect use of “ه” instead of “هٔ” or missing "ـه").
- For each detected issue, return:
  - The corrected full sentence.
  - The **range** (start and end indices) of each erroneous word.
  - The **corrected form** of each erroneous word.

---

**Examples:**

| Input (Persian Text) | Output (JSON Response) |
|----------------------|------------------------|
| حاله من خیلی خوب.    | ```json<br>{<br>  "correct": "حال من خیلی خوبه.",<br>  "حاله": [0, 4],<br>  "خوب": [13, 16]<br>}<br>``` |
| می‌خواهم پرنده‌ از شاخه، روی بامه خانه بپر. | ```json<br>{<br>  "correct": "می‌خواهم پرنده از شاخه، روی بام خانه بپرد.",<br>  "خانه": [27, 31],<br>  "بپر": [37, 40]<br>}<br>``` |
| دیروز آن مرده قوی مرد. | ```json<br>{<br>  "correct": "دیروز آن مرد قوی مرد.",<br>  "مرده": [9, 13]<br>}<br>``` |

These examples show how the system:
- Detects incorrect forms like “حاله” and corrects them to “حالِ”
- Restores missing Ezafe (e.g., “بامه” → “بامِ”)
- Fixes improper verb endings or morphological issues.

---

### 2. Word and Sentence Tokenization

**Objective:**  
Preprocessing in natural language processing starts with dividing the text into sentences and words. You are to create tokenizers that segment an input Persian text into its component sentences and then further into individual words.

**Requirements:**  
- Your module should take a single Persian text string as input.
- **Output:**  
  - A list of sentences extracted from the text.  
  - A nested list (or similar structure) where each sublist contains the tokens (words) of the corresponding sentence.

**Example:**  
- **Input:**  
  ```
  امروز هوا خوب است. فردا شاید بارانی باشد!
  ```
- **Output:**  
  ```json
  {
    "sentences": [
      "امروز هوا خوب است.",
      "فردا شاید بارانی باشد!"
    ],
    "tokens": [
      ["امروز", "هوا", "خوب", "است", "."],
      ["فردا", "شاید", "بارانی", "باشد", "!"]
    ]
  }
  ```


---

### 3. Creating a Text Normalizer

**Objective:**  
Implement a normalization module that standardizes Persian text. Normalization can include removing extra spaces, converting variant forms of characters (for example, replacing Arabic variants with Persian ones), and transforming informal spellings into their formal equivalents.

**Requirements:**  
- Build a module (or extend an existing one) to normalize text.
- Demonstrate the transformation on sample texts by showing before-and-after examples.
- Include multiple test cases that highlight the strengths and potential weaknesses of your normalization approach.

**Example:**  
- **Input:**  
  ```
  میخوااام اینو درست کنم!!!
  ```
- **Output:**  
  ```
  می‌خواهم این را درست کنم!
  ```


---

### 4. Detection of Illegal Words

**Objective:**  
Some automated systems (bots) are tasked with detecting “illegal” or forbidden words in text. However, if extra characters—such as non-Persian letters, numbers, or special characters—are inserted between the letters, these systems may fail. Your task is to create a system that can robustly detect these words despite such modifications.

**Function Signature:**
```python
run(input: str, illegal_words: list)
```

**Requirements:**  
- **Input:**  
  - A Persian text string.  
  - A list of illegal words to search for.
- **Output:**  
  - A list or dictionary indicating which illegal words were detected, along with the positions (or ranges) in the text where they were found.  
  - Your solution must ignore extraneous characters (e.g., “#”, “…”, spaces, etc.) inserted between the letters of a potentially illegal word.

---

**Examples:**

| **Input**                                   | **Illegal Words**         | **Output (JSON)**                                              |
|--------------------------------------------|---------------------------|-----------------------------------------------------------------|
| `این گفت:#ر ی# ازخوشم`                     | `["تفنگ"]`                | ```json<br>{<br>  "تفنگ": [4, 14]<br>}<br>```                  |
| `با ما رفتم، به #گ...غ غذا نخورد.`         | `["قاشق", "چنگال"]`       | ```json<br>{<br>  "قاشق": [3, 10],<br>  "چنگال": [14, 23]<br>}<br>``` |

In these examples, the system successfully detects the words `"تفنگ"`, `"قاشق"`, and `"چنگال"` despite the presence of additional characters interspersed between the Persian letters.

---

**Additional Example**  
Consider the following input, which has punctuation between each letter:

- **Input:**  
  ```
  م..ی..خ..ر..ی
  ```  
- **Illegal Words:**  
  ```
  ["میخری"]
  ```  
- **Output (illustrative):**  
  ```json
  {
    "detections": [
      { "range": [0, 9], "matched_word": "میخری" }
    ]
  }
  ```

Here, the system identifies that `"م..ی..خ..ر..ی"` effectively matches the illegal word `"میخری"`, ignoring the extra dots in between.


---

### 5. Extraction of Conjunctions and Subsentences

**Objective:**  
Persian sentences are often compound and may be composed of multiple clauses connected by conjunctions (such as "و", "یا", "اما", etc.). Your task is to:

1. Split the text into its constituent subsentences.  
2. Extract the conjunction words used to connect these clauses.  
3. Identify the type of each conjunction (e.g., coordinating, subordinating, adversative, dual) and record the position (index range) of each occurrence.  

If a conjunction appears more than once, differentiate each occurrence (e.g., “و1,” “و2,” etc.).

---

**Requirements:**  
- **Input:**  
  - A single Persian text string.  
- **Output:**  
  1. A list (or array) of subsentences.  
  2. A data structure (e.g., a list or dictionary) that includes each extracted conjunction, its type, and its location (start and end indices) within the text.

---

**Examples:**

| **Input**                                                                                                 | **Output (JSON)**                                                                                                                                                                                              |
|-----------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| امروز در خبرها آمده است که هوای شهر نیمه ابری است. <br> او درس خواند اما نمره‌اش خوب نشد و کلّی یاد گرفت. | ```json<br>{<br>  "splits": [<br>    "امروز در خبرها آمده است که هوای شهر نیمه ابری است.",<br>    "او درس خواند",<br>    "اما نمره‌اش خوب نشد",<br>    "و کلّی یاد گرفت."<br>  ],<br>  "conjunctions": [<br>    {<br>      "word": "اما",<br>      "type": "adversative",<br>      "range": [26, 28]<br>    },<br>    {<br>      "word": "و",<br>      "type": "coordinating",<br>      "range": [45, 45]<br>    }<br>  ]<br>}<br>``` |
| رفت به دوستش گفت بیا با هم بازی کنیم.                                                                     | ```json<br>{<br>  "splits": [<br>    "رفت به دوستش گفت",<br>    "بیا با هم بازی کنیم."<br>  ],<br>  "conjunctions": []<br>}<br>```                                                                                 |
| هم فال بود هم تماشا.                                                                                      | ```json<br>{<br>  "splits": [<br>    "هم فال بود",<br>    "هم تماشا."<br>  ],<br>  "conjunctions": [<br>    {<br>      "word": "هم",<br>      "type": "dual",<br>      "range": [0, 1]<br>    },<br>    {<br>      "word": "هم",<br>      "type": "dual",<br>      "range": [9, 10]<br>    }<br>  ]<br>}<br>``` |

---

### Explanation of the Examples

1. **Multiple Conjunctions in a Single Paragraph**  
   - **Input:**  
     ```
     امروز در خبرها آمده است که هوای شهر نیمه ابری است. او درس خواند اما نمره‌اش خوب نشد و کلّی یاد گرفت.
     ```
   - **Output:**  
     - **splits:** Each sentence or clause is separated based on punctuation or recognized conjunction words.  
     - **conjunctions:** The words “اما” and “و” are identified, along with their indices in the text. Their types are labeled as “adversative” and “coordinating,” respectively.

2. **No Conjunction Present**  
   - **Input:**  
     ```
     رفت به دوستش گفت بیا با هم بازی کنیم.
     ```
   - **Output:**  
     - **splits:** Two main clauses are identified.  
     - **conjunctions:** An empty list, since no explicit conjunction words (“و,” “یا,” “اما,” etc.) are found.

3. **Repeated Conjunction**  
   - **Input:**  
     ```
     هم فال بود هم تماشا.
     ```
   - **Output:**  
     - **splits:** The text is split into two clauses: “هم فال بود” and “هم تماشا.”  
     - **conjunctions:** The word “هم” appears twice and is considered a dual conjunction. The positions (ranges) for each occurrence are recorded separately.

---

### Implementation Tips

- You can use a combination of **regular expressions** and **string processing** to detect conjunctions.
- Decide on a set of known conjunctions (e.g., “و,” “یا,” “اما,” “هم,” “ولی,” etc.) and classify them as coordinating, subordinating, adversative, or dual.
- Track each conjunction’s **start** and **end** indices in the original string to fill out the `"conjunctions"` data structure.
- For splitting into subsentences, look for:
  1. **Conjunctions**  
  2. **Punctuation** (e.g., “.”, “!”, “؟”)  
  3. **Possible multi-sentence boundaries** (some Persian texts omit explicit punctuation)


---

### 6. Extraction of Food Order Features from a Message

**Objective:**  
Develop a system that analyzes a text message to extract features related to a food order. The message may include details such as the type of food, size or quantity, and any special instructions or extras.

**Requirements:**  
1. **Input Text Analysis:**  
   - Identify **food items** (e.g., “پیتزا,” “کباب,” “سبزی پلو,” “ماهی,” “سالاد,” etc.).  
   - Detect **sizes** or **quantities** (e.g., “یک,” “دوتا,” “بزرگ,” “کوچک,” etc.).  
   - Extract **extras** or **special instructions** (e.g., “سس اضافه,” “نوشابه,” “خیلی خوب شسته شده,” “چرخ کرده,” etc.).  
2. **Output:**  
   - A structured representation (such as JSON) that clearly specifies the details of the order:  
     - **food**: List of all mentioned dishes or items.  
     - **quantity/size**: If provided (single or multiple).  
     - **extras/notes**: Any additional preferences or instructions.

---

**Examples:**

1. **Example 1**  
   - **Input:**  
     ```
     لطفاً یک پیتزای بزرگ با سس اضافه بیار
     ```
   - **Output:**  
     ```json
     {
       "food": "پیتزا",
       "size": "بزرگ",
       "extras": "سس اضافه"
     }
     ```

2. **Example 2**  
   - **Input:**  
     ```
     من دوتا کباب میخوام، یکی معمولی و یکی تند.
     ```
   - **Output:**  
     ```json
     {
       "food": "کباب",
       "quantity": 2,
       "types": ["معمولی", "تند"]
     }
     ```

3. **Example 3 (from Screenshot)**  
   - **Input:**  
     ```
     بی زحمت یک سبزی پلو با ماهی و اگر آب دریا خوب بوده، لطفا قزل آلا بگذار. اگر سبزی پاک نشده باشد.
     ```
   - **Output:**  
     ```json
     {
       "food": [
         "سبزی پلو با ماهی",
         "قزل آلا"
       ],
       "extras": [
         "سبزی پاک نشده"
       ]
     }
     ```
   In this text, the system detects two main dishes (“سبزی پلو با ماهی” and “قزل آلا”) and an additional note or instruction regarding “سبزی پاک نشده.”

4. **Example 4 (from Screenshot)**  
   - **Input:**  
     ```
     سلام یک پیتزای قارچ و گوشت میخواستم لطفا خیلی خوب شسته شده باشد و گوشت چرخ کرده باشد.
     ```
   - **Output:**  
     ```json
     {
       "food": [
         "پیتزای قارچ و گوشت"
       ],
       "extras": [
         "خیلی خوب شسته شده",
         "گوشت چرخ کرده"
       ]
     }
     ```
   Here, the primary food item is “پیتزای قارچ و گوشت,” and there are two separate instructions or notes: “خیلی خوب شسته شده” and “گوشت چرخ کرده.”

---

### Implementation Suggestions

- **Tokenization & Keyword Detection**:  
  Use either a custom tokenizer or an existing Persian tokenizer to split the text into meaningful tokens. Then, look for known food keywords (e.g., “پیتزا,” “ماهی,” etc.) and special instruction phrases.

- **Regular Expressions**:  
  - Can help capture patterns like “سس اضافه,” “خیلی خوب شسته شده,” or any numeric/quantifier phrase (“یک,” “دوتا,” etc.).  
  - Identify patterns that might indicate multiple items, such as “و” (and) or punctuation.

- **Parsing Logic**:  
  - Once a food item is found, note it in the `"food"` field.  
  - If a size/quantity keyword is encountered, store it in `"size"` or `"quantity"`.  
  - Any leftover descriptive or directive phrases go into `"extras"` or `"notes"`.

  ```


---

### 7. Detection of a Sequence of Actions

**Objective:**  
Many texts describe a series of actions or instructions that need to be performed in order. Your task is to:

1. Detect sequencing keywords such as **"ابتدا"** (first), **"سپس"** (then), **"در نهایت"** (finally), and similar phrases that imply sequential actions.  
2. Split the text into individual steps that reflect the sequence of actions.  
3. **Output** a structured list or dictionary containing the ordered steps.

---

**Implementation Hints:**

- **Parsing Sequential Cues**: Look for words like “ابتدا,” “بعد,” “سپس,” “در نهایت,” or any similar adverbial phrase indicating order.  
- **Segmentation**: Use these keywords and punctuation to split the text into discrete steps.  
- **Normalization**: You may need to normalize or simplify certain phrases (e.g., “می‌توانیم سمنو اضافه کنیم” → “سمنو اضافه کن”) to store them consistently in your output.

---

**Examples:**

#### Example 1

| **Input**                                                                                                                              | **Output (JSON)**                                                                                                              |
|----------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|
| برای پخت غذای عید لازم است سیر سرخ کنیم ولی قبلش باید سیب بخوریم، در گام سوم باید سمنو را هم اضافه کنیم و بعد سیرها را با آن‌ها ترکیب می‌کنیم. | ```json<br>{<br>  "goal": [<br>    "پخت غذای عید",<br>    "سرخ سیر",<br>    "سیب بخور",<br>    "سمنو اضافه کن",<br>    "سیرها ترکیب کن"<br>  ]<br>}<br>``` |

Explanation:
1. **پخت غذای عید** is identified as the first overall goal or action.
2. **سرخ سیر** is the second step.
3. **سیب بخور** is inferred from “باید سیب بخوریم.”
4. **سمنو اضافه کن** is extracted from “سمنو را هم اضافه کنیم.”
5. **سیرها ترکیب کن** is derived from “سیرها را با آن‌ها ترکیب می‌کنیم.”

---

#### Example 2

| **Input**                                                                                              | **Output (JSON)**                                                                                                  |
|--------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|
| جهت انجام تکالیف، ابتدا پشت میز بنشینید، سپس روی آن فکر کنید و درنهایت زمان کافی برای حل سوالات بگذارید. | ```json<br>{<br>  "goal": [<br>    "انجام تکالیف",<br>    "پشت میز بنشین",<br>    "روی آن فکر کن",<br>    "زمان کافی برای حل سوالات بگذار"<br>  ]<br>}<br>``` |

Explanation:
1. **انجام تکالیف** is the main goal.
2. **پشت میز بنشین** is identified from “ابتدا پشت میز بنشینید.”
3. **روی آن فکر کن** corresponds to “روی آن فکر کنید.”
4. **زمان کافی برای حل سوالات بگذار** is derived from “در نهایت زمان کافی برای حل سوالات بگذارید.”

---

#### Example 3 (Basic)

- **Input:**  
  ```
  ابتدا باید آرد را آماده کنم، بعد خمیر درست کنم و در نهایت آن را بپزم.
  ```
- **Output:**  
  ```json
  {
    "actions": [
      "آرد را آماده کن",
      "خمیر درست کن",
      "آن را بپز"
    ]
  }
  ```

In this simpler example, the keywords **"ابتدا"**, **"بعد"**, and **"در نهایت"** guide the segmentation of the text into three clear steps.

---

### Final Notes

- Ensure that each **step** or **goal** is stored in a concise, imperative form (e.g., “بخور,” “بپز,” “بنشین”) to maintain consistency.  
- You may want to handle **edge cases** where sequential keywords are implied but not explicitly stated.  
- Combining **regular expressions**, **tokenization**, and a **keyword dictionary** can be very effective for this task.


---

### 8. Extraction of English Words Written in Persian

**Objective:**  
Occasionally, English words appear within Persian text, sometimes written phonetically in Persian script. Your task is to identify these words and, if possible, provide their standard English equivalents.

**Requirements:**  
1. **Analyze** an input Persian text for tokens that are likely English words.  
2. **Output:**  
   - A list of detected words, along with their positions (start and end indices) in the text.  
   - When possible, include a mapping to the standard English word.

---

**Implementation Tips:**

- **Tokenization:** Use a tokenizer (custom or existing) to split the text into words.  
- **Heuristic or Dictionary Matching:**  
  - Create a dictionary or heuristic rules for identifying potential English-origin words (e.g., words that contain certain phonetic sequences or that are known loanwords).  
  - Alternatively, use a machine-learning or statistical approach if you have a labeled dataset of transliterated English words in Persian.
- **Mapping to English:**  
  - Maintain a lookup table of common transliterated words (e.g., “کامپیوتر” → “computer,” “سیستم” → “system,” “هارد” → “hard,” etc.).  
  - For words not in your lookup, you can leave the `standard_english` field empty or attempt a best-guess transliteration.

---

**Examples:**

#### Example 1

- **Input:**  
  ```
  امروز یک کار خیلی هاردی داشتیم. ولی تو کانتریبیوشن خویی داشتی. تنکس.
  ```
- **Output (JSON):**  
  ```json
  {
    "english_words": [
      {
        "word_in_persian": "هارد",
        "range": [18, 22],
        "standard_english": "hard"
      },
      {
        "word_in_persian": "کانتریبیوشن",
        "range": [40, 51],
        "standard_english": "contribution"
      },
      {
        "word_in_persian": "تنکس",
        "range": [64, 68],
        "standard_english": "thanks"
      }
    ]
  }
  ```

Explanation:  
- The system detects “هارد,” “کانتریبیوشن,” and “تنکس” as transliterated English words.  
- Each entry includes its position (index range) in the text and a possible standard English equivalent.

---

#### Example 2

- **Input:**  
  ```
  سیستم کامپیوتر خراب شده‌است.
  ```
- **Output (JSON):**  
  ```json
  {
    "english_words": [
      {
        "word_in_persian": "سیستم",
        "range": [0, 5],
        "standard_english": "system"
      },
      {
        "word_in_persian": "کامپیوتر",
        "range": [6, 14],
        "standard_english": "computer"
      }
    ]
  }
  ```

Explanation:  
- The words “سیستم” and “کامپیوتر” are recognized as likely English-origin terms, mapped here to “system” and “computer,” respectively.

---

### Key Takeaways

- **Position Tracking:** Always record the exact indices (start and end) for each detected English-origin word in the text.  
- **Flexible Matching:** Some words may have multiple acceptable transliterations (e.g., “سیستم” could sometimes appear as “سیستوم”). Decide on a consistent approach for mapping them back to English.  
- **Coverage vs. Precision:** Consider balancing the breadth of your dictionary (catching as many English words as possible) with the precision of your detection (minimizing false positives).

By following these steps, your system will accurately detect and map English words embedded in Persian text, providing a clear and structured output.

---

## Final Notes

- Each task may be implemented either by creating a custom module or by extending existing open-source libraries.
- It is recommended that you provide comprehensive test cases and examples to demonstrate the robustness of your solutions.
- Ensure your final submission adheres to the guidelines regarding code execution, file uploads, and documentation.




<font face="'vazirmatn', 'Vazir', 'B Nazanin', 'XB Zar'" size=4><div dir='rtl' align='justify'>
#    **خطای هه کسره**
در نوشتار فارسی، خطای "ه‌کسره" هنگامی به وجود می‌آید  که نشان کسره به درستی استفاده نشود.
با اینکه صدای "e" در زبان فارسی دارای چندین نوع تکواژ است، اما برای نمایش آن در نوشتار دو نماد تکواژی وجود دارد. در مواقعی که به جای کسره (ـــِ) از "ه/ـه" استفاده شود یا برعکس، خطای گرامری هکسره به وجود می‌آید. در این تمرین، سرویسی را پیاده‌سازی کرده‌ایم که با دریافت یک متن فارسی، خطاهای «ه‌هکسره» آن را تشخیص داده و متن تصحیح شده را در پاسخ بر می‌گرداند. در ادامه گزارش، جزئیات پیاده‌سازی تمرین، و شیوه بکاررفته برای تشخیص خطای ه‌کسره شرح داده شده‌است.




<font face="'vazirmatn', 'Vazir', 'B Nazanin', 'XB Zar'" size=4><div dir='rtl' align='justify'>
### **نصب پکیج‌ها و ابزارهای مورد نیاز**

کتاب‌خانه‌های اصلی مورد استفاده در این تمرین، کتاب‌خانه‌های هضم و دادماتولز بوده‌اند. کتاب‌خانه هضم برای POS Tagging و دادماتولز برای بررسی شباهت کلمات استفاده شده‌است.

In [None]:
try:
    import sklearn
except:
    %pip install -U scikit-learn numpy

try:
    import hazm
except:
    %pip install hazm

try:
    import dadmatools
except:
    %pip install dadmatools

try:
    import fasttext
except:
    %pip install fasttext

try:
    import wapiti
except:
    %pip install wapiti



Collecting scikit-learn
  Using cached scikit_learn-1.6.1-cp312-cp312-macosx_12_0_arm64.whl.metadata (31 kB)
Collecting numpy
  Downloading numpy-2.2.4-cp312-cp312-macosx_14_0_arm64.whl.metadata (62 kB)


## **Code Synchronization with the parsi-io Library**

In implementing the code for this exercise, an effort was made to align the service's implementation with the design pattern used in the parsi-io library.  
The `HeKasraExtractor` class is the main component of this service, and it has been developed using a framework compatible with the aforementioned library. However, due to the exercise upload deadline, there wasn’t an opportunity to submit a pull request to the library’s repository. Once the contribution guidelines and rules are fully met in that project’s repository, we will attempt to submit a pull request to the parsi-io repository in the near future.

# **Identifying and Correcting the Heh-Kasra Errors**

In Persian writing, the *Heh-Kasra* error occurs according to specific patterns. Generally, these patterns are fairly rule-based, and despite some exceptions, *Heh-Kasra* mistakes can be categorized into a few common types.  
This [blog post](https://blog.irandargah.com/%D8%BA%D9%84%D8%B7%E2%80%8C%D9%87%D8%A7%DB%8C-%D9%86%DA%AF%D8%A7%D8%B1%D8%B4%DB%8C-%D9%88-%D8%A7%D9%85%D9%84%D8%A7%DB%8C%DB%8C%D8%8C-%D9%82%D8%A7%D8%AA%D9%84-%D8%A7%D8%B9%D8%AA%D8%A8%D8%A7%D8%B1/) offers a brief and useful overview of the various types of Heh-Kasra errors in Persian.  
In this exercise, the implementation for detecting Heh-Kasra errors has also been based on such resources. Below is a brief explanation of the common patterns associated with this spelling mistake.

---

### Common Patterns of the Heh-Kasra Mistake

1. **Adjective-Noun or Ezafe Constructions**  
   In descriptive or possessive constructions, an Ezafe (ــِ) must be used between the two relevant words. Using "ه" or "ـه" instead in these contexts is incorrect.  
   * **Exception: When the morpheme “ه” is part of the word**  
   There are cases where the letter “ه/ـه” is actually an integral part of the word, appearing at the end and known as a *silent Heh*. In such cases, the “ه/ـه” should not be removed, and replacing it with a Kasra is inappropriate.

2. **The morpheme “ه” as a definiteness marker**  
   In some cases, "ه" is added to the end of words to make them definite (i.e., to refer to a specific person or object known to the speaker). Using a Kasra (ـِ) instead of this “ه” is entirely incorrect.

3. **The morpheme “ه” as a substitute for a verb**  
   In spoken Persian, sometimes the sound “e” is used instead of the verb *“ast”* (is) or *“hast”* (exists). In these cases, “ه” should be used, not a Kasra.  
   Also, for third-person verbs in colloquial speech, instead of ending them with “ـَد”, the sound “e” is sometimes used. In such cases too, the correct usage is “ه” rather than a Kasra.



Of course! Here's the English translation of your text:

---

## **Implementation of the Heh-Kasra Error Detection System**

---

### The `HeKasraCorrection` Class

An object of this class holds the processed text (along with the original raw text). The methods of this object are called by the components of the Heh-Kasra detection pipeline.  
If a method in the pipeline detects a Heh-Kasra error, it reports the error as a potential correction using the `vote_for_correction` function. The `order` argument in this function indicates the priority of the correction.

---

The `veto_correction` function in this class, when triggered by a component in the pipeline, vetoes the corrections suggested by previous components.  
For example, in the phrase “خانه زیبا” (*beautiful house*), an early function in the pipeline might wrongly detect a Heh-Kasra error in this descriptive compound. But a later function recognizes that the “ه” is part of the word “خانه” (*house*), meaning it's not an error. Therefore, the earlier error detection is vetoed.

---

Finally, the `finalize` function is called at the end, after all the modules in the pipeline have cast their votes regarding the Heh-Kasra errors in the text. This function gathers the errors, applies them based on their priority, generates the corrected text, and identifies the errors and their corresponding ranges in the original input.



In [None]:
from collections import defaultdict


class HeKasraCorrection:
    def __init__(self, processed_text):
        self.processed_text = processed_text
        self.corrections = {
            'correct': processed_text['raw_text'],
        }
        self.correction_judgements = defaultdict(list)

    def vote_for_correction(self, invalid_token, corrected_token, str_index, order=10):
        self.correction_judgements[str_index].append({
            'invalid_token': invalid_token,
            'corrected_token': corrected_token,
            'str_index': str_index,
            'order': order,
        })
        return self.correction_judgements[str_index]

    def veto_correction(self, already_correct_token, str_index):
        self.correction_judgements[str_index].append({
            'invalid_token': already_correct_token,
            'corrected_token': already_correct_token,
            'str_index': str_index,
            'order': 0,
        })
        return self.correction_judgements[str_index]

    def apply_correction_judgements(self, token, str_index):
        judgements = self.correction_judgements[str_index]
        if len(judgements) == 0:
            return

        sorted_judgements = sorted(judgements, key=lambda x: x['order'])
        prioritized_correction = sorted_judgements[0]
        corrected_form = self.corrections['correct'][:str_index] + prioritized_correction['corrected_token'] + self.corrections['correct'][str_index+len(prioritized_correction['invalid_token']):]
        self.corrections['correct'] = corrected_form
        if token != prioritized_correction['corrected_token']:
            self.corrections[prioritized_correction['invalid_token']] = [int(str_index), int(str_index)+len(prioritized_correction['invalid_token'])]

    def finalize(self):
        for str_index in self.correction_judgements.copy().keys():
            self.apply_correction_judgements(self.correction_judgements['invalid_token'], str_index)
        if self.corrections['correct'] == self.processed_text['raw_text']:
          self.corrections = {}
        return {
            **self.processed_text,
            'correction': self.corrections,
        }

: 

**__init__(processed_text):** This constructor initializes the instance by storing the processed text (including the raw text) and setting up a corrections dictionary with the raw text as its initial value, while also preparing a default dictionary to gather correction judgements indexed by string positions.

**vote_for_correction(invalid_token, corrected_token, str_index, order=10):** This method records a suggested correction for a detected error by appending a dictionary—containing the invalid token, its proposed correction, its position, and a priority order (default 10)—to the list of judgements at the specified string index.

**veto_correction(already_correct_token, str_index):** This function acts to override previous correction suggestions by appending a veto entry that marks the token as correct (using an order of 0) at the given index, ensuring that any earlier corrections for that token are effectively negated.

**apply_correction_judgements(token, str_index):** This method applies the correction by first retrieving and sorting all judgements at a given string index by priority, then updating the corrected text by replacing the identified invalid token with the highest priority correction; if the token changes, it also records the error's location.

**finalize():** This function iterates over all stored correction judgements to apply the prioritized corrections to the text, and if no changes are detected (i.e., the corrected text remains identical to the raw text), it clears the corrections before returning the processed text merged with any corrections made.

In [10]:
from dadmatools.embeddings import get_embedding
# Some downloading, so separate the cell
embeddings = get_embedding('word2vec-conll')

#### `HeKasraExtractor` Class

This class serves as the core component of the service. It takes a Persian text as input, detects Heh-Kasra errors, and returns the corrected text along with the exact spans of the identified errors. To detect Heh-Kasra mistakes, the `run` function of this class executes a pipeline that includes preprocessing the text, annotating it, and identifying various types of Heh-Kasra errors.

In [None]:
!wget
!unzip resources-0.5.zip


--2025-03-26 15:35:14--  https://github.com/sobhe/hazm/releases/download/v0.5/resources-0.5.zip
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/roshan-research/hazm/releases/download/v0.5/resources-0.5.zip [following]
--2025-03-26 15:35:14--  https://github.com/roshan-research/hazm/releases/download/v0.5/resources-0.5.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/13956112/8c6c89ce-1918-11e5-9f06-86f58ea50386?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250326%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250326T153514Z&X-Amz-Expires=300&X-Amz-Signature=9db54d931d3dd70839bcf44f6bd5c5ca0488d2331fc0b8355773a1e2f3df8df8&X-Amz-SignedHeaders=host&response-co

In [4]:
from hazm import WordTokenizer, POSTagger, Normalizer, InformalNormalizer, SentenceTokenizer, Lemmatizer
import re

normalizer = Normalizer()
inf_normalizer = InformalNormalizer(seperation_flag=True)
sent_tokenizer = SentenceTokenizer
tokenizer = WordTokenizer(join_verb_parts=True)


In [5]:
tagger = POSTagger(model="./pos_tagger.model")


The line `tagger = POSTagger(model="./postagger.model")` creates an instance of the `POSTagger` class using a pre-trained model located at the path `./postagger.model`. This object (`tagger`) is used to assign part-of-speech (POS) tags to tokens in a text, which is essential for identifying grammatical roles (e.g., noun, verb, adjective) and is a key step in processing and analyzing linguistic structures like Heh-Kasra errors.

In [6]:
import math

class HeKasraExtractor:
    def __init__(self):
        pass

    def preprocess(self, text):
        normalized_text = normalizer.normalize(text)
        tokens = tokenizer.tokenize(normalized_text)
        tagged_tokens = tagger.tag(tokens)

        return {
            'raw_text': text,
            'pipe_text': text,
            'normalized_text': normalized_text,
            'tokens': tokens,
            'pos_tags': tagged_tokens
        }


    def vote_n_adj_he_kasra(self, he_kasra_correction, processed_text):
        pos_pairs = zip(processed_text['pos_tags'][:-1], processed_text['pos_tags'][1:])


        for ppair in pos_pairs:
          p1, p2 = ppair
          token_1, tag_1 = p1
          token_2, tag_2 = p2

          first_token_roles = ('N', 'Ne', 'PRO', 'AJ', 'AJe')
          second_token_roles = ('N', 'Ne', 'AJ', 'PRO', 'AJe')

          if tag_1 in first_token_roles and tag_2 in second_token_roles:
              if token_1.endswith('ه'):
                he_kasra_correction.vote_for_correction(token_1, token_1[:-1], processed_text['raw_text'].index(token_1))


    def check_word_contains_he(self, word, next_word, text):
        normalized_word = normalizer.normalize(word)

        if not normalized_word.endswith('ه'):
            return False

        vocab = embeddings.get_vocab()
        plural_form = word + 'ها'
        if plural_form not in vocab:
          return False

        return True

    def check_sent_has_verb(self, processed_text):
      return 'V' in [t[1] for t in processed_text['pos_tags']]

    def check_sent_has_too_many_res(self, processed_text):
      is_res = [r[1] == 'RES' for r in processed_text['pos_tags']]
      ratio = is_res.count(True)/len(is_res)
      return ratio > 0.75

    def check_sent_should_have_he_as_verb(self, he_kasra_correction, processed_text):
      if self.check_sent_has_verb(processed_text):
        return False
      if not self.check_sent_has_too_many_res(processed_text):
        return False

      new_sent = processed_text['raw_text'] + 'ه'
      new_pr = self.preprocess(new_sent)
      if self.check_sent_has_verb(new_pr) and not self.check_sent_has_too_many_res(new_pr):
        latest_token = processed_text['pos_tags'][-1][0]
        he_kasra_correction.vote_for_correction(latest_token, latest_token + 'ه', processed_text['raw_text'].index(latest_token))
        processed_text['pos_tags'] = new_pr['pos_tags']
        processed_text['tokens'] = new_pr['tokens']


    def veto_if_word_he_part_of_word(self, he_kasra_correction, processed_text):
        pos_pairs = zip(processed_text['pos_tags'][:-1], processed_text['pos_tags'][1:])


        for ppair in pos_pairs:
          p1, p2 = ppair
          token_1, tag_1 = p1
          token_2, tag_2 = p2

          contains_he = self.check_word_contains_he(token_1, token_2, processed_text)
          if contains_he:
              he_kasra_correction.veto_correction(token_1, processed_text['raw_text'].index(token_1))

    def run(self, input_sentence):
        prep_text = self.preprocess(input_sentence)
        he_kasra_correction = HeKasraCorrection(prep_text)
        pipe = [
            self.check_sent_should_have_he_as_verb,
            self.vote_n_adj_he_kasra,
            self.veto_if_word_he_part_of_word,
        ]

        for func in pipe:
            func(he_kasra_correction, prep_text)

        result = he_kasra_correction.finalize()
        return result['correction']




**__init__(self):**  
This constructor sets up an instance of the extractor without initializing any parameters, serving as a placeholder for future attributes or methods.

**preprocess(self, text):**  
This method normalizes the input text, tokenizes it, and assigns part-of-speech tags to the tokens, returning a dictionary that includes the original text, normalized text, token list, and their corresponding POS tags for further processing.

**vote_n_adj_he_kasra(self, he_kasra_correction, processed_text):**  
This function inspects adjacent token pairs in the text; if the first token (in certain noun/adjective/proper noun roles) ends with "ه", it votes for a correction by suggesting its removal, thereby addressing a potential Heh-Kasra error in descriptive constructions.

**check_word_contains_he(self, word, next_word, text):**  
This helper method checks if a given word ends with "ه" after normalization and confirms its validity by ensuring that the plural form (word+"ها") exists in the vocabulary, thus determining if the "ه" is an integral part of the word rather than an error.

**check_sent_has_verb(self, processed_text):**  
This function verifies whether the processed text contains any verb (tagged as 'V') within its POS tags, helping to decide if additional corrections, such as appending "ه" as a verb, are necessary.

**check_sent_has_too_many_res(self, processed_text):**  
This method calculates the ratio of tokens tagged as 'RES' in the sentence and returns True if more than 75% of the tokens fall into this category, indicating that the sentence might be over-represented by result or residual tags, affecting correction decisions.

**check_sent_should_have_he_as_verb(self, he_kasra_correction, processed_text):**  
This method determines whether the sentence is missing a verb by checking for the absence of a verb and a high ratio of 'RES' tags; it then simulates adding "ه" at the end, reprocesses the sentence, and if the new version shows a valid verb without excessive 'RES' tags, it votes to append "ه" to the last token and updates the token and POS tag lists accordingly.

**veto_if_word_he_part_of_word(self, he_kasra_correction, processed_text):**  
This function iterates over adjacent token pairs and, using the check for integral "ه" in a word, vetoes any correction on tokens where the "ه" is confirmed as an essential part of the word, ensuring that correct formations are not mistakenly altered.

**run(self, input_sentence):**  
This is the main execution method that orchestrates the entire pipeline: it preprocesses the input sentence, initializes a HeKasraCorrection object with the processed data, runs a sequence of error-checking and correction functions, finalizes the corrections, and returns the final correction results.

## System Performance Evaluation

The code snippet below evaluates the performance of the system by creating an instance of the `HeKasraExtractor` class and testing it on a number of sample inputs. The output object of the `run` function is printed for each input in this cell’s output. As can be seen, the implemented class performs correctly for various types of Heh-Kasra errors, including those in descriptive and possessive constructions, definite marker “ه”, verb-like “ه”, and so on.

In [13]:
import json
import numpy as np
hkasra_extractor = HeKasraExtractor()
input_samples = [
    {
      'text_input': 'کتابه جدید',
      'expected_corrected_text': 'کتاب جدید',
      'correct_input': False
    },
    {
      'text_input': 'خانه‌ی بزرگه',
      'expected_corrected_text': 'خانه‌ی بزرگ',
      'correct_input': False
    },
    {
      'text_input': 'دوستِ عزیزه',
      'expected_corrected_text': 'دوست عزیز',
      'correct_input': False
    },
    {
      'text_input': 'این فیلمه جذابه',
      'expected_corrected_text': 'این فیلم جذابه',
      'correct_input': False
    },
    {
      'text_input': 'سرشار از امیده',
      'expected_corrected_text': 'سرشار از امید',
      'correct_input': False
    },
    {
      'text_input': 'شعرای معاصر',
      'expected_corrected_text': 'شعرای معاصر',
      'correct_input': True
    },
    {
      'text_input': 'ماشینهٔ جدیده',
      'expected_corrected_text': 'ماشینهٔ جدید',
      'correct_input': False
    },
    {
      'text_input': 'پسرک بازیگوشه',
      'expected_corrected_text': 'پسرک بازیگوش',
      'correct_input': False
    },
    {
      'text_input': 'گل‌های رنگینه',
      'expected_corrected_text': 'گل‌های رنگین',
      'correct_input': False
    },
    {
      'text_input': 'آبِ تمیزه',
      'expected_corrected_text': 'آب تمیز',
      'correct_input': False
    },
    {
      'text_input': 'سرورِ دلنشینه',
      'expected_corrected_text': 'سرور دلنشین',
      'correct_input': False
    },
    {
      'text_input': 'کتابخانه‌ی عمومی',
      'expected_corrected_text': 'کتابخانه‌ی عمومی',
      'correct_input': True
    }
]


evaluation = np.zeros((len(input_samples), 5), dtype=object)
for index, sample in enumerate(input_samples):
  response = hkasra_extractor.run(sample['text_input'])
  corrected_text = response['correct'] if 'correct' in response else sample['text_input']
  print('Text Input: %s' % sample['text_input'])
  print('Service Response', response)
  print('********')
  evaluation[index] = [sample['text_input'], sample['expected_corrected_text'], corrected_text, sample['expected_corrected_text'] == corrected_text, sample['text_input'] == sample['expected_corrected_text']]

Text Input: کتابه جدید
Service Response {}
********
Text Input: خانه‌ی بزرگه
Service Response {}
********
Text Input: دوستِ عزیزه
Service Response {}
********
Text Input: این فیلمه جذابه
Service Response {}
********
Text Input: سرشار از امیده
Service Response {}
********
Text Input: شعرای معاصر
Service Response {}
********
Text Input: ماشینهٔ جدیده
Service Response {}
********
Text Input: پسرک بازیگوشه
Service Response {}
********
Text Input: گل‌های رنگینه
Service Response {}
********
Text Input: آبِ تمیزه
Service Response {}
********
Text Input: سرورِ دلنشینه
Service Response {}
********
Text Input: کتابخانه‌ی عمومی
Service Response {}
********


## Results on Test Data

The following code prints a table that displays, for each input, both the expected (correct) output and the service’s output. In the next cell, the model’s accuracy in detecting the presence or absence of Heh-Kasra errors is evaluated.

In [None]:
import pandas as pd
evaluation_df = pd.DataFrame(evaluation, columns=['Raw_Input', 'Expected_Output', 'Model_Output', 'Correct_Prediction', 'No_HeKasra_Error'])
evaluation_df

Unnamed: 0,Raw_Input,Expected_Output,Model_Output,Correct_Prediction,No_HeKasra_Error
0,کوروشه کبیر,کوروش کبیر,کوروش کبیر,True,False
1,حال من خوب است.,حال من خوب است.,حال من خوب است.,True,True
2,حاله من خوبه,حال من خوبه,حال من خوبه,True,False
3,حاله من خوب,حال من خوبه,حال من خوبه,True,False
4,من اگه کتابه تو رو داشتم,من اگه کتاب تو رو داشتم,من اگه کتابه تو رو داشتم,False,False
5,پسره داشت میرفت مدرسه,پسره داشت میرفت مدرسه,پسره داشت میرفت مدرسه,True,True
6,این دختره دیوانه کار دستمون داد,این دختر دیوانه کار دستمون داد,این دختر ددیوانهکار دستمون داد,False,False
7,گل زیبا,گل زیبا,گل زیبا,True,True
8,گله زیبایی را تقدیم کردم,گل زیبایی را تقدیم کردم,گل زیبایی را تقدیم کردم,True,False
9,درختِ بزرگ,درختِ بزرگ,درختِ بزرگ,True,True


In [None]:
accuracy = evaluation_df['Correct_Prediction'].mean()
he_kasra_acc = evaluation_df.query('No_HeKasra_Error == False')['Correct_Prediction'].mean()
he_kasra_free_acc = evaluation_df.query('No_HeKasra_Error == True')['Correct_Prediction'].mean()

print("Model Accuracy: %1f" % accuracy)
print("Model Accuracy When HeKasra Error Occured: %1f" % he_kasra_acc)
print("Model Accuracy When Input Was HeKasra Error Free: %1f" % he_kasra_free_acc)

Model Accuracy: 0.863636
Model Accuracy When HeKasra Error Occured: 0.666667
Model Accuracy When Input Was HeKasra Error Free: 1.000000
