### Group 1

### Alireza Mousavizadeh - 97106284

### Fatemeh Tohidian - 97100354

### Amin Kashiri - 97101026

# Initialization

In [1]:
%pip install spacy torch stanza spacy-stanza transformers nltk hazm black pyspellchecker

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stanza
  Downloading stanza-1.4.0-py3-none-any.whl (574 kB)
[K     |████████████████████████████████| 574 kB 13.7 MB/s 
[?25hCollecting spacy-stanza
  Downloading spacy_stanza-1.0.2-py3-none-any.whl (9.7 kB)
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 79.5 MB/s 
Collecting hazm
  Downloading hazm-0.7.0-py3-none-any.whl (316 kB)
[K     |████████████████████████████████| 316 kB 99.0 MB/s 
[?25hCollecting black
  Downloading black-22.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 71.2 MB/s 
[?25hCollecting pyspellchecker
  Downloading pyspellchecker-0.6.3-py3-none-any.whl (2.7 MB)
[K     |████████████████████████████████| 2.7 MB 41.4 MB/s 
Collecting emoji
  Downloading emoji-1.7.0.tar.gz (175 kB)
[K     |███████

In [2]:
!python -m spacy download en_core_web_lg

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9 MB)
[K     |████████████████████████████████| 827.9 MB 666 kB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-py3-none-any.whl size=829180942 sha256=6baf0a0f04273cd67f9ab9dd1541e9ead9bc1c875e7c390491890615af9dccdf
  Stored in directory: /tmp/pip-ephem-wheel-cache-qfdnu71l/wheels/11/95/ba/2c36cc368c0bd339b44a791c2c1881a1fb714b78c29a4cb8f5
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


# Import Required Libraries

In this project, we use **transformers** library (from **huggingface.co**) to use the pre-trained **BERT** base model. We use BERT and RoBERTa models for English and BERT and ALBERT models for Persian.


In [1]:
import editdistance
import pandas as pd
import string

from spacy.lang.fa import Persian
from spacy.lang.en import English
from spellchecker import SpellChecker
from transformers import (
    pipeline,
    BertTokenizer,
    BertForMaskedLM,
    AlbertTokenizer,
    AlbertForMaskedLM,
    RobertaTokenizer,
    RobertaModel,
)

from spacy.tokens.token import Token

# Models

The following is a brief description of these Transformer models and their differences and similarities with the base bert model:

1. **ALBERT**: As stated earlier, BERT base consists of 110 million parameters which makes it computationally intensive and therefore a light version was required with reduced parameters. ALBERT model has 12 million parameters with 768 hidden layers and 128 embedding layers. As expected, the lighter model reduced the training time and inference time. To achieve lesser set of parameters, the **Cross-layer parameter sharing** & **Factorized embedding layer parameterization** techniques are used.

2. **RoBERTa**: roberta stands for “Robustly Optimized BERT pre-training Approach”. In many ways this is a better version of the BERT model. The key points of difference are as follows:

    - **Dynamic Masking**: BERT uses static masking i.e. the same part of the sentence is masked in each Epoch. In contrast, RoBERTa uses dynamic masking, wherein for different Epochs different part of the sentences are masked. This makes the model more robust.

    - **Remove NSP Task**: It was observed that the NSP task is not very useful for pre-training the BERT model. Therefore, the RoBERTa only with the MLM task.

    - **More data Points**: BERT is pre-trained on “Toronto BookCorpus” and “English Wikipedia datasets” i.e. as a total of 16 GB of data. In contrast, in addition to these two datasets, RoBERTa was also trained on other datasets like CC-News (Common Crawl-News), Open WebText etc. The total size of these datasets is around 160 GB.

    - **Large Batch size**: To improve on the speed and performance of the model, RoBERTa used a batch size of 8,000 with 300,000 steps. In comparison, BERT uses a batch size of 256 with 1 million steps.


In [11]:
# torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
torch_device = "cpu"
print(f"Torch Device: {torch_device}")

language = 'fa'
model_type = 'bert'

# language = "en"
# model_type = "roberta"

if language == "en":
    if model_type == "bert":
        model_name = "bert-large-uncased"  # Bert large
        # model_name = "bert-base-uncased" # Bert base
    elif model_type == "roberta":
        model_name = "roberta-large"  # Roberta
    else:
        raise f"{model_type} model not found."

elif language == "fa":
    if model_type == "bert":
        # model_name = "HooshvareLab/bert-fa-base-uncased"  # BERT V2
        model_name = "HooshvareLab/bert-fa-zwnj-base"  # BERT V3
    elif model_type == "albert":
        model_name = "HooshvareLab/albert-fa-zwnj-base-v2"  # Albert
    else:
        raise f"{model_type} model not found."

else:
    raise f"{language} language not found."

if model_type == "bert":
    MASK = "[MASK]"
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertForMaskedLM.from_pretrained(model_name).to(torch_device)
    unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer)

elif model_type == "albert":
    MASK = "[MASK]"
    tokenizer = AlbertTokenizer.from_pretrained(model_name)
    model = AlbertForMaskedLM.from_pretrained(model_name).to(torch_device)
    unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer)

elif model_type == "roberta":
    MASK = "<mask>"
    tokenizer = RobertaTokenizer.from_pretrained("roberta-large")
    unmasker = pipeline("fill-mask", model="roberta-large", tokenizer=tokenizer)

else:
    print(f"{model_type} not found.")

vocab: set = set(tokenizer.get_vocab().keys())

if model_type == "roberta":
    vocab = set(map(lambda s: s[1:], vocab))

print(f"len vocab: {len(vocab)}")
print(f"{language} {model_type} Model Loaded ...")


Torch Device: cpu


Downloading:   0%|          | 0.00/416k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/134 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/292 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/565 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/452M [00:00<?, ?B/s]

len vocab: 42000
fa bert Model Loaded ...


# Spacy

spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc object: you should always be able to reconstruct and reproduce the original input text.

## Setup

We use **spacy** library for Persian and English tokenizers.

In [12]:
if language == "fa":
    TOKENIZER = Persian().tokenizer

elif language == "en":
    TOKENIZER = English().tokenizer

else:
    raise ValueError(f"{language} not supported.")


# Half-Space Handling

For Persian texts, a semicolon plays a key role. Unfortunately, pre-trained models in Persian do not support half space and their predicted words do not have half space.
With this function, if the difference between the predicted words and the main word in the given input is only contains half-space, we do not change the main word in the given input.

In [4]:
def half_space_case(predicted: str, current: str):
    wo_half_space_current = current.replace("‌", "")
    return wo_half_space_current == predicted



# Spell Correction pipeline

## Lexical Correction:

1. For each token in the input text, we check whether the given token exists in the tokenizer vocabulary or not.
    If there is, we do not consider this word as a lexical error and go to the next token in the input text.
    If not, this token is probably misspelled. In this way, we mask the mentioned token and predict the masked token with the help of the language model (in this project transformer models or n-grams).
    Now with the help of following factors:
   1. Score (model prediction score from language model) : s
   2. Edit Distance (difference between predicted tokens and input token) : e
   3. Token Length (input token length) : l
   4. $\alpha$ (hyper-parameter for adjust weight of edit distance against model prediction score)

   Consider the following objective function and calculate this function value for each model predicted tokens:

    $$f(s,e,l) = (\frac{l}{e + 1})^\alpha s$$

2. In this step, first, we filter tokens with edit distance less than `MAX_EDIT_DISTANCE` hyperparameter constant. second, sort (descending-ly) predicted tokens values based on objetive function values calculated in the previous step and select first as result. If there are no tokens left in the first part after filtering. The closest token in terms of lexical distance is selected from all the words in vocabulary, regardless of the context.

3. If the token changes in the previous step, this process is done again from the beginning of the input text; otherwise, this process continues on the next token until it reaches the end of the input text.

## Contextual Correction

- Similar to the previous part: For each token in the input text, we mask the token and predict the masked token with the help of the language model (in this project transformer models or n-grams) and then first, we filter tokens with edit distance less than `MAX_EDIT_DISTANCE` hyperparameter constant. second, sort (descending-ly) predicted tokens values based on objetive function values calculated in the previous step and select first as result. If there are no tokens left in the first part after filtering, the token does not change and this process continues for the next token until it reaches the end of the input text.

In [43]:

class SpellCorrector:
    def __init__(self, tokenizer, alpha=5, max_edit_distance=2, verbose=False, top_k=50):
        self.tokenizer = tokenizer
        self.alpha = alpha
        self.max_edit_distance = max_edit_distance
        self.verbose = verbose
        self.top_k = top_k
        self.spell_checker = SpellChecker()

    def print_summary(self, type):
        text = self.text
        current_token = self.current_token
        start_char_index = self.start_char_index
        end_char_index = self.end_char_index

        if self.some_token_corrected or self.verbose:
            print("*" * 50)
            print(f"Token: {current_token.text}")

            if self.verbose:
                print("Filtered Predicts: \n")
                if current_token.text in string.punctuation:
                    print(self.filtered_predicts[["token_str", "score"]])
                else:
                    print(self.filtered_predicts[["token_str", "score", "total_score"]])

            if self.some_token_corrected:
                print(f"{current_token.text} -> {self.selected_predict} : {type}")
                typo_correction_details = {
                    "raw": current_token.text,
                    "corrected": self.selected_predict,
                    "span": f"[{start_char_index}, {end_char_index}]",
                    "around": text[start_char_index - 10 : end_char_index + 10],
                    "type": type,
                }

                print(typo_correction_details)
            print("#" * 50)


    def set_predictions(self):
        start_char_index: int = self.current_token.idx
        end_char_index = start_char_index + len(self.current_token)

        masked_text = (
            self.text[:start_char_index] + MASK + self.text[end_char_index:]
        )

        predicts = unmasker(masked_text, top_k=self.top_k)
        predicts = pd.DataFrame(predicts)
        
        self.predicts = predicts
        self.start_char_index = start_char_index
        self.end_char_index = end_char_index
        self.masked_text = masked_text
        return predicts

    def set_filtered_predictions(self):
        predicts = self.predicts
        predicts.loc[:, "token_str"] = predicts["token_str"].apply(
            lambda tk: tk.replace(" ", "")
        )
        predicts.loc[:, "edit_distance"] = predicts["token_str"].apply(
            lambda tk: editdistance.eval(self.current_token.text, tk)
        )

        # Filter tokens with at most 3 edit distance
        filtered_predicts = predicts.loc[
            predicts["edit_distance"] <= self.max_edit_distance, :
        ].copy()

        # Apply total score function
        # e: edit distance + 1
        # l: token length
        filtered_predicts.loc[:, "e_to_l"] = (
            filtered_predicts.loc[:, "edit_distance"] + 1
        ) / len(self.current_token.text)

        filtered_predicts.loc[:, "total_score"] = (
            filtered_predicts.loc[:, "score"]
            / filtered_predicts.loc[:, "e_to_l"] ** self.alpha
        )

        filtered_predicts = filtered_predicts.sort_values(
            "total_score", ascending=False
        )
        self.filtered_predicts = filtered_predicts

    def correct_predict(self, selected_predict):
        if selected_predict != self.current_token.text:
            if not half_space_case(selected_predict, self.current_token.text):
                self.some_token_corrected = True
                self.text = self.masked_text.replace(MASK, selected_predict, 1)

            else:
                vocab.add(self.current_token.text)

        self.selected_predict = selected_predict


    def correct_lexico_typo(self):
        while True:
            self.some_token_corrected = False
            self.tokens = list(self.tokenizer(self.text))
            for index, current_token in enumerate(self.tokens):
                self.current_token: Token = current_token

                if current_token.text not in vocab:
                    self.set_predictions()

                    try:
                        if current_token.text in string.punctuation:
                            selected_predict = self.predicts["token_str"].iloc[0]
                        elif any(c.isdigit() for c in current_token.text):
                            print("DIGIT")
                            selected_predict = current_token.text
                        else:
                            self.set_filtered_predictions()
                            selected_predict_row = self.filtered_predicts.iloc[0, :]
                            selected_predict = selected_predict_row["token_str"]
                    except Exception as e:
                        print(
                            f"Error: {e} From {current_token.text} Filtered Predictions Length: {len(self.filtered_predicts)}"
                        )
                        selected_predict = self.spell_checker.correction(self.current_token.text)

                    self.correct_predict(selected_predict)
                    self.print_summary('lexical')

                    if self.some_token_corrected:
                        break

            if not self.some_token_corrected:
                break


    def correct_contextual_typo(self):
        index = 0
        while True:
            self.some_token_corrected = False
            self.tokens = list(self.tokenizer(self.text))
            for j in range(index, len(self.tokens)):
                current_token: Token = self.tokens[j]
                self.current_token = current_token
                self.set_predictions()

                try:
                    if current_token.text in string.punctuation:
                        self.filtered_predicts = self.predicts.loc[
                            self.predicts["token_str"].apply(lambda tk: tk in string.punctuation), :
                        ].copy()
                        selected_predict = self.filtered_predicts["token_str"].iloc[0]
                    elif any(c.isdigit() for c in current_token.text):
                        selected_predict = current_token.text
                    else:
                        self.set_filtered_predictions()
                        selected_predict_row = self.filtered_predicts.iloc[0, :]
                        selected_predict = selected_predict_row["token_str"]

                except Exception as e:
                    selected_predict = current_token.text
                    print(
                        f"Error: {e} From {current_token.text} Filtered Predictions Length: {len(self.filtered_predicts)}"
                    )

                self.correct_predict(selected_predict)
                self.print_summary('contexual')
                index += 1
                if self.some_token_corrected:
                    break

            if not self.some_token_corrected:
                break


    def correction_pipeline(self):
        print(f"Lexico Correction ... . text = {self.text}") 
        self.correct_lexico_typo()

        print(f"Contextual Correction ... . text = {self.text}")
        self.correct_contextual_typo()


    def __call__(self, text, *args, **kwargs):
        self.text = text
        self.correction_pipeline()
        return self.text


# Test on Sample Texts

In [44]:
if language == "en":
    test_cases = [
        {
            "input_text": """
            When he was walking on the roog, he saw a start that was very shining.
        """,
            "true_text": """
            When he was walking on the roof, he saw a star that was very shining.
        """,
        },
        {
            "input_text": """
            Being one of the larges cities in word, tehran is always crowded with pollution walking left and write.
        """,
            "true_text": """
            Being one of the larges cities in world, tehran is always crowded with population walking left and right.
        """,
        },    
        {
            "input_text": """
            I was playing fotball, but then I broke my legg. The goal keeper saved a very powerfull shout. It as a very good hatch.
        """,
            "true_text": """
            I was playing football, but then I broke my leg. The goal keeper saved a very powerfull shot. It was a very good match.
        """,
        },    
        {
            "input_text": """
            The quantity thoery of money also assume that the quantity of money in an economy has a large influense on its level of economic activity. So, a change in the money supply results in either a change in the price levels or a change in the sopply of gods and services, or both. In addition, the theory assumes that changes in the money supply are the primary reason for changes in spending.
        """,
            "true_text": """
            The quantity theory of money also assumes that the quantity of money in an economy has a large influence on its level of economic activity. So, a change in the money supply results in either a change in the price levels or a change in the supply of goods and services, or both. In addition, the theory assumes that changes in the money supply are the primary reason for changes in spending.
        """,
        },
        {
            "input_text": """
            Does it privent Iran from getting nuclear weapens. Many exports say that if all parties adhered to their pledges, the deal almost certainly could have achieved that goal for longer than a dekade!
        """,
            "true_text": """
            Does it prevent Iran from getting nuclear weapons? Many experts say that if all parties adhere to their pledges, the deal almost certainly could have achieved that goal for longer than a decade.
        """,
        },
        {
            "input_text": """
            The Federal Reserve monitor risks to the financal system and works to help insure the system supports a haelthy economy for US households, communities, and busineses.
        """,
            "true_text": """
            The Federal Reserve monitors risks to the financial system and works to help ensure the system supports a healthy economy for US households, communities, and businesses.
        """,
        },
        {
            "input_text": """
            Bitcoin is a decentrallized digital curency that can be transfered on the peer-to-peer bitcoin network. Bitcoin transactions are veryfied by network nodes throgh cryptography and recorded in a public distributed ledger called a blockchain. The criptocurrency was invented in 2008 by an unknown person or group of people using the name Satoshi Nakamoto. The curency began use in 2009 when its implemntation was released as open-source software.
        """,
            "true_text": """
            Bitcoin is a decentralized digital currency that can be transferred on the peer-to-peer bitcoin network. Bitcoin transactions are verified by network nodes through cryptography and recorded in a public distributed ledger called a blockchain. The cryptocurrency was invented in 2008 by an unknown person or group of people using the name Satoshi Nakamoto. The currency began use in 2009 when its implementation was released as open-source software.
        """,
        },
        {
            "input_text": """
            The 2022 FILA World Cup is scheduled to be the 22nd running of the FILA World Cup competition, the quadrennial international men's football championship contested by the national teams of the member associations of FIFA. It is scheduled to take place in Qatar from 21 Novamber to 18 Decamber 2022.
        """,
            "true_text": """
            The 2022 FIFA World Cup is scheduled to be the 22nd running of the FIFA World Cup competition, the quadrennial international men's football championship contested by the national teams of the member associations of FIFA. It is scheduled to take place in Qatar from 21 November to 18 December 2022.
        """,
        },
        {
            "input_text": """
            President Daneld Trump annonced on Tuesday he well withdraw the United States from the Iran nuclear deal and restore far-reaching sanktions aimed at withdrawal Iran from the global finansial system.
        """,
            "true_text": """
            President Donald Trump announced on Tuesday he will withdraw the United States from the Iran nuclear deal and restore far-reaching sanctions aimed at withdrawal Iran from the global financial system.
        """,
        },
        {
            "input_text": """
            Cars has very sweet features. It has two beautifull eye, adorable tiny paws, sharp claws, and two fury ear which are very sensitive to sounds. It has a tiny body covered with sot fur and it has a furry tail as well. Cats have an adorable face with a tiny nose.
        """,
            "true_text": """
            Cat has very sweet features. It has two beautiful eyes, adorable tiny paws with sharp claws, and two furry ears which are very sensitive to sounds. It has a tiny body covered with soft fur and it has a furry tail as well. Cats have an adorable face with a tiny nose.
        """,
        },
    ]

    if model_type != "roberta":
        for test_case in test_cases:
            test_case["input_text"] = test_case["input_text"].lower()
            test_case["true_text"] = test_case["true_text"].lower()

elif language == "fa":
    test_cases = [
        {
            "input_text": "وقتی قیمت گوست قرمز یا صفید در کشورهای دیگر بیشتر شده است، ممکن است در جیران هم گرا شود.",
            "true_text": """""",
        },
        {
            "input_text": " بسیاری از مباحث علوم غیرطبیعی با استفاده از فیزیک دنیای مادی ابل توجیح نیست و برای یادگیری باید به فلسفه‌های خاصی رجو کرد.",
            "true_text": """""",
        },  
        {
            "input_text": "پس از سال‌ها تلاش، رازی موفق به کسف الکل شد. این دانشمند تیرانی باعث افتخار در تاریخ کور است.",
            "true_text": """""",
        },
        {
            "input_text": "در هفته گذشته قیمت تلا تغییر چندانی نداشت، و در همان محدوده 1850 دلاری کار خود را به پایان رساند. ",
            "true_text": """""",
        },
        {
            "input_text": "بر اساس مسوبه سران قوا، معاملات فردایی طلا همانند معاملات فردایی ارض، ممنوع و غیرقانونی شناخته شد و فعالان این بازار به جرم اخلال اقتصادی، تحت پیگرد قرار خواهند گرفت. در نتیجه تانک مرکزی در بازار فردایی مداخله نخواهد کرد",
            "true_text": """""",
        },
        {
            "input_text": """
        با نزدیک شدن قیمت دار غیر رسمی به سفف خود در روز قبل، تحلیلگران در بازار برای هفته بعد هشدار میدادند که باید احطیاط کرد و اقدامات امنیتی در بازار افزایش خواهد یافت.
        """,
            "true_text": """""",
        },
        {
            "input_text": """
        با تولانی شدن جنگ روسیه و اوکراین و سهم قابل توجهی که این دو کشور در تأمین کندم جهان داشتند، بازار کندم با نوسانات زیادی مواجه شد و قیمت محصولاتی که مواد اولیه‌شان کندم بود، در همه جای جهان افزایش یافت.
        """,
            "true_text": """""",
        },
        {
            "input_text": """علت واقعی تعویق در مزاکرات وین چیست.""",
            "true_text": """""",
        },
    ]

else:
    raise f"{language} language not found."

ALPHA = 5 if language == "en" else 30
MAX_EDIT_DISTANCE = 3 if language == "en" else 2
TOP_K = 500 if language == "en" else 5000
VERBOSE = False

for test_case in test_cases:
    test_case["input_text"] = test_case["input_text"].strip()
    test_case["true_text"] = test_case["true_text"].strip()

spell_corrector = SpellCorrector(TOKENIZER, ALPHA, MAX_EDIT_DISTANCE, VERBOSE, TOP_K)

for idx in range(len(test_cases)):
    test_case = test_cases[idx]

    input_text = test_case["input_text"]

    output_text = spell_corrector(input_text)


    print('Is output corrected: ', output_text == test_case["true_text"])

    print('Corrected text: ', output_text)

    print("\n")
    print("* " * 50)
    print(" *" * 50)
    print("\n")
    break


Lexico Correction ... . text = وقتی قیمت گوست قرمز یا صفید در کشورهای دیگر بیشتر شده است، ممکن است در جیران هم گرا شود.
**************************************************
Token: صفید
Filtered Predicts: 

     token_str     score   total_score
3         سفید  0.018338  1.968992e+07
3307       صید  0.000029  3.096770e+04
75         فیل  0.001177  6.588904e+00
225        دید  0.000452  2.532494e+00
245       شاید  0.000417  2.334473e+00
340       سپید  0.000303  1.699325e+00
344       جدید  0.000300  1.678344e+00
568      سفیدی  0.000187  1.047382e+00
637        عید  0.000168  9.410656e-01
729       خرید  0.000149  8.333981e-01
1046      سعید  0.000101  5.654131e-01
1089       شید  0.000097  5.448662e-01
1425       صفر  0.000072  4.051298e-01
1485      شدید  0.000069  3.865040e-01
2241       کید  0.000045  2.525382e-01
2350       صمد  0.000043  2.398875e-01
2686       فین  0.000037  2.051143e-01
2818      واید  0.000035  1.945062e-01
3755      صدفی  0.000025  1.381866e-01
3839       فیک  

In [None]:
f"الکل" in vocab