# Question 1:

Before anything else, we import the required libraries to run the code:

In [1]:
import hazm
import nltk
from nltk import bigrams
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm.preprocessing import flatten
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace
import random

We import the text file and put it in the text variable:

In [2]:
with open("hp_en.txt", "r", encoding="utf-8") as f:
    text_en = f.read()

In [3]:
with open("hp_fa.txt", "r", encoding="utf-8") as f:
    text_fa = f.read()

### part a) Applying the necessary pre-processings:

We normalize the text by using the "hazm" library, which is for processing Persian texts:

In [4]:
normalizer = hazm.Normalizer()
normalized_text = normalizer.normalize(text_fa)

With the help of this library, we segment the sentences:

In [5]:
sentences = hazm.sent_tokenize(normalized_text)
print("Number of sentences: " + str(len(sentences)))

Number of sentences: 7021


Next, we segment the words and take them out for processing in the form shown in the library guide (https://www.nltk.org/api/nltk.lm.html).

In [6]:
new_sentences = []
for i in sentences:
    tokens = nltk.word_tokenize(i)
    new_sentences.append(tokens)

For each of the sentences, we add characters for the beginning and end of the sentences:

In [7]:
flatten(pad_both_ends(sent, n=2) for sent in new_sentences)

<itertools.chain at 0x2c883ac2590>

### part b) Training language model:

Bigrams of first sentence:

In [8]:
print(list(bigrams(pad_both_ends(new_sentences[0], n=2))))

[('<s>', '\ufeffآقا'), ('\ufeffآقا', 'و'), ('و', 'خانم'), ('خانم', 'دورسلی'), ('دورسلی', 'ساکن'), ('ساکن', 'خانه'), ('خانه', 'شماره'), ('شماره', 'چهار'), ('چهار', 'خیابان'), ('خیابان', 'پریوت'), ('پریوت', 'درایو'), ('درایو', 'بودند'), ('بودند', '.'), ('.', '</s>')]


Default preprocessing for a sequence of sentences:

In [9]:
train_2, vocab_2 = padded_everygram_pipeline(2, new_sentences)

We define model with add-1 laplace smoothing:

In [10]:
lm_2 = Laplace(1)

Now we can train model on our text:

In [11]:
lm_2.fit(train_2, vocab_2)
print(len(lm_2.vocab))

9467


Number of 2-grams in the text is equal to:

In [12]:
print(lm_2.counts)

<NgramCounter with 2 ngram orders and 212119 ngrams>


For example we want to know what is the chance that “هوا” is preceded by “امسال”:

In [13]:
print(lm_2.score("هوا", ["امسال"]))

0.00010552975939214858


### part c) Importance of Laplace smoothing:

When performing statistical analysis on a dataset that contains categorical variables, it is possible that some categories in the dataset may have zero frequency, which can lead to issues with statistical models that assume non-zero probabilities for all categories.

Without Laplace smoothing, if a category in the dataset has zero frequency, then the probability of that category occurring will be zero as well.

Laplace smoothing helps to avoid this problem by adding a small amount of probability mass to all categories, even those with zero frequency. This ensures that all categories have non-zero probabilities, and it helps to prevent overfitting to the training data.

### part d) Generating new sentences:

In [14]:
print(lm_2.generate(random.randint(12, 24)))

['یادمون', 'رفت', 'تعقیبش', 'کنه', 'که', 'وقت', 'چنین', 'استنباط', 'کرد', 'و', 'غول', 'پیکر', 'که', 'جادوگر', 'بوده', '.', '</s>', 'را', 'به', 'طبقه\u200cی', 'بالا', 'گرفت', 'واقعیت']


In [15]:
print(lm_2.generate(random.randint(12, 24)))

[':', '-', 'هیس', '!', '</s>', 'اتومبیلی', 'را', 'شناخت', 'و', 'منتظر', 'چیست', '.', '</s>', '…']


In [16]:
print(lm_2.generate(random.randint(12, 24)))

['دست', 'چاقش', 'که', 'از', 'ساعت', 'یازده', 'سیکل', 'نقره', 'ایه', '.', '</s>', 'چیزی', 'میدانست', 'کجا', 'فهمیدین', 'ولی', 'بعد', 'متوجه', 'حالت', 'رئیس', 'این', 'قدر']


In [17]:
print(lm_2.generate(random.randint(12, 24)))

['تونسته', 'راه', 'می\u200cرفت', 'و', 'سر', 'داد', '.', '</s>', 'ما', 'توی', 'پاتیل', 'درزدار', 'برد', 'دوباره', 'جام', 'قهرمانی', 'صعود']


In [18]:
print(lm_2.generate(random.randint(12, 24)))

['همه', 'دوستشون', 'دارن', 'بمبارون', 'میکنن؟', '</s>', 'خوردن', 'شیرینی', 'تعارف', 'کرد', '.', '</s>', '</s>', '<s>', 'توی', 'اسلیترین', 'میتونه', 'جلوتونو', 'بگیره', '.', '</s>', 'و', 'احمق', '!']


### part e) 3-grams and 5-grams:

For 3-gram we have:

In [19]:
train_3, vocab_3 = padded_everygram_pipeline(3, new_sentences)

In [20]:
lm_3 = Laplace(1)

In [21]:
lm_3.fit(train_3, vocab_3)
print(len(lm_3.vocab))

9467


In [22]:
print(lm_3.generate(random.randint(12, 24)))

['را', 'به', 'چنگ', 'آورد', '.', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


In [23]:
print(lm_3.generate(random.randint(12, 24)))

['</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


In [24]:
print(lm_3.generate(random.randint(12, 24)))

['باید', 'به', 'بدعنق', 'فرصت', 'بدیم؟', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


In [25]:
print(lm_3.generate(random.randint(12, 24)))

['</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


In [26]:
print(lm_3.generate(random.randint(12, 24)))

['همیشه', 'اعضای', 'تیمش', 'را', 'به', 'سوی', 'آن\u200cها', 'می\u200cآمد', 'پوزخندی', 'زد', 'و', 'به', 'سوی', 'آشپزخانه', 'رفت', 'و', 'پچ', 'پچ', 'ضعیفی', 'را']


for 5-gram:

In [27]:
train_5, vocab_5 = padded_everygram_pipeline(5, new_sentences)

In [28]:
lm_5 = Laplace(1)

In [29]:
lm_5.fit(train_5, vocab_5)
print(len(lm_5.vocab))

9467


In [30]:
print(lm_5.generate(random.randint(12, 24)))

['<s>', '<s>', '<s>', '-', 'تو', 'نباید', 'شب', 'توی', 'مدرسه', 'پرسه', 'بزنی', '.', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


In [31]:
print(lm_5.generate(random.randint(12, 24)))

['</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


In [32]:
print(lm_5.generate(random.randint(12, 24)))

['برداشت', 'کنیم', '.', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


In [33]:
print(lm_5.generate(random.randint(12, 24)))

['وقتی', 'آن', 'موجود', 'در', 'نور', 'مهتاب', 'قرار', 'گرفت', 'آن', 'را', 'ببینند', '.', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


In [34]:
print(lm_5.generate(random.randint(12, 24)))

['</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


### part f) Comparing sentencs that are generated by 3 models:

2-gram model is simpler and faster to compute, but it can not capture longer-range dependencies between words in a text.

5-gram model is more complex and computationally expensive, but it may be able to capture more complex relationships between words in a text. However, it can not generate good sentences when vocab size is not big enough (case of this problem).

#### 3-gram model can capture some longer-range dependencies between words, and make a good trade-of between 2-gram and 5-gram

### part g) Perplexity of model:

Considering that the content of the first sentence is similar to the content on which the model was trained and the difference in the style of the second sentence with the text on which the model was trained, the probability of the occurrence of the first sentence in the text is higher than the second sentence. Therefore, the perplexity of the second sentence will be more.

# Question 2:

In [35]:
import spacy
from tokenizers import ByteLevelBPETokenizer
from tabulate import tabulate
from tokenizers.decoders import ByteLevel

### part a) Three tokenization methods:

- The white space tokenizer simply splits text into tokens based on white space characters like spaces, tabs, and newlines.

- spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token.

- Byte Pair Encoding (BPE) is a type of subword tokenizer that learns to represent words as a sequence of smaller subword units. This can be useful for handling rare or out-of-vocabulary words, as well as for languages with complex morphological structure.

### part b) Text tokenization:

White space tokenization:

In [36]:
tokens_1_en = text_en.split()
tokens_1_fa = text_fa.split()

spacy tokenization:

In [37]:
#!-m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

doc_en = nlp(text_en)
doc_fa = nlp(text_fa)

tokens_2_en = [token.text for token in doc_en]
tokens_2_fa = [token.text for token in doc_fa]

BPE tokenization:

In [38]:
bpe_tokenizer_en = ByteLevelBPETokenizer()
bpe_tokenizer_fa = ByteLevelBPETokenizer()


bpe_tokenizer_en.train_from_iterator([text_en])
bpe_tokenizer_fa.train_from_iterator([normalized_text])


bpe_encoded_en = bpe_tokenizer_en.encode(text_en)
bpe_encoded_fa = bpe_tokenizer_fa.encode(normalized_text)

We'll use 'tabulate' library for drawing the table:

In [39]:
data = [["White Space", len(tokens_1_fa), len(tokens_1_en)], 
        ["spacy", len(tokens_2_fa), len(tokens_2_en)], 
        ["BPE", len(bpe_encoded_fa), len(bpe_encoded_en)]]
col_names = ["Algorithm", "Number of Tokens for Persian Book", "Number of Tokens for English Book"]

print(tabulate(data, headers=col_names, tablefmt="fancy_grid"))

╒═════════════╤═════════════════════════════════════╤═════════════════════════════════════╕
│ Algorithm   │   Number of Tokens for Persian Book │   Number of Tokens for English Book │
╞═════════════╪═════════════════════════════════════╪═════════════════════════════════════╡
│ White Space │                               96294 │                               78443 │
├─────────────┼─────────────────────────────────────┼─────────────────────────────────────┤
│ spacy       │                              124822 │                              102406 │
├─────────────┼─────────────────────────────────────┼─────────────────────────────────────┤
│ BPE         │                              116168 │                              107701 │
╘═════════════╧═════════════════════════════════════╧═════════════════════════════════════╛


### part c) Tokenizing test inputs:

In [40]:
en_input = 'This question is about tokenization and shows several tokenizer algorithms.Hopefully, you will be able to understand how they are trained and generate tokens.'
fa_input = 'این سوال در مورد قطعه بندی جملات است و چندین الگوریتم توکنایز کردن متن را نشان می دهد. امیدواریم بتوانید نحوه آموزش آنها و تولید توکن ها را درک کنید.'

Tokenizing inputs with white space tokenization algorithm:

In [41]:
test_tokens_en_1 = en_input.split()
test_tokens_fa_1 = fa_input.split()

print(test_tokens_en_1)
print(test_tokens_fa_1)

['This', 'question', 'is', 'about', 'tokenization', 'and', 'shows', 'several', 'tokenizer', 'algorithms.Hopefully,', 'you', 'will', 'be', 'able', 'to', 'understand', 'how', 'they', 'are', 'trained', 'and', 'generate', 'tokens.']
['این', 'سوال', 'در', 'مورد', 'قطعه', 'بندی', 'جملات', 'است', 'و', 'چندین', 'الگوریتم', 'توکنایز', 'کردن', 'متن', 'را', 'نشان', 'می', 'دهد.', 'امیدواریم', 'بتوانید', 'نحوه', 'آموزش', 'آنها', 'و', 'تولید', 'توکن', 'ها', 'را', 'درک', 'کنید.']


Tokenizing inputs with spacy algorithm:

In [42]:
test_en = nlp(en_input)
test_fa = nlp(fa_input)

test_tokens_en_2 = [token.text for token in test_en]
test_tokens_fa_2 = [token.text for token in test_fa]

print(test_tokens_en_2)
print(test_tokens_fa_2)

['This', 'question', 'is', 'about', 'tokenization', 'and', 'shows', 'several', 'tokenizer', 'algorithms', '.', 'Hopefully', ',', 'you', 'will', 'be', 'able', 'to', 'understand', 'how', 'they', 'are', 'trained', 'and', 'generate', 'tokens', '.']
['این', 'سوال', 'در', 'مورد', 'قطعه', 'بندی', 'جملات', 'است', 'و', 'چندین', 'الگوریتم', 'توکنایز', 'کردن', 'متن', 'را', 'نشان', 'می', 'دهد', '.', 'امیدواریم', 'بتوانید', 'نحوه', 'آموزش', 'آنها', 'و', 'تولید', 'توکن', 'ها', 'را', 'درک', 'کنید', '.']


Tokenizing inputs with BPE tokenization algorithm:

In [43]:
encoding_en = bpe_tokenizer_en.encode(en_input)
encoding_fa = bpe_tokenizer_fa.encode(fa_input)

#Resolving the problem of BPE with utf-8 characters:
decoder = ByteLevel()
repaired = [decoder.decode([elem]) for elem in encoding_fa.tokens]

print(encoding_en.tokens)
print(repaired)

['This', 'Ġquestion', 'Ġis', 'Ġabout', 'Ġto', 'ken', 'iz', 'ation', 'Ġand', 'Ġshows', 'Ġseveral', 'Ġto', 'ken', 'iz', 'er', 'Ġal', 'g', 'or', 'ith', 'm', 's', '.', 'Hope', 'fully', ',', 'Ġyou', 'Ġwill', 'Ġbe', 'Ġable', 'Ġto', 'Ġunderstand', 'Ġhow', 'Ġthey', 'Ġare', 'Ġtrain', 'ed', 'Ġand', 'Ġg', 'en', 'er', 'ate', 'Ġto', 'ken', 's', '.']
['این', ' سو', 'ال', ' در', ' مورد', ' قط', 'عه', ' بندی', ' جم', 'لات', ' است', ' و', ' چندین', ' ال', 'گ', 'وری', 'تم', ' تو', 'کن', 'ای', 'ز', ' کردن', ' متن', ' را', ' نشان', ' می', ' دهد', '.', ' امید', 'و', 'اریم', ' ب', 'توان', 'ید', ' نح', 'وه', ' آموزش', ' آنها', ' و', ' تو', 'ل', 'ید', ' تو', 'کن', ' ها', ' را', ' درک', ' کن', 'ید', '.']


The use of BPE allows us to separate the words obtained from the combination of several small parts into sub-words. For example, the word "tokenization" is divided into three sub-words: "to", "ken", "iz", and "ation". Due to the fact that BPE is based on machine learning and is taught with the help of text, it also works well in Persian language.

The main problem with white space tokenization is that words are mistakenly treated as a token when they are not separated by spaces. For example, this happened at the end of the first English sentence. ('algorithms.Hopefully,')

Although spacy algoriyhm cannot recognize sub-words, it has a better performance in recognizing words that are not separated by spaces and can also recognize puctuations. However, this algorithm does not perform well in Persian language.