*بخش دوم:*
---
### 1_1
WordPiece is a subword tokenization algorithm mostly used for large vocabularies with complex words. It is closely related to BPE and was introduced by Google. It starts with a vocabulary of characters from the training corpus, adds special tokens (such as [UNK] for unknown words and ## prefix for subwords that do not start a word), splits the corpus into words, and creates all possible subword combinations for each word. Then, it calculates token frequencies and performs a merging process, which is the core of the algorithm. In the merging process, a score is assigned to each pair, and pairs with the highest score are selected, merged, and added to the vocabulary. This process continues until it reaches a predefined vocabulary size or until the highest score falls below a certain threshold. Once the final vocabulary is constructed, the tokenization algorithm begins.

### 1_2
Some popular models like BERT, DistilBERT (a lighter and faster version of BERT), ALBERT (A Lite BERT), and Google's Multilingual BERT have utilized WordPiece encoding as a crucial component.

### 1_3
WordPiece and BPE are very similar to each other but have some differences:

        1-Token selection method:
        WordPiece selects tokens based on their probability of occurrence and their combinations using statistical probabilities, whereas BPE operates by merging the most 
        frequent pairs of tokens at each step.

        2-Efficiency and speed:
        Being more complex and involving statistical calculations, WordPiece may be slower but produces more accurate results. In contrast, BPE is faster since its method is 
        based on frequency.

        3-Quality and results:
        WordPiece demonstrates better quality on complex languages or large datasets, whereas BPE performs poorly when it encounters complex datasets.

        4-Usage in language models:
        BPE is mostly used in GPT variants, but WordPiece is typically used in BERT and other transformer-based architectures.

In [None]:
# 2_1:      ## Has been inspired by a youtube video which learnt how to tokenize using wordpeice

import os
from tqdm.auto import tqdm
import random

# os.mkdir('./Ferdowsi')
with open(f'./data/ferdowsi.txt', 'r', encoding='utf-8') as fp:
    data = fp.readlines()


text_data = []
file_count = 0
for sample in tqdm(data):
    sample = sample.split('\n')
    sample = sample[0].split('|')
    text_data.append(sample[1])
    if len(text_data) == 5000:
        with open(f'./Ferdowsi/file_{file_count}.txt', 'w', encoding='utf-8') as fp:
            fp.write('\n'.join(text_data))
        
        text_data = []
        file_count += 1

with open(f'./Ferdowsi/file_{file_count}.txt', 'w', encoding='utf-8') as fp:
    fp.write('\n'.join(text_data))


paths = [os.path.join('./Ferdowsi', file) for file in os.listdir('./Ferdowsi')]


from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer(
    clean_text=True,
    strip_accents=False
)

tokenizer.train(
    files=paths,
    vocab_size=20000,
    min_frequency=10,
    special_tokens=[
        '[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'
    ],
    wordpieces_prefix='##'
    )

# os.mkdir('./tokenizer')
tokenizer.save_model('./tokenizer')


  0%|          | 0/99218 [00:00<?, ?it/s]

100%|██████████| 99218/99218 [00:00<00:00, 998863.37it/s]


['./tokenizer\\vocab.txt']

In [None]:
# 2_2:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('./tokenizer')
with open('./tokenizer/vocab.txt', 'r', encoding='utf-8') as fp:
    vocab = fp.readlines()

text = 'در نخستین ساعت شب،  در اطاق چوبیش تنها، زن چینی در سرش اندیشه های هولناکی دور می گیرد'
for i in tokenizer(text)['input_ids']:
    print(vocab[i], end=' ')

[CLS]
 در
 نخستین
 سا
 ##عت
 شب
 ،
 در
 ا
 ##ط
 ##اق
 چوب
 ##یش
 تنها
 ،
 زن
 چینی
 در
 سرش
 اندیشه
 های
 هول
 ##ن
 ##اک
 ##ی
 دور
 می
 گیرد
 [SEP]
 

*بخش سوم:*
---
### 3_3
Increasing n in an n-gram model leads to several important problems, including:

        1-Data sparsity (because it requires more data to cover all possible combinations), leading to zero counts and perplexity calculation issues.
        
        2-Increased computational cost.
        
        3-Overfitting (failing to generalize well to new data).

In [None]:
# 3_1:
from collections import Counter, defaultdict
import math

class NGram:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def tokenize(self, text):
        return self.tokenizer.tokenize(text)

    def calculate_ngram_probabilities(self, train_tokens, test_tokens, n, k=1):  ## Since I did not know how consider the probability matrix 
                                                                                 ## (to consider all possible combinations), I took a hand from 
                                                                                 ## GPT, But this is my code
        V = len(set(train_tokens))
        
        ngram_counts = Counter([tuple(train_tokens[i:i+n]) for i in range(len(train_tokens)-n+1)])
        n_minus_one_gram_counts = Counter([tuple(train_tokens[i:i+n-1]) for i in range(len(train_tokens)-n)])
        ngram_probabilities = {}

        for ngram in ngram_counts:
            prefix = ngram[:-1]
            # ngram_counts[ngram] += k
            # n_minus_one_gram_counts[prefix] += k
            ngram_probabilities[ngram] = (ngram_counts[ngram] + k) / (n_minus_one_gram_counts[prefix] + k*V)


        for i in range(len(test_tokens)-n+1):
            ngram = tuple(test_tokens[i:i+n])
            if ngram not in ngram_counts:
                ngram_counts[ngram] = 0
                prefix = ngram[:-1]
                if prefix not in n_minus_one_gram_counts:
                    n_minus_one_gram_counts[prefix] = 0

                ngram_probabilities[ngram] = (ngram_counts[ngram] + k) / (n_minus_one_gram_counts[prefix] + k*V)
        
        return ngram_probabilities



    def generate_text(self, ngram_probabilities, n, num_words=200): 
        ## This part has mostly generated by ChatGPT. But it was my idea. ##
        
        
        # Start the text generation with a random n-gram as a seed
        current_ngram = random.choice(list(ngram_probabilities.keys()))
        generated_tokens = list(current_ngram)

        for _ in range(num_words - n):  # Generate the remaining tokens
            prefix = tuple(generated_tokens[-(n-1):])  # Get the last (n-1) tokens as the prefix
            possible_ngrams = {ngram: prob for ngram, prob in ngram_probabilities.items() 
                                if ngram[:-1] == prefix}

            if not possible_ngrams:
                break  # If no possible next tokens, stop generation
            
            # Choose the next token based on the probabilities
            next_ngram = random.choices(list(possible_ngrams.keys()), 
                                        weights=list(possible_ngrams.values()))[0]
            generated_tokens.append(next_ngram[-1])  # Append the last token of the selected n-gram


        clean_generated_tokens = []  ## From here is mine ##
        for token in generated_tokens:
            if token.startswith('##'):
                clean_generated_tokens[-1] = clean_generated_tokens[-1]+token[2:]

            else:
                clean_generated_tokens.append(token)

        return ' '.join(clean_generated_tokens)


ngram_model = NGram(tokenizer)


with open('data/ferdowsi.txt', 'r', encoding='utf-8') as fp:
    train_text = fp.read()


with open('data/hafez.txt', 'r', encoding='utf-8') as fp:
    test_text = fp.read()

train_tokens = ngram_model.tokenize(train_text)
test_tokens = ngram_model.tokenize(test_text)

bigram_probabilities = ngram_model.calculate_ngram_probabilities(train_tokens, test_tokens, 2)
fourgram_probabilities = ngram_model.calculate_ngram_probabilities(train_tokens, test_tokens, 4)
eightgram_probabilities = ngram_model.calculate_ngram_probabilities(train_tokens, test_tokens, 8)


In [None]:
# 3_2:
# Generate a 200-word text
generated_text = ngram_model.generate_text(fourgram_probabilities, 4, num_words=200)
with open('sample_generated.txt', 'w', encoding='utf-8') as fp:
    fp.write(generated_text)
print(generated_text)

چون رفتی امروز و چون امدی [UNK] بر ان برترین نام یزدان پاک [UNK] سر دشمنان را بگاز اورم [UNK] بشد با گرازه به اورد رفتند پیچان عنان [UNK] همان نیزه و خود و گوپال و زین [UNK] بزر افسر و خسروانی نگین [UNK] چو دانا توانا بد و دادگر [UNK] چنان کن که نیکاختر و رای تست [UNK] زمانه بزیر کف پای تست [UNK] یکی نامه بنوشت با درد دل سام پیر [UNK] اگر هست بیهوده منمای دست [UNK] سرت پر ز تیزی و ارام من [UNK] ز پرده بگسترد بر انجمن [UNK] جهاندار بر شاد و رد بزرگ [UNK] نوشته همه نام تو بر نگین [UNK] هران بند کز دست تو کس نرست [UNK] به هر سو دورانی میکرد عشق گرم خون بخورند ناکسم گر به [UNK] سوی [UNK] روم بعد از روزگار ما بسی گردش کند گردون بسی [UNK] و نهار ارد عماری دار [UNK] را که صدر مجلس عشرت اشارتی چشمی بدان دو گوشه ابرو [UNK] ما مصلحت و فوق وجود خودم از [UNK] [UNK]


*بخش چهارم:*
---
### 4_1
Perplexity is an intrinsic measure to evaluate the performance of a language model. By calculating the inverse of the geometric mean of a token's probability, it represents the number of options for the next word in a sentence, known as the branching factor. A low perplexity indicates that the model has fewer options for the next word, meaning it assigns higher probabilities to certain tokens and aims to predict a meaningful sentence.


### 4_2
As demonstrated by the model, increasing n in an n-gram model leads to worse perplexity due to data sparsity and other drawbacks. Additionally, ngrams are not supposed to predict the test texts out of the training dataset. This leads to low accuracy for prediction, as it has been shown below.

In [None]:
# 4_2:
def calculate_perplexity(test_tokens, ngram_probabilities, n):  
    log_probability_sum = 0
    ngram_count = 0
    
    for i in range(len(test_tokens)-n+1):
        ngram = tuple(test_tokens[i:i+n])
        log_probability_sum += math.log2(ngram_probabilities[ngram])
        ngram_count += 1
    
    average_log_probability = -log_probability_sum / ngram_count
    perplexity = math.pow(2, average_log_probability)
    
    return perplexity



for dataset_address, dataset_name in zip(['data/hafez.txt', 'data/modern_poet.txt'], 
                                         ['Hafez', 'Modern Poet']):
    with open(dataset_address, 'r', encoding='utf-8') as fp:
        test_text = fp.read()

    with open(dataset_address, 'r', encoding='utf-8') as fp:
        lines = fp.readlines()

    for n in [2, 4, 8]:
        ngram_model = NGram(tokenizer)


        with open('data/ferdowsi.txt', 'r', encoding='utf-8') as fp:
            train_text = fp.read()

        train_tokens = ngram_model.tokenize(train_text)
        test_tokens = ngram_model.tokenize(test_text)

        total_perplexity = 0
        line_count = 0
        ngram_probabilities = ngram_model.calculate_ngram_probabilities(train_tokens, test_tokens, n)

        for i, line in enumerate(lines):
            test_tokens = ngram_model.tokenize(line.strip())

            if len(test_tokens) >= n:
                perplexity = calculate_perplexity(test_tokens, ngram_probabilities, n)
                total_perplexity += perplexity
                line_count += 1

            
        if line_count > 0:
            average_perplexity = total_perplexity / line_count
            print(f'{dataset_name}:')
            print(f'\t{n}gram:')
            print(f'\t\tAverage Perplexity: {average_perplexity}')


Hafez:
	2gram:
		Average Perplexity: 2807.9586820143427
Hafez:
	4gram:
		Average Perplexity: 5227.7204785414915
Hafez:
	8gram:
		Average Perplexity: 5249.969919794957
Modern Poet:
	2gram:
		Average Perplexity: 3164.262025015781
Modern Poet:
	4gram:
		Average Perplexity: 5239.414227557773
Modern Poet:
	8gram:
		Average Perplexity: 5249.998681621938
