## Initialization of Libraries

In [4]:
!pip install bpe

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Imports

In [5]:
#Imports
from nltk.tokenize import WhitespaceTokenizer
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
from spacy.lang.fa import Persian
from bpe import Encoder

In [6]:
from nltk.tokenize import WhitespaceTokenizer
whiteSpace_tokenizer=WhitespaceTokenizer()

## A:

### 1) White Space Tokenization

Whitespace tokenization is a simple method of breaking a text into words or tokens by using whitespace characters such as spaces, tabs, and newlines as delimiters. 

This approach assumes that words are separated by whitespace and punctuation marks. 

For example, the sentence "The quick brown fox jumped over the lazy dog" would be tokenized into a list of individual words: ["The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"].

While this method is easy to implement, it does not handle all tokenization scenarios. For instance, contractions like "don't" and hyphenated words like "self-driving" are not correctly handled using whitespace tokenization. As a result, more sophisticated tokenization methods such as Spacy or BPE subword tokenization are often preferred for NLP applications.

White space tokenization gives a good answer for :

In [7]:
print(WhitespaceTokenizer().tokenize("The quick brown fox jumped over the lazy dog."))

['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.']


But can't handle :

In [8]:
print(WhitespaceTokenizer().tokenize("I'd jump over the lazy dog."))

["I'd", 'jump', 'over', 'the', 'lazy', 'dog.']


### 2) Spacy Tokenizer

The Spacy tokenizer is a highly customizable tokenizer provided by the Spacy NLP library. 

It uses a combination of rule-based techniques and machine learning to split text into tokens. The Spacy tokenizer is capable of handling complex tokenization scenarios such as separating punctuation from words, identifying contractions, and handling hyphenated words. It also includes a number of pre-built rules for common tokenization tasks.

One of the key advantages of the Spacy tokenizer is its customizability. Users can add their own rules or adjust existing ones to suit their specific tokenization needs. Additionally, the Spacy tokenizer is designed to work seamlessly with the other components of the Spacy library, including the part-of-speech tagger and named entity recognizer, which can improve the accuracy of downstream NLP tasks.

Overall, the Spacy tokenizer is a powerful tool for handling text tokenization, particularly for more complex NLP applications.

Here's an example of spacy tokenizer

In [9]:
tokens =English().tokenizer("I'd jump over the lazy dog.")
print(*tokens, sep=', ' )

I, 'd, jump, over, the, lazy, dog, .


### 3) SubWord Tokenization

Subword tokenization, also known as Byte Pair Encoding (BPE), is a tokenization method that breaks words into smaller subword units based on the frequency of character sequences in the training corpus. It is particularly useful for handling out-of-vocabulary (OOV) words, or words that are not present in the training data.

The BPE algorithm begins by initializing the vocabulary with all of the individual characters in the training corpus. It then iteratively replaces the most frequent character sequences with a new subword unit, updating the vocabulary each time. For example, the word "university" might be split into "un", "i", "ver", and "sity". This process continues until a predetermined vocabulary size is reached.

One advantage of subword tokenization is that it can generate subwords for OOV words on the fly, without requiring explicit handling of unknown words. This can be particularly useful in low-resource settings where training data is limited. However, subword tokenization can also increase the size of the vocabulary and require more training data.

Subword tokenization is commonly used in neural machine translation and other natural language processing tasks.



here we first fit on the sentence and then we tokenize the exmapple sentence:

In [11]:
enc = Encoder()
enc.fit("I'd jump over the lazy dog.".split('\n'))
print(enc.tokenize("I'd jump over the lazy dog."))

['i', "'", 'd', 'jump', 'over', 'the', 'lazy', 'dog', '.']


## B:

Here we're going to use English and also Persian data on the 3 given tokenizers and then fill the given table.

### English HarryPotter

In [14]:
# first we open English Harry potter
hp_en=open("data/hp_en.txt", "r").read()

#### White Space Tokenizer


In [17]:
num_whiteSpace_tokenizer = len(whiteSpace_tokenizer.tokenize(hp_en))
print("Number of Tokens for WhiteSpace Tokenizer for HP_EN is :",num_whiteSpace_tokenizer)
print(whiteSpace_tokenizer.tokenize(hp_en))


Number of Tokens for WhiteSpace Tokenizer for HP_EN is : 78443


#### Spacy Tokenizer

In [22]:
Spacy_tokenizer_english =English().tokenizer(hp_en)
num_Spacy_tokenizer_english = len(Spacy_tokenizer_english)
print("Number of Tokens for Spacy Tokenizer for HP_EN is :",num_Spacy_tokenizer_english)

Number of Tokens for Spacy Tokenizer for HP_EN is : 102406


#### SubWord Tokenizer or BPE

In [31]:
enc_en = Encoder() 
enc_en.fit(hp_en.split('\n'))
len_BPE = len(enc_en.tokenize(hp_en))
print("Number of Tokens for SubWord Tokenizer for HP_EN is :",len_BPE)
print(enc_en.tokenize(hp_en))

Number of Tokens for SubWord Tokenizer for HP_EN is : 100012


### Persian HarryPotter

In [12]:
# then we open Persian Harry potter
hp_fa=open("data/hp_fa.txt", "r").read()

#### White Space Tokenizer


In [26]:
print("Number of Tokens for WhiteSpace Tokenizer for HP_FA is :" ,len(whiteSpace_tokenizer.tokenize(hp_fa) ) ) 
print(whiteSpace_tokenizer.tokenize(hp_fa))

Number of Tokens for WhiteSpace Tokenizer for HP_FA is : 96294
['\ufeffآقا', 'و', 'خانم', 'دورسلي', 'ساکن', 'خانه', 'شماره', 'چهار', 'خيابان', 'پريوت', 'درايو', 'بودند.', 'خانواده', 'آنها', 'بسيار', 'معمولي', 'و', 'عادي', 'بود', 'و', 'آن', 'ها', 'از', 'اين', 'بابت', 'بسيار', 'راضي', 'و', 'خوشنود', 'بودند.', 'اين', 'خانواده', 'به', 'هيچ', 'وجه', 'با', 'امور', 'مرموز', 'و', 'اسرار', 'آميز', 'سروکار', 'نداشتند', 'زيرا', 'سحر', 'و', 'جادو', 'را', 'امر', 'مهمل', 'و', 'بيهوده', 'اي', 'مي', 'پنداشتند', 'و', 'علاقه', 'اي', 'به', 'اين', 'گونه', 'مسائل', 'نداشت', 'ن', 'آقاي', 'دورسلي', 'مدير', 'شرکت', 'دريل', 'سازي', 'گرونينگز', '،', 'مردي', 'درشت', 'اندام', 'و', 'قوي', 'هيکل', 'بود', 'با', 'گردني', 'بسيار', 'کوتاه', 'که', 'سبيل', 'بلندي', 'نيز', 'داشت', '.', 'همسر', 'او،', 'خانم', 'دورسلي', 'زني', 'لاغر', 'اندام', 'بود', 'با', 'موهاي', 'بور', 'و', 'گردني', 'کشيده', 'و', 'بلند.', 'بلندي', 'گردنش', 'بسيار', 'برايش', 'مفيد', 'بود', 'زيرا', 'بيش', 'تر', 'وقتش', 'را', 'صرف', 'سرک', 'کشيدن', 'به', 'خ

#### Spacy Tokenizer

In [13]:
Spacy_tokenizer_persian =  Persian().tokenizer(hp_fa)
len_Spacy_tokenizer_persian = len(Spacy_tokenizer_persian)
print("Number of Tokens for Spacy Tokenizer for HP_FA is : ",len_Spacy_tokenizer_persian)

Number of Tokens for Spacy Tokenizer for HP_FA is :  125677


#### SubWord Tokenizer or BPE

In [27]:
enc_fa = Encoder() 
enc_fa.fit(hp_fa.split('\n'))
len_enc_fa = len(enc_fa.tokenize(hp_fa))
print("Number of Tokens for WhiteSpace Tokenizer for HP_FA is :" ,len_enc_fa) 
print(enc_fa.tokenize(hp_fa))

Number of Tokens for WhiteSpace Tokenizer for HP_FA is : 106734
['\ufeff', 'آقا', 'و', 'خانم', 'دورسلي', 'ساکن', 'خانه', 'شماره', 'چهار', 'خيابان', 'پريوت', 'درايو', 'بودند', '.', 'خانواده', 'آنها', 'بسيار', 'معمولي', 'و', 'عادي', 'بود', 'و', 'آن', 'ها', 'از', 'اين', 'بابت', 'بسيار', 'راضي', 'و', 'خوشنود', 'بودند', '.', 'اين', 'خانواده', 'به', 'هيچ', 'وجه', 'با', 'امور', 'مرموز', 'و', 'اسرار', 'آميز', 'سروکار', 'نداشتند', 'زيرا', 'سحر', 'و', 'جادو', 'را', 'امر', 'مهمل', 'و', 'بيهوده', 'اي', 'مي', 'پنداشتند', 'و', 'علاقه', 'اي', 'به', 'اين', 'گونه', 'مسائل', 'نداشت', 'ن', 'آقاي', 'دورسلي', 'مدير', 'شرکت', 'دريل', 'سازي', 'گرونينگز', '،', 'مردي', 'درشت', 'اندام', 'و', 'قوي', 'هيکل', 'بود', 'با', 'گردني', 'بسيار', 'کوتاه', 'که', 'سبيل', 'بلندي', 'نيز', 'داشت', '.', 'همسر', 'او', '،', 'خانم', 'دورسلي', 'زني', 'لاغر', 'اندام', 'بود', 'با', 'موهاي', 'بور', 'و', 'گردني', 'کشيده', 'و', 'بلند', '.', 'بلندي', 'گردنش', 'بسيار', 'برايش', 'مفيد', 'بود', 'زيرا', 'بيش', 'تر', 'وقتش', 'را', 'صرف', 'سر

### Results:

### Results

| Algorithm Used | Number of Tokens in Persian | Number of Tokens in English |
| --- | --- | --- |
| White Space | 96294 | 78443 |
| Spacy | 125677 | 102406 |
| BPE | 106734 | 100012 |

## Part C :

### English Sentence:

In [29]:
en_input = "This question is about tokenization and shows several tokenizer algorithms.Hopefully, you will be able to understand how they are trained and generate tokens."

**WhiteSpace Tokenizer**

In [30]:
print(whiteSpace_tokenizer.tokenize(en_input))
print('number of tokens:', len(whiteSpace_tokenizer.tokenize(en_input)))


['This', 'question', 'is', 'about', 'tokenization', 'and', 'shows', 'several', 'tokenizer', 'algorithms.Hopefully,', 'you', 'will', 'be', 'able', 'to', 'understand', 'how', 'they', 'are', 'trained', 'and', 'generate', 'tokens.']
number of tokens: 23


**Spacy Tokenizer**

In [35]:
tokens =English().tokenizer(en_input)
print(*tokens, sep=', ' )
print('number of tokens:', len(tokens))

This, question, is, about, tokenization, and, shows, several, tokenizer, algorithms, ., Hopefully, ,, you, will, be, able, to, understand, how, they, are, trained, and, generate, tokens, .
number of tokens: 27


**BPE**

In [33]:
print(enc_en.tokenize(en_input))
print('number of tokens:', len(enc_en.tokenize(en_input)))


['this', 'question', 'is', 'about', '__sow', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__eow', 'and', 'shows', 'several', '__sow', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__eow', '__sow', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__eow', '.', 'hopefully', ',', 'you', 'will', 'be', 'able', 'to', 'understand', 'how', 'they', 'are', 'trained', 'and', '__sow', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__unk', '__eow', 'tokens', '.']
number of tokens: 70


### Persian Sentence

In [36]:
fa_input= "این سوال در مورد قطعه بندی جملات است و چندین الگوریتم توکنایز کردن متن را نشان می دهد. امیدواریم بتوانید نحوه آموزش آنها و تولید توکن ها را درک کنید."

**WhiteSpace Tokenizer**

In [39]:
print(whiteSpace_tokenizer.tokenize(fa_input))
print('number of tokens:', len(whiteSpace_tokenizer.tokenize(fa_input)))


['این', 'سوال', 'در', 'مورد', 'قطعه', 'بندی', 'جملات', 'است', 'و', 'چندین', 'الگوریتم', 'توکنایز', 'کردن', 'متن', 'را', 'نشان', 'می', 'دهد.', 'امیدواریم', 'بتوانید', 'نحوه', 'آموزش', 'آنها', 'و', 'تولید', 'توکن', 'ها', 'را', 'درک', 'کنید.']
number of tokens: 30


and the problem of using whiteSpace tokenizer as is said earlier, is that like in part "algorithms.Hopefully," it takes all of it as one token. this is a main issue.

and also is persian sentence, where "می دهد", it takes it as two tokens as it it just one token.



**Spacy Tokenizer**

In [37]:
tokens =English().tokenizer(fa_input)
print(*tokens, sep=', ' )
print('number of tokens:', len(tokens))

این, سوال, در, مورد, قطعه, بندی, جملات, است, و, چندین, الگوریتم, توکنایز, کردن, متن, را, نشان, می, دهد, ., امیدواریم, بتوانید, نحوه, آموزش, آنها, و, تولید, توکن, ها, را, درک, کنید, .
number of tokens: 32


**BPE**

In [38]:
print(enc_fa.tokenize(fa_input))
print('number of tokens:', len(enc_fa.tokenize(fa_input)))

['__sow', 'ا', '__unk', 'ن', '__eow', '__sow', 'سو', 'ال', '__eow', 'در', 'مورد', '__sow', 'قط', 'عه', '__eow', '__sow', 'بن', 'د', '__unk', '__eow', '__sow', 'جم', 'لا', 'ت', '__eow', 'است', 'و', '__sow', 'چن', 'د', '__unk', 'ن', '__eow', '__sow', 'ال', 'گو', 'ر', '__unk', 'تم', '__eow', '__sow', 'تو', 'کن', 'ا', '__unk', 'ز', '__eow', 'کردن', '__sow', 'مت', 'ن', '__eow', 'را', 'نشان', '__sow', 'م', '__unk', '__eow', 'دهد', '.', '__sow', 'ام', '__unk', 'دو', 'ار', '__unk', 'م', '__eow', '__sow', 'بت', 'وا', 'ن', '__unk', 'د', '__eow', '__sow', 'ن', 'حو', 'ه', '__eow', 'آموزش', 'آنها', 'و', '__sow', 'تو', 'ل', '__unk', 'د', '__eow', '__sow', 'تو', 'کن', '__eow', 'ها', 'را', 'درک', '__sow', 'کن', '__unk', 'د', '__eow', '.']
number of tokens: 102
