 <h1><spanstyle="text-align:centre">Chapter 3</span></h1>
 
 ---

So far in the past notebooks, we've seen two methods on segmenting words from running texts.

1. Tokenization using regular expression.
2. Tokenization using word segmentation.

---

There's a third approach of tokenization. Instead of defining tokens as word seperated by spaces or as characters, we can use our data to automatically tell what size a token must be.

Example,

Sometimes we want tokens to be space delimited, sometimes large word tokens (New York) and more.

##### Subword

> A subword is in between a word and a character.

Example,

We can split the word 'subword' between 'sub' & 'word'.

---
If our training corpus contains, say the words low, and lowest, but not lower, but then the word lower appears in our test corpus, our system will not know what to do with it.
A solution to this problem is to use a kind of tokenization in which most tokens are words, but some tokens are frequent morphemes or other subwords like -er, so that an unseen word can be represented by combining the parts.

---

### Byte Pair Encoding for Tokenization

* We start with a dictionary of word-frequency, and a vocabulary. {word:frequency}
* End of word is marked by _

Consider the following example,


In [11]:
dictionary = {'low_':5,'lowest_':2,'newer_':6,'wider_':3,'new_':2}
print("Dictionary\n------")
for i,j in zip(dictionary.keys(),dictionary.values()):
    print(str(i)+' : '+str(j))
    
print("\nVocabulary\n------")
vocabulary = ['_', 'd', 'e', 'i', 'l', 'n', 'o', 'r', 's', 't', 'w']
print(*vocabulary)

Dictionary
------
low_ : 5
lowest_ : 2
newer_ : 6
wider_ : 3
new_ : 2

Vocabulary
------
_ d e i l n o r s t w


The most frequent pair of symbols is 'r_' because it occurs 3+6 = 9 times.
We now merge it and treat it as a single symbol, and then count again.

In [12]:
print("Updated vocabulary\n-------")
vocabulary.append('r_')
print(*vocabulary)

Updated vocabulary
-------
_ d e i l n o r s t w r_


Now the most frequent pair is 'er_' , which we merge; our system has learned that there should be a token for word-final er, represented as 'er_'

In [13]:
print("Updated vocabulary\n-------")
vocabulary.append('er_')
print(*vocabulary)

Updated vocabulary
-------
_ d e i l n o r s t w r_ er_


Now the most frequent pair is 'ew' (8)

In [14]:
print("Updated vocabulary\n-------")
vocabulary.append('ew')
print(*vocabulary)

Updated vocabulary
-------
_ d e i l n o r s t w r_ er_ ew


This goes on until the merges are complete and our system learns a new vocabulary.

In [15]:
print("Updated vocabulary\n-------")
vocabulary.extend(['new','lo','low','newer_' ,'low_'])
print(*vocabulary)

Updated vocabulary
-------
_ d e i l n o r s t w r_ er_ ew new lo low newer_ low_


When we need to tokenize a test sentence, we just run the merges we have learned, greedily, in the order we learned them, on the test data. (Thus the fre- quencies in the test data don’t play a role, just the frequencies in the training data). So first we segment each test sentence word into characters. Then we apply the first rule: replace every instance of r   in the test corpus with r   , and then the second rule: replace every instance of e r in the test corpus with er , and so on. By the end, if the test corpus contained the word n e w e r , it would be tokenized as a full word. But a new (unknown) word like l o w e r would be merged into the two tokens low er_ .

In [31]:
import re , collections
def get_stats(vocab):
    pairs = collections.defaultdict(int) 
    for word, freq in vocab.items():
        symbols = word. split ()
        for i in range(len(symbols)-1):
            pairs [ symbols [ i ] , symbols [ i +1]] 
    return pairs
def mergevocab(pair, v_in): 
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)') 
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word] 
    return v_out
vocab = {'l o w </w>' : 5, 'l o w e s t </w>' : 2, 'n e w e r </w> ':6, 'w i d e r </w>':3, 'n e w</w>':2}
num_merges = 8
for i in range(num_merges):
    pairs = get_stats(vocab)
    best = max( pairs , key=pairs.get ) 
    vocab = mergevocab( best , vocab ) 
    print(best)

('l', 'o')
('lo', 'w')
('low', '</w>')
('low', 'e')
('lowe', 's')
('lowes', 't')
('lowest', '</w>')
('n', 'e')


There are many more algorithms to learn vocabularies.

### Methods of tokenization using python
---

* Using [NLTK](https://www.nltk.org/)
> NLTK - Nautural Language Toolkit is a suite of libraries to work with natural language data.

In [33]:
import nltk 

In [35]:
sentence = 'The town was fairly large with a dozen or\
            so business buildings on each side of the street but, as I said, most were closed.'

In [37]:
tokens = nltk.word_tokenize(sentence)
print(tokens)

['The', 'town', 'was', 'fairly', 'large', 'with', 'a', 'dozen', 'or', 'so', 'business', 'buildings', 'on', 'each', 'side', 'of', 'the', 'street', 'but', ',', 'as', 'I', 'said', ',', 'most', 'were', 'closed', '.']


NLTK word tokenizer is a simple and effective tokenizer but over the years there has been a huge improvement in NLP.
The recent tokenizer by HuggingFace is better and faster than NLTK.

Some of the key features of [huggingface](huggingface.co) tokenizers are:

- Performance (“takes less than 20 seconds to tokenize a GB of text on a server’s CPU”)
- Provides access to the latest tokenizers for research and production use cases (BPE/byte-level BPE/WordPiece/SentencePiece…)
- Aims for ease-of-use and versatility
- Offers reproducibility of the original text that corresponds to the token using alignments tracking
- Applies pre-processing best practices such as truncating, padding, etc.

---

* Using [tokenizers](https://github.com/huggingface/tokenizers)

In [44]:
from tokenizers import (BertWordPieceTokenizer)
tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)

In [45]:
from datetime import datetime
def textTokenizer(text):
    start = (datetime.now())
    print(tokenizer.encode(text).tokens)
    end = (datetime.now())
    print("Time taken - {}".format((end-start).total_seconds()))

In [46]:
textTokenizer(sentence)

['[CLS]', 'the', 'town', 'was', 'fairly', 'large', 'with', 'a', 'dozen', 'or', 'so', 'business', 'buildings', 'on', 'each', 'side', 'of', 'the', 'street', 'but', ',', 'as', 'i', 'said', ',', 'most', 'were', 'closed', '.', '[SEP]']
Time taken - 0.00049


---
* Using [Spacy](https://spacy.io/api/tokenizer/)

In [71]:
import spacy
spacy_nlp = spacy.load('en_core_web_sm')

doc = spacy_nlp(sentence)
tokens = [token.text for token in doc]
print(tokens)

['The', 'town', 'was', 'fairly', 'large', 'with', 'a', 'dozen', 'or', '           ', 'so', 'business', 'buildings', 'on', 'each', 'side', 'of', 'the', 'street', 'but', ',', 'as', 'I', 'said', ',', 'most', 'were', 'closed', '.']
