# Tokenizer Basic Thoery

1. A Quick Rundown of Tokenization Tokenization is a common task in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers. Tokens are the building blocks of Natural Language.
2. Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.

### Why :

Language is a thing of beauty. But mastering a new language from scratch is quite a daunting prospect. If you’ve ever picked up a language that wasn’t your mother tongue, you’ll relate to this! There are so many layers to peel off and syntaxes to consider – it’s quite a challenge.

And that’s exactly the way with our machines. In order to get our computer to understand any text, we need to break that word down in a way that our machine can understand. That’s where the concept of tokenization in Natural Language Processing (NLP) comes in.

## Where we use it:
- In NLP before we use it for training

## What it does
- As tokens are the building blocks of Natural Language, the most common way of processing the raw text happens at the token level.
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
- Creating Vocabulary is the ultimate goal of Tokenization.
- Traditional NLP approaches such as Count Vectorizer and TF-IDF use vocabulary as features. Each word in the vocabulary is treated as a unique feature:

## Types of Tokenization
### Word tokens
1. Word Tokenization is the most commonly used tokenization algorithm. It splits a piece of text into individual words based on a certain delimiter. Depending upon delimiters, different word-level tokens are formed. Pretrained Word Embeddings such as Word2Vec and GloVe comes under word tokenization.
2. One of the major issues with word tokens is dealing with Out Of Vocabulary (OOV) words. OOV words refer to the new words which are encountered at testing. These new words do not exist in the vocabulary. Hence, these methods fail in handling OOV words.
3. A small trick can rescue word tokenizers from OOV words. The trick is to form the vocabulary with the Top K Frequent Words and replace the rare words in training data with unknown tokens (UNK). This helps the model to learn the representation of OOV words in terms of UNK tokens
4. every out of word is same so they loose the sense of meaning, large in size 
5. Thats why we have character tokenization

## Character tokenization
1. Character Tokenization splits apiece of text into a set of characters. It overcomes the drawbacks we saw above about Word Tokenization.

2. Character Tokenizers handles OOV words coherently by preserving the information of the word. It breaks down the OOV word into characters and represents the word in terms of these characters
3. It also limits the size of the vocabulary. Want to talk a guess on the size of the vocabulary? 26 since the vocabulary contains a unique set of characters

## Subword tokenization

1. Character tokens solve the OOV problem but the length of the input and output sentences increases rapidly as we are representing a sentence as a sequence of characters. As a result, it becomes challenging to learn the relationship between the characters to form meaningful words.

2. This brings us to another tokenization known as Subword Tokenization which is in between a Word and Character 
3. Subword Tokenization splits the piece of text into subwords (or n-gram characters). For example, words like lower can be segmented as low-er, smartest as smart-est, and so on.

4. Transformed based models – the SOTA in NLP – rely on Subword Tokenization algorithms for preparing vocabulary. Now, I will discuss one of the most popular Subword Tokenization algorithm known as Byte Pair Encoding (BPE).tokenization.

At test time, the OOV word is split into sequences of characters. Then the learned operations are applied to merge the characters into larger known symbols.

– Neural Machine Translation of Rare Words with Subword Units, 2016

# Tokenization using spacy

In [None]:
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz

In [15]:
import spacy
from spacy.symbols import ORTH

# direct use
nlp = spacy.load("en_core_web_sm")

text = '''Apple is looking at buying "U.K." startup for $1 billion!'''
doc = nlp(text)
print([w.text for w in doc])

['Apple', 'is', 'looking', 'at', 'buying', '"', 'U.K.', '"', 'startup', 'for', '$', '1', 'billion', '!']


In [8]:
doc2=nlp("Sachin_gupta is working as Leaddatascientist")
doc2

Sachin_gupta is working as Leaddatascientist

In [9]:
print("\n======= Tokens =======")
# tokens
for token in doc2:
    print(token.text)


Sachin_gupta
is
working
as
Leaddatascientist


In [6]:
print("\n======= Tokens =======")
# tokens
for token in doc:
    print(token.text)


Apple
is
looking
at
buying
"
U.K.
"
startup
for
$
1
billion
!


In [7]:
# token explaination
print("\n======= Tokenization explaination =======")
tok_exp = nlp.tokenizer.explain(text)
for t in tok_exp:
    print(t[1], "\t", t[0])


Apple 	 TOKEN
is 	 TOKEN
looking 	 TOKEN
at 	 TOKEN
buying 	 TOKEN
" 	 PREFIX
U.K. 	 TOKEN
" 	 SUFFIX
startup 	 TOKEN
for 	 TOKEN
$ 	 PREFIX
1 	 TOKEN
billion 	 TOKEN
! 	 SUFFIX


In [10]:
# NOTE: Detokenization without doc is difficult in spacy. 

print("\n======= Tokens information =======")
# spacy offers a lot of other information along with tokens
for token in doc:
    print(f"""token: {token.text},\
    lemmatization: {token.lemma_},\
    pos: {token.pos_},\
    is_alpha: {token.is_alpha},\
    is_stopword: {token.is_stop}""")


token: Apple,    lemmatization: Apple,    pos: PROPN,    is_alpha: True,    is_stopword: False
token: is,    lemmatization: be,    pos: AUX,    is_alpha: True,    is_stopword: True
token: looking,    lemmatization: look,    pos: VERB,    is_alpha: True,    is_stopword: False
token: at,    lemmatization: at,    pos: ADP,    is_alpha: True,    is_stopword: True
token: buying,    lemmatization: buy,    pos: VERB,    is_alpha: True,    is_stopword: False
token: ",    lemmatization: ",    pos: PUNCT,    is_alpha: False,    is_stopword: False
token: U.K.,    lemmatization: U.K.,    pos: PROPN,    is_alpha: False,    is_stopword: False
token: ",    lemmatization: ",    pos: PUNCT,    is_alpha: False,    is_stopword: False
token: startup,    lemmatization: startup,    pos: NOUN,    is_alpha: True,    is_stopword: False
token: for,    lemmatization: for,    pos: ADP,    is_alpha: True,    is_stopword: True
token: $,    lemmatization: $,    pos: SYM,    is_alpha: False,    is_stopword: False
to

In [11]:
print("\n======= Customization =======")
# customization
doc = nlp("gimme that")  # phrase to tokenize
print([w.text for w in doc])  # ['gimme', 'that']


['gimme', 'that']


In [12]:
# Add special case rule
special_case = [{ORTH: "gim"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)

In [13]:
# Check new tokenization
print([w.text for w in nlp("gimme that")])  # ['gim', 'me', 'that']

['gim', 'me', 'that']


In [14]:
# The special case rules have precedence over the punctuation splitting
doc = nlp(".....gimme!!!! that")    # phrase to tokenize
print([w.text for w in doc])    # ['.....', 'gim', 'me', '!', '!', '!', '!', 'that']

['.....', 'gim', 'me', '!', '!', '!', '!', 'that']


# TorchText Tokenization

In [None]:
#Setting up pytorch environment
!pip install torch===1.4.0 torchvision===0.5.0 -f https://download.pytorch.org/whl/torch_stable.

#Then install the tourchtext
pip install tourchtext

In [2]:
import os
import sentencepiece as spm

import torchtext
from torchtext.data import get_tokenizer

In [None]:
tokenizer = get_tokenizer("spacy")
spacy_tokens = tokenizer("You can now install TorchText using pip!")
print(f"Spacy tokens: {spacy_tokens}")  # ['You', 'can', 'now', 'install', 'TorchText', 'using', 'pip', '!']


In [None]:

tokenizer = get_tokenizer("basic_english")
basic_english_tokens = tokenizer("You can now install TorchText using pip!")
print(f"Basic English tokens: {basic_english_tokens}") # ['you', 'can', 'now', 'install', 'torchtext', 'using', 'pip', '!']
# note that all the tokens are converted into lowercase

In [None]:

tokenizer = get_tokenizer("moses")
moses_tokens = tokenizer("You can now install TorchText using pip!")
print(f"Moses tokens: {moses_tokens}")  # ['You', 'can', 'now', 'install', 'TorchText', 'using', 'pip', '!']

### custom tokenizer
let's see how to configure sentencepiece tokenizer to torchtext

In [None]:
DATAFILE = '../data/pg16457.txt'
MODELDIR = 'models'

spm.SentencePieceTrainer.train(f'''\
    --model_type=bpe\
    --input={DATAFILE}\
    --model_prefix={MODELDIR}/bpe\
    --vocab_size=500''')

sp = spm.SentencePieceProcessor()
sp.load(os.path.join(MODELDIR, 'bpe.model'))

In [None]:
def custom_tokenizer(sentence):
    return sp.encode_as_pieces(sentence)

In [None]:
# in-order to provide a custom tokenizer, it must have the functionality 
# of taking a single string and should provide the tokens for the string
tokenizer = get_tokenizer(custom_tokenizer)
sp_tokens = tokenizer("You can now install TorchText using pip!")
print(f"sp tokens: {sp_tokens}")  # ['▁', 'Y', 'ou', '▁can', '▁now', '▁in', 'st', 'all', '▁T', 'or', 'ch', 'T', 'e', 'x', 't', '▁us', 'ing', '▁p', 'i', 'p', '!']

In [2]:
#!pip install sentencepiece

Collecting sentencepiece
  Downloading https://files.pythonhosted.org/packages/5f/03/6cd0c8340ebcecf45f12540a852aede273263f0c757a4a8cea4042fbf715/sentencepiece-0.1.92-cp37-cp37m-win_amd64.whl (1.2MB)
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.92


Type of Subword Embedding
- https://medium.com/@makcedward/how-subword-helps-on-your-nlp-model-83dd1b836f46


In [4]:
import os
import sentencepiece as spm

In [5]:
DATAFILE = 'E:/Sachin/Learning/AI_Learning/7.NLP/100DaysNLP/100-Days-of-NLP-master/100-Days-of-NLP-master/data/pg16457.txt'
MODELDIR = 'models'


# Subword embedding 
It help to resolve the below two problems
1. Out of Vocabulary error
2. Seplling mistake

## Byte Pair Encoding tokenizer

In [6]:
spm.SentencePieceTrainer.train(f'''\
    --model_type=bpe\
    --input={DATAFILE}\
    --model_prefix={MODELDIR}/bpe\
    --vocab_size=500''')

In [7]:
sp = spm.SentencePieceProcessor()
sp.load(os.path.join(MODELDIR, 'bpe.model'))

True

In [8]:
input_string = "This is a test"

# encode: text => id
print(sp.encode_as_pieces(input_string))    # ['▁T', 'h', 'is', '▁is', '▁a', '▁t', 'est']
print(sp.encode_as_ids(input_string))       # [72, 435, 26, 101, 5, 3, 153]

['▁T', 'h', 'is', '▁is', '▁a', '▁t', 'est']
[72, 435, 26, 101, 5, 3, 153]


In [9]:
# decode: id => text
print(sp.decode_pieces(['▁T', 'h', 'is', '▁is', '▁a', '▁t', 'est']))    # This is a test
print(sp.decode_ids([72, 435, 26, 101, 5, 3, 153]))                       # This is a test

This is a test
This is a test


In [10]:
# returns vocab size
print(f"vocab size: {sp.get_piece_size()}")

# id <=> piece conversion
print(f"id 101 to piece: {sp.id_to_piece(101)}")
print(f"Piece ▁is to id: {sp.piece_to_id('▁is')}")

vocab size: 500
id 101 to piece: ▁is
Piece ▁is to id: 101


- You can see from the code that we used the “id_to_piece” function which turns the ID of a token into its corresponding textual representation.

- This is important since SentencePiece enables the subword process to be reversible.
- You can encode your test sentence in ID’s or in subword tokens; what you use is up to you.
- The key is that you can decode either the IDs or the tokens perfectly back into the original sentences,
- Including the original spaces. Previously this was not possible with other tokenizers since they just provided the tokens and it was not clear exactly what encoding scheme was used, e.g. how did they deal with spaces or punctuation? This is a big selling point for SentencePiece.

In [None]:
tokens = ['▁T', 'h', 'is', '▁is', '▁a', '▁t', 'est']
merged = "".join(tokens).replace('▁', " ").strip()
assert merged == input_string, "Input string and detokenized sentence didn't match"

In [14]:
# <unk>, <s>, </s> are defined by default. Their ids are (0, 1, 2)
# <s> and </s> are defined as 'control' symbol.
# control symbol: We only reserve ids for these tokens. Even if these tokens appear in the input text, 
#they are not handled as one token. User needs to insert ids explicitly after encoding.
for id in range(5):
  print(sp.id_to_piece(id), sp.is_control(id))

<unk> False
<s> True
</s> True
▁t False
he False


- We can define special tokens (symbols) to tweak the DNN behavior through the tokens. Typical examples are BERT's special symbols., e.g., [SEP] and [CLS].

- There are two types of special tokens:

- user defined symbols: Always treated as one token in any context. These symbols can appear in the input sentence.
- control symbol: We only reserve ids for these tokens. Even if these tokens appear in the input text, they are not handled as one token. User needs to insert ids explicitly after encoding.

### Refer to this for more details: https://colab.research.google.com/github/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb#scrollTo=dngckiPMcWbA

In [16]:
# ## Example of user defined symbols
spm.SentencePieceTrainer.train(f'''\
    --model_type=bpe\
    --input={DATAFILE}\
    --model_prefix={MODELDIR}/bpe_user\
    --user_defined_symbols=<sep>,<cls>\
    --vocab_size=500''')
sp_user = spm.SentencePieceProcessor()
sp_user.load(os.path.join(MODELDIR, 'bpe_user.model'))

True

In [17]:
# ids are reserved in both mode.
# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4
# user defined symbols allow these symbol to apper in the text.
print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>')) # ['▁this', '▁is', '▁a', '▁t', 'est', '<sep>', '▁he', 'll', 'o', '▁wor', 'ld', '<cls>']
print(sp_user.piece_to_id('<sep>'))  # 3
print(sp_user.piece_to_id('<cls>'))  # 4
print('3=', sp_user.decode_ids([3]))  # decoded to <sep>
print('4=', sp_user.decode_ids([4]))  # decoded to <cls>

['▁this', '▁is', '▁a', '▁t', 'est', '<sep>', '▁he', 'll', 'o', '▁wor', 'ld', '<cls>']
3
4
3= <sep>
4= <cls>


In [18]:
print('bos=', sp_user.bos_id())     # 1
print('eos=', sp_user.eos_id())     # 2
print('unk=', sp_user.unk_id())     # 0
print('pad=', sp_user.pad_id())     # -1, disabled by default

bos= 1
eos= 2
unk= 0
pad= -1


In [19]:
print(sp_user.encode_as_ids('Hello world'))     # [189, 320, 430, 233, 71]

[189, 320, 430, 233, 71]


In [20]:
# Prepend or append bos/eos ids.
print([sp_user.bos_id()] + sp_user.encode_as_ids('Hello world') + [sp_user.eos_id()])   # [1, 189, 320, 430, 233, 71, 2]

[1, 189, 320, 430, 233, 71, 2]


### BPE Example in BERT Tokenization
https://colab.research.google.com/github/pythonvirus/AI-Learning/blob/master/Inspect_BERT_Vocabulary.ipynb

## UniGram Tokenization

In [4]:
spm.SentencePieceTrainer.train(f'''\
    --model_type=unigram\
    --input={DATAFILE}\
    --model_prefix={MODELDIR}/uni\
    --vocab_size=500''')

In [5]:
sp = spm.SentencePieceProcessor()
sp.load(os.path.join(MODELDIR, 'uni.model'))

True

In [6]:
input_string = "This is a test"

In [7]:
# encode: text => id
#Space is encoded "_"
# by default a space is added at the start of the input sentence
print(sp.encode_as_pieces(input_string))    # ['▁This', '▁is', '▁a', '▁t', 'est']
print(sp.encode_as_ids(input_string))       # [371, 77, 13, 101, 181]

['▁This', '▁is', '▁a', '▁t', 'est']
[371, 77, 13, 101, 181]


In [9]:
# decode: id => text
print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est']))    # This is a test
print(sp.decode_ids([371, 77, 13, 101, 181]))      

This is a test
This is a test


In [10]:
# returns vocab size
print(f"vocab size: {sp.get_piece_size()}")

vocab size: 500


In [11]:
# id <=> piece conversion
print(f"id 371 to piece: {sp.id_to_piece(371)}")
print(f"Piece ▁This to id: {sp.piece_to_id('▁This')}")

id 371 to piece: ▁This
Piece ▁This to id: 371


### Summary
- This is important since SentencePiece enables the subword process to be reversible.
-  You can encode your test sentence in ID’s or in subword tokens; what you use is up to you.
-  The key is that you can decode either the IDs or the tokens perfectly back into the original sentences,
-  including the original spaces. Previously this was not possible with other tokenizers since they just provided the tokens and it was not clear exactly what encoding scheme was used,
-  e.g. how did they deal with spaces or punctuation? This is a big selling point for SentencePiece.

In [12]:

tokens = ['▁This', '▁is', '▁a', '▁t', 'est']
merged = "".join(tokens).replace('▁', " ").strip()
assert merged == input_string, "Input string and detokenized sentence didn't match"

In [13]:
merged

'This is a test'

In [8]:
sp.tokenize('This is demo')

[371, 77, 94, 21, 9]

In [9]:
sp.tokenize('Sachin Gupta')

[138, 11, 110, 39, 323, 272, 8, 11]

In [10]:
#It will not out of bad error....like other tokenizers
for i in [138, 11, 110, 39, 323, 272, 8, 11]:
    print(sp.decode_ids([i]))

S
a
ch
in
G
up
t
a


In [11]:
for i in sp.tokenize('Banctec Datascience Team Rocks'):
    print(sp.decode_ids([i]))

B
an
c
t
ec
D
at
as
ci
ence
T
e
a
m

R
o
ck
s
