# Tokenization 


Tokenization is a fundamental step in Natural Language Processing (NLP) that involves splitting text into smaller units called tokens. These tokens can be words, subwords, or characters, and they are the building blocks for processing textual data in machine learning models.  The tokenizers library by HuggingFace offers a fast and efficient way to tokenize text, handling large datasets and integrating seamlessly with the transformers library.

However, tokenization can be challenging, particularly when dealing with punctuation and special characters. 

Standard tokenizers often split text at punctuation marks, which can lead to the loss of meaningful tokens, such as emoticons (e.g., :), ;)) and specific emoji representations (e.g., :thumbsup:).


Here I just highlight a very specific problem to this particular use case but see Andrej Karpathy's amaizing video about tokenization for a much more in depth take on the matter

https://www.youtube.com/watch?v=zduSFxRajkE



In [2]:
%pip install tensorflow keras pandas scikit-learn nltk transformers datasets emoji


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## Handling Smileys Punctuation in Tokenization

The main issue with smileys in tokenization is that most tokenizers treat them as punctuation marks as delimiters.

This means that sequences like :) or ;) might be split into separate tokens, which can alter their intended meaning. 

For example, :) could be tokenized into :, ), or even removed entirely, losing the smileys's semantic value. 

## Example using tensorflow.keras Tokenizer

Lets create a custom tokenizer and test it on two different sentiments :
    
```text
Oh what a day :)
Oh what a day :(
```

In [102]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    "Oh what a day :)",
    "Oh what a day :(",
]

# create a tokenizer
bad_tokenizer = Tokenizer(num_words=256, oov_token="<UNK>")

# train the tokenizer on the sentences
bad_tokenizer.fit_on_texts(sentences)

# tokenize the sentences
tokenized =  bad_tokenizer.texts_to_sequences(sentences)

print(sentences[0], "==>", tokenized[0])
print(sentences[1], "==>", tokenized[1])

# the two sentences are not equal but the tokenized versions are...
# hmmm we've just lost the meaning in the process
print("Are equal : ",   tokenized[0] == tokenized[1])


Oh what a day :) ==> [2, 3, 4, 5]
Oh what a day :( ==> [2, 3, 4, 5]
Are equal :  True


## Lets try to Add them to the Vocaulary

By adding them to the vocabulary we would expect the tokenizer to now consider them as their own tokens.

So lets try that out... 

In [105]:
additional_tokens = [
    # smileys
    ":)", ";)", ":P", ":D", ":(", ":'(", ":O", ":/", ":|", ":*", ":@", ">:(", 
]


bad_tokenizer = Tokenizer(num_words=256, oov_token="<UNK>")
bad_tokenizer.fit_on_texts(sentences)
bad_tokenizer.word_index.update({token: len(bad_tokenizer.word_index) + i + 1 for i, token in enumerate(additional_tokens)})

# lets take a look at our word index
bad_tokenizer.word_index

{'<UNK>': 1,
 'oh': 2,
 'what': 3,
 'a': 4,
 'day': 5,
 ':)': 6,
 ';)': 7,
 ':P': 8,
 ':D': 9,
 ':(': 10,
 ":'(": 11,
 ':O': 12,
 ':/': 13,
 ':|': 14,
 ':*': 15,
 ':@': 16,
 '>:(': 17}

In [108]:
# The above word index looks good, we have assigned a unique index to each smiley

bad_tokenizer.texts_to_sequences(sentences)


print(sentences[0], "==>", tokenized[0])
print(sentences[1], "==>", tokenized[1])

# the two sentences are not equal but the tokenized versions are...
# hmmm we've just lost the meaning in the process
print("Are equal : ",   tokenized[0] == tokenized[1], " <--- arfff... they are equal again")

Oh what a day :) ==> [2, 3, 4, 5]
Oh what a day :( ==> [2, 3, 4, 5]
Are equal :  True  <--- arfff... they are equal again


So the problem remained the same despesite our attempt to specialize the vocabulary... Why ? 

Because ; : and ( ) are considered as punctuation and separators between workds and they are not tokenized themselves, this tokenizer considers them equivalent to whitespace...

## Solution 1 : TextPreprocessing

To address this, one effective approach is to preprocess the text by substituting special tokens with placeholders before tokenization.

This ensures that these tokens are treated as single units and preserved during the tokenization process.

Preprocessing involves scanning the text for special tokens and replacing them with unique placeholders.

These placeholders are then tokenized as single units. After tokenization, the placeholders can be mapped back to their original forms if needed.

This method uses regular expressions (regex) to identify and replace the tokens efficiently.

Steps:
1. Define Special Tokens: List all special tokens (e.g., :), ;), :thumbsup:).
2. Create Placeholders: Generate unique placeholders for each special token.
3. Replace Tokens with Placeholders: Use regex to substitute special tokens in the text with their corresponding placeholders.
4. Tokenize: Apply the tokenizer to the preprocessed text.
5. Map Placeholders Back: Optionally, convert placeholders back to the original tokens after tokenization.

In [128]:
from tensorflow.keras.preprocessing.text import Tokenizer
import re

# Assuming additional_tokens is already defined

# Define a function to preprocess texts and preserve special tokens
def preprocess_texts(texts, additional_tokens):
    token_dict = {token: f"<|SPT{i}|>" for i, token in enumerate(additional_tokens)}
    pattern = re.compile(r'(' + '|'.join(re.escape(token) for token in additional_tokens) + r')')
    
    def replace_tokens(text):
        return pattern.sub(lambda match: token_dict[match.group(0)], text)
    
    preprocessed_texts = [replace_tokens(text) for text in texts]
    return preprocessed_texts, token_dict

# Prepare the tokenizer
tokenizer = Tokenizer(num_words=100000, oov_token="<UNK>")

# Preprocess the additional tokens to ensure they are preserved
preprocessed_additional_tokens, token_dict = preprocess_texts(sentences, additional_tokens)

# Fit the tokenizer on the preprocessed additional tokens
tokenizer.fit_on_texts(preprocessed_additional_tokens)

# Add the additional tokens to the tokenizer's word index with their original form
for token, placeholder in token_dict.items():
    if placeholder in tokenizer.word_index:
        tokenizer.word_index[token] = tokenizer.word_index.pop(placeholder)


print("Vocabulary : ", tokenizer.word_index)

# Preprocess the sample texts
preprocessed_sample_texts, _ = preprocess_texts(sentences, additional_tokens)
print("Preprocessed : ", preprocessed_sample_texts)
# Tokenize the preprocessed sample texts
tokenized = tokenizer.texts_to_sequences(preprocessed_sample_texts)

# Print the tokenized sequences
print("Tokenized : ", tokenized)

print("Are equal : ",   tokenized[0] == tokenized[1], " <--- seams like now we now if you had a good or bad day")

Vocabulary :  {'<UNK>': 1, 'oh': 2, 'what': 3, 'a': 4, 'day': 5, 'spt0': 6, 'spt4': 7}
Preprocessed :  ['Oh what a day <|SPT0|>', 'Oh what a day <|SPT4|>']
Tokenized :  [[2, 3, 4, 5, 6], [2, 3, 4, 5, 7]]
Are equal :  False  <--- seams like now we now if you had a good or bad day


## Solution 2 : Using the Transformers Library

The transformers library by HuggingFace provides robust tools for tokenization, including the ability to add and preserve custom tokens. 

Here's how you can use the transformers library to handle special tokens effectively:

This solution will reuse an already trained tokenizer so we don't have to find enough text to ensure there are not too much <UNK> tokens.

In [136]:
from transformers import BertTokenizerFast

# Initialize the tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Add additional tokens to the tokenizer
tokenizer.add_tokens(additional_tokens)

# Tokenize sample texts
tokenized_texts = tokenizer(sentences, is_split_into_words = False)


# Print the tokenized sequences
print(tokenized_texts[0].ids)
print(tokenized_texts[1].ids)

print("Are equal : ",   tokenized[0] == tokenized[1], " <--- seams like now we now if you had a good or bad day")

# note that the tokenized sequences now also start with [CLS] and end with [SEP] tokens to indicate the beginning and end of the sequences

[101, 2821, 2054, 1037, 2154, 30522, 102]
[101, 2821, 2054, 1037, 2154, 30526, 102]
Are equal :  False  <--- seams like now we now if you had a good or bad day
