<a href="https://www.kaggle.com/code/shravankumar147/what-is-tokenization?scriptVersionId=213819814" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# What is Tokenization?

In **natural language processing (NLP)**, **tokenization** is the process of breaking down a text into smaller units, called **tokens**. These tokens can be:

- **Words** (e.g., "I love programming" → ["I", "love", "programming"])
- **Subwords** or morphemes (e.g., "unbelievable" → ["un", "believe", "able"])
- **Characters** (e.g., "hello" → ["h", "e", "l", "l", "o"])
- **Sentences** (splitting text into sentence-level tokens)

Tokenization is a crucial preprocessing step for text data because most machine learning models can't process raw text directly—they need a numerical or structured input, which tokenization helps achieve.

---

### Why is Tokenization Important?

1. **Understanding Structure**: It helps the model recognize the components of a text.
2. **Standardization**: Tokenization standardizes text input for further processing.
3. **Feature Extraction**: Tokens serve as the features that models use to learn patterns in the data.

---

### Types of Tokenization

#### 1. **Word Tokenization**
   - Splits text into words.
   - Example: `"I love NLP"` → `["I", "love", "NLP"]`
   - Challenges:
     - Handling contractions like "don't" → ["do", "n't"].
     - Treating punctuation as separate tokens.

#### 2. **Subword Tokenization**
   - Breaks words into smaller units when the word is not in the vocabulary.
   - Common in deep learning models like BERT or GPT.
   - Example: `"unbelievable"` → `["un", "##believe", "##able"]` (BERT-style)
   - Benefits:
     - Reduces the size of the vocabulary.
     - Handles unseen words effectively.

#### 3. **Character Tokenization**
   - Breaks text into individual characters.
   - Example: `"Hello"` → `["H", "e", "l", "l", "o"]`
   - Useful for languages with large character sets or when spelling matters.

#### 4. **Sentence Tokenization**
   - Splits text into sentences.
   - Example: `"I love NLP. It’s fascinating."` → `["I love NLP.", "It’s fascinating."]`
   - Often requires handling punctuation correctly.

---

### Methods of Tokenization

#### **Rule-based Tokenization**
   - Uses predefined rules like splitting on spaces or punctuation.
   - Simple but struggles with edge cases (e.g., abbreviations, numbers, etc.).

#### **Statistical Tokenization**
   - Uses probabilities and patterns to decide where to split.
   - Example: WordPiece and Byte Pair Encoding (BPE).

#### **Neural Tokenization**
   - Relies on machine learning models to learn how to tokenize.
   - Example: SentencePiece, which learns tokenization patterns from data.

---

### Applications of Tokenization

1. **Search Engines**: Breaking queries and documents into tokens for efficient matching.
2. **Machine Translation**: Tokenization ensures consistent splitting of words for alignment between languages.
3. **Chatbots and Assistants**: Tokenization helps analyze and process user input.
4. **Text Summarization and Classification**: Tokens serve as input for machine learning models.

---

### Code Example in Python

Here’s how tokenization works with Python libraries like **NLTK** and **Hugging Face Transformers**:

#### Using NLTK
```python
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')

text = "I love NLP. It’s amazing!"
tokens = word_tokenize(text)
print(tokens)
# Output: ['I', 'love', 'NLP', '.', 'It', '’', 's', 'amazing', '!']
```

In [1]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')

text = "I love NLP. It’s amazing!"
tokens = word_tokenize(text)
print(tokens)
# Output: ['I', 'love', 'NLP', '.', 'It', '’', 's', 'amazing', '!']

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['I', 'love', 'NLP', '.', 'It', '’', 's', 'amazing', '!']


#### Using Hugging Face Tokenizer
```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "I love NLP. It’s amazing!"
tokens = tokenizer.tokenize(text)
print(tokens)
# Output: ['i', 'love', 'nlp', '.', 'it', "'", 's', 'amazing', '!']
```

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "I love NLP. It’s amazing!"
tokens = tokenizer.tokenize(text)
print(tokens)
# Output: ['i', 'love', 'nlp', '.', 'it', "'", 's', 'amazing', '!']

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

['i', 'love', 'nl', '##p', '.', 'it', '’', 's', 'amazing', '!']




---

### Challenges in Tokenization

1. **Ambiguity**: Handling different languages and dialects (e.g., Chinese doesn’t use spaces).
2. **Compound Words**: Splitting or keeping compound words intact (e.g., "ice-cream").
3. **Context Sensitivity**: Properly splitting tokens in context (e.g., "U.S.A." vs "USA").

---

### Summary

Tokenization is a foundational step in NLP, transforming raw text into manageable pieces for analysis. Advanced techniques like subword tokenization and neural-based methods have significantly improved handling diverse text data in modern AI systems.