<a href="https://colab.research.google.com/github/shivasmi07/NLP/blob/main/lab02/lab02_Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 02: POS Tagging, Morphology, Lemmatization, Dependency Parsing, and Tokenization

## Introduction  
In this tutorial, we will explore some core NLP tasks using **spaCy**, a powerful and efficient Python library for NLP. Additionally, we will examine tokenization techniques used in modern language models.  

### Topics Covered:  

1. **Tokenization**  
2. **Part-of-Speech (POS) Tagging**  
3. **Lemmatization**  
4. **Morphology**  
5. **Dependency Parsing**  
6. **Subword Tokenization**  

## Prerequisites  

Before we begin, ensure that you have **spaCy** installed in your environment. If you are using the `NLP2025` environment, make sure it is activated. You can install **spaCy** using the following command:



In [None]:
from IPython.display import HTML, display
colab_button = HTML(
    '<a href="https://colab.research.google.com/github/surrey-nlp/NLP-2025/blob/main/lab02/lab02_Tokenization.ipynb" target="_parent">'
    '<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>'
)
display(colab_button)

In [None]:
!pip install spacy

Collecting spacy
  Downloading spacy-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.12-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.5 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.4-cp310-cp310-manylinux_2_17_x86

Next, download the spaCy model for the English language:

In [None]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m86.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## Importing spaCy
Let's start by importing the spaCy library and loading the English language model.

In [None]:
import spacy

# Load the English language model
nlp = spacy.load('en_core_web_sm')

Before beginning, let's define an example string that we can look at.

In [None]:
text = "The University of Surrey is a U.K. university founded in 1966, with a budget of £314.0 million."

## 1. Tokenization  

**Tokenization** is the process of breaking down text input into smaller units called **tokens**, which can be **words, punctuation marks, or other meaningful elements**. This is a fundamental step in NLP as it enables structured text analysis.  

### Tokenization in spaCy  

In **spaCy**, tokenization is performed using language-specific grammatical rules. For example:  
- Punctuation at the end of a sentence is **split off** as a separate token.  
- Abbreviations like **"U.K."** retain their periods within a single token.  

### How spaCy Handles Tokenization  

- The **input** to the tokenizer is a **Unicode text**.  
- The **output** is a **Doc object**, which consists of individual tokens.  
- We can **iterate** over tokens and access attributes such as `token.text`.  
- spaCy's tokenizer is **non-destructive**, meaning it preserves the original text while providing structured access to tokens.  

This efficient tokenization process enables deeper linguistic analysis while maintaining the integrity of the original text.  


In [None]:
# Useful Library for formatting the table
import tabulate

doc = nlp(text)

# Plot table
table = []
for count, token in enumerate(doc):
    table.append([count + 1, token.text])

print(tabulate.tabulate(table, headers=['Position','Text']))

  Position  Text
----------  ----------
         1  The
         2  University
         3  of
         4  Surrey
         5  is
         6  a
         7  U.K.
         8  university
         9  founded
        10  in
        11  1966
        12  ,
        13  with
        14  a
        15  budget
        16  of
        17  £
        18  314.0
        19  million
        20  .


## 2. Part-of-Speech (POS) Tagging  

**Part-of-Speech (POS) tagging** is the process of assigning grammatical tags to individual words in a sentence, indicating their role, such as **noun, verb, adjective,** etc. This helps in understanding the **syntactic structure** of a sentence and is fundamental in many NLP tasks.  

### Using spaCy for POS Tagging  

Since we have previously processed the text input using **spaCy**, we can easily retrieve the POS tag for each token with a simple attribute call `token.pos_`


In [None]:
POS_Tags = []
for count, token in enumerate(doc):
    POS_Tags.append([count + 1, token.text, token.pos_])

print(tabulate.tabulate(POS_Tags, headers=['Position','Text', 'POS Tag']))

  Position  Text        POS Tag
----------  ----------  ---------
         1  The         DET
         2  University  PROPN
         3  of          ADP
         4  Surrey      PROPN
         5  is          AUX
         6  a           DET
         7  U.K.        PROPN
         8  university  NOUN
         9  founded     VERB
        10  in          ADP
        11  1966        NUM
        12  ,           PUNCT
        13  with        ADP
        14  a           DET
        15  budget      NOUN
        16  of          ADP
        17  £           SYM
        18  314.0       NUM
        19  million     NUM
        20  .           PUNCT


Based on the example above, we can see several **POS Tags**. Some common examples include:

- **DET**: Determiner  
- **PROPN**: Proper Noun  
- **ADP**: Adposition  

These tags represent different parts of speech in a sentence and are crucial for understanding the syntactic structure of the language.

### Why Use POS Tagging?

In NLP, understanding the grammatical structure of sentences can be extremely valuable for many tasks. POS tagging helps computers to identify the roles that different words play within a sentence, such as subjects, objects, or actions.

However, in some tasks, it may also be useful to **discard certain words** based on their POS tags. For example:

- **Sentiment Analysis**:  
  In sentiment analysis, words like **articles** (e.g., *"the"*, *"a"*) and **pronouns** (e.g., *"he"*, *"she"*) might be discarded because they contribute little to the overall sentiment of the text.

By filtering out less relevant POS tags, the model can focus on words that carry more meaning and help improve task performance.


## 3. Lemmatization  

**Lemmatization** is the process of reducing words to their **base** or **root** form, known as a **lemma**. This helps in **text normalization** by converting different inflectional forms of a word into a single standardized form.  

Lemmatization is particularly useful in NLP tasks such as:  
- Improving text **search and retrieval**  
- Enhancing **sentiment analysis**  
- Reducing **dimensionality** in text-based models  


In [None]:
Morphs = []
for count, token in enumerate(doc):
    Morphs.append([count + 1, token.text, token.lemma_])

print(tabulate.tabulate(Morphs, headers=['Position','Text','Lemma']))

  Position  Text        Lemma
----------  ----------  ----------
         1  The         the
         2  University  University
         3  of          of
         4  Surrey      Surrey
         5  is          be
         6  a           a
         7  U.K.        U.K.
         8  university  university
         9  founded     found
        10  in          in
        11  1966        1966
        12  ,           ,
        13  with        with
        14  a           a
        15  budget      budget
        16  of          of
        17  £           £
        18  314.0       314.0
        19  million     million
        20  .           .


## 4. Morphology  

**Morphology** is the study of the structure of words and their components, such as **prefixes, suffixes,** and **roots**. In essence, it is the process through which the root form (lemma) of a word is modified by the addition of prefixes or suffixes, altering its meaning or grammatical function.  

In **spaCy**, we can access detailed morphological information for each token, which includes features such as:  
- **Number** (singular or plural)  
- **Tense** (present, past, etc.)  
- **Mood**: Indicates the mode or manner in which the action is expressed (e.g., **indicative**, **imperative**, or **subjunctive**).  
  - Example: *"She eats"* (indicative) vs. *"Eat!"* (imperative)
- **Aspect**: Describes the temporal flow or completion of an action (e.g., **perfective**, **progressive**, or **habitual**).  
  - Example: *"I am eating"* (progressive) vs. *"I have eaten"* (perfective)  

This morphological analysis is essential for understanding how words relate to one another in context and is crucial for tasks such as syntactic parsing and word generation.  


In [None]:
Morphs = []
for count, token in enumerate(doc):
    Morphs.append([count + 1, token.text, token.morph])

print(tabulate.tabulate(Morphs, headers=['Position','Text', 'Morphology']))

  Position  Text        Morphology
----------  ----------  -----------------------------------------------------
         1  The         Definite=Def|PronType=Art
         2  University  Number=Sing
         3  of
         4  Surrey      Number=Sing
         5  is          Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
         6  a           Definite=Ind|PronType=Art
         7  U.K.        Number=Sing
         8  university  Number=Sing
         9  founded     Aspect=Perf|Tense=Past|VerbForm=Part
        10  in
        11  1966        NumType=Card
        12  ,           PunctType=Comm
        13  with
        14  a           Definite=Ind|PronType=Art
        15  budget      Number=Sing
        16  of
        17  £
        18  314.0       NumType=Card
        19  million     NumType=Card
        20  .           PunctType=Peri


## 5. Dependency Parsing  

Dependency parsing involves analyzing the grammatical structure of a sentence and establishing relationships between **head** words and their **modifiers**. This technique allows us to decompose a sentence into multiple sections, assuming a direct connection between each linguistic unit. These relationships are typically represented as a **tree structure**, illustrating how words depend on one another.  

### Example  

**Sentence:**  
*"I prefer the morning flight through Denver."*  

The diagram below visualizes the sentence's dependency structure:  

![Dependency Parsing](https://cdn.analyticsvidhya.com/wp-content/uploads/2024/09/29920Screenshot-127.webp)  
*[Source](https://www.analyticsvidhya.com/blog/2021/12/dependency-parsing-in-natural-language-processing-with-examples/)*  

### Understanding the Dependency Structure  

In the diagram:  

- **Directed arcs** illustrate grammatical relationships between words in the sentence.  
- The **root** of the tree, *prefer*, serves as the central unit of the sentence.  
- Each dependency is labeled with a **dependency tag**, which specifies the relationship between two words.  

For instance, in the phrase **"flight to Denver"**, the noun *Denver* modifies the meaning of *flight*. This creates a **dependency** where:  

- *Flight* is the **head** (governing word).  
- *Denver* is the **dependent** (child node).  
- This relationship is marked by the **nmod** (nominal modifier) tag, indicating that *Denver* provides additional information about *flight*.  

Dependency parsing plays a crucial role in natural language processing (NLP), helping models understand syntactic structures and improving tasks such as named entity recognition, question answering, and machine translation.  


All of this can be done easily with spaCy through the following:

In [None]:
Dependenct_Parsing = []
for count, token in enumerate(doc):
    Dependenct_Parsing.append([count + 1, token.text, token.dep_, [child.text for child in token.children]])

print(tabulate.tabulate(Dependenct_Parsing, headers=['Position','Text', 'Dependency', 'Children']))

  Position  Text        Dependency    Children
----------  ----------  ------------  -----------------------------------------
         1  The         det           []
         2  University  nsubj         ['The', 'of']
         3  of          prep          ['Surrey']
         4  Surrey      pobj          []
         5  is          ROOT          ['University', 'university', 'with', '.']
         6  a           det           []
         7  U.K.        compound      []
         8  university  attr          ['a', 'U.K.', 'founded', ',']
         9  founded     acl           ['in']
        10  in          prep          ['1966']
        11  1966        pobj          []
        12  ,           punct         []
        13  with        prep          ['budget']
        14  a           det           []
        15  budget      pobj          ['a', 'of']
        16  of          prep          ['million']
        17  £           quantmod      []
        18  314.0       compound      []
        19  mi

This table might look very confusing, which is why spaCy offers a quick way to easily view the tree structure with the following:

In [None]:
from spacy import displacy

displacy.serve(doc, style="dep")


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## 6. Subword Tokenization  

Tokenization is the process of breaking down a sentence into smaller units, enabling AI models to process text as discrete tokens rather than as a continuous block of text. In previous sections, you have used spaCy for tokenization, which primarily segments text into individual words. While this approach is efficient, it struggles with handling uncommon or out-of-vocabulary (OOV) words.  

To address this limitation, modern tokenization techniques predominantly use **subword-based methods**. Instead of strictly segmenting text into words, these approaches break words into smaller subword units when necessary. For example, the word *unhappiness* might be tokenized into *un* and *happiness*. This strategy offers several advantages:  

- **Improved Handling of Rare Words** – By decomposing words into meaningful subunits, the model can recognize and generate words that were not explicitly seen during training.  
- **Compact Vocabulary** – Instead of storing an extensive vocabulary of all possible words, subword tokenization relies on a smaller set of subunits, which can be combined to form complex words.  
- **Efficient Representation** – By balancing whole-word tokens with subword segments, this method optimizes both memory usage and model performance.  

(**Note**: Often tokenizers try to maintain words that are frequently used, and split rare words into smaller subwords)

We will therefore explore three subword tokenization techniques:  

1. **WordPiece**  
2. **Byte-Pair Encoding (BPE)**  
3. **SentencePiece**  

These tokenization methods have become standard in modern NLP models and are widely used in recent Large Language Models (LLMs).  


Before starting, let's define a simple string that we will be tokenizing.

In [None]:
text = "Natural Language Processing is incontrovertibly a good module."

## 6.1. WordPiece Tokenization

**WordPiece Tokenization** is a subword tokenization technique used in models like BERT (Bidirectional Encoder Representations from Transformers). It breaks down words into subwords, that can efficiently handle complex words, unknown terms, or out-of-vocabulary (OOV) words.

WordPiece works by iteratively merging the most frequent pairs of characters or subword units in a large corpus. The resulting subwords represent the language's most frequent word components, which helps to reduce the size of the vocabulary while maintaining full language coverage.

### Example: Tokenizing Text with BERT’s WordPiece Tokenizer

In this section, we will use the Hugging Face `transformers` library to showcase how the BERT tokenizer works. We’ll tokenize a sample sentence, convert the tokens into token IDs, and then decode those IDs back into a human-readable string.


In [None]:
# First, install the required library
!pip install transformers

In [None]:
# Importing the necessary module from Hugging Face transformers
from transformers import BertTokenizer

# Step 1: Load the pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Step 2: Tokenize the text into subword tokens
tokens = tokenizer.tokenize(text)
print("\nBERT Tokens:", tokens)

# Step 3: Convert tokens to their corresponding token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("\nBERT Token IDs:", token_ids)

# Step 4: Decode the token IDs back to human-readable text
decoded_text = tokenizer.decode(token_ids)
print("\nDecoded Text:", decoded_text)


BERT Tokens: ['natural', 'language', 'processing', 'is', 'inc', '##ont', '##rove', '##rti', '##bly', 'a', 'good', 'module', '.']

BERT Token IDs: [3019, 2653, 6364, 2003, 4297, 12162, 17597, 28228, 6321, 1037, 2204, 11336, 1012]

Decoded Text: natural language processing is incontrovertibly a good module.


In the example above, the word **"incontrovertibly"**, which is quite rare, is split into **5 subwords** by the tokenizer. This process of splitting words into smaller subunits is particularly useful for handling rare or out-of-vocabulary (OOV) words.

Each subword is represented as a **token**, and you can see that certain tokens are prefixed with `##`. This notation indicates that these subwords are continuations of a previous subword (i.e., they are not starting a new token). The tokenizer has broken down the word into smaller, more frequent subwords that are part of the model’s vocabulary.

### Why Do We Use Token IDs?

As shown above, the tokens are also associated with **token IDs**. These token IDs are numerical representations of the words or subwords. In the context of machine learning and NLP models, it's crucial to convert words into numbers because models operate on numerical data.

Each token is mapped to a unique ID in the model’s vocabulary, which allows the model to process text efficiently. This conversion is essential because:

- **Models can't understand raw text**: Machine learning models, including NLP models, don't process text directly. Instead, they process **numerical representations** of words.
- **Token IDs map to model parameters**: The model's vocabulary is essentially a map of tokens (words or subwords) to unique IDs. These IDs are used by the model to look up the corresponding word embeddings (vector representations) in the model’s parameters.

## 6.2. Byte-Pair Encoding (BPE) Tokenization

**Byte-Pair Encoding (BPE)** is another popular subword tokenization technique used in models like GPT (Generative Pretrained Transformer) and other transformer-based architectures.

### Byte-Level BPE Tokenization

Instead of treating text as sequences of **Unicode characters** (such as 'a', 'b', 'c', etc.), **byte-level BPE** tokenizes text at the **byte level**. Each character, word, and symbol is first converted into its corresponding **byte representation**.

The **base vocabulary** for byte-level BPE is much smaller, consisting of only **256 byte values**, as there are 256 possible byte values. This ensures that any character can be represented without needing to resort to an **unknown token** for out-of-vocabulary (OOV) words.

This approach allows models like **GPT-2** and **RoBERTa** to handle any character or symbol, including those from different languages, special symbols, or rare characters, without needing additional vocabularies or dealing with OOV issues.


### Example: Tokenizing Text with BPE Tokenizer

In this section, we will use the Hugging Face `transformers` library to demonstrate how a BPE tokenizer works.

In [None]:
from transformers import GPT2Tokenizer

# Step 1: Load the pre-trained BPE tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Step 2: Tokenize the text into subword tokens
tokens = tokenizer.tokenize(text)
print("\nBPE Tokens:", tokens)

# Step 3: Convert tokens to their corresponding token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("\nBPE Token IDs:", token_ids)

# Step 4: Decode the token IDs back to human-readable text
decoded_text = tokenizer.decode(token_ids)
print("\nDecoded Text:", decoded_text)


BPE Tokens: ['Natural', 'ĠLanguage', 'ĠProcessing', 'Ġis', 'Ġinc', 'ont', 'ro', 'vert', 'ibly', 'Ġa', 'Ġgood', 'Ġmodule', '.']

BPE Token IDs: [35364, 15417, 28403, 318, 753, 756, 305, 1851, 3193, 257, 922, 8265, 13]

Decoded Text: Natural Language Processing is incontrovertibly a good module.


### The Ġ Character in Byte-Pair Encoding (BPE)

The output above reveals a noticeable difference: the **Ġ** character. In **byte-level BPE**, this character is used to indicate that a word token is preceded by a **space**. This is a crucial part of the tokenization strategy, as it helps BPE models distinguish between different words and their **contexts**.


## 6.3. SentencePiece Tokenization

**SentencePiece** is also another popular subword tokenization technique used in models like T5 (Text-to-Text Transfer Transformer) and other transformer-based architectures.

In this section, we will use the Hugging Face `transformers` library to demonstrate how the **SentencePiece tokenizer** works. However, we will also be training our own SentencePiece tokenizer afterwards.


In [None]:
from transformers import T5Tokenizer

# Step 1: Load a pre-trained SentencePiece tokenizer (T5 model)
tokenizer = T5Tokenizer.from_pretrained("t5-small")

# Step 2: Tokenize the text into subword tokens using SentencePiece
tokens = tokenizer.tokenize(text)
print("\nSentencePiece Tokens:", tokens)

# Step 3: Convert tokens to their corresponding token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("\nSentencePiece Token IDs:", token_ids)

# Step 4: Decode the token IDs back to human-readable text
decoded_text = tokenizer.decode(token_ids)
print("\nDecoded Text:", decoded_text)



SentencePiece Tokens: ['▁Natural', '▁Language', '▁Processing', '▁is', '▁in', 'contro', 'vert', 'ibly', '▁', 'a', '▁good', '▁module', '.']

SentencePiece Token IDs: [6869, 10509, 19125, 19, 16, 23862, 3027, 15596, 3, 9, 207, 6008, 5]

Decoded Text: Natural Language Processing is incontrovertibly a good module.


## Training a SentencePiece Model

In this section, we'll walk through how to **train a SentencePiece model** from a text corpus using **Byte-Pair Encoding (BPE)**. As seen from above, SentencePiece is a subword tokenization technique that efficiently handles rare or out-of-vocabulary (OOV) words by splitting them into smaller, manageable units.

### Training Process Overview:
1. **Input Corpus**: We use a text file (e.g., **Shakespeare_1_10.txt**) as input.
2. **Model Parameters**:
   - **Vocabulary size**: Set to **2000**.
   - **Model type**: We use **BPE**.
3. **Training**: The model is trained using `SentencePieceTrainer.train()` to learn subword units.
4. **Output**: The model and vocabulary files are saved with the specified prefix (e.g., `mymodel.model`, `mymodel.vocab`).


In [None]:
!pip install sentencepiece



In [None]:
import sentencepiece as spm

# Step 1: Define the input corpus file (a large text file)
corpus_file = 'Shakespear_1_10.txt'

# Step 2: Define the model output directory and parameters
model_prefix = 'mymodel'
vocab_size = 2000
model_type = 'bpe'  # BPE model (could also be 'unigram', 'char', etc.)

# Step 3: Train the SentencePiece model
spm.SentencePieceTrainer.train(
    input=corpus_file,
    model_prefix=model_prefix,
    vocab_size=vocab_size,
    model_type=model_type,
    character_coverage=0.9995,  # Coverage for character set (default is 0.9995)
    input_format='text'  # Format of input (usually plain text)
)

print(f"Model trained and saved with prefix: {model_prefix}")


Model trained and saved with prefix: mymodel


sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: Shakespear_1_10.txt
  input_format: text
  model_prefix: mymodel
  model_type: BPE
  vocab_size: 2000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
 

486 active=1032 piece=ell
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=2 size=360 all=1506 active=1052 piece=▁In
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=2 size=380 all=1516 active=1062 piece=king
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=2 size=400 all=1524 active=1070 piece=▁giv
bpe_model_trainer.cc(159) LOG(INFO) Updating active symbols. max_freq=2 min_freq=0
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=2 size=420 all=1522 active=997 piece=▁Look
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=2 size=440 all=1509 active=984 piece=▁lies
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=2 size=460 all=1500 active=975 piece=▁fresh
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=2 size=480 all=1481 active=956 piece=▁gentle
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=1 size=500 all=1465 active=940 piece=,’
bpe_model_trainer.cc(159) LOG(INFO) Updating active symbols. max_freq=1 min_freq=0
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=1 size=520 all=1476 active=1012 p

## SentencePiece Tokenization and Detokenization

Once you’ve trained your **SentencePiece** model, you can use it to tokenize and detokenize sentences. The process involves converting a sentence into subword units (tokens) and then reconstructing the sentence from those tokens.


In [None]:
# Load the trained model
sp = spm.SentencePieceProcessor()
sp.load('mymodel.model')

# Tokenize a sentence
sentence = "I have successfully trained a SentencePiece model."
tokens = sp.encode(sentence, out_type=str)  # or out_type=int for token IDs
print(f"\nTokenized sentence: {tokens}")

# Detokenize the sentence
detokenized = sp.decode(tokens)
print(f"\nDetokenized sentence: {detokenized}")


Tokenized sentence: ['▁I', '▁ha', 've', '▁su', 'ccess', 'fu', 'll', 'y', '▁tra', 'in', 'ed', '▁a', '▁S', 'ent', 'en', 'ce', 'P', 'ie', 'ce', '▁mo', 'de', 'l', '.']

Detokenized sentence: I have successfully trained a SentencePiece model.
