<a href="https://colab.research.google.com/github/saikiran37130204/NLP-in-class-activities/blob/main/Activity_3_Normalization_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Activity 3: Normalization Solution**
**Instructions:**

---
* Please download the provided IPython Notebook (ipynb) file and open it in Google Colab. Once opened, enter your code in the same file directly beneath the relevant question's code block.


* Purpose of this activity is to practice and get hands on experience (Ungraded Activity)  

* No Dataset required for this Activity

# **Text Preprocessing Beyond Tokenization**



##**Word** **normalization**
**Word normalization** is a text preprocessing technique used in natural language processing (NLP) to standardize and simplify words or tokens in a text document. The goal of word normalization is to make text data more consistent and manageable for analysis. This process can involve various transformations, such as converting all text to lowercase, removing punctuation, expanding contractions, and performing tasks like stemming or lemmatization to reduce words to their base or dictionary forms. Word normalization helps improve the accuracy and effectiveness of NLP tasks by reducing the complexity of text data and ensuring that similar words are treated as equivalent.

In [None]:
import re

# for using NLTK
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords

# for using SpaCy
import spacy

# for HuggingFace
!pip install transformers
# !pip install ftfy

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...




In [None]:
# trick to wrap text to the viewing window for this notebook
# Ref: https://stackoverflow.com/questions/58890109/line-wrapping-in-collaboratory-google-results
#helps improve the readability and formatting of code output in the notebook.
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

### **Revisiting Tokenization** :**TreebankWordTokenizer**

The **Treebank Word Tokenizer** is a text processing tool used in natural language processing (NLP) to split text into individual words or tokens. It follows the tokenization conventions and standards of the Penn Treebank corpus

In [None]:
from nltk.tokenize import TreebankWordTokenizer
text="Hello everyone. Welcome to NLP Course."
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(text)

['Hello', 'everyone.', 'Welcome', 'to', 'NLP', 'Course', '.']

##**Using Regular Expression**

**RegexpTokenizer** is a text processing tool provided by the Natural Language Toolkit (NLTK) library in Python. It is used to tokenize (split) text into individual tokens (words or phrases) based on a specified regular expression pattern.

In [None]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer("[\w']+")#[\w']+ as a whole matches sequences of word characters (letters, digits, underscores) and single quotes in a string
text = "Let's see how it's working. We also have digits like 123 and 010"
tokenizer.tokenize(text)

["Let's",
 'see',
 'how',
 "it's",
 'working',
 'We',
 'also',
 'have',
 'digits',
 'like',
 '123',
 'and',
 '010']

# **(Titorial) Tokenizing text using Spacy**

Following is a dummy sample of text to demonstrate tokenization in SpaCy.

In [None]:
dummy_text1 = """Here is the First Paragraph and this is the First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the first paragaraph. This paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the second paragraph. This paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the third paragaraph. This paragraph is ending now with a Fifth Sentence.
4th paragraph just has one sentence in it.
"""

print(dummy_text1)

Here is the First Paragraph and this is the First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the first paragaraph. This paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the second paragraph. This paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the third paragaraph. This paragraph is ending now with a Fifth Sentence.
4th paragraph just has one sentence in it.



In [None]:
# loads a trained English pipeline with specific preprocessing components
nlp = spacy.load('en_core_web_sm')

# using SpaCy's tokenizer...
doc = nlp(dummy_text1)      # applies the processing pipeline on the text
for token in doc:
  print(token.text)

Here
is
the
First
Paragraph
and
this
is
the
First
Sentence
.
Here
is
the
Second
Sentence
.
Now
is
the
Third
Sentence
.
This
is
the
Fourth
Sentence
of
the
first
paragaraph
.
This
paragraph
is
ending
now
with
a
Fifth
Sentence
.


Now
,
it
is
the
Second
Paragraph
and
its
First
Sentence
.
Here
is
the
Second
Sentence
.
Now
is
the
Third
Sentence
.
This
is
the
Fourth
Sentence
of
the
second
paragraph
.
This
paragraph
is
ending
now
with
a
Fifth
Sentence
.


Finally
,
this
is
the
Third
Paragraph
and
is
the
First
Sentence
of
this
paragraph
.
Here
is
the
Second
Sentence
.
Now
is
the
Third
Sentence
.
This
is
the
Fourth
Sentence
of
the
third
paragaraph
.
This
paragraph
is
ending
now
with
a
Fifth
Sentence
.


4th
paragraph
just
has
one
sentence
in
it
.




**Question 1:** Copy the above snippet code and Modify the above regular expression to match only **digits** from the given text.  

In [None]:
text = """
In the year 2024, advancements in technology continued to revolutionize industries across the globe. Artificial intelligence, machine learning, and data science played critical roles in optimizing processes, predicting outcomes, and driving innovation. Companies like Google, Amazon, and Tesla invested billions in research to stay ahead of the competition.

One major breakthrough was the development of quantum computing, which promised to solve complex problems exponentially faster than traditional computers. Researchers at IBM and Microsoft reported significant progress in creating stable qubits, a key component in quantum systems.

Meanwhile, in healthcare, personalized medicine reached new heights. Genetic testing became more accessible, allowing doctors to tailor treatments to individual patients. CRISPR technology made headlines with successful gene-editing trials, curing diseases that were previously considered incurable.

However, with all these advancements came new challenges. Cybersecurity threats continued to evolve, with ransomware attacks becoming more sophisticated and targeting critical infrastructure. Governments and organizations were forced to bolster their defenses and invest in new cybersecurity technologies to protect sensitive information.

As the world faced the effects of climate change, renewable energy sources like solar and wind became increasingly important. Countries around the world set ambitious goals to reduce carbon emissions, with a focus on transitioning to cleaner energy alternatives.

By the end of 2025, society had made great strides in many fields, but the rapid pace of change also raised ethical questions. The role of artificial intelligence in decision-making, the implications of gene editing, and the balance between technological progress and privacy remained hotly debated topics as the world prepared for the next wave of innovations.
"""


In [None]:
#CODE HERE
tokenizer = RegexpTokenizer("\d+")
print(tokenizer.tokenize(text))

['2024', '2025']



---


## ** (Tutorial) Stemming and Lemmatization using NLTK**

**Stemming** is a text normalization technique in natural language processing and information retrieval. It involves reducing words to their root or base form, often by removing suffixes or prefixes. The goal of stemming is to convert words with the same meaning but different forms into a common base form so that they can be treated as equivalent during text analysis and retrieval. Stemming helps improve information retrieval and text processing tasks by reducing the complexity of words while maintaining their core meaning. Common stemming algorithms include the Porter Stemmer and Snowball Stemmer.

##**Porter Stemmer**
The **Porter Stemmer** is a well-known algorithm for stemming in natural language processing. It was  designed to reduce words to their root or base form by removing common suffixes. Stemming is the process of reducing words to their linguistic root or base form to simplify text analysis and improve information retrieval.For example, it can convert words like "running," "runs," and "ran" to their common root "run."

Let's see how we can perform stemming and lemmatization using NLTK library...

In [None]:
# importing PorterStemmer class from nltk.stem module
from nltk.stem import PorterStemmer
porter = PorterStemmer()    # instantiating an object of the PorterStemmer class

stem = porter.stem('cats')    # calling the stemmer algorithm on the desired word
print(f"'cats' after stemming: {stem}")

stem = porter.stem('better')
print(f"'better' after stemming: {stem}")

stem = porter.stem('abaci')
print(f"'abaci' after stemming: {stem}")

stem = porter.stem('aardwolves')
print(f"'aardwolves' after stemming: {stem}")

stem = porter.stem('generically')
print(f"'generically' after stemming: {stem}")

'cats' after stemming: cat
'better' after stemming: better
'abaci' after stemming: abaci
'aardwolves' after stemming: aardwolv
'generically' after stemming: gener


##**Lemmatization**
**Lemmatization** is a natural language processing technique that reduces words to their base or dictionary form, known as a "lemma." Unlike stemming, which often involves removing suffixes to approximate a word's root, lemmatization considers the word's context and grammatical meaning. The goal is to transform different inflected forms of a word into a common base form. Lemmatization is particularly useful for maintaining the grammatical correctness of words in text analysis and information retrieval tasks.

##**WordNet Lemmatizer**
The **WordNet Lemmatizer** is a lemmatization tool based on WordNet, a lexical database of the English language. WordNet groups words into sets of synonyms called "synsets" and provides a rich lexical and semantic structure for the English language. The WordNet Lemmatizer uses this semantic information to perform lemmatization, which is the process of reducing words to their base or dictionary form (lemma)

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# importing Word Net-based lemmatizer class from nltk.stem module
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()    # instantiating an object of the Word NetLemmatizer class

lemma = lemmatizer.lemmatize('cats')    # calling the lemmatization algorithm on the desired word
print(f"'cats' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('better')
print(f"'better' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('abaci')
print(f"'abaci' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('aardwolves')
print(f"'aardwolves' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('generically')
print(f"'generically' after lemmatization: {lemma}")

print("\n\n\n")
lemma = lemmatizer.lemmatize('better', pos='a')   # 'a' denoted ADJECTIVE part-of-speech
print(f"'better' (as an adjective) after lemmatization: {lemma}")

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


'cats' after lemmatization: cat
'better' after lemmatization: better
'abaci' after lemmatization: abacus
'aardwolves' after lemmatization: aardwolf
'generically' after lemmatization: generically




'better' (as an adjective) after lemmatization: good


### **Stemming on text string**




In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

def stem_text(text):
    # Initialize the Porter Stemmer
    stemmer = PorterStemmer()

    # Tokenize the text into words
    words = word_tokenize(text)

    # Apply the stemmer to each word and join them back into a text
    stemmed_text = ' '.join([stemmer.stem(word) for word in words])

    return stemmed_text

# Example usage:
text = "He is jumping, and he jumped over the jumps."
stemmed_text = stem_text(text)
print(stemmed_text)


he is jump , and he jump over the jump .


In [None]:
# This is the text on which you have to perform stemming; taken from Wikipedia.
text = "In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form; generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root."
print("Given text:")
print(text)

Given text:
In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form; generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.


# **Tutorial **

In [None]:
#CODE BLOCK 1
en_stopwords = set(stopwords.words('english'))
def remove_punc(text_string):
  return re.sub('[^a-zA-Z0-9 ]', '', text_string.lower())

def remove_stopwords(text_string):
  return [ token for token in text_string.split(' ') if token not in en_stopwords ]

# applying punctuation removal to the text
unpunc_text = remove_punc(text)
print("After punctuation removal:")
print(unpunc_text)

# # applying stopword removal to the text
clean_text = remove_stopwords(unpunc_text)
print("\n\nAfter stopword removal:")
print(clean_text)

After punctuation removal:
in linguistic morphology and information retrieval stemming is the process of reducing inflected or sometimes derived words to their word stem base or root form generally a written word form the stem need not be identical to the morphological root of the word it is usually sufficient that related words map to the same stem even if this stem is not in itself a valid root


After stopword removal:
['linguistic', 'morphology', 'information', 'retrieval', 'stemming', 'process', 'reducing', 'inflected', 'sometimes', 'derived', 'words', 'word', 'stem', 'base', 'root', 'form', 'generally', 'written', 'word', 'form', 'stem', 'need', 'identical', 'morphological', 'root', 'word', 'usually', 'sufficient', 'related', 'words', 'map', 'stem', 'even', 'stem', 'valid', 'root']


#### **Question 2. Perform stemming on the given cleaned text using the Porter Stemmer from NLTK.**

**Hint:** import PorterStemmer from nltk.stem

In [None]:
text = "In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form; generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root."

stemmer = PorterStemmer()
text_no_punct = re.sub(r'[^\w\s]', '', text)
words = word_tokenize(text_no_punct.lower())
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]
stemmed_words = [stemmer.stem(word) for word in filtered_words]
print(stemmed_words)

['linguist', 'morpholog', 'inform', 'retriev', 'stem', 'process', 'reduc', 'inflect', 'sometim', 'deriv', 'word', 'word', 'stem', 'base', 'root', 'form', 'gener', 'written', 'word', 'form', 'stem', 'need', 'ident', 'morpholog', 'root', 'word', 'usual', 'suffici', 'relat', 'word', 'map', 'stem', 'even', 'stem', 'valid', 'root']


# **Lemmatization on text string**

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
from nltk.stem import WordNetLemmatizer

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Example text to lemmatize
text = "There are like more than 100 foxes and lions in this forest."

# Tokenize the text into words
words = text.split()

# Lemmatize each word and join them back into a sentence
lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in words])

# Print the original and lemmatized text
print("Original Text:", text)
print("Lemmatized Text:", lemmatized_text)


Original Text: There are like more than 100 foxes and lions in this forest.
Lemmatized Text: There are like more than 100 fox and lion in this forest.


#### **Question 3. Perform lemmatization on the given cleaned text above using NLTK's lemmatizer.**

**Hint**:import WordNetLemmatizer from nltk.stem

In [None]:
# apply NLTK's lemmatizer on the cleaned text (after punctuation and stopwords are removed) below this comment
#CODE HERE
text = "In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form; generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root."

lemmatizer = WordNetLemmatizer()
text_no_punct = re.sub(r'[^\w\s]', '', text)
words = word_tokenize(text_no_punct.lower())
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]

print(lemmatized_words)

['linguistic', 'morphology', 'information', 'retrieval', 'stemming', 'process', 'reducing', 'inflected', 'sometimes', 'derived', 'word', 'word', 'stem', 'base', 'root', 'form', 'generally', 'written', 'word', 'form', 'stem', 'need', 'identical', 'morphological', 'root', 'word', 'usually', 'sufficient', 'related', 'word', 'map', 'stem', 'even', 'stem', 'valid', 'root']


---


##**(Tutorial)** **Sentence Segmentation using Spacy**

Following is a dummy paragraph of text to demonstrate how to use SpaCy to segment text into sentences.

In [None]:
dummy_text3 = """Here is the First Paragraph and this is the First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the first paragaraph. This paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the second paragraph. This paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the third paragaraph. This paragraph is ending now with a Fifth Sentence.
4th paragraph just has one sentence in it.
"""

print(dummy_text3)

Here is the First Paragraph and this is the First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the first paragaraph. This paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the second paragraph. This paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. Here is the Second Sentence. Now is the Third Sentence. This is the Fourth Sentence of the third paragaraph. This paragraph is ending now with a Fifth Sentence.
4th paragraph just has one sentence in it.



In [None]:
nlp = spacy.load('en_core_web_sm')

# performing sentence splitting...
doc = nlp(dummy_text3)
for sentence in doc.sents:
  print(sentence)

Here is the First Paragraph and this is the First Sentence.
Here is the Second Sentence.
Now is the Third Sentence.
This is the Fourth Sentence of the first paragaraph.
This paragraph is ending now with a Fifth Sentence.

Now, it is the Second Paragraph and its First Sentence.
Here is the Second Sentence.
Now is the Third Sentence.
This is the Fourth Sentence of the second paragraph.
This paragraph is ending now with a Fifth Sentence.

Finally, this is the Third Paragraph and is the First Sentence of this paragraph.
Here is the Second Sentence.
Now is the Third Sentence.
This is the Fourth Sentence of the third paragaraph.
This paragraph is ending now with a Fifth Sentence.

4th paragraph just has one sentence in it.



## **(Tutorial) Subword Tokenization using HuggingFace**

**Hugging Face** is used for subword tokenization by offering NLP practitioners access to pre-trained subword tokenizers and models. Hugging Face's "transformers" library offers pre-trained models and tokenizers, such as Byte Pair Encoding (BPE) and SentencePiece, which are widely used for subword tokenization

**Subword tokenization** is a text processing technique used in natural language processing (NLP) to break down words into smaller units, often subword pieces. This approach is particularly useful for handling languages with complex morphology or when dealing with out-of-vocabulary words. Subword tokenization methods like Byte-Pair Encoding (BPE) and SentencePiece divide text into subword units, such as character-level tokens or subword pieces, allowing NLP models to work with a more extensive and adaptable vocabulary. This technique improves the handling of rare words and enhances the performance of NLP models on a wide range of languages and tasks

In [None]:
!pip install tokenizers
#This is a JSON file that contains the vocabulary (i.e., the set of words and subword pieces) used by the GPT-2 model
!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json
!wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt

--2024-09-09 16:20:52--  https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json
Resolving s3.amazonaws.com (s3.amazonaws.com)... 16.182.96.248, 52.217.227.80, 52.216.36.128, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|16.182.96.248|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1042301 (1018K) [application/json]
Saving to: ‘gpt2-medium-vocab.json’


2024-09-09 16:20:54 (925 KB/s) - ‘gpt2-medium-vocab.json’ saved [1042301/1042301]

--2024-09-09 16:20:54--  https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 16.182.96.248, 52.217.227.80, 52.216.36.128, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|16.182.96.248|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 456318 (446K) [text/plain]
Saving to: ‘gpt2-merges.txt’


2024-09-09 16:20:56 (607 KB/s) - ‘gpt2-merges.txt’ saved [456318/456318]



**Byte Pair Encoding (BPE)** is a subword tokenization technique used in natural language processing (NLP) and text processing. It involves dividing text into subword units, typically based on frequency, to create a more flexible and adaptive vocabulary for language models.

In [None]:
from tokenizers import ByteLevelBPETokenizer
gpt2vocab = "gpt2-medium-vocab.json"
gpt2merges = "gpt2-merges.txt"

bpe = ByteLevelBPETokenizer(gpt2vocab, gpt2merges)
bpe_encoding = bpe.encode("The custom of delivering an address on Inauguration Day started with the very first Inauguration—George Washington’s—on April 30, 1789.")
print(bpe_encoding.ids)
print(bpe_encoding.tokens)


[464, 2183, 286, 13630, 281, 2209, 319, 554, 7493, 3924, 3596, 2067, 351, 262, 845, 717, 554, 7493, 3924, 960, 20191, 2669, 447, 247, 82, 960, 261, 3035, 1542, 11, 1596, 4531, 13]
['The', 'Ġcustom', 'Ġof', 'Ġdelivering', 'Ġan', 'Ġaddress', 'Ġon', 'ĠIn', 'aug', 'uration', 'ĠDay', 'Ġstarted', 'Ġwith', 'Ġthe', 'Ġvery', 'Ġfirst', 'ĠIn', 'aug', 'uration', 'âĢĶ', 'George', 'ĠWashington', 'âĢ', 'Ļ', 's', 'âĢĶ', 'on', 'ĠApril', 'Ġ30', ',', 'Ġ17', '89', '.']


**Question 4**: Collect the encoding ids which you generate from above snippet and now decode the ids to get back the given text string?

**Hint**: use decode() method

In [None]:
from tokenizers import ByteLevelBPETokenizer

gpt2vocab = "gpt2-medium-vocab.json"
gpt2merges = "gpt2-merges.txt"
bpe = ByteLevelBPETokenizer(gpt2vocab, gpt2merges)
text = "The custom of delivering an address on Inauguration Day started with the very first Inauguration—George Washington’s—on April 30, 1789."
bpe_encoding = bpe.encode(text)
print("Encoded IDs:", bpe_encoding.ids)
print("Tokens:", bpe_encoding.tokens)
decoded_text = bpe.decode(bpe_encoding.ids)
print("Decoded Text:", decoded_text)


Encoded IDs: [464, 2183, 286, 13630, 281, 2209, 319, 554, 7493, 3924, 3596, 2067, 351, 262, 845, 717, 554, 7493, 3924, 960, 20191, 2669, 447, 247, 82, 960, 261, 3035, 1542, 11, 1596, 4531, 13]
Tokens: ['The', 'Ġcustom', 'Ġof', 'Ġdelivering', 'Ġan', 'Ġaddress', 'Ġon', 'ĠIn', 'aug', 'uration', 'ĠDay', 'Ġstarted', 'Ġwith', 'Ġthe', 'Ġvery', 'Ġfirst', 'ĠIn', 'aug', 'uration', 'âĢĶ', 'George', 'ĠWashington', 'âĢ', 'Ļ', 's', 'âĢĶ', 'on', 'ĠApril', 'Ġ30', ',', 'Ġ17', '89', '.']
Decoded Text: The custom of delivering an address on Inauguration Day started with the very first Inauguration—George Washington’s—on April 30, 1789.
