# Tokenization in NLP
.


Tokenization is the process of breaking down a text into smaller units called "tokens." These tokens can be words, subwords,5 phrases, or even individual characters, depending on the specific task and tokenizer used. It's often the first step in an NLP pipeline, as most NLP models require input in tokenized form.
<br/>
Types of Tokenization:
•	Word Tokenization: Splits text into individual words.
•	Sentence Tokenization: Splits text into individual sentences.6
•	Subword Tokenization (e.g., WordPiece, BPE): Breaks words into smaller, more common subword units, especially useful for handling out-of-vocabulary words and morphologically rich languages.


In [1]:
# !pip install nltk



In [2]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Artificial Intelligence is developing at a rapid pace. So it is immensely rewarding to work in an emerging and challenging space."

# Word Tokenization
word_tokens = word_tokenize(text)
print("Word Tokens: ", word_tokens)

# Sentence Tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence Tokens: ", sentence_tokens)
print("Last Sentence: ", sentence_tokens[-1])

Word Tokens:  ['Artificial', 'Intelligence', 'is', 'developing', 'at', 'a', 'rapid', 'pace', '.', 'So', 'it', 'is', 'immensely', 'rewarding', 'to', 'work', 'in', 'an', 'emerging', 'and', 'challenging', 'space', '.']
Sentence Tokens:  ['Artificial Intelligence is developing at a rapid pace.', 'So it is immensely rewarding to work in an emerging and challenging space.']
Last Sentence:  So it is immensely rewarding to work in an emerging and challenging space.


# Stemming and Lemmitization

Both stemming and lemmatization are text normalization techniques used to reduce words to their base or root form. The main difference lies in their approach and the quality of the output:
•	Stemming: A more crude, rule-based process that chops off suffixes from words, often resulting in words that are not actual dictionary words. It's faster but can be less accurate.
o	Example: "running" -> "run", "caring" -> "car" (incorrect)
•	Lemmatization: A more sophisticated process that uses vocabulary and morphological analysis of words to return their dictionary base form (lemma). It's slower but more accurate as it considers the word's meaning and context.
o	Example: "running" -> "run", "caring" -> "care", "better" -> "good"


In [3]:
# using NLTK for stemming and Spacy for Lemmatization

In [4]:
# !pip install spacy



In [5]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
import spacy

In [6]:
# nltk.download('punkt') #for tokenization
# nltk.download('wordnet') #for lemmatization
# nltk.download('omw-1.4') # Open Multilingual WordNet (version 1.4) provides support for WordNet in multiple languages.


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Acer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Acer\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Acer\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [7]:
# Download SpaCy English model
# python -m spacy download en_core_web_sm

In [8]:
# Stemming
porter_stemmer = PorterStemmer()
words = ["running","runs", "ran","flying","flew","caring","cars"]
stemmed_words = [porter_stemmer.stem(word) for word in words]
print("Stemmed words: ", stemmed_words)

Stemmed words:  ['run', 'run', 'ran', 'fli', 'flew', 'care', 'car']


In [9]:
# Lemmatization (using NLTK's WordNetLemmatizer, which requires POS tag for better accuracy)

lemmatizer = WordNetLemmatizer()
# A simple function to get Part of Speech (POS) tag for lemmatizer
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {
        "J" : wordnet.ADJ,
        "N" : wordnet.NOUN,
        "V" : wordnet.VERB,
        "R" : wordnet.ADV
    }
    return tag_dict.get(tag, wordnet.NOUN) # Default to Noun if not found

lemmatized_words_nltk = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words]
print("Lemmatized Words (NLTK):", lemmatized_words_nltk)

Lemmatized Words (NLTK): ['run', 'run', 'ran', 'fly', 'flew', 'care', 'car']


In [10]:
# Lemmatization (using SpaCy, which is generally more robust)
nlp = spacy.load("en_core_web_sm")
text_for_lemmitization = "He was running very fast and caring for his family."
doc = nlp(text_for_lemmitization)
lemmatized_words_spacy = [token.lemma_ for token in doc]
print("Lemmatized words (Spacy) :", " ".join(lemmatized_words_spacy))

Lemmatized words (Spacy) : he be run very fast and care for his family .


# Removal of Stop Words

In [11]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords') # Download stopwords

text = "This is a very good example to show how stop words are removed from a text."
stop_words = set(stopwords.words('english')) #set data structure ensures duplicate values cannot be stored
# returns a collection of stop words

word_tokens = word_tokenize(text)
filtered_words = [word for word in word_tokens if word.lower() not in stop_words]

print("Original Words: ", word_tokens)
print("Filtered Words (after removing stop words): ", filtered_words)



Original Words:  ['This', 'is', 'a', 'very', 'good', 'example', 'to', 'show', 'how', 'stop', 'words', 'are', 'removed', 'from', 'a', 'text', '.']
Filtered Words (after removing stop words):  ['good', 'example', 'show', 'stop', 'words', 'removed', 'text', '.']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Acer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects how important a word is to a document in a collection or corpus.7 It is widely used in information retrieval and text mining.
* TF-IDF Score: The product of TF and IDF. A high TF-IDF score indicates that a word is frequent in a specific document but rare across the entire corpus, suggesting it's a good descriptor for that document.
    

In [12]:
# !pip install scikit-learn

In [14]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "The quick brown fox jumps over the lazy dog.",
    "Never jump over the lazy dog again.",
    "The brown dog is quick and lazy."
]

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Print the TF-IDF matrix
print("TF-IDF Matrix: ")
print(tfidf_matrix.toarray())

# Map features to their TF-IDF scores for the first document
print("\nTF-IDF scores for the first document: ")
doc1_tfidf_scores = dict(zip(feature_names, tfidf_matrix.toarray()[0]))
print(sorted(doc1_tfidf_scores.items(), key=lambda x:x[1], reverse = True)) # sort a dictionary based on values

TF-IDF Matrix: 
[[0.         0.         0.31502724 0.24464675 0.41422296 0.
  0.         0.41422296 0.24464675 0.         0.31502724 0.31502724
  0.4892935 ]
 [0.46499651 0.         0.         0.27463443 0.         0.
  0.46499651 0.         0.27463443 0.46499651 0.35364183 0.
  0.27463443]
 [0.         0.48775955 0.37095371 0.28807865 0.         0.48775955
  0.         0.         0.28807865 0.         0.         0.37095371
  0.28807865]]

TF-IDF scores for the first document: 
[('the', np.float64(0.4892935045993933)), ('fox', np.float64(0.4142229588893787)), ('jumps', np.float64(0.4142229588893787)), ('brown', np.float64(0.31502723701987084)), ('over', np.float64(0.31502723701987084)), ('quick', np.float64(0.31502723701987084)), ('dog', np.float64(0.24464675229969665)), ('lazy', np.float64(0.24464675229969665)), ('again', np.float64(0.0)), ('and', np.float64(0.0)), ('is', np.float64(0.0)), ('jump', np.float64(0.0)), ('never', np.float64(0.0))]
