# Feature Extraction for NLP: Using the Bag of Words Model

In this notebook, we will learn how to perform **feature extraction** for Natural Language Processing (NLP). Each token (e.g., word or sub-word) in a text is treated as a feature.

Feature extraction is a critical step in NLP, involving the transformation of raw data (e.g., text) into numerical features suitable for machine learning (ML). This process is also known as feature or token embedding.

## Steps in Feature Extraction

Typically, feature extraction involves four key steps:

1. **Text Standardization** (stemming & lemmatization)
2. **Text Preprocessing** (removing stop words & tokenization)
3. **Vocabulary Construction and Indexing**
4. **Vectorization of the Features**

There are two primary types of feature vectorization models:

- **Word order agnostic**: the Bag of Words (BoW) model
- **Word order preserving**: sequence models

In this notebook, we will specifically use the BoW model for feature vectorization.


## Python Libraries for the BoW-based Feature Extraction

We will use Python libraries such as the **Natural Language Tool Kit (NLTK)** and Scikit-Learn to extract numerical features from text contents.nts.

# <font color=blue> 1. Text Standardization by Stemming & Lemmatization </font>

Before we do text preprocessing (e.g., tokenize, remove stop words, etc.) and convert to vectors of numbers, sometimes it is useful to standardize the text.


## What is Text Standardization?

Languages we speak and write are made up of many words, often **derived from one another**. When a language contains words that are modified based on their use in speech, it is called an inflected language.

Text standardization reduces words to their root form, as shown in the examples below.

- The boy's cars are different colors --> the boy car be differ color
- Playing, Plays, Played -> Play (common root)
- am, are, is --> be (common root)
- Car, cars, car's, cars' --> car


In **NLP** there are two commonly used text standardization techniques:
- Stemming
- Lemmatization 

Stemming and Lemmatization have been studied, and algorithms have been developed in Computer Science since the 1960s.


#### But stemming and Lemmatization do standardization in different ways!



### Stemming

Stemming is the process of reducing inflection in words to their **root forms** such as mapping a group of words to the same stem even if the stem itself is <strong><font color=red>not a valid word</font></strong> in the Language. 

For example, books —> book, looked —> look. 

There are two stemming algorithms:
- Porter stemming algorithm (removes common morphological and inflexional endings from words)
- Lancaster stemming algorithm (a more aggressive stemming algorithm) 

PorterStemmer is the oldest one originally developed in 1979. LancasterStemmer was developed in 1990 and uses a more aggressive approach than the Porter Stemming Algorithm.

### Lemmatization
 
Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the <strong><font color=red>root word belongs to the language</font></strong>. As opposed to stemming, lemmatization does not simply chop off inflections. Instead, it uses lexical knowledge bases (e.g., dictionary) to get the correct base forms of words.

In lemmatization, a root word is called a lemma. A lemma (plural lemmas or lemmata) is the canonical form, **dictionary form**, or citation form of a set of words.


## Stemming & Lemmatization using Python

Python provides the Natural Language Tool Kit (NLTK) library to make programs that work with natural language. It has a user-friendly interface to datasets that are over 50 corpora and lexical resources such as <strong><font color=blue size=4>WordNet</font></strong> word repository. The library can perform different operations such as tokenizing, stemming, classification, parsing, tagging, and semantic reasoning.


### Installing NLTK:
To install nltk use the pip installer:
- pip install nltk


## Stemming vs. Lemmatization

Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, **lemma is an actual language word**.

Stemming follows a set algorithm that applies steps directly to words, making it faster. In contrast, lemmatization relies on the WordNet corpus and a stop word corpus to produce the lemma, which makes it **slower than stemming**. Additionally, lemmatization requires defining parts of speech to generate the correct lemma.

    So, when should each method be used?

The choice depends on the application's needs. If speed is a priority, stemming is preferable since lemmatizers must scan a corpus, which requires more time and processing. However, if language accuracy is essential—such as in applications where correct word forms matter—lemmatization is more suitable because it matches words to their root forms through a linguistic corpus.


For more details see the following URL:
https://www.datacamp.com/community/tutorials/stemming-lemmatization-python

In [1]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Download Wordnet through NLTK
import nltk
nltk.download('punkt_tab')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer, PorterStemmer, LancasterStemmer

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/mhasan2/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/mhasan2/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /Users/mhasan2/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/mhasan2/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Lemmatizing Using the WordNetLemmatizer

While doing lemmatization, it is useful to remember:
- Lemmatization works **only on individual words**. Thus, we need to tokenize a document first.
- Unlike stemming, lemmatization **does not work on capitalized words**. Thus, we need to convert a word into lowercase before performning lemmatization.

In [2]:
# Create the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize Single Words

print("Some lemmatized words:")
print("bats -> %s" % lemmatizer.lemmatize("bats"))

print("are -> %s" % lemmatizer.lemmatize("are"))

print("feet -> %s" % lemmatizer.lemmatize("feet"))

print("plays -> %s" % lemmatizer.lemmatize("plays"))

print("blasphemious -> %s" % lemmatizer.lemmatize("blasphemious"))

print("BLASHEPHEMERS -> %s" % lemmatizer.lemmatize("BLASHEPHEMERS"))

print("blashephemers -> %s" % lemmatizer.lemmatize("blashephemers"))

print("believing -> %s" % lemmatizer.lemmatize("believing"))



# Lemmatization works only on individual words. Thus, we need to tokenize a document/sentence first.

# Define a sentence to be lemmatized
sentence = "The students received grades from the Professor's webpage."
print("\nExample Sentence: ", sentence)

# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print("\nTokenized Sentence:")
print(word_list)


# Lemmatize the list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print("\nLemmatized Tokens:")
print(lemmatized_output)

Some lemmatized words:
bats -> bat
are -> are
feet -> foot
plays -> play
blasphemious -> blasphemious
BLASHEPHEMERS -> BLASHEPHEMERS
blashephemers -> blashephemers
believing -> believing

Example Sentence:  The students received grades from the Professor's webpage.

Tokenized Sentence:
['The', 'students', 'received', 'grades', 'from', 'the', 'Professor', "'s", 'webpage', '.']

Lemmatized Tokens:
The student received grade from the Professor 's webpage .


## Stemming Using the PorterStemmer

In [3]:
# Create the Porter stemmer
stemmer = PorterStemmer()


print("Some stemmed words:")
print("bats -> %s" % stemmer.stem("bats"))

print("are -> %s" % stemmer.stem("are"))

print("feet -> %s" % stemmer.stem("feet"))

print("plays -> %s" % stemmer.stem("plays"))

print("blasphemious -> %s" % lemmatizer.lemmatize("blasphemious"))

print("BLASHEPHEMERS -> %s" % stemmer.stem("BLASHEPHEMERS"))

print("blashephemers -> %s" % stemmer.stem("blashephemers"))

print("believing -> %s" % stemmer.stem("believing"))

word_list = nltk.word_tokenize(sentence)
print("\nTokenized Sentence:")
print(word_list)

    
stemmed_output = ' '.join([stemmer.stem(w) for w in word_list])
print("\nStemmed Tokens (Porter):")
print(stemmed_output)

Some stemmed words:
bats -> bat
are -> are
feet -> feet
plays -> play
blasphemious -> blasphemious
BLASHEPHEMERS -> blashephem
blashephemers -> blashephem
believing -> believ

Tokenized Sentence:
['The', 'students', 'received', 'grades', 'from', 'the', 'Professor', "'s", 'webpage', '.']

Stemmed Tokens (Porter):
the student receiv grade from the professor 's webpag .


## Stemming Using the LancasterStemmer

In [4]:
# Create the Lancaster stemmer
stemmer = LancasterStemmer()

print("Some stemmed words:")
print("bats -> %s" % stemmer.stem("bats"))

print("are -> %s" % stemmer.stem("are"))

print("feet -> %s" % stemmer.stem("feet"))

print("plays -> %s" % stemmer.stem("plays"))

stemmed_output = ' '.join([stemmer.stem(w) for w in word_list])
print("\nStemmed Tokens (Lancaster):")
print(stemmed_output)

Some stemmed words:
bats -> bat
are -> ar
feet -> feet
plays -> play

Stemmed Tokens (Lancaster):
the stud receiv grad from the profess 's webp .


## <font color=maroon> Observation about Stemming & Lemmatization </font>

We can make two important observations about stemming and lemmatization in the context of text classification:

- Lemmatization is a **more suitable technique** for token standardization in text classification. Unlike stemming, which may produce non-existent or truncated forms of words, lemmatization reduces inflected words to their canonical forms (lemmas) that are recognized in the language.

- If stemming is to be used, the **Porter stemmer is the preferred choice**. The Lancaster stemmer is more aggressive in its approach and may lead to over-stemming, which can negatively impact the quality of the text classification.



# <font color=blue> 2. Text Preprocessing (removing stop words & tokenization), 3. Vocabulary Construction and Indexing, and 4. Vectorization of the features using the BoW Model</font>


## Bag of Words Model

In a Bag of Words (BoW) model, documents are represented by the occurrences of words or tokens. This model completely disregards the relative positions of the words or tokens within the document. It is referred to as a "bag" of words because all information regarding the order or structure of words is discarded. The focus of the BoW model is solely on whether known words appear in the document, rather than their specific locations. 

The underlying intuition is that documents with similar content are likely to be similar overall. Furthermore, the content alone can provide insights into the type of document. 

The BoW model can vary in complexity, which arises from two key considerations: 
- Designing the vocabulary of known words (or tokens)
- Scoring the presence of these known words

In the BoW model, tokenization typically employs the n-gram method, where tokens are formed by grouping n consecutive words. Tokens can represent single words (unigrams), pairs of words (bigrams), sequences of three words (trigrams), or even individual characters. 

Within the BoW framework, there are two primary methods for vectorizing features:
- **Count Vector**: This represents the binary count of tokens or the frequency of token occurrences.
- **TF-IDF Vector**: This method adjusts the word frequency by considering the importance of words across the entire document collection.

First, we will present the count vectorization BoW technique.


## Count Vectorization BoW Technique

We will explore two techniques for count vectorization:

- **Counting the frequency of tokens**
- **Using binary counts of tokens**

### Steps for Count Vectorization

1. **Word Assignment**: Assign a fixed integer ID to each word that appears in any document within the training set. This is achieved by constructing a dictionary that maps each word to its corresponding integer index.

2. **Counting Occurrences or Binary Count**: For each document \( i \), either count the number of occurrences of each word \( w \) or assign a binary count (1 if the word is present, 0 if it is not). Store this value in \( X[i, j] \) for feature \( j \), where \( j \) is the index of word \( w \) in the dictionary.

The Bag of Words representation indicates that the number of features \( n\_features \) corresponds to the total number of distinct words in the corpus.
pus.

## Text Preprocessing and BoW Feature Vectorization: Counting the Frequency of Tokens

The Scikit-Learn CountVectorizer() object implements **both text preprocessing and BoW feature vectorization** in a single class.

It converts a collection of text documents to a matrix of token counts. It produces a sparse representation of the counts using scipy.sparse.csr_matrix.

For example, it creates a set of $d$ unique words (referred to as tokens) from the collection of documents. Then, each document is represented by a d-dimensional feature vector. Each component of this vector represents the occurrence count of the feature (term or word) in that document. 

In [5]:
# A function to lemmatize a sentence
def sentenceLemmatizer(sentence):
    # Tokenize: Split the sentence into words
    word_list = nltk.word_tokenize(sentence)
    
    # Lemmatize the list of words and join
    lemmatized_sentence = ' '.join([lemmatizer.lemmatize(w.lower()) for w in word_list])
    
    return lemmatized_sentence
    
    

# # Define a sentence to be lemmatized
# sentence = "The students received grades from the Professor's WEBPAGES."
# print("\nExample Sentence: ", sentence)


# lemmatized_output = sentenceLemmatizer(sentence)
# print("\nLemmatized Output:")
# print(lemmatized_output)

In [6]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Create a set of documents
documents = (
"This book is good.",
"Good books are good to read."
)

documents = np.array(documents)

# Display the original documents
print("Original Documents:")
j = 1
for i in documents:
    print("Document %d: %s" % (j, i))
    j += 1


# Lemmatize the documents using the "sentenceLemmatizer" function defined above
j = 1
for i in range(len(documents)):
    documents[i] = sentenceLemmatizer(documents[i])
    j += 1


# Display the lemmatized documents
print("\nLemmatized Documents:")
j = 1
for i in documents:
    print("Document %d: %s" % (j, i))
    j += 1


# BoW feature vectorization: Create a count vectorizer object
count_vect = CountVectorizer(lowercase=True)


# Create a matrix representation of the documents
# Each row represents a single document
# Each column represents the term frequency for each feature
document_counts = count_vect.fit_transform(documents).todense()



print("\nFeature Names:")
print(count_vect.get_feature_names_out())



print("\nVocabulary: ", count_vect.vocabulary_)
print("Note: After each word the index of that word is given. It's not word count.")

print("\nGet the index of the words from the vocabulary:")
print("Vocabulary - Index of good: ", count_vect.vocabulary_.get("good"))
print("Vocabulary - Index of awesome: ", count_vect.vocabulary_.get("awesome"))



print("\nCount Vector Matrix (Dense Matrix):")
#print(document_counts.toarray())
print(document_counts)

Original Documents:
Document 1: This book is good.
Document 2: Good books are good to read.

Lemmatized Documents:
Document 1: this book is good .
Document 2: good book are good to read .

Feature Names:
['are' 'book' 'good' 'is' 'read' 'this' 'to']

Vocabulary:  {'this': 5, 'book': 1, 'is': 3, 'good': 2, 'are': 0, 'to': 6, 'read': 4}
Note: After each word the index of that word is given. It's not word count.

Get the index of the words from the vocabulary:
Vocabulary - Index of good:  2
Vocabulary - Index of awesome:  None

Count Vector Matrix (Dense Matrix):
[[0 1 1 1 0 1 0]
 [1 1 2 0 1 0 1]]


## Test Preprocessing: Removing Stop Words

Stop words are words like “and”, “the”, “him”, which are presumed to be uninformative in representing the content of a text.

The stop words may be removed to avoid them being construed as signal for prediction. Sometimes, however, similar words are useful for prediction, such as in classifying writing style or personality.

    To remove the stop words, set the "stop_words" attribute value of the CountVectorizer to 'english'.

Note that if the value is set to ‘english’, a built-in stop word list for English is used. 

However, there are several known issues with ‘english’:

URL: http://aclweb.org/anthology/W18-2502


In [7]:
count_vect = CountVectorizer(lowercase=True, stop_words='english')
document_counts = count_vect.fit_transform(documents)

# Create a set of documents
documents = (
"This book is good.",
"Good books are good to read."
)

# Create an array of documents
documents = np.array(documents)

# Display the original documents
print("Original Documents:")
j = 1
for i in documents:
    print("Document %d: %s" % (j, i))
    j += 1


# Lemmatize the documents
j = 1
for i in range(len(documents)):
    documents[i] = sentenceLemmatizer(documents[i])
    j += 1


# Display the lemmatized documents
print("\nLemmatized Documents:")
j = 1
for i in documents:
    print("Document %d: %s" % (j, i))
    j += 1


print("\nFeature Names:")
print(count_vect.get_feature_names_out())


print("\nVocabulary: ", count_vect.vocabulary_)


print("\nCount Vector Matrix (Dense Matrix):")
print(document_counts.toarray())

Original Documents:
Document 1: This book is good.
Document 2: Good books are good to read.

Lemmatized Documents:
Document 1: this book is good .
Document 2: good book are good to read .

Feature Names:
['book' 'good' 'read']

Vocabulary:  {'book': 0, 'good': 1, 'read': 2}

Count Vector Matrix (Dense Matrix):
[[1 1 0]
 [1 2 1]]


# BoW Feature Vectorization: Using Binary Counts of Tokens

In certain scenarios, we are particularly interested in binary occurrence markers for features.

For instance, very short texts may yield noisy term frequency–inverse document frequency (tf-idf) values, while binary occurrence information tends to be more stable.

Additionally, some estimators, such as **Multivariate Bernoulli Naive Bayes**, explicitly model discrete Boolean random variables.

To count the binary occurrences of features, we can utilize the "binary" attribute of the CountVectorizer.

In [8]:
# binary : boolean, default=False
# If True, all non zero counts are set to 1. 
# This is useful for discrete probabilistic models that model binary events rather than integer counts.

count_vect = CountVectorizer(lowercase=True, binary=True, stop_words='english')
document_counts = count_vect.fit_transform(documents)


documents = np.array(documents)


# Lemmatize the documents
j = 1
for i in range(len(documents)):
    documents[i] = sentenceLemmatizer(documents[i])
    j += 1



# Display the documents
j = 1
for i in documents:
    print("Document %d: %s" % (j, i))
    j += 1



print("\nFeature Names:")
print(count_vect.get_feature_names_out())


print("\nVocabulary: ", count_vect.vocabulary_)

print("\nSize of Vocabulary: ", len(count_vect.vocabulary_))


print("\nCount Vector Matrix")
print(document_counts.toarray())


print("\nDimension of Count Vector Matrix: ", document_counts.toarray().shape)


print("\nRow 1 of the Matrix: ", document_counts.toarray()[0, :])


Document 1: this book is good .
Document 2: good book are good to read .

Feature Names:
['book' 'good' 'read']

Vocabulary:  {'book': 0, 'good': 1, 'read': 2}

Size of Vocabulary:  3

Count Vector Matrix
[[1 1 0]
 [1 1 1]]

Dimension of Count Vector Matrix:  (2, 3)

Row 1 of the Matrix:  [1 1 0]


# BoW Feature Vectorization: The TF-IDF Technique

To this point, we have utilized the count vectorization method (which involves binary counts of tokens or frequency counts) for BoW vectorization. However, this approach has its limitations.

In a large text corpus, certain words (such as “the,” “a,” and “is” in English) tend to appear very frequently, contributing minimal meaningful information about the actual content of the documents. If we were to feed this raw count data directly into a classifier, these common terms would overshadow the frequencies of rarer but more informative terms.

To address this issue, we can **re-weight the count features into floating-point values** that are more suitable for use by a classifier. This is typically achieved using the **term frequency–inverse document frequency (tf-idf)** transformation.

There are two main ways to implement the tf-idf transformation:

1. First, compute the occurrence counts and then apply the tf-idf transformer using separate components (CountVectorizer followed by TfidfTransformer).
2. Alternatively, use the TfidfVectorizer, which integrates CountVectorizer and TfidfTransformer into a single model.



In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents).todense()

# Create a set of documents
documents = (
"This book is good.",
"Good books are good to read."
)

documents = np.array(documents)

# Display the original documents
print("Original Documents:")
j = 1
for i in documents:
    print("Document %d: %s" % (j, i))
    j += 1


# Lemmatize the documents
j = 1
for i in range(len(documents)):
    documents[i] = sentenceLemmatizer(documents[i])
    j += 1


# Display the lemmatized documents
print("\nLemmatized Documents:")
j = 1
for i in documents:
    print("Document %d: %s" % (j, i))
    j += 1



print("\nFeature Names:")
print(tfidf_vectorizer.get_feature_names_out())


print("\nVocabulary:")
print(tfidf_vectorizer.vocabulary_)


print("\nTF-IDF Matrix:")
#print(tfidf_matrix.toarray())
print(tfidf_matrix)


Original Documents:
Document 1: This book is good.
Document 2: Good books are good to read.

Lemmatized Documents:
Document 1: this book is good .
Document 2: good book are good to read .

Feature Names:
['are' 'book' 'good' 'is' 'read' 'this' 'to']

Vocabulary:
{'this': 5, 'book': 1, 'is': 3, 'good': 2, 'are': 0, 'to': 6, 'read': 4}

TF-IDF Matrix:
[[0.         0.40993715 0.40993715 0.57615236 0.         0.57615236
  0.        ]
 [0.42519636 0.30253071 0.60506143 0.         0.42519636 0.
  0.42519636]]


# Limitations of the Bag of Words Representation

The BoW model considered so far is a collection of unigrams. However, it has several limitations:

- It cannot capture phrases or multi-word expressions.
- It effectively disregards any dependency on word order.
- It does not account for potential misspellings or word derivations.

A more sophisticated approach for feature representation is the n-grams model. Instead of constructing a simple collection of unigrams (n=1), one might opt for a collection of bigrams (n=2), where occurrences of pairs of consecutive words are counted.

The n-grams model provides a set of co-occurring words within a given context. When computing n-grams, we typically move \( n \) words forward. For example, consider the sentence "The woods are lovely, dark and deep." If \( n = 2 \) (bigrams), the resulting n-grams would be:

- the woods
- woods are
- are lovely
- lovely dark
- dark and
- and deep

Alternatively, we might consider a collection of character n-grams, which offers resilience against misspellings and word derivations.

## How to Use the n-grams Model

To implement the n-grams model, we need to set the following two attributes in the CountVectorizer:

- **ngram_range**: tuple (min_n, max_n)  
  Specifies the lower and upper boundaries of the range of n-values for different n-grams to be extracted. All values of \( n \) such that \( \text{min\_n} \leq n \leq \text{max\_n} \) will be used.

- **analyzer**: string, {‘word’, ‘char’, ‘char_wb’} or callable  
  Indicates whether the features should consist of word or character n-grams. The option ‘char_wb’ creates character n-grams only from text within word boundaries; n-grams at the edges of words are padded with spaces.

In [10]:
# To create bigrams, set the "ngram_range" to the tuple (1, 2)
count_vect = CountVectorizer(lowercase=True, analyzer="word", ngram_range=(1, 2))
document_counts = count_vect.fit_transform(documents)

j = 1
for i in documents:
    print("Document %d: %s" % (j, i))
    j += 1



print("\nFeature Names:")
print(count_vect.get_feature_names_out())


print("\nVocabulary: ", count_vect.vocabulary_)


print("\nCount Vector Matrix")
print(document_counts.toarray())

Document 1: this book is good .
Document 2: good book are good to read .

Feature Names:
['are' 'are good' 'book' 'book are' 'book is' 'good' 'good book' 'good to'
 'is' 'is good' 'read' 'this' 'this book' 'to' 'to read']

Vocabulary:  {'this': 11, 'book': 2, 'is': 8, 'good': 5, 'this book': 12, 'book is': 4, 'is good': 9, 'are': 0, 'to': 13, 'read': 10, 'good book': 6, 'book are': 3, 'are good': 1, 'good to': 7, 'to read': 14}

Count Vector Matrix
[[0 0 1 0 1 1 0 0 1 1 0 1 1 0 0]
 [1 1 1 1 0 2 1 1 0 0 1 0 0 1 1]]
