### **7. Encodings**

The next step in text preprocessing is to vectorize the filtered tokens so they can be used to build models. The vectorization requires words in a corpus to be given numeric values. There are multiple techniques used to do so.

#### A) One Hot Encoding

One-hot encoding is a way to represent words (or categorical data) as **binary vectors**. Each word is assigned a **unique index**, and its vector has **1 at its position** while all other positions are **0**.

For example, if we have the words:  
**["cat", "dog", "fish"]**, we can represent them as:

| Word     | cat | dog | fish |
| -------- | --- | --- | ---- |
| **cat**  | 1   | 0   | 0    |
| **dog**  | 0   | 1   | 0    |
| **fish** | 0   | 0   | 1    |

The implementation of One Hot Encoding is really simple.

In [1]:
corpus = [
    "I love NLP",
    "NLP is awesome",
    "I love Machine Learning"
]

In [5]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

corpus = ["I love NLP", "NLP is awesome", "I love Machine Learning"]

# Convert text into a list of unique words
unique_words = list(set(" ".join(corpus).split()))

# Convert words into a one-hot encoded format
one_hot_encoder = OneHotEncoder(sparse_output=False)  # ✅ Fixed!
one_hot_encoded = one_hot_encoder.fit_transform(np.array(unique_words).reshape(-1, 1))

# Display results
for word, encoding in zip(unique_words, one_hot_encoded):
    print(f"{word}: {encoding}")


Machine: [0. 0. 1. 0. 0. 0. 0.]
love: [0. 0. 0. 0. 0. 0. 1.]
I: [1. 0. 0. 0. 0. 0. 0.]
is: [0. 0. 0. 0. 0. 1. 0.]
Learning: [0. 1. 0. 0. 0. 0. 0.]
NLP: [0. 0. 0. 1. 0. 0. 0.]
awesome: [0. 0. 0. 0. 1. 0. 0.]


#### A) Bag of Words (BoW)

Bag of Words (BoW) is a simple way to **convert text into numerical features** for machine learning models. It represents **how frequently words appear** in a document **without considering their order** or context.

**📌 How BoW Works**

1️⃣ **Create a Vocabulary** (a set of unique words in the dataset).  
2️⃣ **Count Word Occurrences** in each document.  
3️⃣ **Convert into a Vector Representation.**

**Example**:  
Consider these two sentences:  
📌 **Sentence 1:** "I love NLP and Machine Learning."  
📌 **Sentence 2:** "I love Deep Learning and NLP."

🔹 **Step 1: Create a Vocabulary**  
Unique words in both sentences →  
📌 **["I", "love", "NLP", "and", "Machine", "Learning", "Deep"]**

🔹 **Step 2: Create Word Frequency Vectors**

|                | I   | love | NLP | and | Machine | Learning | Deep |
| -------------- | --- | ---- | --- | --- | ------- | -------- | ---- |
| **Sentence 1** | 1   | 1    | 1   | 1   | 1       | 1        | 0    |
| **Sentence 2** | 1   | 1    | 1   | 1   | 0       | 1        | 1    |

Each row is a **numerical representation of a sentence** based on word counts.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
bow_encoded = vectorizer.fit_transform(corpus)

# Convert to array for readability
print(vectorizer.get_feature_names_out())
print(bow_encoded.toarray())

['awesome' 'is' 'learning' 'love' 'machine' 'nlp']
[[0 0 0 1 0 1]
 [1 1 0 0 0 1]
 [0 0 1 1 1 0]]


Here,

[[0 0 0 1 0 1]   -> Sentence 1: Contains 'love' and 'nlp'

 [1 1 0 0 0 1]   -> Sentence 2: Contains 'awesome', 'is', and 'nlp'

 [0 0 1 1 1 0]]  -> Sentence 3: Contains 'learning', 'love', and 'machine'