### **7. Encodings**

The next step in text preprocessing is to vectorize the filtered tokens so they can be used to build models. The vectorization requires words in a corpus to be given numeric values. There are multiple techniques used to do so.

#### A) One Hot Encoding

One-hot encoding is a way to represent words (or categorical data) as **binary vectors**. Each word is assigned a **unique index**, and its vector has **1 at its position** while all other positions are **0**.

For example, if we have the words:  
**["cat", "dog", "fish"]**, we can represent them as:

| Word     | cat | dog | fish |
| -------- | --- | --- | ---- |
| **cat**  | 1   | 0   | 0    |
| **dog**  | 0   | 1   | 0    |
| **fish** | 0   | 0   | 1    |

The implementation of One Hot Encoding is really simple.

In [1]:
corpus = [
    "I love NLP",
    "NLP is awesome",
    "I love Machine Learning"
]

In [5]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

corpus = ["I love NLP", "NLP is awesome", "I love Machine Learning"]

# Convert text into a list of unique words
unique_words = list(set(" ".join(corpus).split()))

# Convert words into a one-hot encoded format
one_hot_encoder = OneHotEncoder(sparse_output=False)  # ✅ Fixed!
one_hot_encoded = one_hot_encoder.fit_transform(np.array(unique_words).reshape(-1, 1))

# Display results
for word, encoding in zip(unique_words, one_hot_encoded):
    print(f"{word}: {encoding}")


Machine: [0. 0. 1. 0. 0. 0. 0.]
love: [0. 0. 0. 0. 0. 0. 1.]
I: [1. 0. 0. 0. 0. 0. 0.]
is: [0. 0. 0. 0. 0. 1. 0.]
Learning: [0. 1. 0. 0. 0. 0. 0.]
NLP: [0. 0. 0. 1. 0. 0. 0.]
awesome: [0. 0. 0. 0. 1. 0. 0.]


#### B) Bag of Words (BoW)

Bag of Words (BoW) is a simple way to **convert text into numerical features** for machine learning models. It represents **how frequently words appear** in a document **without considering their order** or context.

**📌 How BoW Works**

1️⃣ **Create a Vocabulary** (a set of unique words in the dataset).  
2️⃣ **Count Word Occurrences** in each document.  
3️⃣ **Convert into a Vector Representation.**

**Example**:  
Consider these two sentences:  
📌 **Sentence 1:** "I love NLP and Machine Learning."  
📌 **Sentence 2:** "I love Deep Learning and NLP."

🔹 **Step 1: Create a Vocabulary**  
Unique words in both sentences →  
📌 **["I", "love", "NLP", "and", "Machine", "Learning", "Deep"]**

🔹 **Step 2: Create Word Frequency Vectors**

|                | I   | love | NLP | and | Machine | Learning | Deep |
| -------------- | --- | ---- | --- | --- | ------- | -------- | ---- |
| **Sentence 1** | 1   | 1    | 1   | 1   | 1       | 1        | 0    |
| **Sentence 2** | 1   | 1    | 1   | 1   | 0       | 1        | 1    |

Each row is a **numerical representation of a sentence** based on word counts.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
bow_encoded = vectorizer.fit_transform(corpus)

# Convert to array for readability
print(vectorizer.get_feature_names_out())
print(bow_encoded.toarray())

['awesome' 'is' 'learning' 'love' 'machine' 'nlp']
[[0 0 0 1 0 1]
 [1 1 0 0 0 1]
 [0 0 1 1 1 0]]


Here,

[[0 0 0 1 0 1]   -> Sentence 1: Contains 'love' and 'nlp'

 [1 1 0 0 0 1]   -> Sentence 2: Contains 'awesome', 'is', and 'nlp'

 [0 0 1 1 1 0]]  -> Sentence 3: Contains 'learning', 'love', and 'machine'

#### C) TF-IDF (Term Frequency - Inverse Document Frequency)

TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It is widely used in **text mining, search engines, and NLP tasks** to find the most relevant words in a document.

### **Formula Breakdown**

TF-IDF is the product of two components:

**1. Term Frequency (TF)**

Measures how often a term appears in a document.

$$
TF = \frac{\text{Number of times a term appears in a document}}{\text{Total number of terms in the document}}
$$

- **Example**: If the word "machine" appears **3 times** in a document with **100 words**, then:
  $$
  TF = \frac{3}{100} = 0.03
  $$

**2. Inverse Document Frequency (IDF)**

Measures how important a word is by reducing the weight of commonly used words (e.g., "the", "is").

$$
IDF = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing the term}} + 1 \right)
$$

- **Example**: If we have **10,000 documents** and the word "machine" appears in **1,000** of them:
  $$
  IDF = \log \left( \frac{10,000}{1,000} + 1 \right) = \log(11) \approx 2.4
  $$

**3. TF-IDF Calculation**

$$
TF-IDF = TF \times IDF
$$

- If **TF = 0.03** and **IDF = 2.4**, then:
  $$
  TF-IDF = 0.03 \times 2.4 = 0.072
  $$
- Higher values indicate that the word is **important** in the document but **rare** across the corpus.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample corpus
corpus = [
    "I love NLP and machine learning.",
    "NLP is amazing and I love learning about it.",
    "Machine learning is a key part of AI."
]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the corpus
tfidf_matrix = vectorizer.fit_transform(corpus)

# Convert to array and display results
print("Vocabulary:", vectorizer.get_feature_names_out())  # Unique words
print("\nTF-IDF Matrix:\n", tfidf_matrix.toarray())  # TF-IDF scores


Vocabulary: ['about' 'ai' 'amazing' 'and' 'is' 'it' 'key' 'learning' 'love' 'machine'
 'nlp' 'of' 'part']

TF-IDF Matrix:
 [[0.         0.         0.         0.46609584 0.         0.
  0.         0.361965   0.46609584 0.46609584 0.46609584 0.
  0.        ]
 [0.42024133 0.         0.42024133 0.31960436 0.31960436 0.42024133
  0.         0.2482013  0.31960436 0.         0.31960436 0.
  0.        ]
 [0.         0.4261835  0.         0.         0.32412354 0.
  0.4261835  0.25171084 0.         0.32412354 0.         0.4261835
  0.4261835 ]]


There are three rows in the matrix corresponsing to each sentence in the corpus.