<a href="https://colab.research.google.com/github/shartazkhan/nlp_fundamentals/blob/main/NLP_Text_Representation%20(Machine_Learning).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Representation / Vectorization / Feature Extraction /   in NLP

In Natural Language Processing (NLP), **text representation** refers to the techniques used to convert human language (text) into a format that computers can understand and process. Since computers work with numbers, text representation methods transform words, sentences, or documents into numerical vectors or matrices.

Key goals of text representation include:

*   **Capturing meaning:** Representing the semantic relationships between words.
*   **Dimensionality reduction:** Reducing the complexity of the data.
*   **Enabling analysis:** Making text data suitable for machine learning algorithms.

Common techniques include:

*   **Bag-of-Words (BoW):** Represents text as a collection of word counts, ignoring word order.
*   **TF-IDF (Term Frequency-Inverse Document Frequency):** Weighs words based on their frequency in a document and rarity across a corpus.
*   **Word Embeddings (e.g., Word2Vec, GloVe):** Dense vector representations where words with similar meanings are closer in vector space.
*   **Sentence and Document Embeddings (e.g., BERT, transformers):** More advanced methods that capture contextual information and represent longer pieces of text.

Choosing the right representation method depends on the specific NLP task and the characteristics of the text data.



---

**Things we will try today**

* One-Hot Encoding (OHE)
* Bag-of-Words (BoW)
* N-grams
* Term Frequency-Inverse Document Frequency (TF-IDF)
* Custom Features



---



# One-Hot Encoding (OHE)



> Before you start, Learn about **Key Terms in NLP**

*   **Corpus (C):** A large and structured collection of texts. It can be a collection of documents, books, articles, or any body of written or spoken language used for linguistic analysis and model training.
*   **Vocabulary (V):** The set of all unique words that appear in a corpus. It's the complete list of distinct tokens that the NLP model or technique will work with.
*   **Document (D):** A single text unit within a corpus. This could be a sentence, a paragraph, an article, a book, or any other defined piece of text.
*   **Word (W):** The basic unit of text. In NLP, words are often referred to as tokens, especially after the text has been processed (e.g., lowercased, punctuation removed).



* * *

## Advantages and Disadvantages of One-Hot Encoding (OHE) in NLP

**Advantages:**

* **Simple to Understand and Implement:** OHE is conceptually straightforward and easy to implement.
* **Preserves Uniqueness:** Each unique word is assigned a unique vector, ensuring that distinct words are represented differently.

**Disadvantages:**

* **High Dimensionality:** For a large vocabulary, the resulting vectors are very sparse and high-dimensional, leading to increased memory usage and computational complexity.
* **No Semantic Relationship Captured:** OHE treats each word as independent and does not capture any semantic similarity or relationship between words (e.g., "king" and "queen" would have completely different vectors).
* **Out-of-Vocabulary (OOV) Words:** OHE cannot handle words that were not present in the vocabulary during training.
* **No Fixed Size:** The vector size depends on the vocabulary size, which can grow very large and is not fixed, making it difficult to use in models that require fixed-size input.

## One-Hot Encoding (OHE)

Imagine you have a list of unique words from some text. One-Hot Encoding is like giving each unique word its own special box.

Here's how it works:

1.  **Create a list of all unique words (Vocabulary):** Go through all your text and make a list of every word that appears, but only include each word once. This is your vocabulary.

2.  **Create a box for each word:** For every word in your vocabulary, create a box (or a space in a line of numbers).

3.  **Put a "1" in the word's box and "0" in others:** To represent a specific word, you go to its special box and put a "1" in it. All the other boxes for the other words get a "0".

Think of it like a checklist where you check off the word you are looking at.

**Example:**

Let's say our unique words (vocabulary) are: "cat", "dog", "fish".

*   To represent "cat", we'd have: \[1, 0, 0] (1 for cat, 0 for dog, 0 for fish)
*   To represent "dog", we'd have: \[0, 1, 0] (0 for cat, 1 for dog, 0 for fish)
*   To represent "fish", we'd have: \[0, 0, 1] (0 for cat, 0 for dog, 1 for fish)

Each word gets a unique "hot" spot (the 1) in a long list of zeros.

# Bag-of-Words (BoW)


Bag-of-Words (BoW) is another way to turn text into numbers that computers can understand. Imagine you have a bag, and you put all the words from a piece of text into that bag. The **Bag-of-Words** model just counts how many times each word appears in the text, without caring about the order of the words.

Here's the simple idea:

1.  **Create a list of all unique words (Vocabulary):** Just like with One-Hot Encoding, you start by finding all the unique words in your entire collection of texts.
2.  **Count words in each document:** For each piece of text (document), you go through it and count how many times each word from your vocabulary appears.
3.  **Create a vector:** You then create a list of numbers (a vector) for each document. Each number in the vector corresponds to a word in your vocabulary, and its value is the count of that word in the document.

**Why use Bag-of-Words?**

*   **Simplicity:** It's a very easy concept to understand and implement.
*   **Good for basic tasks:** It can be effective for tasks where word order doesn't matter as much, like text classification (e.g., spam detection) or topic modeling.
*   **Provides a numerical representation:** It converts text into a format that can be used by many machine learning algorithms.

**When to use Bag-of-Words?**

*   When you need a simple and quick way to represent text numerically.
*   When the task doesn't heavily rely on understanding the exact sequence of words (like sentiment analysis where the presence of certain words is more important than their order).
*   As a baseline model before trying more complex techniques.

**Example:**

Let's say our vocabulary is: \["cat", "dog", "the", "quick", "brown", "fox"]

Document 1: "the quick brown fox"
BoW representation: \[0, 0, 1, 1, 1, 1] (Counts of "cat", "dog", "the", "quick", "brown", "fox")

Document 2: "the quick brown cat"
BoW representation: \[1, 0, 1, 1, 1, 0] (Counts of "cat", "dog", "the", "quick", "brown", "fox")

Notice that the order of words in the original document doesn't affect the final vector. The vector only contains the counts of each word from the vocabulary.

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


In [None]:
df = pd.DataFrame({
    'text': [
        'this is the first document.',
        'this is the second document.',
        'this second is the   document.',
        'document is this the first ',
    ], 'label': [1,0,1,0]
})

In [None]:
count_vectorizer = CountVectorizer()

In [None]:
bow = count_vectorizer.fit_transform(df['text'])

In [None]:
print(count_vectorizer.vocabulary_)

{'this': 5, 'is': 2, 'the': 4, 'first': 1, 'document': 0, 'second': 3}


In [None]:
print(bow[0].toarray())
print(bow[1].toarray())

[[1 1 1 0 1 1]]
[[1 0 1 1 1 1]]


In [None]:
count_vectorizer.transform(['this second document is the first document']).toarray()

array([[2, 1, 1, 1, 1, 1]])

**What if we use a new word??**


In [None]:
count_vectorizer.transform(['here is the second and the first document']).toarray()

array([[1, 1, 1, 1, 2, 0]])

You can see nothing happens!

Out of vocabulary words will be ignored.

Learn more [click here.](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

## Advantages and Disadvantages of Bag-of-Words (BoW)

**Advantages:**

*   **Simplicity and Ease of Implementation:** BoW is a straightforward model that is easy to understand and implement.
*   **Effective for certain tasks:** It can be quite effective for tasks like text classification and topic modeling where the presence and frequency of words are more important than their order.
*   **Provides a numerical representation:** It successfully converts text data into a numerical format that can be used by various machine learning algorithms.

**Disadvantages:**

*   **Loss of Word Order/Context:** The most significant disadvantage is that BoW completely ignores the order and context of words, which can be crucial for understanding the meaning of a sentence (e.g., "the dog bit the man" and "the man bit the dog" would have the same BoW representation).
*   **High Dimensionality and Sparsity:** Similar to OHE, for a large vocabulary, the resulting vectors can be very high-dimensional and sparse (mostly zeros), leading to increased memory usage and computational costs.
*   **Out-of-Vocabulary (OOV) Words:** BoW models cannot handle words that were not present in the training vocabulary.
*   **Doesn't Capture Semantic Meaning:** It doesn't capture the semantic relationships between words (e.g., "king" and "queen" are treated as completely unrelated).

# N-grams


N-grams are a way to represent text that considers sequences of words, not just individual words like in Bag-of-Words.

Think of it like this: instead of just looking at single words in a sentence, you look at groups of 'N' words that appear right next to each other.

Here's the simple idea:

1.  **Choose a value for 'N':** This is how many words you want to group together.
    *   If N=1, you get unigrams (single words) - this is like Bag-of-Words.
    *   If N=2, you get bigrams (pairs of words).
    *   If N=3, you get trigrams (groups of three words).
    *   And so on...

2.  **Slide a window of size 'N' across your text:** Start at the beginning of your text and take the first 'N' words. Then, move one word over and take the next 'N' words, and keep doing this until you reach the end of the text.

3.  **Collect all the N-grams:** The groups of words you collected are your N-grams.

**Why use N-grams?**

*   **Captures some word order:** Unlike Bag-of-Words, N-grams keep some information about the sequence of words, which can be important for understanding context.
*   **Useful for tasks like:**
    *   **Text generation:** Predicting the next word based on the previous N-1 words.
    *   **Spelling correction:** Identifying common sequences of characters or words.
    *   **Language identification:** Different languages have different common N-grams.

**Example:**

Let's take the sentence: "The quick brown fox"

*   **Unigrams (N=1):** "The", "quick", "brown", "fox"
*   **Bigrams (N=2):** "The quick", "quick brown", "brown fox"
*   **Trigrams (N=3):** "The quick brown", "quick brown fox"

We can use scikit-learn's `CountVectorizer` again, but this time we'll specify the `ngram_range` parameter.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

df = pd.DataFrame({
    'text': [
        'this is the first document.',
        'this is the second document.',
        'this second is the   document.',
        'document is this the first ',
    ], 'label': [1,0,1,0]
})

# Using CountVectorizer to get N-grams
# Let's try bigrams (N=2)
ngram_vectorizer = CountVectorizer(ngram_range=(2, 2))

# Fit and transform the text data
X_ngram = ngram_vectorizer.fit_transform(df['text'])

# Print the vocabulary of bigrams
print("Bigram Vocabulary:", ngram_vectorizer.vocabulary_)

Bigram Vocabulary: {'this is': 9, 'is the': 2, 'the first': 7, 'first document': 1, 'the second': 8, 'second document': 4, 'this second': 10, 'second is': 5, 'the document': 6, 'document is': 0, 'is this': 3, 'this the': 11}


In [None]:
# Print the bigram representation of the first document
print("\nBigram representation of the first document:")
print(X_ngram[0].toarray())


Bigram representation of the first document:
[[0 1 1 0 0 0 0 1 0 1 0 0]]


In [None]:
# Let's try trigrams (N=3)
ngram_vectorizer_3 = CountVectorizer(ngram_range=(3, 3))
X_ngram_3 = ngram_vectorizer_3.fit_transform(df['text'])
print("\nTrigram Vocabulary:", ngram_vectorizer_3.vocabulary_)


Trigram Vocabulary: {'this is the': 8, 'is the first': 2, 'the first document': 6, 'is the second': 3, 'the second document': 7, 'this second is': 9, 'second is the': 5, 'is the document': 1, 'document is this': 0, 'is this the': 4, 'this the first': 10}




> So yeah... bag of words is just Unigrams.



## Advantages and Disadvantages of N-grams

**Advantages:**

*   **Captures some word order and context:** Unlike Bag-of-Words, N-grams preserve some sequential information, which can be important for tasks where word order matters.
*   **Useful for various tasks:** N-grams are effective in applications like text generation, spelling correction, and language identification.
*   **Can capture short-range dependencies:** They can identify patterns and relationships between words that appear close together.

**Disadvantages:**

*   **High Dimensionality:** As the value of 'N' increases, the number of possible N-grams grows exponentially, leading to very high-dimensional and sparse feature vectors. This can increase computational cost and memory usage.
*   **Data Sparsity:** Many possible N-grams will not appear in the corpus, resulting in a sparse representation.
*   **Limited long-range dependency capture:** N-grams with a fixed 'N' cannot capture dependencies between words that are far apart in the text.
*   **Out-of-Vocabulary (OOV) N-grams:** Similar to BoW, N-gram models cannot handle N-grams that were not present in the training data.

# Term Frequency-Inverse Document Frequency (TF-IDF)



Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used in NLP to evaluate how important a word is to a document within a collection of documents (corpus). It's not just about how often a word appears in a single document (like in Bag-of-Words), but also how unique the word is across all documents.

Think of it this way:

*   **Term Frequency (TF):** How often a word appears in a specific document. If a word appears many times in a document, its TF is high.
*   **Inverse Document Frequency (IDF):** How rare a word is across the entire collection of documents. If a word appears in many documents, its IDF is low. If a word appears in only a few documents, its IDF is high.

**TF-IDF combines these two:**

TF-IDF score for a word in a document = TF (of the word in that document) * IDF (of the word across all documents)

A high TF-IDF score means the word is frequent in the document but rare in the corpus, making it likely a key term for that specific document. Words that are very common across all documents (like "the", "is", "a") will have a low IDF and thus a low TF-IDF score, even if they appear frequently in a single document.

**Why use TF-IDF?**

*   **Highlights important words:** It helps identify words that are particularly relevant to a specific document compared to the rest of the corpus.
*   **Reduces the impact of common words:** It gives less weight to words that appear very often across all documents, which are usually less informative.
*   **Provides a numerical representation:** Like BoW, it converts text into a numerical format suitable for machine learning.

**When to use TF-IDF?**

*   **Information Retrieval:** To rank documents based on how relevant they are to a query (the query terms with high TF-IDF in a document are good indicators of relevance).
*   **Text Summarization:** To identify the most important words in a document.
*   **Text Classification:** As features to train classifiers, where important words can help distinguish between categories.
*   **Topic Modeling:** To understand the key terms associated with different topics.

In essence, TF-IDF helps us find the words that are uniquely characteristic of a document, making it a valuable technique for many text analysis tasks.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

df = pd.DataFrame({
    'text': [
        'this is the first document.',
        'this is the second document.',
        'this second is the   document.',
        'document is this the first ',
    ], 'label': [1,0,1,0]
})

# Using TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the text data
tfidf_vectorizer.fit_transform(df['text']).toarray()


array([[0.39896105, 0.60276058, 0.39896105, 0.        , 0.39896105,
        0.39896105],
       [0.39896105, 0.        , 0.39896105, 0.60276058, 0.39896105,
        0.39896105],
       [0.39896105, 0.        , 0.39896105, 0.60276058, 0.39896105,
        0.39896105],
       [0.39896105, 0.60276058, 0.39896105, 0.        , 0.39896105,
        0.39896105]])

In [None]:
display(tfidf_vectorizer.idf_)
display(tfidf_vectorizer.get_feature_names_out())

array([1.        , 1.51082562, 1.        , 1.51082562, 1.        ,
       1.        ])

array(['document', 'first', 'is', 'second', 'the', 'this'], dtype=object)

Learn more [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)

# Custom Features