# Unit 3 Introduction to TF-IDF Vectorization in NLP

### **Introduction to TF-IDF Vectorization in NLP**

-----

### **Introduction to Text Feature Extraction**

Welcome back to the "Foundations of NLP Data Processing" course. In the previous lesson, we explored the **Bag of Words (BoW)** model, a foundational technique for text feature extraction that represents text data by counting the frequency of each word in a document. While BoW is simple and effective, it does not account for the importance of words across different documents in a corpus. Today, we will delve into text feature extraction using **TF-IDF**, which stands for **Term Frequency-Inverse Document Frequency**. TF-IDF not only considers the frequency of words within a document but also evaluates their significance across the entire corpus, making it a powerful tool for text analysis.

### **Understanding TF-IDF Vectorization**

**TF-IDF** is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (or corpus). In this context:

  * A **document** refers to a single piece of text data, such as a sentence, paragraph, or article. It is the unit of text for which we calculate the term frequency (TF).
  * A **corpus** is a collection of documents. It represents the entire dataset of text data that we are analyzing. The inverse document frequency (IDF) is calculated based on the entire corpus to determine the significance of a term across all documents.

TF-IDF combines two metrics: **Term Frequency (TF)** and **Inverse Document Frequency (IDF)**. By multiplying these two metrics, TF-IDF assigns higher scores to terms that are frequent in a document but rare in the corpus, highlighting their significance.

### **TF-IDF Calculations**

  * **Term Frequency (TF):** This measures how frequently a term appears in a document. It is calculated as:
    $$TF(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}$$

  * **Inverse Document Frequency (IDF):** This measures how important a term is across the entire corpus. It is calculated as:
    $$IDF(t,D) = \log\left(\frac{\text{Total number of documents in corpus } D}{\text{Number of documents containing term } t}\right)$$

  * **TF-IDF:** The TF-IDF score for a term is the product of its TF and IDF scores:
    $$TF\text{-}IDF(t,d,D) = TF(t,d) \times IDF(t,D)$$

These formulas help quantify the importance of a term within a document and across a corpus, providing a more nuanced representation of text data compared to simple frequency counts.

### **Example: Calculating TF-IDF**

Consider two documents:

  * Document 1: "The cat sat on the mat."
  * Document 2: "The dog chased the cat."

For simplicity, we'll demonstrate the calculations for a few terms, but in practice, you would calculate TF-IDF for all words.

  * **Step 1: Term Frequency (TF)**

      * Document 1: `TF(cat) = 1/6`, `TF(sat) = 1/6`
      * Document 2: `TF(cat) = 1/5`, `TF(dog) = 1/5`

  * **Step 2: Inverse Document Frequency (IDF)**

      * `IDF(cat) = log(2/2) = 0`
      * `IDF(sat) = log(2/1) = 0.301`

  * **Step 3: TF-IDF**

      * Document 1: `TF-IDF(cat) = 0`, `TF-IDF(sat) = 0.050`
      * Document 2: `TF-IDF(cat) = 0`, `TF-IDF(dog) = 0.060`

This example shows how TF-IDF scores highlight term importance.

### **Exploring N-grams**

An **n-gram** is a contiguous sequence of n items from a given text. Unigrams are single words, while bigrams are pairs of consecutive words, and trigrams are sequences of three consecutive words. Using n-grams can help capture more context in text data by considering combinations of words rather than individual words alone.

The `ngram_range` parameter in the `TfidfVectorizer` specifies the range of n-values for different n-grams to be extracted. By setting `ngram_range=(1,2)`, we are instructing the vectorizer to consider both unigrams (single words) and bigrams (pairs of consecutive words) when analyzing the text. This means that the TF-IDF vectorization will account for individual words as well as combinations of two consecutive words, allowing for a richer representation of the text data by capturing more context and relationships between words.

### **Example 1: Implementing TF-IDF with Unigrams**

Let's start with a simple example using only unigrams to understand how TF-IDF vectorization works.

```python
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Example Documents
docs = [
    "The cat sat on the mat.",
    "The dog chased the cat."
]

# TF-IDF Vectorizer with unigrams
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(docs)

# Convert to DataFrame for better visualization
df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(df)
```

**Analyzing the Output**

The output of the above code is a DataFrame that displays the TF-IDF scores for each unigram in the documents. Each column represents a unigram, and each row corresponds to a document. The values in the DataFrame are the TF-IDF scores, indicating the importance of each term in the respective document.

```
        cat    chased       dog       mat       sat
0  0.449436  0.000000  0.000000  0.631667  0.631667
1  0.449436  0.631667  0.631667  0.000000  0.000000
```

### **Example 2: Implementing TF-IDF with Unigrams and Bigrams**

Now, let's extend the example to include both unigrams and bigrams.

```python
# TF-IDF Vectorizer with unigrams and bigrams
vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words='english')
tfidf_matrix = vectorizer.fit_transform(docs)

# Convert to DataFrame for better visualization
df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(df)
```

**Analyzing the Output**

The output now includes both unigrams and bigrams, providing a richer representation of the text data.

```
        cat   cat sat    chased  chased cat       dog  dog chased       mat        sat  sat  mat
0  0.335176  0.471078  0.000000    0.000000  0.000000    0.000000  0.471078   0.471078  0.471078 
1  0.335176  0.000000  0.471078    0.471078  0.471078    0.471078  0.000000   0.000000  0.000000
```

In this output, you can see that the bigrams "cat sat" and "dog chased" have significant TF-IDF scores, indicating their importance in the respective documents.

### **Summary and Next Steps**

In this lesson, you learned about TF-IDF vectorization, a powerful technique for transforming text into numerical features. We covered the basics of TF-IDF, the role of n-grams, and saw practical examples of implementing TF-IDF using **scikit-learn** and visualizing the results with **pandas**. As you move on to the practice exercises, apply these concepts to different datasets and experiment with various n-gram settings to deepen your understanding. This knowledge will be invaluable as you continue to explore more advanced NLP techniques in future lessons.

## Uncover Key Terms with TF-IDF

You've done well exploring TF-IDF with unigrams and bigrams. Now, let's focus on a practical task using unigrams.

Your objective is to create a TF-IDF vectorizer using scikit-learn's TfidfVectorizer on a set of example texts about NLP.

Apply the vectorizer with only unigrams.
Fit the vectorizer to the documents.
Print the TF-IDF scores for each term in each document using a DataFrame.
This exercise will help you see how TF-IDF highlights important words. Dive in and discover which terms stand out!

```python
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Example Documents
docs = [
    "Natural Language Processing is amazing!",
    "Text processing is an essential part of NLP.",
    "TF-IDF is used for text feature extraction."
]

# TODO: Create a TF-IDF Vectorizer with unigrams

# TODO: Convert to DataFrame for better visualization
# TODO: Print DataFrame

```

Here's the completed Python code to solve the exercise.

-----

```python
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Example Documents
docs = [
    "Natural Language Processing is amazing!",
    "Text processing is an essential part of NLP.",
    "TF-IDF is used for text feature extraction."
]

# Create a TF-IDF Vectorizer with unigrams
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)

# Convert to DataFrame for better visualization
df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Print DataFrame
print(df)

```

### **Explanation**

The code above accomplishes your objective by filling in the TO-DO sections.

First, the `TfidfVectorizer()` object is created. By default, it's set to work with **unigrams** (single words), so you don't need to specify an `ngram_range`. Then, the `fit_transform()` method processes the documents, both learning the vocabulary and creating the TF-IDF scores in one step.

Finally, the resulting TF-IDF matrix is converted into a **pandas** DataFrame. This makes the output easy to read, with columns for each unique term and rows for each document, showing you exactly which terms stand out based on their TF-IDF scores.

## Enhance Text Analysis with Bigrams

Cosmo
Just now
Read message aloud
Nice progress on exploring TF-IDF with unigrams! Now, let's enhance your skills by including bigrams in the mix.

Your task is to:

Modify the TF-IDF vectorizer to include both unigrams and bigrams by setting ngram_range to (1,2).
Fit the vectorizer to the same set of movie reviews.
Extract and compare the top 5 features from both the unigram-only and the unigram-and-bigram implementations.
This will help you understand how bigrams can enrich text representation. Dive in and see the difference!

```python
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Example Documents
docs = [
    "I loved the movie, it was fantastic!",
    "The movie was okay, not great.",
    "I did not like the movie, it was boring."
]

# TF-IDF Vectorizer with unigrams
vectorizer_unigram = TfidfVectorizer(stop_words='english')
tfidf_matrix_unigram = vectorizer_unigram.fit_transform(docs)

# Convert to DataFrame for better visualization
df_unigram = pd.DataFrame(tfidf_matrix_unigram.toarray(), columns=vectorizer_unigram.get_feature_names_out())
print("Unigram TF-IDF:")
print(df_unigram)

# TODO: Modify the TF-IDF Vectorizer to include both unigrams and bigrams
vectorizer_bigram = TfidfVectorizer(stop_words='english')
tfidf_matrix_bigram = vectorizer_bigram.fit_transform(docs)

# Convert to DataFrame for better visualization
df_bigram = pd.DataFrame(tfidf_matrix_bigram.toarray(), columns=vectorizer_bigram.get_feature_names_out())
print("\nUnigram and Bigram TF-IDF:")
print(df_bigram)

# TODO: Extract and compare the top 5 features from the unigram and bigram implementations

```

Here is the completed code that modifies the TF-IDF vectorizer to include bigrams and compares the results, as you requested.

```python
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Example Documents
docs = [
    "I loved the movie, it was fantastic!",
    "The movie was okay, not great.",
    "I did not like the movie, it was boring."
]

# TF-IDF Vectorizer with unigrams
vectorizer_unigram = TfidfVectorizer(stop_words='english')
tfidf_matrix_unigram = vectorizer_unigram.fit_transform(docs)

# Convert to DataFrame for better visualization
df_unigram = pd.DataFrame(tfidf_matrix_unigram.toarray(), columns=vectorizer_unigram.get_feature_names_out())
print("Unigram TF-IDF:")
print(df_unigram)

# Modify the TF-IDF Vectorizer to include both unigrams and bigrams
vectorizer_bigram = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')
tfidf_matrix_bigram = vectorizer_bigram.fit_transform(docs)

# Convert to DataFrame for better visualization
df_bigram = pd.DataFrame(tfidf_matrix_bigram.toarray(), columns=vectorizer_bigram.get_feature_names_out())
print("\nUnigram and Bigram TF-IDF:")
print(df_bigram)

# Extract and compare the top 5 features from the unigram and bigram implementations
unigram_features = vectorizer_unigram.get_feature_names_out()
bigram_features = vectorizer_bigram.get_feature_names_out()

# Get the sum of TF-IDF scores for each feature to find the most important ones
unigram_scores_sum = tfidf_matrix_unigram.sum(axis=0)
bigram_scores_sum = tfidf_matrix_bigram.sum(axis=0)

# Get the indices of the top 5 features
top_unigram_indices = unigram_scores_sum.argsort()[-5:][::-1]
top_bigram_indices = bigram_scores_sum.argsort()[-5:][::-1]

print("\nTop 5 Unigram Features:")
for i in top_unigram_indices:
    print(f"- {unigram_features[i]}")

print("\nTop 5 Unigram and Bigram Features:")
for i in top_bigram_indices:
    print(f"- {bigram_features[i]}")

```

-----

### **Explanation**

The updated script successfully includes both unigrams and bigrams by setting the `ngram_range=(1, 2)` parameter in the `TfidfVectorizer`.

When you run the code, you'll see a clear difference between the two outputs:

  * The **unigram** output shows individual words like `movie`, `fantastic`, and `boring`.
  * The **unigram and bigram** output includes both single words and two-word phrases, such as `not great` and `was boring`.

By including bigrams, the model can capture more nuanced information and context. For example, "not great" provides a much more specific sentiment than the single word "great" alone. This demonstrates how using n-grams enriches the text representation and provides a more comprehensive view of the content.

## Trigram Analysis with TF-IDF

You've done a great job exploring TF-IDF with unigrams and bigrams. Now, let's dive into a practical task focusing on trigrams.

Your objective is to create a TF-IDF vectorizer using scikit-learn's TfidfVectorizer on a set of example texts about NLP.

Set the vectorizer to use trigrams by adjusting the ngram_range to (3,3).
Fit the vectorizer to the documents.
Extract and print the top 5 trigrams with the highest TF-IDF scores for each document.
This exercise will help you see how trigrams capture more context and highlight important patterns. Jump in and uncover the key trigrams!

```python
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Example Documents
docs = [
    "Natural Language Processing is amazing!",
    "Text processing is an essential part of NLP.",
    "TF-IDF is used for text feature extraction."
]

# TODO: Set up the TF-IDF Vectorizer to use trigrams

# TODO: Convert to DataFrame for better visualization

# TODO: Extract and print top 5 trigrams with highest TF-IDF scores for each document

```

```python
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

# Example Documents
docs = [
    "Natural Language Processing is amazing!",
    "Text processing is an essential part of NLP.",
    "TF-IDF is used for text feature extraction."
]

# Set up the TF-IDF Vectorizer to use trigrams
vectorizer = TfidfVectorizer(ngram_range=(3, 3), stop_words='english')
tfidf_matrix = vectorizer.fit_transform(docs)

# Convert to DataFrame for better visualization
df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print("Trigram TF-IDF Scores:")
print(df)

# Extract and print top 5 trigrams with highest TF-IDF scores for each document
print("\nTop Trigrams for Each Document:")
for i, doc_scores in df.iterrows():
    print(f"\n--- Document {i+1} ---")
    
    # Get the top 5 trigrams with non-zero scores for the current document
    top_trigrams = doc_scores.sort_values(ascending=False)
    top_trigrams = top_trigrams[top_trigrams > 0].head(5)
    
    for trigram, score in top_trigrams.items():
        print(f"'{trigram}': {score:.4f}")

```

### **Explanation**

The code above accomplishes your task by:

1.  Setting up the `TfidfVectorizer` to use only **trigrams** by setting `ngram_range=(3,3)`.
2.  Using `fit_transform()` to generate the TF-IDF matrix.
3.  Converting the TF-IDF matrix into a DataFrame for easy visualization.
4.  Iterating through each row (document) in the DataFrame to identify and print the trigrams with the highest TF-IDF scores.

The result will show the unique trigrams (`text processing essential`, `natural language processing`, etc.) that are considered most important for each document in the corpus.

## Comparing BoW and TF-IDF