# Unit 2 Navigating the Weights of Words: Analyzing TF-IDF Scores in NLP

---
## Navigating the Weights of Words: Analyzing TF-IDF Scores in NLP

### Introduction

Hello again! Today, we're taking a closer look at **Term Frequency-Inverse Document Frequency**, or **TF-IDF**. As you may recall from earlier lessons, **TF-IDF** is a statistical measure that tells us how important a word is to a document in a collection or corpus.

Here, we explore **TF-IDF** scores and their relevance, as they can guide us in identifying words that carry significant value in determining the context or theme of a document. We'll use Python's **Scikit-learn** library, a machine learning tool that comes with built-in capabilities for calculating **TF-IDF**. Let's embark on this exciting journey of analyzing **TF-IDF** scores on our **SMS Spam Collection** dataset.

---

### Identifying Top Features Based on TF-IDF Scores

Now that we've learned how to compute the **TF-IDF** matrix, we'd like to determine which words (or features) have the highest scores. These words are often the most descriptive or differentiating words in the corpus!

Let's understand how the following code snippet helps identify these top features:

```python
# ... Previously loading the dataset and converting to Dataframe

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform the messages into a matrix of TF-IDF features
X_tfidf = vectorizer.fit_transform(df['message']).toarray()

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Find the indices of the top 10 features based on global maximum TF-IDF scores
top_features_idx = np.argsort(X_tfidf.max(axis=0))[-10:]

# Extract the top 10 features
top_features = feature_names[top_features_idx]

# Print the top 10 TF-IDF features
print("Top 10 TF-IDF features:")
print(top_features)
```

In this code, we first transformed our dataset's messages into a matrix of **TF-IDF** features, assigning greater weight to the most significant words. Then, by extracting the feature names and calculating the indices of the top 10 features with the highest **TF-IDF** scores, we identified and printed the words most characteristic of our corpus. This procedure allows us to discern the terms that uniquely define the content of our dataset, showcasing the power of **TF-IDF** in highlighting keywords.

The output of the above code will be:

```
Top 10 TF-IDF features:
['anytime' 'yup' '645' 'where' 'ok' 'alrite' 'thank' 'okie' 'thanx' 'nite']
```

This output showcases the eclectic mix of words that have the highest **TF-IDF** scores within our **SMS Spam Collection** dataset. It's interesting to note the presence of both common words and seemingly random or unique terms. This underlines the importance of **TF-IDF** in distinguishing relevant terms in a specific context, even when those terms might not seem immediately relevant at a glance.

---

### Lesson Summary and Practice

Remarkable job! Today, we've learned how to extract insights from **TF-IDF** scores. We delved deeper into the meaning of these scores, and we coded a Python script using the **Scikit-learn** library to calculate **TF-IDF**. Furthermore, we built on this by writing code to identify the top words based on their **TF-IDF** scores.

While it may seem like a lot, remember that practice makes perfect. Continue working with different corpora to get comfortable with the process. Use the upcoming practice exercises to reinforce your understanding and enhance your skills. Knowing how to interpret and analyze **TF-IDF** scores forms the backbone of numerous advanced NLP tasks including document classification, sentiment analysis, topic modeling, and many more. Keep going!

## Identifying High-Impact Words with TF-IDF Vectorization

In this exercise, you'll leverage TF-IDF vectorization to identify the top 10 words with the highest TF-IDF scores within the SMS Spam Collection dataset. This practice is crucial for distinguishing significant words that contribute to the context of documents, particularly useful in spam detection. Just execute the provided code as is to see these key terms come to life.

```python
import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for easier handling
df = pd.DataFrame(sms_spam['train'])

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform the messages into a matrix of TF-IDF features
X_tfidf = vectorizer.fit_transform(df['message']).toarray()

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Find the indices of the top 10 features based on global maximum TF-IDF scores
top_features_idx = np.argsort(X_tfidf.max(axis=0))[-10:]

# Extract the top 10 features
top_features = feature_names[top_features_idx]

# Print the top 10 TF-IDF features 
print("Top 10 TF-IDF features:")
print(top_features)

```

## Unveiling Other Significant Words

Building on your newfound understanding from analyzing the TF-IDF scores, this task invites you to broaden your perspective by looking at more features. Adjust the current code to identify the top 20 TF-IDF features, rather than just 10. This exercise aims to give you a more rounded view of word importance across the dataset. By examining a larger set of highly important words, you'll deepen your understanding of TF-IDF vectorization and appreciate the nuanced significance of these words within the context of text analysis.

```python
import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for easier handling
df = pd.DataFrame(sms_spam['train'])

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform the messages into a matrix of TF-IDF features
X_tfidf = vectorizer.fit_transform(df['message']).toarray()

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Find the indices of the top 20 features based on global maximum TF-IDF scores
top_features_idx = np.argsort(X_tfidf.max(axis=0))[-10:]

# Extract the top 20 features
top_features = feature_names[top_features_idx]

# Print the top 20 TF-IDF features 
print("Top TF-IDF features:")
print(top_features)

```

To achieve your goal of identifying the top 20 TF-IDF features, you only need to modify one line in your provided code. Specifically, you need to change the slice `[-10:]` to `[-20:]` when finding the indices of the top features. This will correctly extract the top 20 features instead of just 10.

Here's the corrected code:

```python
import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for easier handling
df = pd.DataFrame(sms_spam['train'])

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform the messages into a matrix of TF-IDF features
X_tfidf = vectorizer.fit_transform(df['message']).toarray()

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Find the indices of the top 20 features based on global maximum TF-IDF scores
top_features_idx = np.argsort(X_tfidf.max(axis=0))[-20:] # Changed from -10 to -20

# Extract the top 20 features
top_features = feature_names[top_features_idx]

# Print the top 20 TF-IDF features
print("Top TF-IDF features:")
print(top_features)
```

## Debugging the TF-IDF Vectorizer

In this task, you'll troubleshoot a piece of code intended to identify the top 10 TF-IDF features from a dataset. Despite its aim, the code currently falls short due to a bug. Your challenge is to debug the Python code. Pay close attention to syntax and function utilization. Correcting this issue will not only solidify your grasp on TF-IDF vectorization but also prime you for more complex text analysis tasks ahead.

```python
import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for easier handling
df = pd.DataFrame(sms_spam['train'])

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform the messages into a matrix of TF-IDF features
X_tfidf = vectorizer.fit_transform(df['message']).toarray()

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Find the indices of the top 10 features based on global maximum TF-IDF scores
top_features_idx = np.argsort(X_tfidf.max(axis=0))[10:]

# Extract the top 10 features
top_features = feature_names[top_features_idx]

# Print the top 10 TF-IDF features 
print("Top 10 TF-IDF features:")
print(top_features)

```

The bug in the provided code lies in the line where `top_features_idx` is calculated:

```python
top_features_idx = np.argsort(X_tfidf.max(axis=0))[10:]
```

`np.argsort()` returns the indices that would sort an array. To get the "top" (largest) values, you typically want the last `n` elements of the sorted indices. Using `[10:]` will give you all elements *after* the 10th element, which is incorrect for getting the top 10. To get the top 10 largest values, you need to use a negative slice `[-10:]`, which takes the last 10 elements.

Here's the corrected code:

```python
import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for easier handling
df = pd.DataFrame(sms_spam['train'])

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform the messages into a matrix of TF-IDF features
X_tfidf = vectorizer.fit_transform(df['message']).toarray()

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Find the indices of the top 10 features based on global maximum TF-IDF scores
# BUG FIX: Changed [10:] to [-10:] to get the last 10 (i.e., top 10) elements
top_features_idx = np.argsort(X_tfidf.max(axis=0))[-10:]

# Extract the top 10 features
top_features = feature_names[top_features_idx]

# Print the top 10 TF-IDF features
print("Top 10 TF-IDF features:")
print(top_features)
```

## Discovering Top TF-IDF Terms

In this task, you'll employ Python to extract and filter the top 10 terms from a text corpus using the TF-IDF vectorization method. For that you need to complete the missing code to successfully sort and identify the indexes of the top 10 features based on their TF-IDF scores.

```python
import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for easier handling
df = pd.DataFrame(sms_spam['train'])

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform the messages into a matrix of TF-IDF features
X_tfidf = vectorizer.fit_transform(df['message']).toarray()

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# TODO: Find the indices of the top 10 features based on global maximum TF-IDF scores
top_features_idx = ________________________________

# Extract the top 10 features
top_features = feature_names[top_features_idx]

# Print the top 10 TF-IDF features
print("Top 10 TF-IDF features:")
print(top_features)

```

```python
import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for easier handling
df = pd.DataFrame(sms_spam['train'])

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform the messages into a matrix of TF-IDF features
X_tfidf = vectorizer.fit_transform(df['message']).toarray()

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Find the indices of the top 10 features based on global maximum TF-IDF scores
top_features_idx = np.argsort(X_tfidf.max(axis=0))[-10:]

# Extract the top 10 features
top_features = feature_names[top_features_idx]

# Print the top 10 TF-IDF features
print("Top 10 TF-IDF features:")
print(top_features)
```

## Mastering TF-IDF Feature Extraction

After exploring and practicing with TF-IDF vectorization, it's time for the final challenge. You will write the complete part of the code that identifies the top 10 significant words. This exercise solidifies your understanding of text preprocessing and analysis using TF-IDF, preparing you for more complex text analysis tasks in future projects.

```python

import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for easier handling
df = pd.DataFrame(sms_spam['train'])

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform the messages into a matrix of TF-IDF features
X_tfidf = vectorizer.fit_transform(df['message']).toarray()

# TODO: Get the feature names

# TODO: Find the indices of the top 10 features based on global maximum TF-IDF scores

# TODO: Extract the top 10 features

# TODO: Print the top 10 TF-IDF features


```

```python

import numpy as np
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for easier handling
df = pd.DataFrame(sms_spam['train'])

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform the messages into a matrix of TF-IDF features
X_tfidf = vectorizer.fit_transform(df['message']).toarray()

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

# Calculate the global maximum TF-IDF score for each feature
# We take the max along axis 0 (columns) to get the max TF-IDF score for each word across all documents
max_tfidf_scores = np.max(X_tfidf, axis=0)

# Find the indices of the top 10 features based on global maximum TF-IDF scores
# Use argsort to get indices that would sort the array, then take the last 10 (largest)
top_10_indices = max_tfidf_scores.argsort()[-10:]

# Extract the top 10 features
top_10_features = [feature_names[i] for i in top_10_indices]

# Print the top 10 TF-IDF features
print("Top 10 TF-IDF features:")
for feature in top_10_features:
    print(feature)

```