# Unit 4 Optimizing TF-IDF Vectorization by Eliminating Stop Words

# Topic Overview
---

Welcome! In this lesson, we're going to explore **removing stop words in TF-IDF Vectorization**. As you learn how to filter out these common words during the vectorization process, you'll uncover how to reveal more meaningful information in your text data.

## Understanding Stop-words in NLP (Recap and Significance in Vectorization)
---

As we've already navigated through the terrain of stop words in a previous lesson, it's crucial to recall their role as we dive deeper into the specifics of TF-IDF vectorization. **Stop words**, often the most frequently occurring words in a language, do not carry significant meaning on their own within a text — words like "the", "is", "at", and "which". Removing these words during the vectorization process is not merely a cleansing step but a methodical approach to refine our data for more insightful analysis.

By filtering out stop words, we significantly **reduce the dimensionality of our data**. This is a key step in enhancing computational efficiency as it lessens the volume of data to process, thereby speeding up algorithmic computations. Moreover, the exclusion of these words minimizes the noise in our text data, enabling our NLP models to focus on the more meaningful words that contribute to the essence of the content. Consequently, this practice has a direct positive impact on the performance of our NLP algorithms, allowing for a more accurate and insightful text analysis.

This recap underscores the strategic importance of stop word removal within the realm of text vectorization, setting the stage for our exploration into implementing this process with TF-IDF vectorization.

## Implementing Stopwords Removal with TF-IDF Vectorization
---

The `TfidfVectorizer` from Scikit-Learn provides a highly versatile way to handle stop words through its `stop_words` parameter, thereby allowing for either the utilization of a pre-defined list or the application of a custom list of stopwords. Let's break down both approaches to give you a comprehensive understanding and the tools to implement each method as needed.

### Using Pre-defined Stop Words
---

For many applications, the **predefined list of stop words** in various languages provided by `TfidfVectorizer` is more than sufficient. This can be easily applied by setting the `stop_words` parameter to the desired language, such as `'english'`.

Let's employ the predefined English stop words to vectorize the 'message' column of an **SMS Spam Collection** dataset loaded into a Pandas DataFrame named `df`:

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer with English stop words
vectorizer = TfidfVectorizer(stop_words='english')

# Tokenize and build vocab
X_tfidf = vectorizer.fit_transform(df['message'])

# Output the shape of the TF-IDF matrix
print(X_tfidf.shape)
```

This script produces a TF-IDF matrix with dimensions indicating the reduction in features due to the removal of stop words:

```
(5572, 8444)
```

Here, 5,572 rows correspond to the dataset messages, and 8,444 columns represent the unique words after excluding stop words, showcasing the effectiveness of pre-defined stop word removal in refining our data.

The shape of the TF-IDF vectorized output represents the dimensions of our TF-IDF matrix. Each row in the matrix corresponds to a text message in our dataset, and each column corresponds to a unique word in our text data. The value in each cell in the matrix represents the TF-IDF score of the corresponding word in the corresponding message.

### Applying Custom Stop Words
---

If your analysis requires a more tailored approach, `TfidfVectorizer` enables the **integration of a custom list of stop words**. This feature is particularly useful when dealing with domain-specific jargon or texts in languages not covered by the predefined lists.

The following example demonstrates how to vectorize the same dataset messages while applying a custom list of stop words:

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a custom list of stop words
custom_stop_words = ['custom', 'list', 'of', 'stop', 'words']

# Initialize the TfidfVectorizer with the custom stop words
vectorizer = TfidfVectorizer(stop_words=custom_stop_words)

# Tokenize and build vocab
X_tfidf_custom = vectorizer.fit_transform(df['message'])
```

In this scenario, the output will similarly reflect the dimensions of the TF-IDF matrix post the exclusion of the custom stop words specified. The exact changes in dimensions will depend on how many of these custom words were present in your text data to begin with.

Both methods of removing stop words enhance the relevancy and quality of your dataset for NLP tasks by eliminating unnecessary noise. Whether using the inbuilt English stop words functionality for quick analysis or going the extra mile with a custom list for specialized needs, `TfidfVectorizer` equips you with the flexibility to adapt your data preprocessing steps accordingly.

---
## Lesson Summary
---

Today, you learned about **stop words**, their influence in NLP, and how to remove them using the **TF-IDF vectorizer** from Python's Scikit-Learn library. This process is vital in reducing the dimensionality of your text data, improving computational efficiency, and enhancing the performance of NLP algorithms.

By practicing the removal of stopwords from different types of text and datasets, you'll extend your skills and create more sophisticated NLP models. So, let's keep advancing — every text you analyze is an opportunity to improve your model's performance.

## Run TF-IDF With Stop Words Removal

In this exercise, we apply TF-IDF vectorization with stop words removal to the SMS Spam Collection dataset using Scikit-Learn's TfidfVectorizer. This method efficiently filters out common words, enhancing model performance and computational efficiency. Simply execute the given code to see the reduction in dimensionality and improvement in data quality.

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Remove stop words to the TF-IDF vectorization
vectorizer = TfidfVectorizer(stop_words='english')
X_tfidf = vectorizer.fit_transform(df['message'])

print(X_tfidf.shape)

```

The error in the provided code is that the `stop_words` parameter is not set in the `TfidfVectorizer` initialization. To effectively remove English stop words, you need to specify `stop_words='english'` when creating the `TfidfVectorizer` instance.

Here's the corrected code:

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Attempt to remove stop words in the TF-IDF vectorization
# Correction: Add stop_words='english' to the TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
X_tfidf = vectorizer.fit_transform(df['message'])

print(X_tfidf.shape)
```

## Fixing TF-IDF Vectorization Stop Words

Stellar Navigator, your initial journey into the realm of TF-IDF Vectorization utilized a predefined list of English stopwords. However, the cosmos of text data often requires a more tailored approach. Now, let's enhance your code by transitioning from the default 'english' stopword list to a custom list that excludes specific terms like 'a', 'an', 'the', 'and'. Apply these changes to your code to see the impact of using a more personalized stopword list on your analysis.

```python
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from datasets import load_dataset

# load dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
df = pd.DataFrame(sms_spam['train'])

# TODO: Define a custom list of stop words

# Initialize the TfidfVectorizer with the custom stop words
vectorizer = TfidfVectorizer(stop_words="english")

# Perform Tf-IDF vectorization
X_tfidf = vectorizer.fit_transform(df['message'])

# Output the shape of the TF-IDF matrix
print(X_tfidf.shape)

```

To implement a custom list of stop words, you need to define a Python list containing the words you want to exclude and then pass this list to the `stop_words` parameter of the `TfidfVectorizer`.

Here's the corrected code:

```python
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from datasets import load_dataset

# load dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
df = pd.DataFrame(sms_spam['train'])

# TODO: Define a custom list of stop words
custom_stop_words = ['a', 'an', 'the', 'and']

# Initialize the TfidfVectorizer with the custom stop words
vectorizer = TfidfVectorizer(stop_words=custom_stop_words) # Changed "english" to custom_stop_words

# Perform Tf-IDF vectorization
X_tfidf = vectorizer.fit_transform(df['message'])

# Output the shape of the TF-IDF matrix
print(X_tfidf.shape)
```

## Customizing Stop Words for TF-IDF Vectorization

Brilliant, Space Explorer! Let's continue building your expertise in handling stop words with TF-IDF Vectorization. Remember, we aim to eliminate those common words to improve the relevancy of our data. Add the code to remove English stop words during vectorization.

```python
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from datasets import load_dataset

# Loading dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
df = pd.DataFrame(sms_spam['train'])

# TODO: Initialize a vectorizer that filters out english stop words

# Fitting vectorizer
X_tfidf = vectorizer.fit_transform(df['message'])

# Printing the shape of the TF-IDF matrix
print(X_tfidf.shape)

```

```python
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from datasets import load_dataset

# Loading dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
df = pd.DataFrame(sms_spam['train'])

# TODO: Initialize a vectorizer that filters out english stop words
vectorizer = TfidfVectorizer(stop_words='english')

# Fitting vectorizer
X_tfidf = vectorizer.fit_transform(df['message'])

# Printing the shape of the TF-IDF matrix
print(X_tfidf.shape)
```

## Adding Stop Words Removal to TF-IDF Vectorizer

Building on the insights and techniques you've learned, this final challenge asks you to encapsulate your newfound skills into one cohesive task. You've already seen how removing stop words can streamline TF-IDF vectorization, making your data more relevant for analysis. Now, you'll apply this knowledge hands-on by preparing the SMS Spam Collection dataset for further NLP tasks. This step is essential for enhancing the performance of NLP models, as it allows them to focus on the more meaningful content of the dataset.

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer

# Loading the SMS Spam Collection dataset 
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Converting the dataset to a pandas DataFrame for ease of handling
df = pd.DataFrame(sms_spam['train'])

# TODO: Initialize TfidfVectorizer with stop words set to remove English stop words

# TODO: Fit the vectorizer to the 'message' column and transform it into a TF-IDF matrix

# TODO: Print the shape of the TF-IDF matrix to see the number of messages and terms


```

To complete the task, you need to add the following lines of code:

1.  Initialize `TfidfVectorizer` with `stop_words='english'`.
2.  Fit and transform the 'message' column of your DataFrame using the vectorizer.
3.  Print the shape of the resulting TF-IDF matrix.

Here's the completed code:

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer

# Loading the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Converting the dataset to a pandas DataFrame for ease of handling
df = pd.DataFrame(sms_spam['train'])

# Initialize TfidfVectorizer with stop words set to remove English stop words
vectorizer = TfidfVectorizer(stop_words='english')

# Fit the vectorizer to the 'message' column and transform it into a TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(df['message'])

# Print the shape of the TF-IDF matrix to see the number of messages and terms
print("Shape of TF-IDF matrix:", tfidf_matrix.shape)
```