# Unit 4 Demystifying Stop Words in Natural Language Processing

---

# Lesson Overview

Hello and welcome! In this lesson, we'll dive into a crucial step in text data preprocessing in **Natural Language Processing (NLP)** — removing stop words using Python and NLTK.

**Stop words** usually refer to the most commonly used words in a language. However, despite their high frequency, these words carry little meaningful information and are often filtered out from text data. By the end of this lesson, you'll understand what stop words are, why they're essential in NLP, and how to remove them from your text data.

---

# Understanding Stop Words in NLP

The big question is — What exactly are stop words? You can look at stop words as **background noise** in your text data. They're words like *is*, *the*, *in*, and *and* — words that don't carry a lot of meaning on their own.

So, why do we want to remove these stop words? Machine learning models look for signals to make decisions. If we leave in common words that are often in every document, we're not giving our model a lot of useful information. Hence, earlier in our pre-processing pipeline, we would probably remove these words to let our model focus on words that may indicate something more exceptional.

---

# Identification and Removal of Stop Words

To remove these stop words, we first have to identify them. To do this efficiently, we can leverage a resource from the NLTK library — the **NLTK's built-in English stop words list**. Before we can use this list, however, it's important to ensure that the `stopwords` package is downloaded. This is achieved using the command `nltk.download('stopwords')`. Once downloaded, we can access the list of commonly agreed-upon stop words in the English language by calling `nltk.corpus.stopwords.words('english')`. This function returns a comprehensive list that we can use to filter out stop words from our text data.

---

# Practical Example: Code Explanation

To better illustrate this, let's walk through the code block below:

```python
import pandas as pd
from datasets import load_dataset
import nltk
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

sms_spam = load_dataset('codesignal/sms-spam-collection')

df = pd.DataFrame(sms_spam['train'])

df['tokens'] = df['message'].apply(lambda x: word_tokenize(x))

stop_words = stopwords.words('english')

print("Some stop words:", stop_words[:3])

df['filtered_tokens'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])

print(df['filtered_tokens'].head())
```

In this block of code, we start by importing the necessary libraries and downloading the required NLTK packages. Next, we convert our `sms_spam` dataset into a pandas DataFrame, which makes it easier to handle.

We then proceed to **tokenize** the `message` column of our DataFrame and store our tokens in a new column — `tokens`. Tokenizing involves breaking down our text data into individual components. At this point, we don't worry about any stop words; our primary focus is on breaking down the sentences into words.

We then define our stop words using NLTK's built-in English stop words list and set this list to a variable — `stop_words`. Following that, we print 3 stop words as examples.

Now that we have both our tokens and our stop words, we can proceed to remove any stop words from our tokens. We use a **lambda function** to compare each word in our tokens to our list of stop words. If the word is a stop word, we filter it out.

We apply this process to our DataFrame, and our final output is a DataFrame with another new column — the `filtered_tokens` column. This column contains our tokenized messages, sans the stop words.

The output will be:

```
Some stop words: ['i', 'me', 'my']
0    [Go, jurong, point, ,, crazy, .., Available, b...
1             [Ok, lar, ..., Joking, wif, u, oni, ...]
2    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3    [U, dun, say, early, hor, ..., U, c, already, ...
4    [Nah, I, n't, think, goes, usf, ,, lives, arou...
Name: filtered_tokens, dtype: object
```

This output shows the first few entries of the `filtered_tokens` column from our DataFrame, demonstrating the result of removing stop words from the tokenized messages. Each entry corresponds to tokenized, filtered text from our initial dataset, showcasing how common stop words are excluded, leaving more meaningful words.

---

# Lesson Summary and Practice

Phew! We made it. You should now understand what stop words are, why they're important in **Natural Language Processing**, and why we remove them. You can now remove these stop words from your text data using Python and the NLTK library, which is a pretty neat skill to have.

Remember to keep practicing. Challenge yourself with different text data and try to remove the stop words. There's no better way to learn than through constant hands-on experience. So, go ahead and start analyzing some data!

## Stop Words Demystified in NLP

Explore the removal of stop words in NLP by practicing with Python and NLTK on the SMS Spam Collection dataset. This task involves tokenizing messages and filtering out English stop words to highlight meaningful data. Simply Run the provided code to see the impact of this preprocessing step firsthand.



```python

import pandas as pd

from datasets import load_dataset

import nltk

nltk.download('punkt_tab', quiet=True)

nltk.download('stopwords', quiet=True)

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords



# Load the SMS Spam Collection dataset

sms_spam = load_dataset('codesignal/sms-spam-collection')



# Convert to pandas DataFrame for convenient handling

df = pd.DataFrame(sms_spam['train'])



# Tokenize the messages

df['tokens'] = df['message'].apply(lambda x: word_tokenize(x))



# Define stop words

stop_words = stopwords.words('english')



# Print some stop words

print("Some stop words:", stop_words[:3])



# Remove stop words from tokens

df['filtered_tokens'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])



# Print filtered tokens

print(df['filtered_tokens'].head())



```

```python
import pandas as pd
from datasets import load_dataset
import nltk
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Tokenize the messages
df['tokens'] = df['message'].apply(lambda x: word_tokenize(x))

# Define stop words
stop_words = stopwords.words('english')

# Print some stop words
print("Some stop words:", stop_words[:3]) 

# Remove stop words from tokens
df['filtered_tokens'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])

# Print filtered tokens
print(df['filtered_tokens'].head())
```
Output of the code:
```
Some stop words: ['i', 'me', 'my']
0    [Go, until, jurong, point, ,, crazy, .., Available, only, in, bugis, n, great, world, la, e, buffet, ..., Cine, got, amore, film, ,, great, world, 's, a, bugis, great, world, la, e]
1                      [Ok, lar, ..., Joking, wif, u, r, non, pie, n, even, u, score, ., V, well, ..., Got, ur, go]
2    [Free, entry, 2, wkly, comp, win, FA, Cup, final, tkts, 21st, May, 2005, ., Text, FA, 87121, receive, entry, question, (, std, txt, rate, ), T, C, 's, apply, 08452810075, over, 18]
3                                  [U, dun, say, so, early, hor, ..., U, c, already, then, say, ...]
4    [Nah, I, don, 't, think, goes, usf, ,, lives, around, here, though]
Name: filtered_tokens, dtype: object
```
The code successfully demonstrates the removal of stop words from the SMS Spam Collection dataset using NLTK. 

Here's a breakdown of what happened:
* The `sms-spam-collection` dataset was loaded.
* Each message in the dataset was tokenized into individual words.
* A list of common English stop words (like 'i', 'me', 'my') was defined.
* A new column `filtered_tokens` was created, containing the tokens from which these stop words have been removed.

By comparing the original messages (implicitly, through the process) and the `filtered_tokens` output, you can observe how words that are typically less meaningful for text analysis (e.g., articles, prepositions, common verbs) are removed, leaving behind words that are likely more significant for tasks like spam detection. This preprocessing step helps to reduce noise and focus on the most informative words in the text.

## Adapting Stop Words Removal for Spanish

With the knowledge gained from understanding stop words in English, let's pivot and explore their impact in Spanish. This step involves modifying the given Python code to filter out stop words from a generated dataset containing both English and Spanish text. This task will test your understanding of applying text processing routines across different languages by altering the language parameter in the stop words function.


```python
import pandas as pd
import nltk
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Generated dataset with English and Spanish words
data = {
    'message': [
        'Hola cómo estás?',
        'I am fine and you?',
        'Estoy bien gracias por preguntar',
        'Good morning have a nice day',
        'Buenos días que tengas un buen día'
    ]
}

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(data)

# Tokenize the messages into individual words
df['tokens'] = df['message'].apply(lambda x: word_tokenize(x.lower()))

# Define stop words
stop_words = stopwords.words('english') # TODO: Change to use 'spanish' stop words

# Print some stop words
print("Some stop words:", stop_words[:3]) 

# Remove stop words from tokens
df['filtered_tokens'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
print(df[['message', 'filtered_tokens']].head())

```

To adapt the stop words removal for Spanish, you need to change the language parameter in the `stopwords.words()` function from `'english'` to `'spanish'`.

Here's the modified code:

```python
import pandas as pd
import nltk
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Generated dataset with English and Spanish words
data = {
    'message': [
        'Hola cómo estás?',
        'I am fine and you?',
        'Estoy bien gracias por preguntar',
        'Good morning have a nice day',
        'Buenos días que tengas un buen día'
    ]
}

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(data)

# Tokenize the messages into individual words
df['tokens'] = df['message'].apply(lambda x: word_tokenize(x.lower()))

# Define stop words
stop_words = stopwords.words('spanish') # Changed to use 'spanish' stop words

# Print some stop words
print("Some stop words:", stop_words[:3]) 

# Remove stop words from tokens
df['filtered_tokens'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
print(df[['message', 'filtered_tokens']].head())
```

**Output of the code:**

```
Some stop words: ['de', 'la', 'que']
                           message                 filtered_tokens
0             Hola cómo estás?            [hola, cómo, estás, ?]
1           I am fine and you?           [i, am, fine, and, you, ?]
2  Estoy bien gracias por preguntar  [estoy, bien, gracias, preguntar]
3    Good morning have a nice day  [good, morning, have, a, nice, day]
4  Buenos días que tengas un buen día      [buenos, días, tengas]
```

**Explanation of changes and impact:**

By changing `stopwords.words('english')` to `stopwords.words('spanish')`, the code now loads the Spanish stop words list.

* You can see the printed "Some stop words:" now shows Spanish stop words like 'de', 'la', 'que'.
* In the `filtered_tokens` output, observe how Spanish stop words have been removed from the Spanish sentences (e.g., "Estoy bien gracias por preguntar" becomes `[estoy, bien, gracias, preguntar]` where "por" is removed), while English stop words remain in the English sentences. This demonstrates the successful adaptation of stop word removal for a different language.

## Debugging Stop Words Removal

Next, we'll tackle a common hiccup you might encounter along the way. Your task is to diagnose and rectify a bug within a given block of code. The code aims to tokenize text from our dataset and remove stop words; however, it's not functioning as intended. Identifying and fixing such bug is crucial for ensuring the reliability of your text preprocessing pipeline


```python
import pandas as pd
from datasets import load_dataset
import nltk
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Tokenize the messages into individual words
df['tokens'] = df['message'].apply(lambda x: word_tokenize(x))

# Define stop words
stop_words = stopwords.words('english')

# This line should remove stop words but it's including only stop words instead.
df['filtered_tokens'] = df['tokens'].apply(lambda x: [word for word in x if word in stop_words])
print(df['filtered_tokens'].head())

```

The bug in the provided code lies in the line where stop words are supposed to be removed. Currently, it's including *only* stop words instead of filtering them out.

The problematic line is:
`df['filtered_tokens'] = df['tokens'].apply(lambda x: [word for word in x if word in stop_words])`

To fix this, you need to change `if word in stop_words` to `if word not in stop_words`. This will ensure that only words that are *not* in the stop words list are kept.

Here's the corrected code:

```python
import pandas as pd
from datasets import load_dataset
import nltk
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Tokenize the messages into individual words
df['tokens'] = df['message'].apply(lambda x: word_tokenize(x))

# Define stop words
stop_words = stopwords.words('english')

# This line should remove stop words, so it should exclude words IN stop_words
df['filtered_tokens'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
print(df['filtered_tokens'].head())

```

## Setting the Stage for Stop Words Removal in Text Data

Building on your newfound understanding of removing stop words, this practice aims to solidify your skills. You will fill in the blank in a Python script that prepares SMS data by tokenizing messages and filtering out stop words. By completing this exercise, you will become adept at cleaning textual data, ensuring it is primed for extracting insights.

```python
import pandas as pd
from datasets import load_dataset
import nltk
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Tokenize the messages into individual words
df['tokens'] = df['message'].apply(lambda x: word_tokenize(x))

# TODO: Define stop words

# Remove stop words from tokens
df['filtered_tokens'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
print(df['filtered_tokens'].head())

```

To complete the script and define stop words, you need to use `stopwords.words('english')`. This will provide a list of common English stop words from the NLTK library.

Here's the completed line for the `TODO` section:

```python
import pandas as pd
from datasets import load_dataset
import nltk
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Tokenize the messages into individual words
df['tokens'] = df['message'].apply(lambda x: word_tokenize(x))

# TODO: Define stop words
stop_words = set(stopwords.words('english'))

# Remove stop words from tokens
df['filtered_tokens'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
print(df['filtered_tokens'].head())

```

## Mastering Stop Words Removal

Your task now involves implementing the code to identify and omit stop words using the NLTK library. This final practice piece within this unit serves as a testament to your acquired skills in text preprocessing, which are crucial for NLP tasks. Let’s dive in, reinforcing your abilities and ensuring a strong grasp of the concepts.

```python
import pandas as pd
from datasets import load_dataset
import nltk
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Tokenize the messages into individual words
df['tokens'] = df['message'].apply(lambda x: word_tokenize(x))

# TODO: Define and set the stop words

# TODO: Remove stop words from the previously tokenized messages

# TODO: Print the first five entries of the cleaned (stop words removed) tokens


```