# Lesson 3: Removing Stop Words and Stemming in Text Preprocessing

Hello and welcome to this lesson on **Removing Stop Words and Stemming**! In this lesson, we will dive deep into two essential steps for preparing text data for machine learning models: removing stop words and stemming. These techniques will help us improve the efficiency and accuracy of our models. Let's get started!

## Understanding Stop Words

In Natural Language Processing (NLP), **stop words** refer to the most common words in a language that do not provide significant meaning. Examples include "and", "the", "is", and others. These words are often removed to speed up processing without losing crucial information. Python's Natural Language Toolkit (NLTK) provides a pre-defined list of stop words. Let's take a look:

```python
from nltk.corpus import stopwords

# Defining the stop words
stop_words = set(stopwords.words('english'))

# Print 5 stop words
examples_of_stopwords = list(stop_words)[:5]
print(f"Examples of stop words: {examples_of_stopwords}")
```

The output of the above code will be:

```
Examples of stop words: ['or', 'some', 'couldn', 'hasn', 'after']
```

Here, the `stopwords.words('english')` function returns a list of English stop words. You might sometimes need to add domain-specific stop words to this list based on the nature of your text data.

## Introduction to Stemming

**Stemming** is a technique that reduces a word to its root form. Although the stemmed word may not always be a real or grammatically correct word in English, it helps consolidate different forms of the same word into a common base form, reducing the complexity of text data. This simplification leads to quicker computation and potentially better performance when implementing NLP algorithms, as there are fewer unique words to consider.

For example, the words "run", "runs", and "running" might all be stemmed to the common root "run". This helps our algorithm understand that these words are related and carry a similar semantic meaning.

Let's illustrate this with the **Porter Stemmer**, a well-known stemming algorithm from the NLTK library:

```python
from nltk.stem import PorterStemmer

# Stemming with NLTK Porter Stemmer
stemmer = PorterStemmer()

stemmed_word = stemmer.stem('running')
print(f"Stemmed word: {stemmed_word}")
```

The output of the above code will be:

```
Stemmed word: run
```

The `PorterStemmer` class comes with the `stem` method that takes in a word and returns its root form. In this case, "running" is correctly stemmed to its root word "run". This form of preprocessing, although it may lead to words that are not recognizable, is a standard practice in text preprocessing for NLP tasks.

## Stop Words Removal and Stemming in Action

Having understood stop words and stemming, let's develop a function that removes stop words and applies stemming to a given text. We will tokenize the text (split it into individual words) and apply these transformations word by word.

```python
from nltk.tokenize import word_tokenize

def remove_stopwords_and_stem(text):
    tokenized_text = word_tokenize(text)
    filtered_text = [stemmer.stem(word) for word in tokenized_text if word not in stop_words]
    return " ".join(filtered_text)

example_text = "This is an example text to demonstrate the removal of stop words and stemming."

print(f"Original Text: {example_text}")
print(f"Processed Text: {remove_stopwords_and_stem(example_text)}")
```

The output of the above code will be:

```
Original Text: This is an example text to demonstrate the removal of stop words and stemming.
Processed Text: thi exampl text demonstr remov stop word stem .
```

The `remove_stopwords_and_stem` function performs the required processing and provides the cleaned-up text.

## Stop Words Removal and Stemming on a Dataset

Let's implement the above concepts on a real-world text dataset – the **20 Newsgroups Dataset**.

```python
from sklearn.datasets import fetch_20newsgroups

# Fetching 20 newsgroups dataset
newsgroups_data = fetch_20newsgroups(subset='all')

# Limit to first 100 data points for efficient code execution
newsgroups_data = newsgroups_data['data'][:100]

processed_newsgroups_data = [remove_stopwords_and_stem(text) for text in newsgroups_data]

# Print first 100 characters of the first document
print("First 100 characters of the first processed document:")
print(processed_newsgroups_data[0][:100])
```

The output of the above code will be:

```
First 100 characters of the first processed document:
from : mamatha devineni ratnam < mr47+ @ andrew.cmu.edu > subject : pen fan reaction organ : post of
```

This process can take a while for large datasets, but the output will be much cleaner and easier for a machine learning model to work with.

## Summary and Conclusion

And that's a wrap! In today's lesson, we've learned about stop words and stemming as crucial steps in text preprocessing for machine learning models. We used Python's NLTK library to work with stop words and perform stemming. We processed some example sentences and a real-world dataset to practice these concepts.

As we proceed to more advanced NLP tasks, preprocessing techniques like removing stop words and stemming will serve as a solid foundation. In the upcoming lessons, we will delve deeper into handling missing text data and learn about reshaping textual data for analysis. Let's keep going!

## Switch from LancasterStemmer to PorterStemmer

Excellent work so far, Stellar Navigator! Now it's time to tweak the implemented text preprocessing method. Replace the use of LancasterStemmer with PorterStemmer. Remember to import PorterStemmer from nltk.stem. After adjusting your code, run it and observe the differences in your processed text.

```python
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer
from nltk.tokenize import word_tokenize

# Defining the stop words
stop_words = set(stopwords.words('english'))

# Creating an instance of LancasterStemmer
stemmer = LancasterStemmer()

def remove_stopwords_and_stem(text):
    tokenized_text = word_tokenize(text)
    filtered_text = [stemmer.stem(word) for word in tokenized_text if not word in stop_words]
    return " ".join(filtered_text)

example_text = "This is a example text to demonstrate the removal of stop words and stemming."
processed_text = remove_stopwords_and_stem(example_text)
print(f"Original Text: {example_text}")
print(f"Processed Text: {processed_text}")
```

To modify the provided code by replacing the `LancasterStemmer` with `PorterStemmer`, you will need to import `PorterStemmer` from `nltk.stem` and create an instance of it. Below is the adjusted code:

```python
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer  # Importing PorterStemmer
from nltk.tokenize import word_tokenize

# Defining the stop words
stop_words = set(stopwords.words('english'))

# Creating an instance of PorterStemmer
stemmer = PorterStemmer()

def remove_stopwords_and_stem(text):
    tokenized_text = word_tokenize(text)
    filtered_text = [stemmer.stem(word) for word in tokenized_text if word not in stop_words]
    return " ".join(filtered_text)

example_text = "This is an example text to demonstrate the removal of stop words and stemming."
processed_text = remove_stopwords_and_stem(example_text)
print(f"Original Text: {example_text}")
print(f"Processed Text: {processed_text}")
```

### Explanation of Changes:
1. **Import Statement**: Changed the import from `LancasterStemmer` to `PorterStemmer`.
2. **Stemmer Instance**: Created an instance of `PorterStemmer` instead of `LancasterStemmer`.

### Expected Output:
When you run the modified code, you should see the processed text reflecting the stemming done by the Porter Stemmer. The output will look something like this:

```
Original Text: This is an example text to demonstrate the removal of stop words and stemming.
Processed Text: exampl text demonstr remov stop word stem .
```

### Differences in Processed Text:
The main difference you might observe between the outputs of `LancasterStemmer` and `PorterStemmer` is in how they handle certain words. The Porter Stemmer is generally more conservative and may produce different stemmed forms compared to the Lancaster Stemmer, which is more aggressive in its stemming approach. 

If you run the code, you can compare the outputs to see how the two stemming algorithms differ in their results.Executed 1st Code Block

## Removing Stop Words and Punctuation from Text

Well done on completing the first practice, stellar navigator! Now it's time to put your skills further to the test.

The skeleton code for this task is already written for you. You just need to fill in the blank spots (____) to make the code run without errors.

First, you'll need to create a set of English stop words.
Then, tokenize the sample text using an appropriate function from the NLTK tokenization module.
Take on this challenge and let's see how you do

```python
from nltk.tokenize import casual_tokenize
from nltk.corpus import stopwords
import string

stop_words = set(____.____('english'))
# Adding string punctuation to stop words to filter them out
stop_words.update(string.punctuation)

ex_text = "Here's a sample sentence to remove stop words from. It has generic and specific words."

# Tokenize the sample sentence with casual_tokenize
ex_words = _____(ex_text)

# Remove stop words and punctuation from the list of tokenized words
filtered_words = [word for word in ex_words  if not word.lower() in stop_words]

print("Filtered words:", filtered_words)

```

To complete the provided code, you need to fill in the blanks with the appropriate functions and methods from the NLTK library. Here’s the completed code with the necessary modifications:

```python
from nltk.tokenize import casual_tokenize
from nltk.corpus import stopwords
import string

# Create a set of English stop words
stop_words = set(stopwords.words('english'))

# Adding string punctuation to stop words to filter them out
stop_words.update(string.punctuation)

ex_text = "Here's a sample sentence to remove stop words from. It has generic and specific words."

# Tokenize the sample sentence with casual_tokenize
ex_words = casual_tokenize(ex_text)

# Remove stop words and punctuation from the list of tokenized words
filtered_words = [word for word in ex_words if word.lower() not in stop_words]

print("Filtered words:", filtered_words)
```

### Explanation of the Filled Blanks:
1. **Creating the Set of Stop Words**:
   - `stop_words = set(stopwords.words('english'))`: This line creates a set of English stop words using the `stopwords` module from NLTK.

2. **Tokenizing the Sample Text**:
   - `ex_words = casual_tokenize(ex_text)`: This line tokenizes the sample text using the `casual_tokenize` function from the NLTK tokenization module.

### Expected Output:
When you run the completed code, you should see the filtered words printed out, which will exclude any stop words and punctuation from the original text. The output will look something like this:

```
Filtered words: ['sample', 'sentence', 'remove', 'stop', 'words', 'generic', 'specific', 'words']
```

### Note:
Make sure you have the necessary NLTK resources downloaded, especially the stop words. If you encounter any errors related to missing resources, you can download them using:

```python
import nltk
nltk.download('stopwords')
```

This will ensure that the `stopwords` corpus is available for use in your code.

## Stemming Words with PorterStemmer

Stellar work so far, Space Voyager! Now it's time to hone your skills in stemming. Fill in the blank spots (____) to make the code functional. Stem the provided words, and print the stemmed versions. Let's get processing!

```python
from nltk.stem import ____

# List of sample words
example_words = ["connection", "connections", "connective", "connected", "connecting", "connection"]

# Create object of the Porter Stemmer Class
stemmer = ____

# Stem each word in the list of words
stemmed_words = [stemmer.____(word) for word in example_words]

print("Stemmed words: ", stemmed_words)

```

To complete the provided code for stemming words using the `PorterStemmer`, you need to fill in the blanks with the appropriate class and method from the NLTK library. Here’s the completed code:

```python
from nltk.stem import PorterStemmer  # Fill in the first blank

# List of sample words
example_words = ["connection", "connections", "connective", "connected", "connecting", "connection"]

# Create object of the Porter Stemmer Class
stemmer = PorterStemmer()  # Fill in the second blank

# Stem each word in the list of words
stemmed_words = [stemmer.stem(word) for word in example_words]  # Fill in the third blank

print("Stemmed words: ", stemmed_words)
```

### Explanation of the Filled Blanks:
1. **Importing the PorterStemmer**:
   - `from nltk.stem import PorterStemmer`: This line imports the `PorterStemmer` class from the NLTK library.

2. **Creating an Instance of PorterStemmer**:
   - `stemmer = PorterStemmer()`: This line creates an instance of the `PorterStemmer` class.

3. **Stemming Each Word**:
   - `stemmed_words = [stemmer.stem(word) for word in example_words]`: This line applies the `stem` method of the `PorterStemmer` instance to each word in the `example_words` list.

### Expected Output:
When you run the completed code, you should see the stemmed versions of the provided words printed out. The output will look something like this:

```
Stemmed words:  ['connect', 'connect', 'connect', 'connect', 'connect', 'connect']
```

This output shows that all variations of the word "connect" have been reduced to their root form "connect" by the Porter Stemmer.

## Implementing Stopword Removal and Stemming Function

Celestial Traveler, let’s put our text preprocessing skills to the test. Can you add the missing line inside the function remove_stopwords_and_stem to tokenize the given text, remove stop words, stem the remaining words, and return the processed text? Give it a shot!

```python
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Defining the stop words
stop_words = set(stopwords.words('english'))

# Creating an instance of PorterStemmer
stemmer = PorterStemmer()

def remove_stopwords_and_stem(text):
    # TODO: Add the code to tokenize the text, remove stopwords and apply stemming
    return " ".join(filtered_text)

example_text = "This is a example text to demonstrate the removal of stop words and stemming."
processed_text = remove_stopwords_and_stem(example_text)
print(f"Original Text: {example_text}")
print(f"Processed Text: {processed_text}")

```

To complete the `remove_stopwords_and_stem` function, you need to add the code that tokenizes the input text, removes the stop words, stems the remaining words, and then returns the processed text. Here’s the completed code:

```python
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Defining the stop words
stop_words = set(stopwords.words('english'))

# Creating an instance of PorterStemmer
stemmer = PorterStemmer()

def remove_stopwords_and_stem(text):
    # Tokenize the text
    tokenized_text = word_tokenize(text)
    
    # Remove stop words and apply stemming
    filtered_text = [stemmer.stem(word) for word in tokenized_text if word.lower() not in stop_words]
    
    return " ".join(filtered_text)

example_text = "This is an example text to demonstrate the removal of stop words and stemming."
processed_text = remove_stopwords_and_stem(example_text)
print(f"Original Text: {example_text}")
print(f"Processed Text: {processed_text}")
```

### Explanation of the Added Code:
1. **Tokenization**:
   - `tokenized_text = word_tokenize(text)`: This line tokenizes the input text into individual words.

2. **Removing Stop Words and Stemming**:
   - `filtered_text = [stemmer.stem(word) for word in tokenized_text if word.lower() not in stop_words]`: This line iterates over each tokenized word, checks if it is not in the set of stop words (case insensitive), and applies stemming using the `stem` method of the `PorterStemmer` instance.

3. **Returning Processed Text**:
   - `return " ".join(filtered_text)`: This line joins the filtered and stemmed words into a single string and returns it.

### Expected Output:
When you run the completed code, you should see the original text and the processed text printed out. The output will look something like this:

```
Original Text: This is an example text to demonstrate the removal of stop words and stemming.
Processed Text: exampl text demonstr remov stop word stem

## Cleaning and Processing the First Newsgroup Article

Good job, Stellar Navigator! Could you now define the stop words and instantiate the PorterStemmer using NLTK's predefined English stop words? Additionally, could you implement the tokenization of the text, remove stop words, apply stemming, and return the processed text in the function remove_stopwords_and_stem? After doing so, you will preprocess the first newsgroup article and display it. The stars are in your favor!

```python
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.datasets import fetch_20newsgroups

# TODO: Define the stop words and instantiate the PorterStemmer

def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'\d', ' ', text)  # Remove digits
    text = re.sub(r'\s\s+', ' ', text)  # Remove extra spaces
    text = remove_stopwords_and_stem(text)

    return text

def remove_stopwords_and_stem(text):
    # TODO: Implement the process to tokenize the text, remove stopwords and apply stemming

    return " ".join(filtered_text)

# Fetching 20 newsgroups dataset
newsgroups_data = fetch_20newsgroups(subset='all')['data'][:1]

# Process and display the first newsgroup article
cleaned_data = clean_text(newsgroups_data[0])
print("First newsgroup article after cleaning and textual preprocessing:")
print(cleaned_data)
```

Abacus AI ChatLLM:
Sure! Let's complete the code by defining the stop words and instantiating the `PorterStemmer`. We will also implement the `remove_stopwords_and_stem` function to tokenize the text, remove stop words, apply stemming, and return the processed text. Here’s the complete code:

```python
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.datasets import fetch_20newsgroups

# Define the stop words and instantiate the PorterStemmer
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'\d', ' ', text)  # Remove digits
    text = re.sub(r'\s\s+', ' ', text)  # Remove extra spaces
    text = remove_stopwords_and_stem(text)

    return text

def remove_stopwords_and_stem(text):
    # Tokenize the text
    tokenized_text = word_tokenize(text)
    
    # Remove stop words and apply stemming
    filtered_text = [stemmer.stem(word) for word in tokenized_text if word.lower() not in stop_words]
    
    return " ".join(filtered_text)

# Fetching 20 newsgroups dataset
newsgroups_data = fetch_20newsgroups(subset='all')['data'][:1]

# Process and display the first newsgroup article
cleaned_data = clean_text(newsgroups_data[0])
print("First newsgroup article after cleaning and textual preprocessing:")
print(cleaned_data)
```

### Explanation of the Code:
1. **Stop Words and Stemmer Initialization**:
   - `stop_words = set(stopwords.words('english'))`: This line defines a set of English stop words using NLTK's predefined list.
   - `stemmer = PorterStemmer()`: This line creates an instance of the `PorterStemmer`.

2. **Text Cleaning Function**:
   - The `clean_text` function processes the input text by converting it to lowercase, removing email addresses, URLs, punctuation, digits, and extra spaces. Finally, it calls `remove_stopwords_and_stem` to further process the text.

3. **Stop Words Removal and Stemming Function**:
   - The `remove_stopwords_and_stem` function tokenizes the cleaned text, removes stop words, applies stemming to the remaining words, and returns the processed text as a single string.

4. **Fetching and Processing Newsgroup Data**:
   - The code fetches the first article from the 20 Newsgroups dataset and processes it using the `clean_text` function. The cleaned article is then printed.

### Expected Output:
When you run the completed code, you should see the first newsgroup article printed after cleaning and preprocessing. The output will show the text with stop words removed and words stemmed, making it more suitable for further analysis or modeling. 

Make sure you have the necessary NLTK resources downloaded, especially the stop words and punkt tokenizer. If you encounter any errors related to missing resources, you can download them using:

```python
import nltk
nltk.download('stopwords')
nltk.download('punkt')
```