# 2.3 Stopwords

In this lesson we'll be using the nltk package to remove stop words from text.

Stop words are common words in the language which don't carry much meaning e.g. "and", "of", "a", "to". 

We remove these words because it removes a lot of complexity from the data. These words don't add much meaning to text so by removing them we are left with a smaller, cleaner dataset. Smaller, cleaner datasets often lead to increased accuracy in machine learning and will also speed up processing times.

In [None]:
# Import packages
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

In [11]:
# This saves all English stopwords into a variable
en_stopwords = stopwords.words('english')

In [12]:
# Print the list of stop words to see what we will be removing
print(en_stopwords)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [13]:
sentence = "it was too far to go to the shop and he did not want her to walk"

In [None]:
# Filter out stopwords from the sentence
sentence_no_stopwords = ' '.join([word for word in sentence.split() if word not in (en_stopwords)])
print(sentence_no_stopwords)

far go shop want walk


In [15]:
# Removing stop words from list
en_stopwords.remove("did")
en_stopwords.remove("not")

In [16]:
# Add custom stop words
en_stopwords.append("go")

In [None]:
# Filter the sentence again
sentence_no_stopwords_custom = ' '.join([word for word in sentence.split() if word not in (en_stopwords)])
print(sentence_no_stopwords_custom)

far shop did not want walk


## Another Example

In [21]:
my_sentence = "She didn't go to school because she felt sick and tired"
my_sentence = my_sentence.lower()

my_stopwords = stopwords.words('english')
my_stopwords.remove("didn't")
my_stopwords.append("felt")

cleaned = ' '.join([word for word in my_sentence.split() if word not in my_stopwords])
print("Cleaned: ")
print(cleaned)

Cleaned: 
didn't go school sick tired


## What I Learned

- **Stopwords are commonly used to remove non-informative words from text data.**

- You can customize stopword lists by **removing or adding** items for task-specific filtering.

- Be careful with negations like "not" or "didn't" — removing them might hurt tasks like sentiment analysis.

- Using split() and list comprehensions is a simple way to filter stopwords from text.