# 🔹 Step 2: Stopword Removal

**Concept: Removing common filler words that don’t add meaning.**

- (Examples: the, is, of, and, this)


🧠 Why Remove Stopwords?

Let’s say your sentence is:

**“The weather is very nice today and the sun is shining.”**

After removing stopwords:

**“weather nice today sun shining”**

- ✅ Makes the text concise and meaningful
- ✅ Reduces noise for algorithms like TF-IDF or classification models


✅ We’ll learn:

How to use NLTK’s stopword list

How to customize it for your use case

# Redoing Step1 Tokenization for removing stop words

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = "I am going to the store because I need to buy some things for my family and I will be back soon."

# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokenization:", sentences)

# Word Tokenization
words_tokens = word_tokenize(text)
print("Word Tokenization:", words_tokens)


Sentence Tokenization: ['I am going to the store because I need to buy some things for my family and I will be back soon.']
Word Tokenization: ['I', 'am', 'going', 'to', 'the', 'store', 'because', 'I', 'need', 'to', 'buy', 'some', 'things', 'for', 'my', 'family', 'and', 'I', 'will', 'be', 'back', 'soon', '.']


### 📝 Note:
### word_tokenize keeps punctuation as separate tokens (which can be useful later).

# STEP 2 : Dowloading the STOP WORDS

In [1]:
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     c:\Users\jsril\anaconda3\envs\nlp_env\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [21]:
from nltk.corpus import stopwords
stopwrd = stopwords.words("english") 
print("STOP WORDS IN NLTK :\n", stopwrd)
filtered_words = [ word  for word in words_tokens if word.lower() not in stopwrd ]

STOP WORDS IN NLTK :
 ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 

STOPWORDS IN NLTK:
['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she'd", "she'll", "she's", 'should', 'shouldn', "shouldn't", "should've", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', "they'd", "they'll", "they're", "they've", 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', "we'd", "we'll", "we're", 'were', 'weren', "weren't", "we've", 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'won', "won't", 'wouldn', "wouldn't", 'y', 'you', "you'd", "you'll", 'your', "you're", 'yours', 'yourself', 'yourselves', "you've"]

In [22]:
filtered_words

['going', 'store', 'need', 'buy', 'things', 'family', 'back', 'soon', '.']

# Overall code

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "The weather is very nice today and the sun is shining."
words = word_tokenize(text)

# English stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords
filtered_words = [word for word in words if word.lower() not in stop_words]

print("Original Words:", words)
print("Filtered Words:", filtered_words)
