# Stopwords

Stopwords are **common words** in a language that usually **don’t carry significant meaning** in text analysis.

Examples in English:
`["is", "am", "are", "the", "in", "on", "at", "a", "of", "this", "that"]`

👉 In the sentence:
`"The cat is on the mat."`
The important words are: `["cat", "mat"]`
Words like `"the", "is", "on"` are stopwords.

---

## 🔹 Why Remove Stopwords?

* They **don’t add much meaning** to most NLP tasks (like classification, topic modeling).
* They **increase dataset size & computation** without improving accuracy.
* Removing them helps models focus on meaningful words.

⚠️ **BUT**: In some tasks (e.g., translation, question answering), stopwords **must be kept**, because they affect grammar and meaning.

---

## 🔹 How to Remove Stopwords?

### 1. **Using NLTK**

```python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download stopwords (only once)
nltk.download("punkt")
nltk.download("stopwords")

text = "This is a simple example showing the removal of stopwords in NLP."

# Tokenize
words = word_tokenize(text)

# Remove stopwords
filtered_words = [w for w in words if w.lower() not in stopwords.words("english")]

print("Original:", words)
print("After Stopword Removal:", filtered_words)
```

✅ **Output**

```
Original: ['This', 'is', 'a', 'simple', 'example', 'showing', 'the', 'removal', 'of', 'stopwords', 'in', 'NLP', '.']
After Stopword Removal: ['simple', 'example', 'showing', 'removal', 'stopwords', 'NLP']
```

---

### 2. **Using spaCy**

```python
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a simple example showing the removal of stopwords in NLP.")

filtered_words = [token.text for token in doc if not token.is_stop]
print(filtered_words)
```

✅ Output:

```
['simple', 'example', 'showing', 'removal', 'stopwords', 'NLP']
```

---

### 3. **Using Custom Stopwords List**

You can add/remove words from the default list:

```python
custom_stopwords = set(stopwords.words("english"))
custom_stopwords.update(["example", "showing"])  # add extra words

filtered_words = [w for w in words if w.lower() not in custom_stopwords]
print(filtered_words)
```

---

**Summary**

* **Stopwords** = common words with little meaning.
* Removing them helps simplify text and improves performance in many NLP tasks.
* Different libraries (NLTK, spaCy, sklearn) provide stopword lists.
* You can always create a **custom list** depending on your dataset.



In [1]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download required resources (run once)
nltk.download("punkt")
nltk.download("stopwords")

# Example text
text = "This is a great movie and I really enjoyed it!"

# Tokenize text into words
words = word_tokenize(text)

# Load English stopwords
stop_words = set(stopwords.words("english"))

# Remove stopwords
filtered_words = [w for w in words if w.lower() not in stop_words and w.isalpha()]

print("Original Text:", text)
print("Tokenized Words:", words)
print("After Stopword Removal:", filtered_words)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sangouda\AppData\Roaming\nltk_data...


Original Text: This is a great movie and I really enjoyed it!
Tokenized Words: ['This', 'is', 'a', 'great', 'movie', 'and', 'I', 'really', 'enjoyed', 'it', '!']
After Stopword Removal: ['great', 'movie', 'really', 'enjoyed']


[nltk_data]   Unzipping corpora\stopwords.zip.
