# StopWords

Stopwords are common words that are usually filtered out during text processing tasks, especially in Natural Language Processing (NLP). These are words that don't carry significant meaning and are often removed to improve the efficiency and accuracy of text analysis algorithms like search engines, sentiment analysis, and other text mining techniques.

Examples of stopwords include: "the," "is," "in," "and," "a," "of" and other similar words that appear frequently in the text but don't contribute much to the meaning or content.

## Why Remove Stopwords?
- Reduce noise: Stopwords do not add much value to the understanding of a sentence or document.
- Improve performance: Removing stopwords helps to reduce the size of the text data and focus on the more meaningful words.
- Save processing power: Algorithms can process documents faster and more efficiently when irrelevant words are excluded.

In [1]:
from nltk.corpus import stopwords

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\satee\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [4]:
str(stopwords.words('english'))

'[\'i\', \'me\', \'my\', \'myself\', \'we\', \'our\', \'ours\', \'ourselves\', \'you\', "you\'re", "you\'ve", "you\'ll", "you\'d", \'your\', \'yours\', \'yourself\', \'yourselves\', \'he\', \'him\', \'his\', \'himself\', \'she\', "she\'s", \'her\', \'hers\', \'herself\', \'it\', "it\'s", \'its\', \'itself\', \'they\', \'them\', \'their\', \'theirs\', \'themselves\', \'what\', \'which\', \'who\', \'whom\', \'this\', \'that\', "that\'ll", \'these\', \'those\', \'am\', \'is\', \'are\', \'was\', \'were\', \'be\', \'been\', \'being\', \'have\', \'has\', \'had\', \'having\', \'do\', \'does\', \'did\', \'doing\', \'a\', \'an\', \'the\', \'and\', \'but\', \'if\', \'or\', \'because\', \'as\', \'until\', \'while\', \'of\', \'at\', \'by\', \'for\', \'with\', \'about\', \'against\', \'between\', \'into\', \'through\', \'during\', \'before\', \'after\', \'above\', \'below\', \'to\', \'from\', \'up\', \'down\', \'in\', \'out\', \'on\', \'off\', \'over\', \'under\', \'again\', \'further\', \'then\', 

In [5]:
# Sample text
text = "This is a simple example demonstrating the use of stopwords in NLTK."

In [12]:
from nltk.tokenize import word_tokenize
# Tokenize the text
words = word_tokenize(text)
words

['This',
 'is',
 'a',
 'simple',
 'example',
 'demonstrating',
 'the',
 'use',
 'of',
 'stopwords',
 'in',
 'NLTK',
 '.']

In [10]:
# Get the list of English stopwords
stop_words = set(stopwords.words('english'))

In [11]:
# Remove stopwords from the tokenized words
filtered_words = [word for word in words if word.lower() not in stop_words]

In [13]:
filtered_words

['simple', 'example', 'demonstrating', 'use', 'stopwords', 'NLTK', '.']

After removing stopwords, we are left with the more meaningful words

## When to Use Stopwords:
- **Text Classification:** Removing stopwords can improve the performance of models by focusing on the important words.
- **Sentiment Analysis:** Stopwords can dilute the sentiment score, so removing them helps in more accurate sentiment scoring.
- **Information Retrieval/Search Engines:** Helps refine search results by ignoring frequently occurring but irrelevant words.

## Key Points:
- **Stopwords** improve the relevance of text data by filtering out less meaningful words.
- The **list of stopwords** can be language-specific and is available in multiple languages in NLTK.
- You can **customize the stopwords** list according to your needs (e.g., adding domain-specific stopwords).