# ***Stopwords***

Stopwords are common words (like "the," "is," "in," "and," "of," etc.) that appear frequently in a language but usually do not carry significant meaning in Natural Language Processing (NLP) tasks. These words are often removed during text preprocessing to improve efficiency and focus on more meaningful terms.

Why Remove Stopwords?

Reduce Dimensionality: Eliminating stopwords decreases the number of unique tokens, making models more efficient.

Improve Model Performance: Many NLP tasks (like text classification, sentiment analysis) benefit from focusing on more relevant words.

Enhance Search Accuracy: In search engines, removing stopwords can improve indexing and retrieval speed.


So for removing stopwords from our data/corpus we can use nltk.

In [None]:
corpus = """Natural Language Processing (NLP) is a fascinating field of artificial intelligence. 
It enables computers to understand, interpret, and generate human language. 
Stopwords are frequently removed during text preprocessing to improve the efficiency of NLP models. 
Machine learning techniques, along with deep learning, have significantly improved the accuracy of language models."""

In [None]:
import nltk

In [17]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download("stopwords")
nltk.download("punkt_tab")
nltk.download("wordnet")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [23]:
# First of all make tokens of corpus in the form of sentences.
words = sent_tokenize(corpus)
# also intialize lemmatizer
lemmatizer = WordNetLemmatizer()

In [28]:
# storing stopword in a variable for comparing later
stpWords = set(stopwords.words('english'))
for i in range(len(words)):
    # running a loop over all the sentences one by one and then separating all the words of each sentence using word_tokenize.
    sentence = nltk.word_tokenize(words[i]) 
    sentence=[lemmatizer.lemmatize(word,pos="v") for word in sentence if word not in stpWords ] #after tokenizing each words and storing them in a new list we are going to apply lemmatization on them and we will only store only those words that does not come in stopwords set.
    words[i] = ' '.join(sentence) #after lemmatization and removing stopwords we are going to replace each tokein in main list words by making a token with these words joing them with a whitespace.
    

In [None]:
# Finally we  see that corpus has been processed successfully
corpus= ' '.join(words)
print(corpus) #now the corpus has lemmatized text with all the stopwords removed.

Natural Language Processing ( NLP ) fascinate field artificial intelligence . It enable computers understand , interpret , generate human language . Stopwords frequently remove text preprocessing improve efficiency NLP model . Machine learn techniques , along deep learn , significantly improve accuracy language model .
