# ***Stopwords***

Stopwords are common words (like "the," "is," "in," "and," "of," etc.) that appear frequently in a language but usually do not carry significant meaning in Natural Language Processing (NLP) tasks. These words are often removed during text preprocessing to improve efficiency and focus on more meaningful terms.

Why Remove Stopwords?

Reduce Dimensionality: Eliminating stopwords decreases the number of unique tokens, making models more efficient.

Improve Model Performance: Many NLP tasks (like text classification, sentiment analysis) benefit from focusing on more relevant words.

Enhance Search Accuracy: In search engines, removing stopwords can improve indexing and retrieval speed.


So for removing stopwords from our data/corpus we can use nltk.

In [3]:
corpus = """Natural Language Processing (NLP) is a fascinating field of artificial intelligence. 
It enables computers to understand, interpret, and generate human language. 
Stopwords are frequently removed during text preprocessing to improve the efficiency of NLP models. 
Machine learning techniques, along with deep learning, have significantly improved the accuracy of language models."""

In [4]:
import nltk

In [5]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download("stopwords")
nltk.download("punkt_tab")
nltk.download("wordnet")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
# First of all make tokens of corpus in the form of sentences.
words = sent_tokenize(corpus)
# also intialize lemmatizer
lemmatizer = WordNetLemmatizer()

In [7]:
# storing stopword in a variable for comparing later
stpWords = set(stopwords.words('english'))
for i in range(len(words)):
    # running a loop over all the sentences one by one and then separating all the words of each sentence using word_tokenize.
    sentence = nltk.word_tokenize(words[i]) 
    sentence=[lemmatizer.lemmatize(word,pos="v") for word in sentence if word not in stpWords ] #after tokenizing each words and storing them in a new list we are going to apply lemmatization on them and we will only store only those words that does not come in stopwords set.
    words[i] = ' '.join(sentence) #after lemmatization and removing stopwords we are going to replace each tokein in main list words by making a token with these words joing them with a whitespace.
    

In [8]:
# Finally we  see that corpus has been processed successfully
corpus= ' '.join(words)
print(corpus) #now the corpus has lemmatized text with all the stopwords removed.

Natural Language Processing ( NLP ) fascinate field artificial intelligence . It enable computers understand , interpret , generate human language . Stopwords frequently remove text preprocessing improve efficiency NLP model . Machine learn techniques , along deep learn , significantly improve accuracy language model .


# ***Parts of Speech(POS) in NLP***

POS refers to the categories of words based on their grammatical properties, such as:

1. Nouns (NN): Words that refer to people, places, things, or ideas (e.g., "dog", "city", "happiness").
2. Verbs (VB): Words that express actions, events, or states (e.g.,"run","eat","be").
3. Adjectives (JJ): Words that modify or describe nouns or pronouns (e.g., "big", "happy", "blue").
4. Adverbs (RB): Words that modify or describe verbs, adjectives, or other adverbs (e.g., "quickly", "very", "well").
5. Pronouns (PRP): Words that replace nouns in a sentence (e.g., "he", "she", "it").
6. Prepositions (IN): Words that show relationships between words or phrases (e.g., "in", "on", "at").
7. Conjunctions (CC): Words that connect words, phrases, or clauses (e.g., "and", "but", "or").
8. Interjections (UH): Words that express emotion or feeling (e.g., "oh", "wow", "ouch").
9. Articles (DT): Words that modify nouns and indicate whether they are specific or general (e.g., "the", "a", "an").
10. Numbers (CD): Words that express numerical values (e.g., "one", "two", "three").



NLTK has built in methods with help of those we can find POS for each word.

In [9]:
import nltk
from nltk.tokenize import sent_tokenize,word_tokenize

nltk.download("punkt_tab")


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [10]:

corpus2 = """Natural Language Processing (NLP) is a fascinating field of artificial intelligence. 
It enables computers to understand, interpret, and generate human language. 
Stopwords are frequently removed during text preprocessing to improve the efficiency of NLP models. 
Machine learning techniques, along with deep learning, have significantly improved the accuracy of language models."""


In [20]:
tokens = word_tokenize(corpus2)
print(tokens)

['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'of', 'artificial', 'intelligence', '.', 'It', 'enables', 'computers', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', '.', 'Stopwords', 'are', 'frequently', 'removed', 'during', 'text', 'preprocessing', 'to', 'improve', 'the', 'efficiency', 'of', 'NLP', 'models', '.', 'Machine', 'learning', 'techniques', ',', 'along', 'with', 'deep', 'learning', ',', 'have', 'significantly', 'improved', 'the', 'accuracy', 'of', 'language', 'models', '.']


In [None]:
from nltk.tag import pos_tag


for using pos_tag we need to download following thing

In [14]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


True

input in pos_tag must be a list.On a single word/string it will give an error.So if we want to apply it on single word then we need to put it in a array and pass this array to pos_tag 

In [19]:
pos_tag(tokens)     

[('Natural', 'JJ'),
 ('Language', 'NNP'),
 ('Processing', 'NNP'),
 ('(', '('),
 ('NLP', 'NNP'),
 (')', ')'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('fascinating', 'JJ'),
 ('field', 'NN'),
 ('of', 'IN'),
 ('artificial', 'JJ'),
 ('intelligence', 'NN'),
 ('.', '.'),
 ('It', 'PRP'),
 ('enables', 'VBZ'),
 ('computers', 'NNS'),
 ('to', 'TO'),
 ('understand', 'VB'),
 (',', ','),
 ('interpret', 'VB'),
 (',', ','),
 ('and', 'CC'),
 ('generate', 'VB'),
 ('human', 'JJ'),
 ('language', 'NN'),
 ('.', '.'),
 ('Stopwords', 'NNS'),
 ('are', 'VBP'),
 ('frequently', 'RB'),
 ('removed', 'VBN'),
 ('during', 'IN'),
 ('text', 'JJ'),
 ('preprocessing', 'NN'),
 ('to', 'TO'),
 ('improve', 'VB'),
 ('the', 'DT'),
 ('efficiency', 'NN'),
 ('of', 'IN'),
 ('NLP', 'NNP'),
 ('models', 'NNS'),
 ('.', '.'),
 ('Machine', 'NNP'),
 ('learning', 'VBG'),
 ('techniques', 'NNS'),
 (',', ','),
 ('along', 'IN'),
 ('with', 'IN'),
 ('deep', 'JJ'),
 ('learning', 'NN'),
 (',', ','),
 ('have', 'VBP'),
 ('significantly', 'RB'),
 ('improved',

In [21]:
ass = "Taj mahal is a beautiful monument"
pos_tag(word_tokenize(ass))

[('Taj', 'NNP'),
 ('mahal', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('beautiful', 'JJ'),
 ('monument', 'NN')]

In [23]:
pos_tag(["heelo"])

[('heelo', 'NN')]