# Stopwords

In Natural Language Processing (NLP), stopwords are common words that are often removed from text before analysis. These words, such as "the," "a," "is," "and," and "in," are considered to carry little semantic meaning on their own and occur very frequently in a language.

# Tokens

Tokenization in Natural Language Processing (NLP) is the process of breaking down a continuous stream of text into smaller, manageable units called tokens.

These tokens can be words, characters, numbers, or punctuation, depending on the specific task and how the text is segmented.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
text = "This is an example of a sentence showing how to remove the stopwords using NLTK"

In [None]:
tokens = word_tokenize(text)
tokens

['This',
 'is',
 'an',
 'example',
 'of',
 'a',
 'sentence',
 'showing',
 'how',
 'to',
 'remove',
 'the',
 'stopwords',
 'using',
 'NLTK']

In [None]:
text.split()

['This',
 'is',
 'an',
 'example',
 'of',
 'a',
 'sentence',
 'showing',
 'how',
 'to',
 'remove',
 'the',
 'stopwords',
 'using',
 'NLTK']

In [None]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [None]:
stop_words = set(stopwords.words('english'))

In [None]:
cleaned_text = [word for word in tokens if word.lower() not in stop_words ]

In [None]:
text

'This is an example of a sentence showing how to remove the stopwords using NLTK'

In [None]:
cleaned_text

['example', 'sentence', 'showing', 'remove', 'stopwords', 'using', 'NLTK']