In Natural Language Processing (NLP), **stopwords** are common words like "the," "is," "in," "at," etc., that typically do not carry significant meaning or context in text analysis. These words are frequently removed during text preprocessing to focus on more meaningful terms, improving both efficiency and relevance in tasks like search engines, machine learning models, or text summarization.

The specific list of stopwords can vary depending on the language or the task at hand. Removing stopwords can help reduce noise and allow algorithms to focus on the content that differentiates texts from one another.

In [None]:
paragraph = """*Pride and Prejudice* by Jane Austen is a timeless masterpiece that captures the beauty of human emotions, relationships, and societal dynamics. Set in early 19th-century England, the novel intricately portrays the tension between individual desires and societal expectations. Its beauty lies in the way Austen explores complex themes such as love, class, and personal growth with wit and irony. The story centers around Elizabeth Bennet and Mr. Darcy, whose initially strained relationship gradually evolves as they both confront their own prejudices and misconceptions.
Austen’s vivid characters breathe life into the narrative, from Elizabeth's spirited independence to Darcy's quiet transformation. The novel’s charm lies in its delicate balance of humor, romance, and social critique, making it a rich reflection on human nature. The beauty also extends to its exploration of themes that remain relevant today—how first impressions can deceive, how pride can blind, and how love can transcend social boundaries.
Austen’s elegant prose and sharp dialogue make the novel both a delightful read and a deep commentary on society. *Pride and Prejudice* endures as a literary gem, showcasing the beauty of personal growth and the triumph of love over pride and prejudice."""

In [None]:
from nltk.stem import PorterStemmer

In [None]:
from nltk.corpus import stopwords

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
stopwords.words('french')

['au',
 'aux',
 'avec',
 'ce',
 'ces',
 'dans',
 'de',
 'des',
 'du',
 'elle',
 'en',
 'et',
 'eux',
 'il',
 'ils',
 'je',
 'la',
 'le',
 'les',
 'leur',
 'lui',
 'ma',
 'mais',
 'me',
 'même',
 'mes',
 'moi',
 'mon',
 'ne',
 'nos',
 'notre',
 'nous',
 'on',
 'ou',
 'par',
 'pas',
 'pour',
 'qu',
 'que',
 'qui',
 'sa',
 'se',
 'ses',
 'son',
 'sur',
 'ta',
 'te',
 'tes',
 'toi',
 'ton',
 'tu',
 'un',
 'une',
 'vos',
 'votre',
 'vous',
 'c',
 'd',
 'j',
 'l',
 'à',
 'm',
 'n',
 's',
 't',
 'y',
 'été',
 'étée',
 'étées',
 'étés',
 'étant',
 'étante',
 'étants',
 'étantes',
 'suis',
 'es',
 'est',
 'sommes',
 'êtes',
 'sont',
 'serai',
 'seras',
 'sera',
 'serons',
 'serez',
 'seront',
 'serais',
 'serait',
 'serions',
 'seriez',
 'seraient',
 'étais',
 'était',
 'étions',
 'étiez',
 'étaient',
 'fus',
 'fut',
 'fûmes',
 'fûtes',
 'furent',
 'sois',
 'soit',
 'soyons',
 'soyez',
 'soient',
 'fusse',
 'fusses',
 'fût',
 'fussions',
 'fussiez',
 'fussent',
 'ayant',
 'ayante',
 'ayantes',


In [None]:
stopwords.words('spanish')

['de',
 'la',
 'que',
 'el',
 'en',
 'y',
 'a',
 'los',
 'del',
 'se',
 'las',
 'por',
 'un',
 'para',
 'con',
 'no',
 'una',
 'su',
 'al',
 'lo',
 'como',
 'más',
 'pero',
 'sus',
 'le',
 'ya',
 'o',
 'este',
 'sí',
 'porque',
 'esta',
 'entre',
 'cuando',
 'muy',
 'sin',
 'sobre',
 'también',
 'me',
 'hasta',
 'hay',
 'donde',
 'quien',
 'desde',
 'todo',
 'nos',
 'durante',
 'todos',
 'uno',
 'les',
 'ni',
 'contra',
 'otros',
 'ese',
 'eso',
 'ante',
 'ellos',
 'e',
 'esto',
 'mí',
 'antes',
 'algunos',
 'qué',
 'unos',
 'yo',
 'otro',
 'otras',
 'otra',
 'él',
 'tanto',
 'esa',
 'estos',
 'mucho',
 'quienes',
 'nada',
 'muchos',
 'cual',
 'poco',
 'ella',
 'estar',
 'estas',
 'algunas',
 'algo',
 'nosotros',
 'mi',
 'mis',
 'tú',
 'te',
 'ti',
 'tu',
 'tus',
 'ellas',
 'nosotras',
 'vosotros',
 'vosotras',
 'os',
 'mío',
 'mía',
 'míos',
 'mías',
 'tuyo',
 'tuya',
 'tuyos',
 'tuyas',
 'suyo',
 'suya',
 'suyos',
 'suyas',
 'nuestro',
 'nuestra',
 'nuestros',
 'nuestras',
 'vuestro'

In [None]:
stemmer = PorterStemmer()
sentences = nltk.sent_tokenize(paragraph)

#We will apply stopwords and then filter them and then apply stemming

In [None]:
for i in range(len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  # stem the words which are not in stopwords
  words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
  #converting a;; the words into a sentence
  sentences[i] = ' '.join(words)

In [None]:
sentences

['* pride prejudic * jane austen timeless masterpiec captur beauti human emot , relationship , societ dynam .',
 'set earli 19th-centuri england , novel intric portray tension individu desir societ expect .',
 'it beauti lie way austen explor complex theme love , class , person growth wit ironi .',
 'the stori center around elizabeth bennet mr. darci , whose initi strain relationship gradual evolv confront prejudic misconcept .',
 "austen ’ vivid charact breath life narr , elizabeth 's spirit independ darci 's quiet transform .",
 'the novel ’ charm lie delic balanc humor , romanc , social critiqu , make rich reflect human natur .',
 'the beauti also extend explor theme remain relev today—how first impress deceiv , pride blind , love transcend social boundari .',
 'austen ’ eleg prose sharp dialogu make novel delight read deep commentari societi .',
 '* pride prejudic * endur literari gem , showcas beauti person growth triumph love pride prejudic .']

HERE THE STEMMING DOESNT LOOK THAT FOOD SO WE USE SNOWBALL STEMMING

In [None]:
from nltk.stem import SnowballStemmer

In [None]:
snowballStemmer = SnowballStemmer('english')

In [None]:
for i in range(len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  # stem the words which are not in stopwords
  words = [snowballStemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
  #converting a;; the words into a sentence
  sentences[i] = ' '.join(words)

In [None]:
sentences

['* pride prejud * jane austen timeless masterpiec captur beauti human emot , relationship , societ dynam .',
 'set ear 19th-centuri england , novel intric portray tension individu desir societ expect .',
 'beauti lie way austen explor complex theme love , class , person growth wit ironi .',
 'stori center around elizabeth bennet mr. darci , whose initi strain relationship gradual evolv confront prejud misconcept .',
 "austen ’ vivid charact breath life narr , elizabeth 's spirit independ darci 's quiet transform .",
 'novel ’ charm lie delic balanc humor , romanc , social critiqu , make rich reflect human natur .',
 'beauti also extend explor theme remain relev today—how first impress deceiv , pride blind , love transcend social boundari .',
 'austen ’ eleg prose sharp dialogu make novel delight read deep commentari societi .',
 '* pride prejud * endur literari gem , showca beauti person growth triumph love pride prejud .']

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
for i in range(len(sentences)):
  sentences[i] = sentences[i].lower()
  words = nltk.word_tokenize(sentences[i])
  # stem the words which are not in stopwords
  words = [lemmatizer.lemmatize(word, pos = 'v') for word in words if word not in set(stopwords.words('english'))]
  #converting a;; the words into a sentence
  sentences[i] = ' '.join(words)

In [None]:
sentences

['* pride prejud * jane austen timeless masterpiec captur beauti human emot , relationship , societ dynam .',
 'set ear 19th-centuri england , novel intric portray tension individu desir societ expect .',
 'beauti lie way austen explor complex theme love , class , person growth wit ironi .',
 'stori center around elizabeth bennet mr. darci , whose initi strain relationship gradual evolv confront prejud misconcept .',
 "austen ’ vivid charact breath life narr , elizabeth 's spirit independ darci 's quiet transform .",
 'novel ’ charm lie delic balanc humor , romanc , social critiqu , make rich reflect human natur .',
 'beauti also extend explor theme remain relev today—how first impress deceiv , pride blind , love transcend social boundari .',
 'austen ’ eleg prose sharp dialogu make novel delight read deep commentari societi .',
 '* pride prejud * endur literari gem , showca beauti person growth triumph love pride prejud .']