# Natural Language Processing

### **1. Tokenization**

In [1]:
import nltk # type: ignore

In [2]:
corpus = """Hello world! This is a sample text for natural language processing. 
We are going to perform tokenization on it using NLTK's tokenizer functionality."""
# tokenizes by sentence.
sentences = nltk.sent_tokenize(corpus)
sentences

['Hello world!',
 'This is a sample text for natural language processing.',
 "We are going to perform tokenization on it using NLTK's tokenizer functionality."]

In [3]:
len(sentences) # number of sentences in the corpus

3

In [4]:
tokens = nltk.word_tokenize(corpus)
print(tokens)

['Hello', 'world', '!', 'This', 'is', 'a', 'sample', 'text', 'for', 'natural', 'language', 'processing', '.', 'We', 'are', 'going', 'to', 'perform', 'tokenization', 'on', 'it', 'using', 'NLTK', "'s", 'tokenizer', 'functionality', '.']


In [5]:
len(tokens) # number of words and punctuations in the corpus

27

In [6]:
# print words of each sentence
for sentence in sentences:
    print(nltk.word_tokenize(sentence))

['Hello', 'world', '!']
['This', 'is', 'a', 'sample', 'text', 'for', 'natural', 'language', 'processing', '.']
['We', 'are', 'going', 'to', 'perform', 'tokenization', 'on', 'it', 'using', 'NLTK', "'s", 'tokenizer', 'functionality', '.']


In [7]:
from nltk.tokenize import wordpunct_tokenize
tokens = wordpunct_tokenize(corpus)
print(tokens)

['Hello', 'world', '!', 'This', 'is', 'a', 'sample', 'text', 'for', 'natural', 'language', 'processing', '.', 'We', 'are', 'going', 'to', 'perform', 'tokenization', 'on', 'it', 'using', 'NLTK', "'", 's', 'tokenizer', 'functionality', '.']


Difference between the above word_tokenize and wordpunct_tokenize is that in the previous example the apostrophe _**" ' "**_ did non get split separately. But in this example, it got separated. 

#### TreebankWordTokenizer

In [8]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(corpus)

['Hello',
 'world',
 '!',
 'This',
 'is',
 'a',
 'sample',
 'text',
 'for',
 'natural',
 'language',
 'processing.',
 'We',
 'are',
 'going',
 'to',
 'perform',
 'tokenization',
 'on',
 'it',
 'using',
 'NLTK',
 "'s",
 'tokenizer',
 'functionality',
 '.']

NLTK provides several word tokenizers, each suited for different use cases. Here are the key tokenizers and their differences:

##### 1. **`word_tokenize` (Recommended)**
   - **Implementation:** Uses the Penn Treebank tokenizer from `nltk.tokenize.punkt`.
   - **Features:** Handles punctuation, contractions, and special cases like "U.S." correctly.
   - **Example:**
     ```python
     from nltk.tokenize import word_tokenize
     text = "I'm going to the U.S. next week!"
     print(word_tokenize(text))
     ```
     **Output:**
     ```python
     ["I", "'m", "going", "to", "the", "U.S.", "next", "week", "!"]
     ```
   - **Use Case:** General-purpose word tokenization.

---

##### 2. **`TreebankWordTokenizer`**
   - **Implementation:** Uses the Penn Treebank tokenizer rules (same as `word_tokenize`).
   - **Features:** Splits contractions (e.g., "can't" → ["ca", "n't"]), handles punctuation.
   - **Example:**
     ```python
     from nltk.tokenize import TreebankWordTokenizer
     tokenizer = TreebankWordTokenizer()
     print(tokenizer.tokenize("Can't won't don't"))
     ```
     **Output:**
     ```python
     ['Ca', "n't", 'wo', "n't", 'do', "n't"]
     ```
   - **Use Case:** When working with text where contractions need to be split.

A **contraction** is a shortened form of one or more words where missing letters are replaced by an apostrophe (`'`). Contractions are commonly used in informal writing and speech.  

##### **Examples of Contractions:**
| Full Form | Contraction |
|-----------|------------|
| I am | I'm |
| You are | You're |
| He is / He has | He's |
| They are | They're |
| Cannot | Can't |
| Will not | Won't |
| Do not | Don't |
| Should not | Shouldn't |
| Would have | Would've |

##### **Why Do Contractions Matter in NLP?**
- Some tokenizers **split contractions** into separate words (`"can't"` → `["ca", "n't"]`).
- Others **keep contractions intact** (`"can't"` → `["can't"]`).
- Handling contractions correctly is important for **sentiment analysis**, **text preprocessing**, and **machine learning models**.




##### 3. **`WordPunctTokenizer`**
   - **Implementation:** Splits words and punctuation separately.
   - **Features:** Breaks contractions into separate parts (e.g., "can't" → ["can", "'t"]).
   - **Example:**
     ```python
     from nltk.tokenize import WordPunctTokenizer
     tokenizer = WordPunctTokenizer()
     print(tokenizer.tokenize("I'm excited!"))
     ```
     **Output:**
     ```python
     ['I', "'", 'm', 'excited', '!']
     ```
   - **Use Case:** When punctuation needs to be treated as separate tokens.

---
There are a couple other tokenizers.

##### **Summary**
| Tokenizer | Handles Contractions? | Splits Punctuation? | Use Case |
|-----------|------------------|----------------|---------|
| `word_tokenize` | Yes | Mostly | General NLP tasks |
| `TreebankWordTokenizer` | Yes (splits aggressively) | Yes | Similar to `word_tokenize`, more aggressive |
| `ToktokTokenizer` | No | Yes | Fast and simple tokenization |
| `RegexpTokenizer` | Custom | Custom | Custom rules-based tokenization |
| `WordPunctTokenizer` | Yes (aggressive) | Yes (separates all punctuation) | When punctuation should be separate |
| `MWETokenizer` | No | No | Preserve multi-word expressions |


### **2. Stemming**

**Stemming** is the process of reducing a word to its root form (also called a "stem") by **removing suffixes**. It helps normalize words so that variations of a word are treated as the same.  

For example:  
- **"running" → "run"**  
- **"flies" → "fli"** (incorrect but common with some stemmers)  
- **"happily" → "happili"**  

Stemming is a **rule-based** approach and doesn't always produce real words. It just chops off endings based on predefined rules.

---

**Limitations of Stemming**
- Doesn't always produce real words (e.g., *"flies"* → *"fli"*)  
- Different words may map to the same stem incorrectly  
- Over-stemming (too aggressive) or under-stemming (not aggressive enough)  

---

 **Stemming vs Lemmatization**
If you need more **accurate** root words, **lemmatization** is a better choice because it uses a **dictionary-based** approach instead of just chopping off suffixes.


In [9]:
words = ["eating", "eats", "eaten", "writing", "writes", "programming", "programs", "history", "finally", "finalized"]

In [10]:
# Porter Stemming
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()
for word in words:
    print(word, "→", porter_stemmer.stem(word))

eating → eat
eats → eat
eaten → eaten
writing → write
writes → write
programming → program
programs → program
history → histori
finally → final
finalized → final


The change from "history" --> "histori" exhibits the major disadvantage of Stemming. The meaning of the original word has changed completely.

In [11]:
# Another example
porter_stemmer.stem("congratulations")

'congratul'

This is not a word. The original word lost its meaning.

In [12]:
# Regexp Stemming
from nltk.stem import RegexpStemmer

# Define a regex pattern to remove common suffixes (-ing, -ed, -ly)
regexp_stemmer = RegexpStemmer(r'ing$|ed$|ly$', min=4)

print(regexp_stemmer.stem("running"))   # Output: runn
print(regexp_stemmer.stem("happily"))   # Output: happi
print(regexp_stemmer.stem("worked"))    # Output: work

runn
happi
work


In [13]:
regexp_stemmer = RegexpStemmer('ing|ed$|ly$')
print(regexp_stemmer.stem("ingeating")) #eat
regexp_stemmer = RegexpStemmer('ing$|ed$|ly$')
print(regexp_stemmer.stem("ingeating")) #ingeat

eat
ingeat


In [14]:
# snowball stemmer
from nltk.stem.snowball import SnowballStemmer
snowball_stemmer = SnowballStemmer(language='english')
print(snowball_stemmer.stem("happily"))
print(snowball_stemmer.stem("worked"))

happili
work


In [15]:
for word in words:
    print(word, "→", snowball_stemmer.stem(word))

eating → eat
eats → eat
eaten → eaten
writing → write
writes → write
programming → program
programs → program
history → histori
finally → final
finalized → final


**Comparing all three stemming**

In [16]:
porter_stemmer.stem("fairly"), porter_stemmer.stem("sportingly")

('fairli', 'sportingli')

In [17]:
regexp_stemmer.stem("fairly"), regexp_stemmer.stem("sportingly")

('fair', 'sporting')

In [18]:
snowball_stemmer.stem("fairly"), snowball_stemmer.stem("sportingly")

('fair', 'sport')

**1️⃣ Porter Stemmer**

✅ **Pros:**  
- One of the most widely used stemming algorithms.  
- Uses a **set of heuristic rules** to remove common suffixes.  
- Efficient and relatively fast.  

❌ **Cons:**  
- Sometimes over-stems words (e.g., `"flies"` → `"fli"`).  
- Does not always produce real words.  
- **Not customizable** (fixed rules).  

---

**2️⃣ Regex Stemmer**

✅ **Pros:**  
- **Customizable**: You define the regex pattern to remove suffixes.  
- Useful for **domain-specific** text processing.  
- Can prevent over-stemming by **setting minimum word length** (`min=` parameter).  

❌ **Cons:**  
- Requires **manual regex tuning** for different datasets.  
- May **miss irregular word forms** (e.g., `"better"` won’t stem to `"good"`).  
- **Not language-aware** (simply removes predefined suffixes).  
---

**3️⃣ Snowball Stemmer**

✅ **Pros:**  
- **Improved version of Porter Stemmer**.  
- Supports **multiple languages** (e.g., English, French, Spanish).  
- More **accurate and flexible** than Porter.  

❌ **Cons:**  
- **Slower than Porter Stemmer** due to additional rules.  
- Not as customizable as Regex Stemmer.  

---

**Comparison Table**

| Feature          | **Porter Stemmer** | **Regex Stemmer** | **Snowball Stemmer** |
|-----------------|------------------|-----------------|------------------|
| **Algorithm**   | Rule-based       | Regex-based    | Rule-based (Improved Porter) |
| **Customizable?** | ❌ No | ✅ Yes | ❌ No |
| **Language Support** | ❌ English Only | ❌ Manual | ✅ Multiple Languages |
| **Speed** | ✅ Fast | ✅ Fast | ❌ Slower (More Rules) |
| **Accuracy** | ❌ Can over-stem | ✅ Depends on Regex | ✅ More accurate than Porter |
| **Best Use Case** | General NLP tasks | Domain-specific text | Multi-language support |

---

**Which One Should You Use?**

🔹 **Use Porter Stemmer** → If you want a simple, fast stemming method for **English text**.  
🔹 **Use Regex Stemmer** → If you need **full control** over stemming rules for **custom datasets**.  
🔹 **Use Snowball Stemmer** → If you want **better accuracy** and support for **multiple languages**.  

### **3. Lemmatization**

**Lemmatization** is the process of reducing a word to its **base or dictionary form (lemma)** while ensuring it remains a real word. Unlike **stemming**, which just chops off suffixes, lemmatization considers **grammatical meaning** using a **lexical database** like WordNet.

---

**How Does Lemmatization Work?**

- **Considers the context** and part of speech (POS) of a word.  
- Uses a **dictionary lookup** to find the root form (lemma).  
- Ensures the output is a valid word.  

🔹 **Example:**  
| Word | Stemmed (Porter) | Lemmatized (WordNet) |
|------|----------------|------------------|
| Running | run | run |
| Better | better | good |
| Studies | studi | study |
| Mice | mice | mouse |

---

🔹 **Explanation:**  

- `"running"` (verb) → `"run"` ✅  
- `"better"` (adjective) → `"good"` ✅  
- `"mice"` (noun) → `"mouse"` ✅  
- `"studies"` (noun) → `"study"` ✅  

🚨 **POS Tagging is Important!**  

- If no `pos` is provided, it assumes the word is a **noun**.  
- `"running"` → `"running"` (incorrect)  
- `"running", pos="v"` → `"run"` (correct)  

---

**Stemming vs Lemmatization**

| Feature | **Stemming** | **Lemmatization** |
|---------|-------------|------------------|
| **Method** | Removes suffixes | Uses dictionary lookup |
| **Grammar Aware?** | ❌ No | ✅ Yes |
| **Produces Real Words?** | ❌ No | ✅ Yes |
| **Computational Cost** | ✅ Fast | ❌ Slower |
| **Example ("better")** | **"better"** → **"better"** | **"better"** → **"good"** |

---

**When to Use Lemmatization?**

✅ **Linguistic Accuracy Required** (e.g., chatbots, search engines).  
✅ **When Meaning Matters** (e.g., sentiment analysis).  
✅ **If You Have Computational Resources** (since it's slower than stemming).  

In [21]:
from nltk import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [22]:
for word in words:
    print(word, "→", lemmatizer.lemmatize(word))

eating → eating
eats → eats
eaten → eaten
writing → writing
writes → writes
programming → programming
programs → program
history → history
finally → finally
finalized → finalized


No changes as all these words exist in the dictionary. Now we will add a "pos" attribute in the lemmatize method.

In [26]:
''' 
Pos Tagging
Noun- n
Verb- v
Adjective- a
Adverb- r
'''
# by default it's noun
for word in words:
    print(word, "→", lemmatizer.lemmatize(word, pos='v'))

eating → eat
eats → eat
eaten → eat
writing → write
writes → write
programming → program
programs → program
history → history
finally → finally
finalized → finalize


In [25]:
for word in words:
    print(word, "→", lemmatizer.lemmatize(word, pos='a'))

eating → eating
eats → eats
eaten → eaten
writing → writing
writes → writes
programming → programming
programs → programs
history → history
finally → finally
finalized → finalized


In [28]:
lemmatizer.lemmatize("goes", pos='v')

'go'

In [30]:
lemmatizer.lemmatize("fairly", pos='a'), lemmatizer.lemmatize("sportingly", pos = 'v')

('fairly', 'sportingly')

Advantage of Lemmatization over Stemming

In [31]:
snowball_stemmer.stem("better")

'better'

In [33]:
lemmatizer.lemmatize("better", 'a')

'good'

### **4. Stopwords**

Stopwords are **common words** in a language that are **filtered out** in Natural Language Processing (NLP) because they **do not carry much meaning**. These include words like *"the," "is," "and," "in," "of,"* etc.  

Since stopwords appear **frequently** in text but add little value to analysis, they are often **removed** to improve efficiency in NLP tasks like text classification, search engines, and machine learning models.

---

**Examples of Stopwords in English** 
 
| Category | Examples |
|----------|----------|
| Articles | the, a, an |
| Pronouns | he, she, it, they |
| Conjunctions | and, or, but, nor |
| Prepositions | in, on, at, with |
| Auxiliary Verbs | is, are, was, were, be |
| Other Common Words | this, that, those, these |

---

**Stopwords in NLTK**

NLTK provides a built-in list of stopwords for multiple languages.  

In [49]:
paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""

In [39]:
# removing stopwords
from nltk.corpus import stopwords
eng_stopwords = stopwords.words('english') # will print stopwords of English Language

In [40]:
len(eng_stopwords) # number of stopwords

198

In [41]:
stopwords.words('spanish') # will print stopwords of Spanish Language

['de',
 'la',
 'que',
 'el',
 'en',
 'y',
 'a',
 'los',
 'del',
 'se',
 'las',
 'por',
 'un',
 'para',
 'con',
 'no',
 'una',
 'su',
 'al',
 'lo',
 'como',
 'más',
 'pero',
 'sus',
 'le',
 'ya',
 'o',
 'este',
 'sí',
 'porque',
 'esta',
 'entre',
 'cuando',
 'muy',
 'sin',
 'sobre',
 'también',
 'me',
 'hasta',
 'hay',
 'donde',
 'quien',
 'desde',
 'todo',
 'nos',
 'durante',
 'todos',
 'uno',
 'les',
 'ni',
 'contra',
 'otros',
 'ese',
 'eso',
 'ante',
 'ellos',
 'e',
 'esto',
 'mí',
 'antes',
 'algunos',
 'qué',
 'unos',
 'yo',
 'otro',
 'otras',
 'otra',
 'él',
 'tanto',
 'esa',
 'estos',
 'mucho',
 'quienes',
 'nada',
 'muchos',
 'cual',
 'poco',
 'ella',
 'estar',
 'estas',
 'algunas',
 'algo',
 'nosotros',
 'mi',
 'mis',
 'tú',
 'te',
 'ti',
 'tu',
 'tus',
 'ellas',
 'nosotras',
 'vosotros',
 'vosotras',
 'os',
 'mío',
 'mía',
 'míos',
 'mías',
 'tuyo',
 'tuya',
 'tuyos',
 'tuyas',
 'suyo',
 'suya',
 'suyos',
 'suyas',
 'nuestro',
 'nuestra',
 'nuestros',
 'nuestras',
 'vuestro'

In [42]:
stopwords.words('arabic') 

['إذ',
 'إذا',
 'إذما',
 'إذن',
 'أف',
 'أقل',
 'أكثر',
 'ألا',
 'إلا',
 'التي',
 'الذي',
 'الذين',
 'اللاتي',
 'اللائي',
 'اللتان',
 'اللتيا',
 'اللتين',
 'اللذان',
 'اللذين',
 'اللواتي',
 'إلى',
 'إليك',
 'إليكم',
 'إليكما',
 'إليكن',
 'أم',
 'أما',
 'أما',
 'إما',
 'أن',
 'إن',
 'إنا',
 'أنا',
 'أنت',
 'أنتم',
 'أنتما',
 'أنتن',
 'إنما',
 'إنه',
 'أنى',
 'أنى',
 'آه',
 'آها',
 'أو',
 'أولاء',
 'أولئك',
 'أوه',
 'آي',
 'أي',
 'أيها',
 'إي',
 'أين',
 'أين',
 'أينما',
 'إيه',
 'بخ',
 'بس',
 'بعد',
 'بعض',
 'بك',
 'بكم',
 'بكم',
 'بكما',
 'بكن',
 'بل',
 'بلى',
 'بما',
 'بماذا',
 'بمن',
 'بنا',
 'به',
 'بها',
 'بهم',
 'بهما',
 'بهن',
 'بي',
 'بين',
 'بيد',
 'تلك',
 'تلكم',
 'تلكما',
 'ته',
 'تي',
 'تين',
 'تينك',
 'ثم',
 'ثمة',
 'حاشا',
 'حبذا',
 'حتى',
 'حيث',
 'حيثما',
 'حين',
 'خلا',
 'دون',
 'ذا',
 'ذات',
 'ذاك',
 'ذان',
 'ذانك',
 'ذلك',
 'ذلكم',
 'ذلكما',
 'ذلكن',
 'ذه',
 'ذو',
 'ذوا',
 'ذواتا',
 'ذواتي',
 'ذي',
 'ذين',
 'ذينك',
 'ريث',
 'سوف',
 'سوى',
 'شتان',
 'عدا',
 'عسى',
 'عل'

In [57]:
speech_sentences = nltk.sent_tokenize(paragraph)
speech_sentences

['I have three visions for India.',
 'In 3000 years of our history, people from all over \n               the world have come and invaded us, captured our lands, conquered our minds.',
 'From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,\n               the French, the Dutch, all of them came and looted us, took over what was ours.',
 'Yet we have not done this to any other nation.',
 'We have not conquered anyone.',
 'We have not grabbed their land, their culture, \n               their history and tried to enforce our way of life on them.',
 'Why?',
 'Because we respect the freedom of others.That is why my \n               first vision is that of freedom.',
 'I believe that India got its first vision of \n               this in 1857, when we started the War of Independence.',
 'It is this freedom that\n               we must protect and nurture and build on.',
 'If we are not free, no one will respect us.',
 'My second vision for India’s developme

In [58]:
speech_words = nltk.word_tokenize(paragraph)
speech_words

['I',
 'have',
 'three',
 'visions',
 'for',
 'India',
 '.',
 'In',
 '3000',
 'years',
 'of',
 'our',
 'history',
 ',',
 'people',
 'from',
 'all',
 'over',
 'the',
 'world',
 'have',
 'come',
 'and',
 'invaded',
 'us',
 ',',
 'captured',
 'our',
 'lands',
 ',',
 'conquered',
 'our',
 'minds',
 '.',
 'From',
 'Alexander',
 'onwards',
 ',',
 'the',
 'Greeks',
 ',',
 'the',
 'Turks',
 ',',
 'the',
 'Moguls',
 ',',
 'the',
 'Portuguese',
 ',',
 'the',
 'British',
 ',',
 'the',
 'French',
 ',',
 'the',
 'Dutch',
 ',',
 'all',
 'of',
 'them',
 'came',
 'and',
 'looted',
 'us',
 ',',
 'took',
 'over',
 'what',
 'was',
 'ours',
 '.',
 'Yet',
 'we',
 'have',
 'not',
 'done',
 'this',
 'to',
 'any',
 'other',
 'nation',
 '.',
 'We',
 'have',
 'not',
 'conquered',
 'anyone',
 '.',
 'We',
 'have',
 'not',
 'grabbed',
 'their',
 'land',
 ',',
 'their',
 'culture',
 ',',
 'their',
 'history',
 'and',
 'tried',
 'to',
 'enforce',
 'our',
 'way',
 'of',
 'life',
 'on',
 'them',
 '.',
 'Why',
 '?',
 '

Apply Stopwords and filter and then apply stemming or lemmatization.

1) Using Porter Stemmer

In [48]:
for i in range(len(speech_sentences)):
    words = nltk.word_tokenize(speech_sentences[i])
    # words = [word for word in words if word not in eng_stopwords]
    words = [porter_stemmer.stem(word) for word in words if word not in eng_stopwords]
    speech_sentences[i] = ' '.join(words) # join the words to form a sentence
speech_sentences

['i three vision india .',
 'in 3000 year histori , peopl world come invad us , captur land , conquer mind .',
 'from alexand onward , greek , turk , mogul , portugues , british , french , dutch , came loot us , took .',
 'yet done nation .',
 'we conquer anyon .',
 'we grab land , cultur , histori tri enforc way life .',
 'whi ?',
 'becaus respect freedom others.that first vision freedom .',
 'i believ india got first vision 1857 , start war independ .',
 'it freedom must protect nurtur build .',
 'if free , one respect us .',
 'my second vision india ’ develop .',
 'for fifti year develop nation .',
 'it time see develop nation .',
 'we among top 5 nation world term gdp .',
 'we 10 percent growth rate area .',
 'our poverti level fall .',
 'our achiev global recognis today .',
 'yet lack self-confid see develop nation , self-reli self-assur .',
 'isn ’ incorrect ?',
 'i third vision .',
 'india must stand world .',
 'becaus i believ unless india stand world , one respect us .',
 'onl

2) Using Snowball Stemmer

In [52]:
for i in range(len(speech_sentences)):
    words = nltk.word_tokenize(speech_sentences[i])
    # words = [word for word in words if word not in eng_stopwords]
    words = [snowball_stemmer.stem(word) for word in words if word not in eng_stopwords]
    speech_sentences[i] = ' '.join(words) # join the words to form a sentence
speech_sentences

['i three vision india .',
 'in 3000 year histori , peopl world come invad us , captur land , conquer mind .',
 'from alexand onward , greek , turk , mogul , portugues , british , french , dutch , came loot us , took .',
 'yet done nation .',
 'we conquer anyon .',
 'we grab land , cultur , histori tri enforc way life .',
 'whi ?',
 'becaus respect freedom others.that first vision freedom .',
 'i believ india got first vision 1857 , start war independ .',
 'it freedom must protect nurtur build .',
 'if free , one respect us .',
 'my second vision india ’ develop .',
 'for fifti year develop nation .',
 'it time see develop nation .',
 'we among top 5 nation world term gdp .',
 'we 10 percent growth rate area .',
 'our poverti level fall .',
 'our achiev global recognis today .',
 'yet lack self-confid see develop nation , self-reli self-assur .',
 'isn ’ incorrect ?',
 'i third vision .',
 'india must stand world .',
 'becaus i believ unless india stand world , one respect us .',
 'onl

3) Using Lemmatization

In [60]:
for i in range(len(speech_sentences)):
    words = nltk.word_tokenize(speech_sentences[i])
    # words = [word for word in words if word not in eng_stopwords]
    words = [lemmatizer.lemmatize(word.lower()) for word in words if word not in eng_stopwords]
    speech_sentences[i] = ' '.join(words) # join the words to form a sentence
speech_sentences

['i three vision india .',
 'in 3000 year history , people world come invaded u , captured land , conquered mind .',
 'from alexander onwards , greek , turk , mogul , portuguese , british , french , dutch , came looted u , took .',
 'yet done nation .',
 'we conquered anyone .',
 'we grabbed land , culture , history tried enforce way life .',
 'why ?',
 'because respect freedom others.that first vision freedom .',
 'i believe india got first vision 1857 , started war independence .',
 'it freedom must protect nurture build .',
 'if free , one respect u .',
 'my second vision india ’ development .',
 'for fifty year developing nation .',
 'it time see developed nation .',
 'we among top 5 nation world term gdp .',
 'we 10 percent growth rate area .',
 'our poverty level falling .',
 'our achievement globally recognised today .',
 'yet lack self-confidence see developed nation , self-reliant self-assured .',
 'isn ’ incorrect ?',
 'i third vision .',
 'india must stand world .',
 'becaus


**Why Remove Stopwords?**

✅ **Reduces dataset size** → Speeds up processing.  
✅ **Improves accuracy** → Less noise in text analysis.  
✅ **Enhances model performance** → Focuses on important words.  

---

**When NOT to Remove Stopwords?**

❌ If stopwords **carry meaning** (e.g., in **sentiment analysis** or **question classification**).  
❌ If you are working with **short texts** (e.g., chatbot responses).  

---

**Customizing Stopwords List**

You can **add or remove** stopwords based on your use case.

```python
# Add a custom stopword
stop_words.add("example")

# Remove a stopword
stop_words.remove("not")  # Useful if you need negation in sentiment analysis
```