# Natural Language Processing

### **1. Tokenization**

In [1]:
import nltk # type: ignore

In [2]:
corpus = """Hello world! This is a sample text for natural language processing. 
We are going to perform tokenization on it using NLTK's tokenizer functionality."""
# tokenizes by sentence.
sentences = nltk.sent_tokenize(corpus)
sentences

['Hello world!',
 'This is a sample text for natural language processing.',
 "We are going to perform tokenization on it using NLTK's tokenizer functionality."]

In [3]:
len(sentences) # number of sentences in the corpus

3

In [4]:
tokens = nltk.word_tokenize(corpus)
print(tokens)

['Hello', 'world', '!', 'This', 'is', 'a', 'sample', 'text', 'for', 'natural', 'language', 'processing', '.', 'We', 'are', 'going', 'to', 'perform', 'tokenization', 'on', 'it', 'using', 'NLTK', "'s", 'tokenizer', 'functionality', '.']


In [5]:
len(tokens) # number of words and punctuations in the corpus

27

In [6]:
# print words of each sentence
for sentence in sentences:
    print(nltk.word_tokenize(sentence))

['Hello', 'world', '!']
['This', 'is', 'a', 'sample', 'text', 'for', 'natural', 'language', 'processing', '.']
['We', 'are', 'going', 'to', 'perform', 'tokenization', 'on', 'it', 'using', 'NLTK', "'s", 'tokenizer', 'functionality', '.']


In [7]:
from nltk.tokenize import wordpunct_tokenize
tokens = wordpunct_tokenize(corpus)
print(tokens)

['Hello', 'world', '!', 'This', 'is', 'a', 'sample', 'text', 'for', 'natural', 'language', 'processing', '.', 'We', 'are', 'going', 'to', 'perform', 'tokenization', 'on', 'it', 'using', 'NLTK', "'", 's', 'tokenizer', 'functionality', '.']


Difference between the above word_tokenize and wordpunct_tokenize is that in the previous example the apostrophe _**" ' "**_ did non get split separately. But in this example, it got separated. 

#### TreebankWordTokenizer

In [8]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(corpus)

['Hello',
 'world',
 '!',
 'This',
 'is',
 'a',
 'sample',
 'text',
 'for',
 'natural',
 'language',
 'processing.',
 'We',
 'are',
 'going',
 'to',
 'perform',
 'tokenization',
 'on',
 'it',
 'using',
 'NLTK',
 "'s",
 'tokenizer',
 'functionality',
 '.']

NLTK provides several word tokenizers, each suited for different use cases. Here are the key tokenizers and their differences:

##### 1. **`word_tokenize` (Recommended)**
   - **Implementation:** Uses the Penn Treebank tokenizer from `nltk.tokenize.punkt`.
   - **Features:** Handles punctuation, contractions, and special cases like "U.S." correctly.
   - **Example:**
     ```python
     from nltk.tokenize import word_tokenize
     text = "I'm going to the U.S. next week!"
     print(word_tokenize(text))
     ```
     **Output:**
     ```python
     ["I", "'m", "going", "to", "the", "U.S.", "next", "week", "!"]
     ```
   - **Use Case:** General-purpose word tokenization.

---

##### 2. **`TreebankWordTokenizer`**
   - **Implementation:** Uses the Penn Treebank tokenizer rules (same as `word_tokenize`).
   - **Features:** Splits contractions (e.g., "can't" → ["ca", "n't"]), handles punctuation.
   - **Example:**
     ```python
     from nltk.tokenize import TreebankWordTokenizer
     tokenizer = TreebankWordTokenizer()
     print(tokenizer.tokenize("Can't won't don't"))
     ```
     **Output:**
     ```python
     ['Ca', "n't", 'wo', "n't", 'do', "n't"]
     ```
   - **Use Case:** When working with text where contractions need to be split.

A **contraction** is a shortened form of one or more words where missing letters are replaced by an apostrophe (`'`). Contractions are commonly used in informal writing and speech.  

##### **Examples of Contractions:**
| Full Form | Contraction |
|-----------|------------|
| I am | I'm |
| You are | You're |
| He is / He has | He's |
| They are | They're |
| Cannot | Can't |
| Will not | Won't |
| Do not | Don't |
| Should not | Shouldn't |
| Would have | Would've |

##### **Why Do Contractions Matter in NLP?**
- Some tokenizers **split contractions** into separate words (`"can't"` → `["ca", "n't"]`).
- Others **keep contractions intact** (`"can't"` → `["can't"]`).
- Handling contractions correctly is important for **sentiment analysis**, **text preprocessing**, and **machine learning models**.




##### 3. **`WordPunctTokenizer`**
   - **Implementation:** Splits words and punctuation separately.
   - **Features:** Breaks contractions into separate parts (e.g., "can't" → ["can", "'t"]).
   - **Example:**
     ```python
     from nltk.tokenize import WordPunctTokenizer
     tokenizer = WordPunctTokenizer()
     print(tokenizer.tokenize("I'm excited!"))
     ```
     **Output:**
     ```python
     ['I', "'", 'm', 'excited', '!']
     ```
   - **Use Case:** When punctuation needs to be treated as separate tokens.

---
There are a couple other tokenizers.

##### **Summary**
| Tokenizer | Handles Contractions? | Splits Punctuation? | Use Case |
|-----------|------------------|----------------|---------|
| `word_tokenize` | Yes | Mostly | General NLP tasks |
| `TreebankWordTokenizer` | Yes (splits aggressively) | Yes | Similar to `word_tokenize`, more aggressive |
| `ToktokTokenizer` | No | Yes | Fast and simple tokenization |
| `RegexpTokenizer` | Custom | Custom | Custom rules-based tokenization |
| `WordPunctTokenizer` | Yes (aggressive) | Yes (separates all punctuation) | When punctuation should be separate |
| `MWETokenizer` | No | No | Preserve multi-word expressions |


### **2. Stemming**

**Stemming** is the process of reducing a word to its root form (also called a "stem") by **removing suffixes**. It helps normalize words so that variations of a word are treated as the same.  

For example:  
- **"running" → "run"**  
- **"flies" → "fli"** (incorrect but common with some stemmers)  
- **"happily" → "happili"**  

Stemming is a **rule-based** approach and doesn't always produce real words. It just chops off endings based on predefined rules.

---

**Limitations of Stemming**
- Doesn't always produce real words (e.g., *"flies"* → *"fli"*)  
- Different words may map to the same stem incorrectly  
- Over-stemming (too aggressive) or under-stemming (not aggressive enough)  

---

 **Stemming vs Lemmatization**
If you need more **accurate** root words, **lemmatization** is a better choice because it uses a **dictionary-based** approach instead of just chopping off suffixes.


In [9]:
words = ["eating", "eats", "eaten", "writing", "writes", "programming", "programs", "history", "finally", "finalized"]

In [10]:
# Porter Stemming
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()
for word in words:
    print(word, "→", porter_stemmer.stem(word))

eating → eat
eats → eat
eaten → eaten
writing → write
writes → write
programming → program
programs → program
history → histori
finally → final
finalized → final


The change from "history" --> "histori" exhibits the major disadvantage of Stemming. The meaning of the original word has changed completely.

In [11]:
# Another example
porter_stemmer.stem("congratulations")

'congratul'

This is not a word. The original word lost its meaning.

In [12]:
# Regexp Stemming
from nltk.stem import RegexpStemmer

# Define a regex pattern to remove common suffixes (-ing, -ed, -ly)
regexp_stemmer = RegexpStemmer(r'ing$|ed$|ly$', min=4)

print(regexp_stemmer.stem("running"))   # Output: runn
print(regexp_stemmer.stem("happily"))   # Output: happi
print(regexp_stemmer.stem("worked"))    # Output: work

runn
happi
work


In [13]:
regexp_stemmer = RegexpStemmer('ing|ed$|ly$')
print(regexp_stemmer.stem("ingeating")) #eat
regexp_stemmer = RegexpStemmer('ing$|ed$|ly$')
print(regexp_stemmer.stem("ingeating")) #ingeat

eat
ingeat


In [14]:
# snowball stemmer
from nltk.stem.snowball import SnowballStemmer
snowball_stemmer = SnowballStemmer(language='english')
print(snowball_stemmer.stem("happily"))
print(snowball_stemmer.stem("worked"))

happili
work


In [15]:
for word in words:
    print(word, "→", snowball_stemmer.stem(word))

eating → eat
eats → eat
eaten → eaten
writing → write
writes → write
programming → program
programs → program
history → histori
finally → final
finalized → final


**Comparing all three stemming**

In [16]:
porter_stemmer.stem("fairly"), porter_stemmer.stem("sportingly")

('fairli', 'sportingli')

In [17]:
regexp_stemmer.stem("fairly"), regexp_stemmer.stem("sportingly")

('fair', 'sporting')

In [18]:
snowball_stemmer.stem("fairly"), snowball_stemmer.stem("sportingly")

('fair', 'sport')

**1️⃣ Porter Stemmer**

✅ **Pros:**  
- One of the most widely used stemming algorithms.  
- Uses a **set of heuristic rules** to remove common suffixes.  
- Efficient and relatively fast.  

❌ **Cons:**  
- Sometimes over-stems words (e.g., `"flies"` → `"fli"`).  
- Does not always produce real words.  
- **Not customizable** (fixed rules).  

---

**2️⃣ Regex Stemmer**

✅ **Pros:**  
- **Customizable**: You define the regex pattern to remove suffixes.  
- Useful for **domain-specific** text processing.  
- Can prevent over-stemming by **setting minimum word length** (`min=` parameter).  

❌ **Cons:**  
- Requires **manual regex tuning** for different datasets.  
- May **miss irregular word forms** (e.g., `"better"` won’t stem to `"good"`).  
- **Not language-aware** (simply removes predefined suffixes).  
---

**3️⃣ Snowball Stemmer**

✅ **Pros:**  
- **Improved version of Porter Stemmer**.  
- Supports **multiple languages** (e.g., English, French, Spanish).  
- More **accurate and flexible** than Porter.  

❌ **Cons:**  
- **Slower than Porter Stemmer** due to additional rules.  
- Not as customizable as Regex Stemmer.  

---

**Comparison Table**

| Feature          | **Porter Stemmer** | **Regex Stemmer** | **Snowball Stemmer** |
|-----------------|------------------|-----------------|------------------|
| **Algorithm**   | Rule-based       | Regex-based    | Rule-based (Improved Porter) |
| **Customizable?** | ❌ No | ✅ Yes | ❌ No |
| **Language Support** | ❌ English Only | ❌ Manual | ✅ Multiple Languages |
| **Speed** | ✅ Fast | ✅ Fast | ❌ Slower (More Rules) |
| **Accuracy** | ❌ Can over-stem | ✅ Depends on Regex | ✅ More accurate than Porter |
| **Best Use Case** | General NLP tasks | Domain-specific text | Multi-language support |

---

**Which One Should You Use?**

🔹 **Use Porter Stemmer** → If you want a simple, fast stemming method for **English text**.  
🔹 **Use Regex Stemmer** → If you need **full control** over stemming rules for **custom datasets**.  
🔹 **Use Snowball Stemmer** → If you want **better accuracy** and support for **multiple languages**.  

### **3. Lemmatization**

**Lemmatization** is the process of reducing a word to its **base or dictionary form (lemma)** while ensuring it remains a real word. Unlike **stemming**, which just chops off suffixes, lemmatization considers **grammatical meaning** using a **lexical database** like WordNet.

---

**How Does Lemmatization Work?**

- **Considers the context** and part of speech (POS) of a word.  
- Uses a **dictionary lookup** to find the root form (lemma).  
- Ensures the output is a valid word.  

🔹 **Example:**  
| Word | Stemmed (Porter) | Lemmatized (WordNet) |
|------|----------------|------------------|
| Running | run | run |
| Better | better | good |
| Studies | studi | study |
| Mice | mice | mouse |

---

🔹 **Explanation:**  

- `"running"` (verb) → `"run"` ✅  
- `"better"` (adjective) → `"good"` ✅  
- `"mice"` (noun) → `"mouse"` ✅  
- `"studies"` (noun) → `"study"` ✅  

🚨 **POS Tagging is Important!**  

- If no `pos` is provided, it assumes the word is a **noun**.  
- `"running"` → `"running"` (incorrect)  
- `"running", pos="v"` → `"run"` (correct)  

---

**Stemming vs Lemmatization**

| Feature | **Stemming** | **Lemmatization** |
|---------|-------------|------------------|
| **Method** | Removes suffixes | Uses dictionary lookup |
| **Grammar Aware?** | ❌ No | ✅ Yes |
| **Produces Real Words?** | ❌ No | ✅ Yes |
| **Computational Cost** | ✅ Fast | ❌ Slower |
| **Example ("better")** | **"better"** → **"better"** | **"better"** → **"good"** |

---

**When to Use Lemmatization?**

✅ **Linguistic Accuracy Required** (e.g., chatbots, search engines).  
✅ **When Meaning Matters** (e.g., sentiment analysis).  
✅ **If You Have Computational Resources** (since it's slower than stemming).  

In [21]:
from nltk import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [22]:
for word in words:
    print(word, "→", lemmatizer.lemmatize(word))

eating → eating
eats → eats
eaten → eaten
writing → writing
writes → writes
programming → programming
programs → program
history → history
finally → finally
finalized → finalized


No changes as all these words exist in the dictionary. Now we will add a "pos" attribute in the lemmatize method.

In [26]:
''' 
Pos Tagging
Noun- n
Verb- v
Adjective- a
Adverb- r
'''
# by default it's noun
for word in words:
    print(word, "→", lemmatizer.lemmatize(word, pos='v'))

eating → eat
eats → eat
eaten → eat
writing → write
writes → write
programming → program
programs → program
history → history
finally → finally
finalized → finalize


In [25]:
for word in words:
    print(word, "→", lemmatizer.lemmatize(word, pos='a'))

eating → eating
eats → eats
eaten → eaten
writing → writing
writes → writes
programming → programming
programs → programs
history → history
finally → finally
finalized → finalized


In [28]:
lemmatizer.lemmatize("goes", pos='v')

'go'

In [30]:
lemmatizer.lemmatize("fairly", pos='a'), lemmatizer.lemmatize("sportingly", pos = 'v')

('fairly', 'sportingly')

Advantage of Lemmatization over Stemming

In [31]:
snowball_stemmer.stem("better")

'better'

In [33]:
lemmatizer.lemmatize("better", 'a')

'good'