# Natural Language Processing

### **1. Tokenization**

In [1]:
import nltk # type: ignore

In [2]:
corpus = """Hello world! This is a sample text for natural language processing. 
We are going to perform tokenization on it using NLTK's tokenizer functionality."""
# tokenizes by sentence.
sentences = nltk.sent_tokenize(corpus)
sentences

['Hello world!',
 'This is a sample text for natural language processing.',
 "We are going to perform tokenization on it using NLTK's tokenizer functionality."]

In [3]:
len(sentences) # number of sentences in the corpus

3

In [4]:
tokens = nltk.word_tokenize(corpus)
print(tokens)

['Hello', 'world', '!', 'This', 'is', 'a', 'sample', 'text', 'for', 'natural', 'language', 'processing', '.', 'We', 'are', 'going', 'to', 'perform', 'tokenization', 'on', 'it', 'using', 'NLTK', "'s", 'tokenizer', 'functionality', '.']


In [5]:
len(tokens) # number of words and punctuations in the corpus

27

In [6]:
# print words of each sentence
for sentence in sentences:
    print(nltk.word_tokenize(sentence))

['Hello', 'world', '!']
['This', 'is', 'a', 'sample', 'text', 'for', 'natural', 'language', 'processing', '.']
['We', 'are', 'going', 'to', 'perform', 'tokenization', 'on', 'it', 'using', 'NLTK', "'s", 'tokenizer', 'functionality', '.']


In [8]:
from nltk.tokenize import wordpunct_tokenize
tokens = wordpunct_tokenize(corpus)
print(tokens)

['Hello', 'world', '!', 'This', 'is', 'a', 'sample', 'text', 'for', 'natural', 'language', 'processing', '.', 'We', 'are', 'going', 'to', 'perform', 'tokenization', 'on', 'it', 'using', 'NLTK', "'", 's', 'tokenizer', 'functionality', '.']


Difference between the above word_tokenize and wordpunct_tokenize is that in the previous example the apostrophe _**" ' "**_ did non get split separately. But in this example, it got separated. 

#### TreebankWordTokenizer

In [9]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(corpus)

['Hello',
 'world',
 '!',
 'This',
 'is',
 'a',
 'sample',
 'text',
 'for',
 'natural',
 'language',
 'processing.',
 'We',
 'are',
 'going',
 'to',
 'perform',
 'tokenization',
 'on',
 'it',
 'using',
 'NLTK',
 "'s",
 'tokenizer',
 'functionality',
 '.']

NLTK provides several word tokenizers, each suited for different use cases. Here are the key tokenizers and their differences:

##### 1. **`word_tokenize` (Recommended)**
   - **Implementation:** Uses the Penn Treebank tokenizer from `nltk.tokenize.punkt`.
   - **Features:** Handles punctuation, contractions, and special cases like "U.S." correctly.
   - **Example:**
     ```python
     from nltk.tokenize import word_tokenize
     text = "I'm going to the U.S. next week!"
     print(word_tokenize(text))
     ```
     **Output:**
     ```python
     ["I", "'m", "going", "to", "the", "U.S.", "next", "week", "!"]
     ```
   - **Use Case:** General-purpose word tokenization.

---

##### 2. **`TreebankWordTokenizer`**
   - **Implementation:** Uses the Penn Treebank tokenizer rules (same as `word_tokenize`).
   - **Features:** Splits contractions (e.g., "can't" → ["ca", "n't"]), handles punctuation.
   - **Example:**
     ```python
     from nltk.tokenize import TreebankWordTokenizer
     tokenizer = TreebankWordTokenizer()
     print(tokenizer.tokenize("Can't won't don't"))
     ```
     **Output:**
     ```python
     ['Ca', "n't", 'wo', "n't", 'do', "n't"]
     ```
   - **Use Case:** When working with text where contractions need to be split.

A **contraction** is a shortened form of one or more words where missing letters are replaced by an apostrophe (`'`). Contractions are commonly used in informal writing and speech.  

##### **Examples of Contractions:**
| Full Form | Contraction |
|-----------|------------|
| I am | I'm |
| You are | You're |
| He is / He has | He's |
| They are | They're |
| Cannot | Can't |
| Will not | Won't |
| Do not | Don't |
| Should not | Shouldn't |
| Would have | Would've |

##### **Why Do Contractions Matter in NLP?**
- Some tokenizers **split contractions** into separate words (`"can't"` → `["ca", "n't"]`).
- Others **keep contractions intact** (`"can't"` → `["can't"]`).
- Handling contractions correctly is important for **sentiment analysis**, **text preprocessing**, and **machine learning models**.




##### 3. **`WordPunctTokenizer`**
   - **Implementation:** Splits words and punctuation separately.
   - **Features:** Breaks contractions into separate parts (e.g., "can't" → ["can", "'t"]).
   - **Example:**
     ```python
     from nltk.tokenize import WordPunctTokenizer
     tokenizer = WordPunctTokenizer()
     print(tokenizer.tokenize("I'm excited!"))
     ```
     **Output:**
     ```python
     ['I', "'", 'm', 'excited', '!']
     ```
   - **Use Case:** When punctuation needs to be treated as separate tokens.

---
There are a couple other tokenizers.

##### **Summary**
| Tokenizer | Handles Contractions? | Splits Punctuation? | Use Case |
|-----------|------------------|----------------|---------|
| `word_tokenize` | Yes | Mostly | General NLP tasks |
| `TreebankWordTokenizer` | Yes (splits aggressively) | Yes | Similar to `word_tokenize`, more aggressive |
| `ToktokTokenizer` | No | Yes | Fast and simple tokenization |
| `RegexpTokenizer` | Custom | Custom | Custom rules-based tokenization |
| `WordPunctTokenizer` | Yes (aggressive) | Yes (separates all punctuation) | When punctuation should be separate |
| `MWETokenizer` | No | No | Preserve multi-word expressions |


### **Stemming**

**Stemming** is the process of reducing a word to its root form (also called a "stem") by **removing suffixes**. It helps normalize words so that variations of a word are treated as the same.  

For example:  
- **"running" → "run"**  
- **"flies" → "fli"** (incorrect but common with some stemmers)  
- **"happily" → "happili"**  

Stemming is a **rule-based** approach and doesn't always produce real words. It just chops off endings based on predefined rules.

---

**Common Stemming Algorithms in NLTK**
1️⃣ **Porter Stemmer** (Most Common)  
Uses a set of heuristic rules to remove suffixes. It is simple but sometimes over-stems words.  
```python
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
print(stemmer.stem("running"))   # Output: run
print(stemmer.stem("flies"))     # Output: fli
print(stemmer.stem("happiness")) # Output: happi
```
✅ **Pros:** Fast, widely used  
❌ **Cons:** Sometimes removes too much, resulting in non-real words  

---
2️⃣ **Lancaster Stemmer** (Aggressive)  
More aggressive than Porter Stemmer, often reducing words excessively.  
```python
from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()
print(stemmer.stem("running"))   # Output: run
print(stemmer.stem("flies"))     # Output: fli
print(stemmer.stem("happiness")) # Output: happy
```
✅ **Pros:** More aggressive  
❌ **Cons:** Can distort words too much  

---

3️⃣ **Snowball Stemmer** (Improved Porter Stemmer)  
A more advanced version of the Porter Stemmer, supporting multiple languages.  
```python
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")
print(stemmer.stem("running"))   # Output: run
print(stemmer.stem("flies"))     # Output: fli
print(stemmer.stem("happiness")) # Output: happi
```
✅ **Pros:** Better than Porter, supports many languages  
❌ **Cons:** Still rule-based, may not handle all cases correctly  

---

**Limitations of Stemming**
- Doesn't always produce real words (e.g., *"flies"* → *"fli"*)  
- Different words may map to the same stem incorrectly  
- Over-stemming (too aggressive) or under-stemming (not aggressive enough)  

---

 **Stemming vs Lemmatization**
If you need more **accurate** root words, **lemmatization** is a better choice because it uses a **dictionary-based** approach instead of just chopping off suffixes.

Would you like to explore **lemmatization** next? 🚀