```{contents}
```

# Text Preprocessing

### **Lowercasing**

* Convert all text into lowercase to avoid treating `"Apple"` and `"apple"` as different words.
* Example: `"Natural Language Processing"` → `"natural language processing"`

```python
text = "Natural Language Processing"
text_lower = text.lower()
print(text_lower)  # 'natural language processing'
```

---

### **Tokenization**

* Breaking text into smaller units (sentences or words).
* Example: `"I love NLP."` → `['I', 'love', 'NLP', '.']`

```python
from nltk.tokenize import word_tokenize
word_tokenize("I love NLP.")
```

---

### **Removing Punctuation / Special Characters**

* Punctuation often doesn’t add much meaning in text classification tasks.
* Example: `"Hello!!! How are you??"` → `"Hello How are you"`

```python
import re
text = "Hello!!! How are you??"
cleaned = re.sub(r'[^\w\s]', '', text)  # keep only words and spaces
print(cleaned)  # 'Hello How are you'

from nltk.tokenize import word_tokenize
import string

text = "Hello!!! How are you?? I'm fine... thanks :)"

tokens = word_tokenize(text)
tokens = [word for word in tokens if word not in string.punctuation]

print(tokens)

```

---

### **Stopword Removal**

* Stopwords are common words (e.g., "is", "the", "and") that carry little meaning.
* Example: `"This is a good book"` → `"good book"`

```python
from nltk.corpus import stopwords
nltk.download("stopwords")

words = word_tokenize("This is a good book")
filtered = [w for w in words if w.lower() not in stopwords.words("english")]
print(filtered)  # ['good', 'book']
```

---

### **Stemming**

* Reducing words to their **root form** (not always valid words).
* Example: `"playing" → "play", "studies" → "studi"`

```python
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem("playing"))  # play
print(stemmer.stem("studies"))  # studi
```

---

### **Lemmatization**

* Similar to stemming but uses vocabulary + grammar → produces valid words.
* Example: `"studies" → "study", "better" → "good"`

```python
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("studies"))   # study
print(lemmatizer.lemmatize("better", pos="a"))  # good
```

---

### **Handling Numbers**

* Numbers may or may not be useful.

  * Option 1: Remove numbers → `"I bought 3 apples"` → `"I bought apples"`
  * Option 2: Keep numbers but normalize → `"3"` → `"three"`

---

### **Handling Emojis / Emoticons (Optional)**

* Emojis can carry meaning in sentiment analysis.

  * Example: `"I am happy 😊"` → `"happy"`

---

### **Spelling Correction**

* Example: `"I lve NLP"` → `"I love NLP"`

Libraries like `TextBlob` or `SymSpell` can fix spelling.

---

### **Text Normalization**

* Expanding contractions → `"don't"` → `"do not"`
* Normalizing slang → `"u"` → `"you"`

---

### **Vectorization**

* Final step: convert words into numerical form.

  * **Bag of Words (BoW)**
  * **TF-IDF (Term Frequency – Inverse Document Frequency)**
  * **Word Embeddings (Word2Vec, GloVe, BERT, etc.)**

---

**Workflow Summary**

1. Lowercasing
2. Tokenization (sentence/word level)
3. Cleaning (punctuation, numbers, special chars)
4. Stopword removal
5. Normalization (stemming/lemmatization)
6. Spelling correction / Slang normalization (if needed)
7. Convert to vectors (BoW, TF-IDF, embeddings)

---

⚡ In practice: preprocessing steps **depend on the task**.

* For **sentiment analysis**, emojis might matter.
* For **legal/medical NLP**, numbers and special terms matter.
* For **chatbots**, spelling correction and contraction expansion are crucial.

```{dropdown} Click here for Sections
```{tableofcontents}