# Handle Numbers in NLP

* Raw numbers (e.g., `12345`, `3.14159`, `2025`) **don’t have semantic meaning** for models.
* Numbers may represent:

  * **Quantities** (e.g., "10 apples")
  * **Dates** (e.g., "2025-09-18")
  * **Money** (e.g., "\$100")
  * **IDs/Phone numbers** (irrelevant as features)
* If left untreated, the vocabulary size grows unnecessarily, hurting model efficiency.

---

## Common Approaches to Handle Numbers

### 1. **Remove Numbers**

* When numbers are irrelevant (e.g., reviews: “Good movie 10/10” → numbers may not matter).
* Example:
  `"The price is 100 dollars"` → `"The price is dollars"`

---

### 2. **Replace with a Special Token**

* Replace any number with a placeholder like `<NUM>`.
* Example:
  `"The price is 100 dollars"` → `"The price is <NUM> dollars"`
* Useful in deep learning (so the model treats all numbers uniformly).

---

### 3. **Normalize Numbers**

* Convert numbers into a canonical form:

  * `"100" → "one hundred"`
  * `"3.14" → "three point one four"`
* Useful in speech or translation systems.

---

### 4. **Bucket / Categorize**

* Group numbers into ranges:

  * Age: `23` → `20-30`
  * Salary: `85000` → `80k-90k`
* Example:
  `"He is 27 years old"` → `"He is [20-30] years old"`

---

### 5. **Keep as Features (when meaningful)**

* In tasks like financial text mining, numbers are **critical**.
* Instead of discarding, keep them as **separate numeric features** in ML models.

---

### 6. **Advanced Contextual Handling**

* Use NLP + Regex to detect patterns:

  * Dates: `2025-09-18` → `<DATE>`
  * Percentages: `85%` → `<PERCENT>`
  * Currency: `$100` → `<MONEY>`
* Example:
  `"I paid $500 on 2025-09-18"` → `"I paid <MONEY> on <DATE>"`

---

## ✅ Example in Python

```python
import re

text = "I bought 3 apples for $10 on 18/09/2025."

# 1. Remove numbers
remove_numbers = re.sub(r'\d+', '', text)

# 2. Replace numbers with <NUM>
replace_numbers = re.sub(r'\d+', '<NUM>', text)

# 3. Replace dates with <DATE>, money with <MONEY>
custom_replace = re.sub(r'\d{2}/\d{2}/\d{4}', '<DATE>', text)
custom_replace = re.sub(r'\$\d+', '<MONEY>', custom_replace)

print("Original:", text)
print("Remove:", remove_numbers)
print("Replace <NUM>:", replace_numbers)
print("Custom Replace:", custom_replace)
```

---

### ✅ Output

```
Original: I bought 3 apples for $10 on 18/09/2025.
Remove: I bought  apples for $ on //.
Replace <NUM>: I bought <NUM> apples for $<NUM> on <NUM>/<NUM>/<NUM>.
Custom Replace: I bought 3 apples for <MONEY> on <DATE>.
```

---

**Summary**:

* If numbers are irrelevant → **remove/replace with <NUM>**.
* If numbers carry meaning → **normalize or categorize**.
* If domain-specific (finance, medicine, dates) → **extract and treat separately**.

