# Text Processing for AI Data (Multilingual & RTL)

## Purpose
This notebook focuses on preparing raw text for AI training and annotation,
with special attention to Arabic, Urdu, and other RTL languages.

## Why this matters
AI models learn patterns from text.
Unclean or inconsistent text produces unreliable models.


In [1]:
raw_texts = [
    "   میں   پائتھن   سیکھ   رہا ہوں   ",
    "أُحِبُّ   البرمجة!!!",
    "I   am   learning    Python!!!",
    "\n\nهذا نص يحتوي على مسافات زائدة\t"
]

raw_texts


['   میں   پائتھن   سیکھ   رہا ہوں   ',
 'أُحِبُّ   البرمجة!!!',
 'I   am   learning    Python!!!',
 '\n\nهذا نص يحتوي على مسافات زائدة\t']

## Removing Extra Whitespace

Real-world data often contains:
- Extra spaces
- Newlines
- Tabs

These must be normalized before annotation or training.


In [2]:
cleaned_texts = []

for text in raw_texts:
    text = text.strip()
    text = " ".join(text.split())
    cleaned_texts.append(text)

cleaned_texts


['میں پائتھن سیکھ رہا ہوں',
 'أُحِبُّ البرمجة!!!',
 'I am learning Python!!!',
 'هذا نص يحتوي على مسافات زائدة']

### Impact on AI systems

- Prevents duplicate entries
- Improves tokenization
- Stabilizes labels
- Reduces noise in training data


In [3]:
import re

normalized_texts = []

for text in cleaned_texts:
    text = re.sub(r"[!؟،,.]+", "", text)
    normalized_texts.append(text)

normalized_texts


['میں پائتھن سیکھ رہا ہوں',
 'أُحِبُّ البرمجة',
 'I am learning Python',
 'هذا نص يحتوي على مسافات زائدة']

## RTL Language Safety

Python treats RTL text as Unicode strings.
No special handling is required, but normalization is essential.


In [4]:
for text in normalized_texts:
    print(len(text), "→", text)


23 → میں پائتھن سیکھ رہا ہوں
15 → أُحِبُّ البرمجة
20 → I am learning Python
29 → هذا نص يحتوي على مسافات زائدة


## What I learned

- Text cleaning is essential before annotation
- Python handles Arabic and Urdu natively
- Simple rules scale to large datasets
- Clean data = reliable AI behavior
