# 02. Text Preprocessing & Data Cleaning
Text preprocessing and data cleaning are essential steps in Natural Language Processing (NLP) that involve preparing and transforming raw text data into a structured format suitable for analysis and modeling.

### What You'll Learn:
- Why text preprocessing matters
- Text cleaning techniques
- Tokenization methods
- Normalization strategies
- Stemming vs Lemmatization
- Real-world examples and best practices

## Why is Text Preprocessing Important?

Raw text data is messy and contains:
- **Noise**: Special characters, URLs, HTML tags
- **Inconsistency**: Mixed case, extra spaces
- **Irrelevant data**: Stopwords, punctuation
- **Variations**: running, runs, ran (same meaning)

**Without preprocessing**: Poor accuracy
**With preprocessing**: Better model performance

In [1]:
import re
import string
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk

# Download required data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

print('✓ Libraries loaded')

✓ Libraries loaded


## Step 1: Remove Special Characters

In [2]:
raw_text = 'Hello! This is TEST... @#$% with 123 numbers!!!'
print('Original:', raw_text)

# Remove special characters
cleaned = re.sub(r'[^a-zA-Z0-9\s]', '', raw_text)
print('Cleaned:', cleaned)

Original: Hello! This is TEST... @#$% with 123 numbers!!!
Cleaned: Hello This is TEST  with 123 numbers


## Step 2: Remove URLs and Emails

In [3]:
text = 'Visit https://example.com or email support@example.com'
print('Original:', text)

# Remove URLs
text = re.sub(r'http\S+|www\S+', '', text)
text = re.sub(r'\S+@\S+', '', text)
print('Cleaned:', text)

Original: Visit https://example.com or email support@example.com
Cleaned: Visit  or email 


## Step 3: Tokenization

In [5]:
text = 'Hello world! How are you?'
print('Original:', text)

# Word tokenization
words = word_tokenize(text, language="english", preserve_line=False)
print('\nWord tokens:', words)

# Sentence tokenization
sentences = sent_tokenize(text, language="english")
print('Sentence tokens:', sentences)

Original: Hello world! How are you?

Word tokens: ['Hello', 'world', '!', 'How', 'are', 'you', '?']
Sentence tokens: ['Hello world!', 'How are you?']


## Step 4: Remove Stopwords

In [6]:
text = 'the quick brown fox jumps'
tokens = word_tokenize(text.lower())

stop_words = set(stopwords.words('english'))
filtered = [w for w in tokens if w not in stop_words]

print('Original tokens:', tokens)
print('Filtered tokens:', filtered)

Original tokens: ['the', 'quick', 'brown', 'fox', 'jumps']
Filtered tokens: ['quick', 'brown', 'fox', 'jumps']


## Step 5: Lemmatization vs Stemming

In [7]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ['running', 'caring', 'studies']

print('Word | Stemming | Lemmatization')
for word in words:
    stem = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word, pos='v')
    print(f'{word} | {stem} | {lemma}')

Word | Stemming | Lemmatization
running | run | run
caring | care | care
studies | studi | study
