# Text Cleaning and Manipulation

Prepare text data for analysis.

## Common Tasks
- **Normalization:** Lowercasing, removing whitespace.
- **Removal:** Stopwords, punctuation, special chars.
- **Extraction:** Dates, emails, phone numbers (Regex).
- **Encoding:** Making text ready for ML.


In [None]:
import pandas as pd
import numpy as np
import re

# Create messy text data
df = pd.DataFrame({
    'Feedback': [
        '  Great Product!  ',
        'Loving it, 5 stars.',
        'Not worth $50...',
        'delivery was LATE!!!',
        np.nan
    ],
    'Contact': [
        'john@email.com',
        'support@company.org',
        '123-456-7890',
        'No email',
        'jane.doe@gmail.com'
    ]
})

print(df)

## 1. Basic Normalization
Lowercase and whitespace removal.

In [None]:
# Always fill NaN first
df['Feedback'] = df['Feedback'].fillna('')

# Lowercase + Strip whitespace
df['Cleaned'] = df['Feedback'].str.lower().str.strip()

print(df[['Feedback', 'Cleaned']])

## 2. Removing Punctuation
Using Regular Expressions (Regex).

In [None]:
# Remove everything that is NOT a word char or space
df['Cleaned'] = df['Cleaned'].str.replace(r'[^\w\s]', '', regex=True)
print(df['Cleaned'])

## 3. Extracting Information (Regex)
Find emails or patterns.

In [None]:
# Regex for email
email_pattern = r'([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})'

df['Email_Extracted'] = df['Contact'].str.extract(email_pattern)
print(df[['Contact', 'Email_Extracted']])

## 4. Feature Engineering from Text

In [None]:
# Character count
df['Length'] = df['Cleaned'].str.len()

# Word count
df['Word_Count'] = df['Cleaned'].str.split().str.len()

# Identify simple sentiment (keyword based)
positive_words = ['great', 'loving', 'good']
df['Is_Positive'] = df['Cleaned'].apply(
    lambda x: any(word in x for word in positive_words)
)

print(df[['Cleaned', 'Length', 'Word_Count', 'Is_Positive']])

## 5. Fuzzy Matching (Advanced)
Fixing typos (e.g., 'Aplle' -> 'Apple').

In [None]:
# Requires: pip install fuzzywuzzy python-Levenshtein
# from fuzzywuzzy import process

# choices = ['Apple', 'Banana', 'Orange']
# process.extractOne('Aplle', choices)
# Output: ('Apple', 100)

## Practice Exercise
Clean the Name column in Titanic to extract Titles (Mr, Mrs, Miss).

In [None]:
# Load titanic
# Extract 'Mr.', 'Mrs.', etc. using Regex
# Check counts of each title
# Your code here

## Key Takeaways

✅ **Normalization** - Critical first step.
✅ **Regex** - Learn basics (`^`, `$`, `\d`, `\w`, `+`, `*`).
✅ **Encoding** - Converting text to numbers is needed for ML.

**Next:** [Data Cleaning Project](README.md) →