### **Module 7: Working with Text Data**
Text data often needs special handling, especially in ML applications like NLP. This module covers common text processing tasks.

#### **Topics:**
- **String Operations:**
  - Using `str` accessor for string manipulation (e.g., `lower()`, `upper()`, `split()`, `contains()`).

- **Removing Whitespace and Special Characters:**
  - Stripping extra spaces and cleaning up special characters.
  
- **Extracting Information:**
  - Extracting patterns from text using regular expressions.

#### **Hands-on Lab:**
- Clean a dataset's text column by removing special characters, converting text to lowercase, and splitting values.

---

## **1. String Operations**

### **Real-world Scenario 1: Customer Feedback Text Cleaning**  
You have a **customer feedback dataset** where the text contains mixed case and random formatting. You want to standardize it by **converting everything to lowercase and checking if certain keywords** (e.g., "refund") are present.

### **Example: String Manipulation in Feedback**  

In [1]:
import pandas as pd

# Sample customer feedback data
feedback_data = {
    'customer_id': [1, 2, 3],
    'feedback': ['Excellent product! Totally worth it!', 'Requesting a REFUND ASAP!!!', 'delivery was Late and Damaged.']
}

df = pd.DataFrame(feedback_data)

# Convert feedback to lowercase
df['feedback_cleaned'] = df['feedback'].str.lower()

# Check if the feedback contains the word "refund"
df['contains_refund'] = df['feedback_cleaned'].str.contains('refund')

print(df[['customer_id', 'feedback_cleaned', 'contains_refund']])

   customer_id                      feedback_cleaned  contains_refund
0            1  excellent product! totally worth it!            False
1            2           requesting a refund asap!!!             True
2            3        delivery was late and damaged.            False


## **2. Removing Whitespace and Special Characters**

### **Real-world Scenario 2: Cleaning Product Descriptions**  
You have a dataset of **product descriptions** where some entries contain special characters (`#, @, *, $`) and extra spaces. You need to **strip whitespace and remove special characters** to clean the data.

### **Example: Clean Product Descriptions** 

In [2]:
import re

# Sample product descriptions
product_data = {
    'product_id': [101, 102, 103],
    'description': ['  Best-Quality Laptop@@ ', ' High-Performance**Phone!', ' Budget-Friendly*  Tablet']
}

df = pd.DataFrame(product_data)

# Remove special characters and strip whitespace
df['description_cleaned'] = df['description'].str.replace(r'[^\w\s]', '', regex=True).str.strip()

print(df[['product_id', 'description', 'description_cleaned']])

   product_id                description     description_cleaned
0         101     Best-Quality Laptop@@       BestQuality Laptop
1         102   High-Performance**Phone!    HighPerformancePhone
2         103   Budget-Friendly*  Tablet  BudgetFriendly  Tablet


## **3. Extracting Information**

### **Real-world Scenario 3: Extracting Dates from Text**  
You have a dataset of **support tickets** where the text contains dates. You want to **extract dates** from the text using regular expressions to create a new "date" column.

### **Example: Extracting Dates from Support Tickets**  

In [3]:
# Sample support tickets with dates in text
ticket_data = {
    'ticket_id': [201, 202, 203],
    'description': ['Reported on 2025-01-15: Server crashed', 
                    'Issue detected on 2025/02/05 - Login failure', 
                    'Maintenance scheduled for 2025.03.01 - Network upgrade']
}

df = pd.DataFrame(ticket_data)

# Extract dates in different formats (YYYY-MM-DD, YYYY/MM/DD, YYYY.MM.DD)
df['extracted_date'] = df['description'].str.extract(r'(\d{4}[-/.]\d{2}[-/.]\d{2})')

print(df[['ticket_id', 'description', 'extracted_date']])

   ticket_id                                        description extracted_date
0        201             Reported on 2025-01-15: Server crashed     2025-01-15
1        202       Issue detected on 2025/02/05 - Login failure     2025/02/05
2        203  Maintenance scheduled for 2025.03.01 - Network...     2025.03.01


## **4. Full Hands-on Lab: Text Data Cleaning**  
### **Scenario 4: Cleaning a Product Reviews Dataset**  
You have a dataset of **product reviews** with issues such as special characters, mixed case, and extra spaces. You want to:
1. **Remove special characters** and **extra spaces**.
2. Convert reviews to **lowercase**.
3. Split the cleaned review into **individual words**.

### **Example: Cleaning and Splitting Text**

In [6]:
# Sample product reviews data
reviews_data = {
    'review_id': [1, 2, 3],
    'review': [' Amazing @ Quality! Phone   #Value for Money!', 'Horrible** experience. Never again.', 'Very SLOW delivery!!!  Disappointing.']
}

df = pd.DataFrame(reviews_data)

# Clean the review text
df['cleaned_review'] = df['review'].str.replace(r'[^\w\s]', '', regex=True).str.lower().str.strip()

# Split cleaned reviews into words
df['split_review'] = df['cleaned_review'].str.split()

display(df[['review_id', 'review', 'cleaned_review', 'split_review']])

Unnamed: 0,review_id,review,cleaned_review,split_review
0,1,Amazing @ Quality! Phone #Value for Money!,amazing quality phone value for money,"[amazing, quality, phone, value, for, money]"
1,2,Horrible** experience. Never again.,horrible experience never again,"[horrible, experience, never, again]"
2,3,Very SLOW delivery!!! Disappointing.,very slow delivery disappointing,"[very, slow, delivery, disappointing]"


## **5. Hands-on Lab: Real-world Scenarios**  

### **Scenario 5: Cleaning Emails for Spam Detection**  
You have a dataset of **emails** where you need to clean the subject lines by **removing URLs, special characters, and extra spaces**, and converting the text to lowercase.

In [7]:
# Sample email subject lines
email_data = {
    'email_id': [1, 2, 3],
    'subject_line': ['WIN BIG $$$ Visit http://promo.com', 'Limited Offer!!! Click here NOW', 'Your Invoice #12345 Ready']
}

df = pd.DataFrame(email_data)

# Remove URLs and special characters
df['cleaned_subject'] = df['subject_line'].str.replace(r'http\S+', '', regex=True)  # Remove URLs
df['cleaned_subject'] = df['cleaned_subject'].str.replace(r'[^\w\s]', '', regex=True).str.lower().str.strip()

print(df[['email_id', 'subject_line', 'cleaned_subject']])

   email_id                        subject_line               cleaned_subject
0         1  WIN BIG $$$ Visit http://promo.com                win big  visit
1         2     Limited Offer!!! Click here NOW  limited offer click here now
2         3           Your Invoice #12345 Ready      your invoice 12345 ready


### **Scenario 6: Extracting Phone Numbers from Text**  
You have a dataset of **customer messages** where phone numbers are mentioned in different formats. You want to **extract the phone numbers**.

In [8]:
# Sample messages with phone numbers
messages_data = {
    'message_id': [1, 2, 3],
    'message': ['Call me at 987-654-3210 for details', 'My number is (123) 456-7890', 'Contact: 1234567890']
}

df = pd.DataFrame(messages_data)

# Extract phone numbers
df['extracted_phone'] = df['message'].str.extract(r'(\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})')

print(df[['message_id', 'message', 'extracted_phone']])

   message_id                              message extracted_phone
0           1  Call me at 987-654-3210 for details    987-654-3210
1           2          My number is (123) 456-7890  (123) 456-7890
2           3                  Contact: 1234567890      1234567890


### **Scenario 7: Identifying Spam Keywords**  
You have a dataset of **user comments**. You need to **flag comments** that contain spammy keywords such as "win", "offer", or "prize".

In [10]:
import pandas as pd

# Sample user comments
comments_data = {
    'comment_id': [1, 2, 3],
    'comment': ['You have WON a prize!', 'Limited time OFFER available now!', 'Happy with the product.']
}

df = pd.DataFrame(comments_data)

# Use non-capturing group (?:...) in the regex
df['is_spam'] = df['comment'].str.lower().str.contains(r'\b(?:win|offer|prize)\b')

print(df[['comment_id', 'comment', 'is_spam']])

   comment_id                            comment  is_spam
0           1              You have WON a prize!     True
1           2  Limited time OFFER available now!     True
2           3            Happy with the product.    False


### **Scenario 8: Cleaning Social Media Post Content**  
You have a dataset of **social media posts** with hashtags, mentions, and emojis. You need to **remove hashtags, mentions (@username), and emojis**.

In [14]:
# Sample social media posts
social_media_data = {
    'post_id': [1, 2, 3],
    'content': ['Great day! #happy @john 😊', 'Just bought a new phone! 📱🎉', 'Lunch time! #yummy']
}

df = pd.DataFrame(social_media_data)

# Remove hashtags, mentions, and emojis
df['cleaned_content'] = df['content'].str.replace(r'@\w+', '', regex=True)  # Remove mentions
df['cleaned_content'] = df['cleaned_content'].str.replace(r'#\w+', '', regex=True)  # Remove hashtags
df['cleaned_content'] = df['cleaned_content'].str.replace(r'[^\w\s]', '', regex=True).str.strip()  # Remove emojis

print(df[['post_id', 'content', 'cleaned_content']])

   post_id                      content          cleaned_content
0        1    Great day! #happy @john 😊                Great day
1        2  Just bought a new phone! 📱🎉  Just bought a new phone
2        3           Lunch time! #yummy               Lunch time


### **Scenario 9: Extracting Hashtags from Tweets**  
You have a dataset of **tweets** and want to **extract hashtags** to analyze trending topics.

In [12]:
# Sample tweets
tweet_data = {
    'tweet_id': [1, 2, 3],
    'tweet': ['#Python is amazing! #AI', 'Learning #DataScience with #Python', 'No hashtags here!']
}

df = pd.DataFrame(tweet_data)

# Extract hashtags
df['hashtags'] = df['tweet'].str.findall(r'#\w+')

print(df[['tweet_id', 'tweet', 'hashtags']])

   tweet_id                               tweet                 hashtags
0         1             #Python is amazing! #AI           [#Python, #AI]
1         2  Learning #DataScience with #Python  [#DataScience, #Python]
2         3                   No hashtags here!                       []


## **Summary of Examples:**

| **Scenario**               | **Task**                                  | **Description**                                           |
|----------------------------|-------------------------------------------|----------------------------------------------------------|
| Customer Feedback Cleaning  | `str.lower()`, `contains()`                | Standardize case and check for keywords.                  |
| Product Description Cleaning| `str.replace()`, `str.strip()`             | Remove special characters and trim whitespace.            |
| Support Ticket Date Extraction | `str.extract()`                         | Extract dates in various formats using regex.             |
| Product Reviews             | `str.split()`                              | Split cleaned reviews into individual words.              |
| Spam Detection in Emails    | Regex for URLs and special characters      | Clean subject lines by removing URLs and special chars.   |
| Extracting Phone Numbers    | Regex patterns for phone numbers           | Extract phone numbers from customer messages.             |
| Spam Keywords in Comments   | `contains()`                               | Flag comments containing spammy keywords.                 |
| Social Media Content        | `str.replace()` for hashtags and mentions  | Remove mentions, hashtags, and emojis.                    |
| Hashtag Extraction          | `str.findall()`                            | Extract hashtags from tweets for trend analysis.          |

