# Lesson 2: Mastering Text Cleaning for NLP: Techniques and Applications

### 🚀 **Text Cleaning in NLP: Master the Basics!**  

#### 📚 **Introduction**  
- **Objective**: Learn how to clean textual data using Python for better Natural Language Processing (NLP) results.  
- **Importance**: Clean input data leads to more accurate NLP model outputs.

---

#### 🧹 **Understanding Text Cleaning**  
1. **Why Text Cleaning?**  
   - Handles noisy data (e.g., slang, abbreviations, emojis).  
   - Prepares text for machine comprehension by removing distractions like punctuation and stop words.  

2. **Key Tool**:  
   - **Python's Regex (`re`)**: Simplifies pattern-based string replacement using `re.sub(pattern, repl, string)`.

---

#### 🛠 **Text Cleaning Process**  
Using the Python function `clean_text`:  

```python
import re

def clean_text(text):
    text = text.lower()  # Convert text to lowercase
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove special characters and punctuation
    text = re.sub(r'\d', ' ', text)  # Remove digits
    text = re.sub(r'\s\s+', ' ', text)  # Remove extra spaces
    return text
```

**Steps Explained**:  
- **Lowercase**: Normalizes case sensitivity (e.g., `The` = `the`).  
- **Remove Emails**: Excludes unnecessary email addresses.  
- **Remove URLs**: Eliminates irrelevant links.  
- **Special Characters**: Filters out symbols, punctuation.  
- **Numbers**: Discards numeric distractions.  
- **Extra Spaces**: Cleans up formatting.

---

#### 🔍 **Demo Example**  

Input:  
```python
print(clean_text('Check out the course at www.codesignal.com/course123'))
```

Output:  
```
check out the course at www codesignal com course
```

---

#### 📊 **Dataset Implementation**  
Apply `clean_text` to a dataset using Pandas:  

```python
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# Load dataset
newsgroups_data = fetch_20newsgroups(subset='train')
nlp_df = pd.DataFrame(newsgroups_data.data, columns=['text'])

# Apply cleaning function
nlp_df['text'] = nlp_df['text'].apply(lambda x: clean_text(x))

print(nlp_df.head())
```

**Result**: A cleaned DataFrame ready for further NLP tasks.  

---

#### 🧪 **Testing the Cleaning Function**  
Example with various inputs:  

```python
test_texts = ['This is an EXAMPLE!', 'Another ex:ample123 with $#@!', 'example@mail.com is an email.']

for text in test_texts:
    print(f'Original: {text}')
    print(f'Cleaned: {clean_text(text)}')
```

**Output**:  
- Original: `This is an EXAMPLE!` → Cleaned: `this is an example`  
- Original: `Another ex:ample123 with $#@!` → Cleaned: `another ex ample with `  
- Original: `example@mail.com is an email.` → Cleaned: `is an email `  

---

### 📝 **Lesson Summary**  
- **What You Learned**:  
  - Importance of text cleaning in NLP.  
  - Implementing a Python function to clean data.  
  - Application to real datasets using Pandas.  

- **Next Steps**:  
  - Practice cleaning datasets to build better NLP pipelines!  

**Happy Cleaning!** 🧹

## Update String and Clean Text

Well done, Space Voyager! Now, to further explore the workings of our text cleaning function, let's use a different sentence. Replace the first sentence in the test_texts list with the phrase "I love learning at CodeSignal; it's so interactive and fun!". Then run the clean_text function with the updated list.

```python
import re

def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation and special characters but leave spaces
    text = re.sub(r'\d+', '', text)  # Remove digits
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace to single spaces

    return text.strip()  # Strip leading and trailing spaces

test_texts = ['This is an EXAMPLE!', 
              'Another ex:ample123 with special characters $#@!', 
              'example@mail.com is an email address.']
for text in test_texts:
    print(f'Original: {text}')
    print(f'Cleaned: {clean_text(text)}')
    print('--')
    
```

Here's the updated code and output with the first sentence replaced as per your instructions:

```python
import re

def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation and special characters but leave spaces
    text = re.sub(r'\d+', '', text)  # Remove digits
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace to single spaces

    return text.strip()  # Strip leading and trailing spaces

# Updated test_texts list
test_texts = ['I love learning at CodeSignal; it\'s so interactive and fun!', 
              'Another ex:ample123 with special characters $#@!', 
              'example@mail.com is an email address.']

# Test the clean_text function with the updated list
for text in test_texts:
    print(f'Original: {text}')
    print(f'Cleaned: {clean_text(text)}')
    print('--')
```

### **Output:**

```
Original: I love learning at CodeSignal; it's so interactive and fun!
Cleaned: i love learning at codesignal its so interactive and fun
--
Original: Another ex:ample123 with special characters $#@!
Cleaned: another ex ample with special characters
--
Original: example@mail.com is an email address.
Cleaned: is an email address
--
```

### **Explanation of Updates:**
1. The first sentence was replaced with `"I love learning at CodeSignal; it's so interactive and fun!"`.
2. The `clean_text` function:
   - Converted the sentence to lowercase.
   - Removed punctuation (`;`, `'`).
   - Normalized spaces and stripped leading/trailing whitespace.
3. Output reflects cleaned versions of the updated text inputs.

## Filling in Python Functions and Regex Patterns

Superb job! Now, let's ensure you have grasped the core Python functions and regex patterns used in text cleaning.

Fill in the blanks (____) to transform all text to lowercase and remove extra spaces.

```python
import re

def clean_text(text):
    text = text.____()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'\d', ' ', text)  # Remove digits
    text = re.sub(r'\____\s+', ' ', text)  # Remove extra spaces
    return text.strip()

test_texts = ['Fill me In! I enjoy problem-solving tasks like these; they are quite engaging.', 
              'Another ex:ample123 with special characters $#@!', 
              'example@mail.com is an email address.']
for text in test_texts:
    print(f'Original: {text}')
    print(f'Cleaned: {clean_text(text)}')
    print('--')

```

Here’s the filled-in code:

```python
import re

def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'\d', ' ', text)  # Remove digits
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    return text.strip()

test_texts = ['Fill me In! I enjoy problem-solving tasks like these; they are quite engaging.', 
              'Another ex:ample123 with special characters $#@!', 
              'example@mail.com is an email address.']
for text in test_texts:
    print(f'Original: {text}')
    print(f'Cleaned: {clean_text(text)}')
    print('--')
```

### **Explanation of the Filled Blanks:**
1. `text.lower()` was filled in to convert the text to lowercase. This ensures that all words are treated uniformly regardless of their original casing.
2. `\s+` was filled in the regex pattern to identify one or more whitespace characters. This is used to normalize multiple spaces into a single space.

### **Output:**
```
Original: Fill me In! I enjoy problem-solving tasks like these; they are quite engaging.
Cleaned: fill me in i enjoy problem solving tasks like these they are quite engaging
--
Original: Another ex:ample123 with special characters $#@!
Cleaned: another ex ample with special characters
--
Original: example@mail.com is an email address.
Cleaned: is an email address
--
```

This output confirms that the `clean_text` function works as intended after filling in the blanks!

## Mastering Text Cleaning with Python Regex

Charming maneuvering, Stellar Navigator!

Now, I want to see more of your Python prowess. Replace the missing parts (____), ensuring that they perform the right operations. We aim to transform all text to lowercase and eliminate email addresses, URLs, special characters, digits, and extra spaces.

```python
import re

def clean_text(text):
    text = text.____()  # Convert text to lower case
    text = re.sub(r'____', '', text)  # Remove email addresses
    text = re.sub(r'____', '', text)  # Remove URLs
    text = re.sub(r'____', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'____', '', text)  # Remove digits
    text = re.sub(r'____', ' ', text)  # Remove extra spaces

    return text.strip()

test_texts = ['Sample Text with EMAIL! mailsample@mail.com', 
              'URL example: https://www.codesignal.com', 
              'Special characters #formed.', 
              'Digits included: 1234!!',
              'Extra spaces   included.']
for text in test_texts:
    print(f'Original: {text}')
    print(f'Cleaned: {clean_text(text)}')
    print('--')

```

Here's how to complete the `clean_text` function to transform all text to lowercase, remove email addresses, URLs, special characters, digits, and extra spaces:

```python
import re

def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^\w\s]', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'\d+', '', text)  # Remove digits
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces

    return text.strip()

test_texts = ['Sample Text with EMAIL! mailsample@mail.com', 
              'URL example: https://www.codesignal.com', 
              'Special characters #formed.', 
              'Digits included: 1234!!',
              'Extra spaces   included.']
for text in test_texts:
    print(f'Original: {text}')
    print(f'Cleaned: {clean_text(text)}')
    print('--')
```

### Explanation of the regex patterns:
1. **`text.lower()`**: Converts the entire text to lowercase.
2. **`r'\S*@\S*\s?'`**: Removes any email address (anything in the format of `user@example.com`).
3. **`r'http\S+'`**: Removes URLs (anything starting with "http" followed by non-whitespace characters).
4. **`r'[^\w\s]'`**: Removes punctuation and special characters, leaving only word characters (letters, digits) and whitespace.
5. **`r'\d+'`**: Removes digits from the text.
6. **`r'\s+'`**: Removes extra spaces and normalizes them to a single space.

This code will ensure that text is clean, with only necessary elements left for further processing or analysis.

## Implement Text Cleaning on Dataset

Off to a flying start, Celestial Traveler! Your new task requires you to insert missing code into two sections. First, write code that transforms the collected dataset into a DataFrame. Then, ensure that you apply the clean_text function to the DataFrame to clean up the textual data. Bon Voyage!

```python
import re
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'\d', ' ', text)  # Remove digits
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace to single spaces

    return text.strip()  # Strip leading and trailing spaces

# Fetch the dataset
newsgroups_data = fetch_20newsgroups(subset='train')

# Create DataFrame from the dataset
# TODO: Write the required code to create a dataframe from the dataset. Name the dataframe as nlp_df and the column as 'text'.

# Apply the clean_text function to the text data in the DataFrame
# TODO: Write the code to apply the clean_text function to the 'text' column in the dataframe. 

# Check the cleaned text
print(nlp_df.head())

```

To complete the task, let's proceed with the following steps:

1. Create a DataFrame from the `newsgroups_data` and store it in `nlp_df`, with the relevant column being `'text'`.
2. Apply the `clean_text` function to the `'text'` column to clean the textual data.

Here’s the completed code:

```python
import re
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

def clean_text(text):
    text = text.lower()  # Convert text to lower case
    text = re.sub(r'\S*@\S*\s?', '', text)  # Remove email addresses
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\W', ' ', text)  # Remove punctuation and special characters
    text = re.sub(r'\d', ' ', text)  # Remove digits
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace to single spaces

    return text.strip()  # Strip leading and trailing spaces

# Fetch the dataset
newsgroups_data = fetch_20newsgroups(subset='train')

# Create DataFrame from the dataset
nlp_df = pd.DataFrame(newsgroups_data.data, columns=['text'])

# Apply the clean_text function to the 'text' column in the dataframe
nlp_df['text'] = nlp_df['text'].apply(clean_text)

# Check the cleaned text
print(nlp_df.head())
```

### Explanation:
- **Step 1**: We fetch the `20newsgroups` dataset using `fetch_20newsgroups(subset='train')` and store the raw data in `newsgroups_data`. We then create a DataFrame (`nlp_df`) using `pd.DataFrame()` where the data is placed under the column `'text'`.
  
- **Step 2**: We apply the `clean_text` function to the `'text'` column using the `apply()` method, which processes each row (i.e., each piece of text) in the column.

### Expected Output:
The first few rows of the cleaned DataFrame will be displayed, where the text has been transformed, with all unwanted characters and patterns removed.

## Mastering Text Cleaning with Python Regex on a Dataset

Excellent work so far, Stellar Navigator! This time, you'll master how to clean textual data. We aim to transform pieces of text into cleaner versions, making them more suitable for further analysis. Let's put the focus on the cleaning functions you've built previously. Implement a combined function that runs all these cleaning operations in sequence: converting text to lowercase letters, removing email addresses, URLs, special characters, digits, and extra spaces. Finally,apply this unified cleaning function to a dataset.

```python
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
import re

def lower_text(text):
    return text.lower()

def remove_email(text):
    return re.sub(r'\S*@\S*\s?', '', text)

def remove_url(text):
    return re.sub(r'http\S+', '', text)

def remove_special_chars(text):
    return re.sub(r'\W', ' ', text)

def remove_digits(text):
    return re.sub(r'\d', ' ', text)

def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text).strip()

# TODO: Implement a function clean_text that runs all the functions above in a sequence

# Fetch the dataset
newsgroups_data = fetch_20newsgroups(subset='train')

# TODO: Create a DataFrame from the dataset

# TODO: Apply the cleaning functions to the DataFrame

# TODO: Check the cleaned text

```

To implement a unified `clean_text` function that combines all the text cleaning functions and applies them in sequence, we'll follow these steps:

1. **Combine all functions**: We'll create a `clean_text` function that calls each of the existing functions one after the other.
2. **Apply the function**: We'll apply the `clean_text` function to the text data in the dataset.
3. **Create the DataFrame**: We'll create a DataFrame from the dataset to hold the cleaned text.

Here’s the completed code:

```python
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
import re

def lower_text(text):
    return text.lower()

def remove_email(text):
    return re.sub(r'\S*@\S*\s?', '', text)

def remove_url(text):
    return re.sub(r'http\S+', '', text)

def remove_special_chars(text):
    return re.sub(r'\W', ' ', text)

def remove_digits(text):
    return re.sub(r'\d', ' ', text)

def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text).strip()

# Combine all cleaning functions into a single clean_text function
def clean_text(text):
    text = lower_text(text)  # Convert to lowercase
    text = remove_email(text)  # Remove email addresses
    text = remove_url(text)  # Remove URLs
    text = remove_special_chars(text)  # Remove special characters
    text = remove_digits(text)  # Remove digits
    text = remove_extra_spaces(text)  # Remove extra spaces
    return text

# Fetch the dataset
newsgroups_data = fetch_20newsgroups(subset='train')

# Create a DataFrame from the dataset
nlp_df = pd.DataFrame(newsgroups_data.data, columns=['text'])

# Apply the cleaning functions to the DataFrame
nlp_df['text'] = nlp_df['text'].apply(clean_text)

# Check the cleaned text
print(nlp_df.head())
```

### Explanation:
1. **Clean Text Function**: 
   - The `clean_text` function now combines all the individual functions (`lower_text`, `remove_email`, `remove_url`, `remove_special_chars`, `remove_digits`, and `remove_extra_spaces`) into a single pipeline.
   
2. **DataFrame Creation**:
   - We create a `DataFrame` (`nlp_df`) with the `newsgroups_data.data`, assigning the text to the column `'text'`.
   
3. **Apply Cleaning**:
   - The `apply()` method is used to apply the `clean_text` function to every text entry in the `'text'` column of the DataFrame.

### Expected Output:
The first few rows of the cleaned text will be displayed, showing that all unwanted elements (such as emails, URLs, special characters, digits, and excessive spaces) have been removed. This dataset will now be ready for further processing or analysis.