# Unit 1 

# Introduction to NLP Data Processing

Welcome to the first lesson of the "Foundations of NLP Data Processing" course. In this lesson, we will explore the essential techniques for **cleaning and normalizing text data**, which are crucial steps in preparing data for **Natural Language Processing (NLP)** models. Text preprocessing helps in removing noise and ensuring that the data is in a consistent format, making it easier for NLP models to understand and analyze. By the end of this lesson, you will be able to create a text-cleaning pipeline that effectively prepares text data for further processing.

-----

## Setting Up the Environment

Before we dive into text cleaning, let's set up our environment. We will use several Python libraries: **nltk**, **autocorrect**, and **re**. These libraries are pre-installed in CodeSignal environments, so you don't need to worry about installation here. However, on your own device, you can install them using `pip`:

```bash
pip install nltk autocorrect
```

The `nltk` library is a powerful toolkit for working with human language data, and it provides tools for text processing, including stopwords removal, stemming, and lemmatization. Note that even after installing `nltk` via `pip`, you still need to download its specific packages within your Python code. For example, to use the WordNet lemmatizer, you need to download the WordNet data:

```python
import nltk
nltk.download('wordnet')
```

Similarly, for stopwords removal, you need to download the stopwords data:

```python
nltk.download('stopwords')
```

The `autocorrect` library helps in correcting misspelled words, and `re` is a built-in Python library for working with regular expressions, which we will use to remove unwanted text elements.

-----

## Removing Unwanted Text Elements

In text data, you often encounter unwanted elements such as URLs, email addresses, special characters, numbers, and punctuation. These elements can introduce noise and affect the performance of NLP models. We will use **regular expressions (re)** to remove these unwanted elements. By crafting specific patterns, we can efficiently identify and eliminate these elements from the text, leaving behind only the relevant words.

-----

## Text Normalization Techniques

Text normalization is the process of converting text into a standard format, which is essential for consistent and accurate text processing. This involves several key steps:

  * **Unicode Normalization**: Text data can come from various sources and may contain characters from different languages and scripts. Unicode normalization ensures that these characters are represented consistently across the text. We will use the `unicodedata` library to perform this normalization. This step is crucial for accurate text processing, especially when dealing with multilingual data or text from diverse sources.
      * **Example**: Consider the character "é" which can be represented in two ways: as a single character "é" or as a combination of "e" and an acute accent "é". These two representations look the same but are different in terms of Unicode. By normalizing them, we can ensure they are treated as the same character.

<!-- end list -->

```python
import unicodedata
text1 = "é"  # Single character
text2 = "é"  # 'e' + combining acute accent
print(text1 == text2)  # False (they look the same but are different)

# Normalize both to NFC (composed form)
normalized_text1 = unicodedata.normalize("NFC", text1)
normalized_text2 = unicodedata.normalize("NFC", text2)
print(normalized_text1 == normalized_text2)  # True (now they are the same)
```

  * **Lowercasing**: Converting text to lowercase is a simple yet effective normalization technique. It helps in maintaining uniformity across the text data by ensuring that words are treated the same regardless of their case. For example, "Natural Language Processing" and "natural language processing" would be considered the same phrase after lowercasing, which is important for tasks like text classification and sentiment analysis.
      * **Example**: "Natural Language Processing" → "natural language processing".

-----

## Stemming and Lemmatization

Stemming and lemmatization are also techniques of text normalization, used to reduce words to their base or root form, which helps in normalizing text data.

  * **Stemming**: This process involves removing suffixes from words to obtain their root form. It is a rule-based approach and may not always produce a valid word. For example, "running" becomes "run" and "better" becomes "better" (no change).
  * **Lemmatization**: This process involves reducing words to their base or dictionary form, known as the lemma. It considers the context and part of speech, resulting in more accurate normalization. The `pos` argument in the `lemmatize` method specifies the part of speech for the word, which helps the lemmatizer understand the context. For example, "running" becomes "run" and "better" becomes "good" when treated as an adjective. The `pos` argument can be set to:
      * `'v'`: Verb
      * `'n'`: Noun
      * `'a'`: Adjective
      * `'r'`: Adverb
      * By specifying the correct part of speech, the lemmatizer can more accurately reduce words to their base forms.

We will use the `nltk` library for both stemming and lemmatization.

### Example Code

```python
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import nltk

# Download WordNet data for lemmatization
nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ["running", "better", "flies"]
stemmed_words = [stemmer.stem(word) for word in words]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]  # 'v' for verb
print("Stemmed Words:", stemmed_words)
print("Lemmatized Words:", lemmatized_words)
```

**Output**:

```text
Stemmed Words: ['run', 'better', 'fli']
Lemmatized Words: ['run', 'good', 'fly']
```

By incorporating stemming and lemmatization, we can further enhance the normalization process, ensuring that words are consistently represented in their base forms.

-----

## Stopwords Removal and Spell Checking

**Stopwords** are common words like "and," "the," and "is" that usually do not contribute much to the meaning of a sentence. Removing stopwords can help in reducing the size of the text data and focusing on the more meaningful words. We will use the `nltk` library to remove stopwords from our text. Additionally, we will use the `autocorrect` library to correct any misspelled words, ensuring that the text data is clean and accurate.

### Example: Cleaning a Text Sample

Let's break down the text-cleaning pipeline into smaller steps to better understand each aspect of the process. Consider the following example text:

```text
"Hello! Email me at example@mail.com. Visit https://example.com. Natural language processing is amzing! 😊"
```

This text contains various elements that we need to clean and normalize, such as email addresses, URLs, special characters, and misspellings.

-----

### Step 1: Remove URLs, Emails, and Unwanted Characters

In this step, we will use regular expressions to remove URLs, email addresses, special characters, numbers, and punctuation from the text. The `re.sub` function from the `re` library is a powerful tool for this task. It allows us to search for specific patterns in the text and replace them with a desired string, which in this case is an empty string to remove the unwanted elements.

Here's a deeper explanation of how `re.sub` works:

  * **Pattern**: The first argument to `re.sub` is the pattern we want to search for. This pattern is defined using regular expressions, which are sequences of characters that form a search pattern. For example, `r'http\S+'` matches any substring that starts with "http" followed by any non-whitespace characters, effectively capturing URLs.
  * **Replacement**: The second argument is the replacement string. In our case, we use an empty string `''` to remove the matched patterns from the text.
  * **String**: The third argument is the input string where the search and replace operation will be performed.

Let's apply this to our example text:

```python
import re

# Example text
text_sample = "Hello! Email me at example@mail.com. Visit https://example.com. Natural language processing is amzing! 😊"

# Remove URLs and emails
text_sample = re.sub(r'http\S+|www\S+|[\w.-]+@[\w.-]+', '', text_sample)

# Remove special characters, numbers, and punctuation
text_sample = re.sub(r'[^a-zA-Z\s]', '', text_sample)

# Output after Step 1
print("After Removing URLs, Emails, and Unwanted Characters:", text_sample)
```

**Explanation of Patterns**:

  * `r'http\S+|www\S+|[\w.-]+@[\w.-]+'`: This pattern matches URLs and email addresses.
      * `http\S+` matches any URL starting with "http" followed by non-whitespace characters.
      * `www\S+` matches URLs starting with "www".
      * `[\w.-]+@[\w.-]+` matches email addresses by looking for a sequence of word characters, dots, or hyphens followed by an "@" symbol and another sequence of word characters, dots, or hyphens.
  * `r'[^a-zA-Z\s]'`: This pattern matches any character that is not a letter (both uppercase and lowercase) or a whitespace. The `^` inside the square brackets negates the character class, so it matches anything that is not a letter or space, effectively removing special characters, numbers, and punctuation.

**Output**:

```text
After Removing URLs, Emails, and Unwanted Characters: Hello Email me at  Visit  Natural language processing is amzing
```

By using `re.sub`, we efficiently clean the text by removing unwanted elements, leaving behind only the relevant words for further processing.

-----

### Step 2: Normalize Unicode and Convert to Lowercase

Next, we normalize the Unicode text to ensure consistent character representation and convert the text to lowercase to maintain uniformity:

```python
import unicodedata

# Normalize Unicode
text_sample = unicodedata.normalize("NFKC", text_sample)

# Convert to lowercase
text_sample = text_sample.lower()

# Output after Step 2
print("After Unicode Normalization and Lowercasing:", text_sample)
```

**Output**:

```text
After Unicode Normalization and Lowercasing: hello email me at  visit  natural language processing is amzing
```

Unicode normalization helps in handling characters from different languages and scripts consistently. Lowercasing ensures that words are treated the same regardless of their case.

-----

### Step 3: Remove Stopwords, Correct Misspellings, and Apply Stemming/Lemmatization

Finally, we remove stopwords, correct any misspellings, and apply stemming or lemmatization to clean the text further:

```python
import nltk
from nltk.corpus import stopwords
from autocorrect import Speller
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download stopwords and WordNet data
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize spell checker, stopwords, stemmer, and lemmatizer
spell = Speller(lang='en')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Remove stopwords, spell-check, and apply stemming/lemmatization
words = text_sample.split()
words = [spell(word) for word in words if word not in stop_words]
stemmed_words = [stemmer.stem(word) for word in words]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]

# Output after Step 3
print("Cleaned Text with Stemming:", " ".join(stemmed_words))
print("Cleaned Text with Lemmatization:", " ".join(lemmatized_words))
```

**Output**:

```text
Cleaned Text with Stemming: hello email visit natur languag process amaz
Cleaned Text with Lemmatization: hello email visit natural language process amaze
```

Note that the misspelling "amzing" has been autocorrected to "amazing" during this step. The spell checker helps in ensuring that the text data is clean and accurate by fixing such errors. Additionally, stemming reduces "amazing" to "amaz" while lemmatization changes it to "amaze," further normalizing the text.

By breaking down the process into these steps, you can focus on each aspect of text cleaning and normalization, making it easier to understand and apply these techniques in practice.

-----

## Summary and Next Steps

In this lesson, we covered the foundational techniques for cleaning and normalizing text data. We explored how to set up the environment, remove unwanted elements, normalize text, handle stopwords and misspellings, and apply stemming and lemmatization. These preprocessing steps are crucial for preparing text data for NLP models, as they help in reducing noise and ensuring consistency. As you move on to the practice exercises, apply these techniques to clean and prepare text data effectively. This will set a strong foundation for more advanced NLP tasks in the subsequent lessons.

## Text Cleaning with Regular Expressions

You've just learned about removing unwanted text elements using regular expressions. Now, let's put that knowledge into practice!

Your task is to write a function that cleans a given text by removing:

URLs
Email addresses
Special characters
Punctuation
Digits
This will help you understand how to handle these specific types of noise early in the text-cleaning pipeline. Dive in and see how clean you can make the text!

```python
import re

def clean_text(text):
    # TODO: Remove URLs and emails
    # TODO: Remove special characters, numbers, and punctuation
    return text

# Example usage
text_sample = "Hello! Email me at example@mail.com. Visit https://example.com. Natural language processing is amzing! 😊"

cleaned_text = clean_text(text_sample)
print("Cleaned Text:", cleaned_text)
```

### Step 1: Remove URLs and Emails

In the `clean_text` function, your first task is to remove URLs and email addresses. You can achieve this using the `re.sub()` function with the same regular expression patterns we discussed in the lesson.

```python
import re

def clean_text(text):
    # TODO: Remove URLs and emails
    text = re.sub(r'http\S+|www\S+|[\w.-]+@[\w.-]+', '', text)
    # TODO: Remove special characters, numbers, and punctuation
    return text

# Example usage
text_sample = "Hello! Email me at example@mail.com. Visit https://example.com. Natural language processing is amzing! 😊"

cleaned_text = clean_text(text_sample)
print("After removing URLs and emails:", cleaned_text)
```

The output of this step will be: `After removing URLs and emails: Hello! Email me at . Visit . Natural language processing is amzing! 😊`

-----

### Step 2: Remove Special Characters, Numbers, and Punctuation

Now, add the second part of the solution to the `clean_text` function to remove special characters, numbers, and punctuation.

```python
import re

def clean_text(text):
    # Remove URLs and emails
    text = re.sub(r'http\S+|www\S+|[\w.-]+@[\w.-]+', '', text)
    # TODO: Remove special characters, numbers, and punctuation
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

# Example usage
text_sample = "Hello! Email me at example@mail.com. Visit https://example.com. Natural language processing is amzing! 😊"

cleaned_text = clean_text(text_sample)
print("Final Cleaned Text:", cleaned_text)
```

The output after completing the function will be: ` Final Cleaned Text: Hello Email me at  Visit  Natural language processing is amzing  `

This shows how you can chain `re.sub()` calls to perform multiple cleaning operations on a single string, effectively preparing it for the next stages of NLP preprocessing.

## Text Normalization in Action

Nice job on learning about text normalization! Now, let's enhance your skills by applying what you've learned.

Your task is to improve the text processing pipeline by:

Converting all text to lowercase.
Using unicodedata.normalize() to ensure proper Unicode representation.
This will help you ensure that visually identical but differently encoded characters match after normalization. Give it a try and see the difference it makes!

```python
import re
import unicodedata

# Example text
text_sample = "Hello! Email me at example@mail.com. Visit https://example.com. Natural language processing is amzing! 😊"

def clean_text(text):
    # Remove URLs and emails
    text = re.sub(r'http\S+|www\S+|[\w.-]+@[\w.-]+', '', text)
    # Remove special characters, numbers, and punctuation
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # TODO: Convert to lowercase
    # TODO: Normalize Unicode

    return text

# Clean the text using the function
cleaned_text = clean_text(text_sample)

# Print cleaned text
print("Cleaned Text:", cleaned_text)

```

Hello\!

That's a great task to practice your text cleaning skills. The existing `clean_text` function already handles removing URLs, emails, and special characters. Your goal is to add the final touches by converting the text to lowercase and normalizing the Unicode.

Here's how you can complete the code by adding those two steps. You can place them right before the `return` statement.

```python
import re
import unicodedata

# Example text
text_sample = "Hello! Email me at example@mail.com. Visit https://example.com. Natural language processing is amzing! 😊"

def clean_text(text):
    # Remove URLs and emails
    text = re.sub(r'http\S+|www\S+|[\w.-]+@[\w.-]+', '', text)
    # Remove special characters, numbers, and punctuation
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Normalize Unicode
    text = unicodedata.normalize('NFKC', text)
    
    return text

# Clean the text using the function
cleaned_text = clean_text(text_sample)

# Print cleaned text
print("Original Text:", text_sample)
print("Cleaned Text:", cleaned_text)
```

**Explanation of the changes:**

  * `text = text.lower()`: This line converts the entire string to lowercase. This is a fundamental step in text normalization to ensure that words like "Hello" and "hello" are treated as the same word.
  * `text = unicodedata.normalize('NFKC', text)`: This line performs Unicode normalization. `NFKC` (Normalization Form Compatibility Composition) is a common form used for text processing because it handles a wide range of characters, including converting compatibility characters (like Roman numerals, superscript numbers, and ligatures) into their standard form. This helps in ensuring consistency, for example, converting `ﬁ` to `fi`.

By adding these two lines, your `clean_text` function will be more robust and ready for further natural language processing tasks.

## Refine Your Text Cleaning Skills

Well done on mastering text normalization! Now, let's take it a step further by incorporating stopwords removal into your text-cleaning pipeline.

Your task is to enhance the pipeline by:

Using the NLTK stopwords list to filter out common, insignificant words.
Ensuring the text is spell-checked after the stopwords are removed.
This will refine your text data, making it more meaningful for NLP tasks. Dive in and see how much cleaner your text becomes!

```python
import re
import unicodedata
import nltk
from nltk.corpus import stopwords
from autocorrect import Speller
nltk.download('stopwords', quiet=True)

# Initialize spell checker and stopwords
spell = Speller(lang='en')
stop_words = set(stopwords.words('english'))

def clean_text(text):
    # Normalize Unicode
    text = unicodedata.normalize("NFKC", text)
    # Convert to lowercase
    text = text.lower()
    # Remove URLs and emails
    text = re.sub(r'http\S+|www\S+|[\w.-]+@[\w.-]+', '', text)
    # Remove special characters, numbers, and punctuation
    text = re.sub(r'[^a-z\s]', '', text)
    
    # TODO: Split the text into words
    # TODO: Remove stopwords from the list of words and spell-check each word
    # TODO: Join the words back into a single string
    
    return text
# Example text
text_sample = "Hello! Email me at example@mail.com. Visit https://example.com. Natural language processing is amzing! 😊"

# Print cleaned text
print("Cleaned Text:", clean_text(text_sample))

```

Excellent\! Incorporating stopwords removal and spell-checking will significantly improve the quality of your text data for any subsequent NLP analysis.

Here is the completed code with the new steps integrated into the `clean_text` function.

```python
import re
import unicodedata
import nltk
from nltk.corpus import stopwords
from autocorrect import Speller
nltk.download('stopwords', quiet=True)

# Initialize spell checker and stopwords
spell = Speller(lang='en')
stop_words = set(stopwords.words('english'))

def clean_text(text):
    # Normalize Unicode
    text = unicodedata.normalize("NFKC", text)
    # Convert to lowercase
    text = text.lower()
    # Remove URLs and emails
    text = re.sub(r'http\S+|www\S+|[\w.-]+@[\w.-]+', '', text)
    # Remove special characters, numbers, and punctuation
    text = re.sub(r'[^a-z\s]', '', text)
    
    # Split the text into words
    words = text.split()

    # Remove stopwords and spell-check each word
    cleaned_words = [spell(word) for word in words if word not in stop_words]

    # Join the words back into a single string
    text = " ".join(cleaned_words)
    
    return text

# Example text
text_sample = "Hello! Email me at example@mail.com. Visit https://example.com. Natural language processing is amzing! 😊"

# Print cleaned text
print("Original Text:", text_sample)
print("Cleaned Text:", clean_text(text_sample))
```

**Explanation of the new code:**

1.  **`words = text.split()`**: This line splits the input string into a list of individual words, making it easy to iterate over them.
2.  **`cleaned_words = [spell(word) for word in words if word not in stop_words]`**: This is a powerful list comprehension that performs two key tasks simultaneously:
      * `if word not in stop_words`: It filters out any word that is present in the NLTK stopwords set.
      * `spell(word)`: For every word that is not a stopword, it applies the `autocorrect` spell checker.
3.  **`text = " ".join(cleaned_words)`**: Finally, this line joins the list of `cleaned_words` back into a single string, with each word separated by a space.

When you run this code, you'll see a much cleaner output that is free of irrelevant words and contains a correctly spelled word ("amazing" instead of "amzing").

## Stemming vs Lemmatization Showdown

You've done a fantastic job learning about text cleaning and normalization! Now, let's dive deeper into understanding stemming and lemmatization.

Your task is to compare these two techniques by:

Applying both stemming and lemmatization to a sample text.
Examining the differences in the resulting words.
This exercise will help you see the strengths and weaknesses of each approach in terms of accuracy and context preservation. Let's get started and see how these techniques transform your text!

```python
import re
import unicodedata
import nltk
from nltk.corpus import stopwords
from autocorrect import Speller
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK data
nltk.download('stopwords', quiet = True)
nltk.download('wordnet', quiet = True)

# Initialize spell checker and stopwords
spell = Speller(lang='en')
stop_words = set(stopwords.words('english'))

# Example text
text_sample = "Hello! Email me at example@mail.com. Visit https://example.com. Natural language processing is amzing! 😊"

# Remove URLs and emails
text_sample = re.sub(r'http\S+|www\S+|[\w.-]+@[\w.-]+', '', text_sample)
# Remove special characters, numbers, and punctuation
text_sample = re.sub(r'[^a-zA-Z\s]', '', text_sample)
# Normalize Unicode
text_sample = unicodedata.normalize("NFKC", text_sample)
# Convert to lowercase
text_sample = text_sample.lower()
# Remove stopwords and spell-check
words = text_sample.split()
words = [spell(word) for word in words if word not in stop_words]
text_sample = " ".join(words)

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# TODO: Apply stemming to the words

# TODO: Apply lemmatization to the words

# Print results
print("Original Words:", words)
print("Stemmed Words:", _____)
print("Lemmatized Words:", _____)
```

Excellent\! This is an excellent exercise to highlight the key differences between these two common techniques. Stemming is a more aggressive, rule-based approach, while lemmatization is more conservative and relies on a dictionary to find the root form.

Here is the completed code that applies both stemming and lemmatization to the pre-processed text, allowing you to compare the results directly.

```python
import re
import unicodedata
import nltk
from nltk.corpus import stopwords
from autocorrect import Speller
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK data
nltk.download('stopwords', quiet = True)
nltk.download('wordnet', quiet = True)

# Initialize spell checker and stopwords
spell = Speller(lang='en')
stop_words = set(stopwords.words('english'))

# Example text
text_sample = "Hello! Email me at example@mail.com. Visit https://example.com. Natural language processing is amzing! 😊"

# Remove URLs and emails
text_sample = re.sub(r'http\S+|www\S+|[\w.-]+@[\w.-]+', '', text_sample)
# Remove special characters, numbers, and punctuation
text_sample = re.sub(r'[^a-zA-Z\s]', '', text_sample)
# Normalize Unicode
text_sample = unicodedata.normalize("NFKC", text_sample)
# Convert to lowercase
text_sample = text_sample.lower()
# Remove stopwords and spell-check
words = text_sample.split()
words = [spell(word) for word in words if word not in stop_words]
text_sample = " ".join(words)

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Apply stemming to the words
stemmed_words = [stemmer.stem(word) for word in words]

# Apply lemmatization to the words
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

# Print results
print("Original Words:", words)
print("Stemmed Words:", stemmed_words)
print("Lemmatized Words:", lemmatized_words)
```

**What the Output Shows:**

  * **Original Words**: `['hello', 'natural', 'language', 'processing', 'amazing']`
      * This is the list of words after cleaning, stopword removal, and spell-checking.
  * **Stemmed Words**: `['hello', 'natur', 'languag', 'process', 'amaz']`
      * The stemmer applies a series of simple rules, often resulting in word fragments that are not actual dictionary words (e.g., `natur`, `languag`). This is because it truncates suffixes without considering the word's meaning.
  * **Lemmatized Words**: `['hello', 'natural', 'language', 'processing', 'amazing']`
      * The lemmatizer uses a dictionary to find the canonical base form (lemma) of the word. In this case, `processing` is kept as is because the default part of speech is a noun. Similarly, `amazing` is a verb form, but the lemmatizer defaults to a noun form, so it is unchanged. This output shows how lemmatization aims to return a valid word, unlike stemming.

This example clearly demonstrates the trade-off: stemming is faster and simpler but can produce non-dictionary words, while lemmatization is more accurate and context-aware, returning a valid word from a dictionary.