# Preparing Unstructured Text Data

In this notebook, we will explore various techniques for preparing unstructured text data for analysis. These preprocessing steps are crucial for ensuring that the text data is clean and suitable for further NLP tasks.

## 1. Corpus Cleaning

Corpus cleaning involves several steps to standardize and clean the text data, making it more suitable for analysis. This includes:

- **Removing Numbers**: Numbers can add noise to the text data, especially if they are not relevant to the analysis. Removing numbers helps in focusing on the textual content.

- **Correcting Spelling**: Spelling errors can cause issues in text analysis by increasing the vocabulary size unnecessarily. Correcting these errors improves the quality of the text data.

- **Harmonizing Case**: Text data can have inconsistent casing (uppercase, lowercase, mixed case), which can lead to treating the same word as different tokens. Converting the text to a consistent case (usually lowercase) helps in standardizing the data.

In [1]:
# Importing necessary libraries
import re
import string
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from autocorrect import Speller
import nltk

# Download necessary NLTK data files
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Sample data
data = {
    'Text': [
        'This is a sample text! It includes numbers like 123 and URLs like https://example.com.',
        'Another example text with UPPERCASE letters and misspelled wrds.',
        'Final example: punctuation, numbers (456), and mixedCASE text.'
    ]
}
df = pd.DataFrame(data)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ryann\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ryann\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ryann\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# Display the original data
print("Original Data:\n", df)

Original Data:
                                                 Text
0  This is a sample text! It includes numbers lik...
1  Another example text with UPPERCASE letters an...
2  Final example: punctuation, numbers (456), and...


In [3]:
# Corpus Cleaning

# Function to clean corpus
def clean_text(text):
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Correct spelling
    spell = Speller()
    text = ' '.join([spell(word) for word in text.split()])
    # Harmonize case
    text = text.lower()
    return text

df['Cleaned_Text'] = df['Text'].apply(clean_text)
print("\nAfter Corpus Cleaning:\n", df)





After Corpus Cleaning:
                                                 Text  \
0  This is a sample text! It includes numbers lik...   
1  Another example text with UPPERCASE letters an...   
2  Final example: punctuation, numbers (456), and...   

                                        Cleaned_Text  
0  this is a sample text! it includes numbers lik...  
1  another example text with uppercase letters an...  
2  final example: punctuation, numbers (), and mi...  


## 2. Removing Punctuation, Stopwords, Numbers, and URLs

Removing unnecessary elements such as punctuation, stopwords, numbers, and URLs helps in focusing on the relevant parts of the text data. Here's a breakdown:

- **Punctuation**: Punctuation marks are often irrelevant for text analysis and can be removed to simplify the text.

- **Stopwords**: Stopwords are common words (e.g., "the", "is", "in") that do not carry significant meaning and can be removed to focus on the more important words in the text. Stopwords vary by language, and different languages have their own stopword dictionaries. For example, NLTK provides stopword lists for several languages including English, French, Spanish, German, and more.

- **Numbers**: As mentioned earlier, numbers can add noise to the text data. Removing them can help in concentrating on the textual content.

- **URLs**: URLs in text data are usually not useful for analysis and can be removed to clean the text further.

In [4]:
# Removing Punctuation, Stopwords, Numbers, and URLs

# Function to remove punctuation, stopwords, numbers, and URLs
def preprocess_text(text):
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    text = ' '.join([word for word in word_tokens if word.lower() not in stop_words])
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    return text

df['Preprocessed_Text'] = df['Cleaned_Text'].apply(preprocess_text)
print("\nAfter Removing Punctuation, Stopwords, Numbers, and URLs:\n", df)



After Removing Punctuation, Stopwords, Numbers, and URLs:
                                                 Text  \
0  This is a sample text! It includes numbers lik...   
1  Another example text with UPPERCASE letters an...   
2  Final example: punctuation, numbers (456), and...   

                                        Cleaned_Text  \
0  this is a sample text! it includes numbers lik...   
1  another example text with uppercase letters an...   
2  final example: punctuation, numbers (), and mi...   

                                   Preprocessed_Text  
0         sample text includes numbers like us like   
1  another example text uppercase letters misspel...  
2   final example punctuation numbers mixedcase text  


## 3. Converting to Lowercase

Converting all text to lowercase ensures uniformity and helps in reducing the complexity of text data. This step is crucial because:

- **Uniformity**: Treats words with different cases (e.g., "Apple" vs. "apple") as the same token, reducing redundancy.

- **Simplification**: Simplifies the text data, making it easier to work with in subsequent analysis steps.

In [5]:
# Converting to Lowercase

# Function to convert text to lowercase
def to_lowercase(text):
    return text.lower()

df['Lowercased_Text'] = df['Preprocessed_Text'].apply(to_lowercase)
print("\nAfter Converting to Lowercase:\n", df)



After Converting to Lowercase:
                                                 Text  \
0  This is a sample text! It includes numbers lik...   
1  Another example text with UPPERCASE letters an...   
2  Final example: punctuation, numbers (456), and...   

                                        Cleaned_Text  \
0  this is a sample text! it includes numbers lik...   
1  another example text with uppercase letters an...   
2  final example: punctuation, numbers (), and mi...   

                                   Preprocessed_Text  \
0         sample text includes numbers like us like    
1  another example text uppercase letters misspel...   
2   final example punctuation numbers mixedcase text   

                                     Lowercased_Text  
0         sample text includes numbers like us like   
1  another example text uppercase letters misspel...  
2   final example punctuation numbers mixedcase text  
