### Using NLP for Text Data Quality
**Objective**: Enhance text data quality using NLP techniques.

**Task**: Handling Noisy Text Data

**Steps**:
1. Data Set: Obtain a dataset with customer reviews containing noise (e.g., random characters).
2. Clean Data: Use regex patterns to clean the noise from text data.
3. Evaluate: Compare the text before and after cleaning for noise.

In [None]:
# write your code from here


In [1]:
import pandas as pd
import re

# Step 1: Simulate a dataset with noisy customer reviews
def generate_noisy_reviews():
    return pd.DataFrame({
        "Customer_Review": [
            "Amaaaazing!!! ***** Great pr0duct :)",
            "Worst@@@ experience ever!!! D0n't buy###",
            "L0ved it...!! will def@initely buy again.",
            "Terrrrible serv1ce!!!! W0uldn't rec0mmend..",
            "S0-so product, quite av#erage. meh",
            "9999Stars*****!!!, absolutely w0rth it :)",
            "BAD!!! SH1TTY pr0duct... waste of $$$",
            "best.best.best.!!! <3 loved itttt!",
            "No issues... but n0t perfect ;)",
            "Co0l 123 product#@! very gooodddd..."
        ]
    })

# Step 2: Clean noisy text using regex and basic NLP preprocessing
def clean_text(text):
    try:
        text = text.lower()  # Normalize case
        text = re.sub(r"http\S+|www\S+|https\S+", "", text)  # Remove URLs
        text = re.sub(r"[^a-z\s]", "", text)  # Remove non-alphabetic characters
        text = re.sub(r"\s+", " ", text).strip()  # Normalize whitespace
        return text
    except Exception as e:
        raise ValueError(f"Text cleaning failed: {e}")

# Step 3: Apply cleaning to the entire dataset
def apply_text_cleaning(df):
    try:
        df['Cleaned_Review'] = df['Customer_Review'].apply(clean_text)
        return df
    except Exception as e:
        raise ValueError(f"Failed to clean dataset: {e}")

# Step 4: Display comparison
def compare_reviews(df):
    print("\n--- Noisy vs Cleaned Reviews ---")
    for original, cleaned in zip(df['Customer_Review'], df['Cleaned_Review']):
        print(f"\nOriginal: {original}\nCleaned : {cleaned}")

# Main workflow
def main():
    print("Generating noisy customer reviews...")
    df = generate_noisy_reviews()

    print("Cleaning text data...")
    df_cleaned = apply_text_cleaning(df)

    compare_reviews(df_cleaned)

# Run script
main()


Generating noisy customer reviews...
Cleaning text data...

--- Noisy vs Cleaned Reviews ---

Original: Amaaaazing!!! ***** Great pr0duct :)
Cleaned : amaaaazing great prduct

Original: Worst@@@ experience ever!!! D0n't buy###
Cleaned : worst experience ever dnt buy

Original: L0ved it...!! will def@initely buy again.
Cleaned : lved it will definitely buy again

Original: Terrrrible serv1ce!!!! W0uldn't rec0mmend..
Cleaned : terrrrible servce wuldnt recmmend

Original: S0-so product, quite av#erage. meh
Cleaned : sso product quite average meh

Original: 9999Stars*****!!!, absolutely w0rth it :)
Cleaned : stars absolutely wrth it

Original: BAD!!! SH1TTY pr0duct... waste of $$$
Cleaned : bad shtty prduct waste of

Original: best.best.best.!!! <3 loved itttt!
Cleaned : bestbestbest loved itttt

Original: No issues... but n0t perfect ;)
Cleaned : no issues but nt perfect

Original: Co0l 123 product#@! very gooodddd...
Cleaned : col product very gooodddd
