### Using NLP for Text Data Quality
**Objective**: Enhance text data quality using NLP techniques.

**Task**: Removing Stopwords

**Steps**:
1. Data Set: Use a dataset of text product descriptions.
2. Stopword Removal: Utilize an NLP library (e.g., NLTK) to remove stopwords from the
descriptions.
3. Assess Impact: Examine the effectiveness by analyzing word frequency before and after
removal.

In [1]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
import nltk

# Download stopwords if you haven't already
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
try:
    word_tokenize("example")
except LookupError:
    nltk.download('punkt')

# Step 1: Use a dataset of text product descriptions.
# Generate sample product descriptions
data = {
    'ProductDescription': [
        "This is a fantastic new smartphone with a large screen and amazing camera features.",
        "The comfortable and stylish running shoes are perfect for your daily workout routine.",
        "A high-quality leather wallet that is both durable and has plenty of space for cards and cash.",
        "This set of colorful and vibrant art markers is ideal for artists of all skill levels.",
        "The lightweight and portable Bluetooth speaker delivers powerful sound for any occasion."
    ]
}
df = pd.DataFrame(data)

print("Original DataFrame with Product Descriptions:")
print(df)

# Step 2: Stopword Removal: Utilize an NLP library (e.g., NLTK) to remove stopwords from the descriptions.
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    word_tokens = word_tokenize(text.lower())
    filtered_words = [word for word in word_tokens if word not in stop_words]
    return " ".join(filtered_words)

df['Description_NoStopwords'] = df['ProductDescription'].apply(remove_stopwords)

print("\nDataFrame after Stopword Removal:")
print(df)

# Step 3: Assess Impact: Examine the effectiveness by analyzing word frequency before and after removal.

# Analyze word frequency before stopword removal
all_words_original = []
for desc in df['ProductDescription']:
    word_tokens = word_tokenize(desc.lower())
    all_words_original.extend(word_tokens)

word_frequency_original = Counter(all_words_original)
print("\nTop 20 Most Frequent Words (Original):")
print(word_frequency_original.most_common(20))

# Analyze word frequency after stopword removal
all_words_nostopwords = []
for desc in df['Description_NoStopwords']:
    word_tokens = word_tokenize(desc.lower())
    all_words_nostopwords.extend(word_tokens)

word_frequency_nostopwords = Counter(all_words_nostopwords)
print("\nTop 20 Most Frequent Words (After Stopword Removal):")
print(word_frequency_nostopwords.most_common(20))

print("\nObservations:")
print("Notice how common English words like 'is', 'a', 'with', 'and', 'that', 'for', 'of', 'are', 'your', 'both' appear frequently in the original word counts.")
print("After removing stopwords, the most frequent words are more content-related, such as 'smartphone', 'screen', 'camera', 'features', 'comfortable', 'running', 'shoes', 'workout', etc.")
print("This demonstrates how stopword removal can help to focus on the more important words in the text for analysis.")


ModuleNotFoundError: No module named 'nltk'