# Text Classification in Natural Language Processing

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/nlp-learning-journey/blob/main/examples/text-classification.ipynb)

## Overview

Text classification is the task of assigning predefined categories or labels to text documents. It's one of the most common NLP applications, used for spam detection, sentiment analysis, topic categorization, and more.

## What You'll Learn

- Text preprocessing for classification
- Feature extraction techniques
- Traditional ML algorithms
- Deep learning approaches
- Transformer-based classification
- Evaluation metrics
- Real-world applications

## Prerequisites

Basic understanding of Python, machine learning, and NLP preprocessing.

In [None]:
# Environment Detection and Setup
import sys
import subprocess

# Detect the runtime environment
IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules
IS_LOCAL = not (IS_COLAB or IS_KAGGLE)

print(f"Environment detected:")
print(f"  - Local: {IS_LOCAL}")
print(f"  - Google Colab: {IS_COLAB}")
print(f"  - Kaggle: {IS_KAGGLE}")

# Platform-specific system setup
if IS_COLAB:
    print("\nSetting up Google Colab environment...")
    !apt update -qq
    !apt install -y -qq libpq-dev
elif IS_KAGGLE:
    print("\nSetting up Kaggle environment...")
    # Kaggle usually has most packages pre-installed
else:
    print("\nSetting up local environment...")

# Install required packages for this notebook
required_packages = [
    "scikit-learn",
    "pandas",
    "matplotlib",
    "seaborn",
    "transformers",
    "torch",
    "nltk",
    "wordcloud"
]

print("\nInstalling required packages...")
for package in required_packages:
    if IS_COLAB or IS_KAGGLE:
        !pip install -q {package}
    else:
        subprocess.run([sys.executable, "-m", "pip", "install", "-q", package], 
                      capture_output=True)
    print(f"✓ {package}")

print("\n🎉 Environment setup complete!")

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re

# NLP and ML libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

import nltk

# Download NLTK data with error handling
nltk_datasets = ['punkt', 'punkt_tab', 'stopwords']
print("Downloading NLTK datasets...")
for dataset in nltk_datasets:
    try:
        nltk.download(dataset, quiet=True)
        print(f"✓ {dataset}")
    except Exception as e:
        print(f"⚠️  Failed to download {dataset}: {e}")

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

plt.style.use('default')
sns.set_palette("husl")

# Note: Transformers models will be loaded when needed due to potential network requirements

## Sample Dataset Creation

Let's create a sample dataset for text classification.

In [None]:
def create_sample_dataset():
    """Create a sample text classification dataset"""
    
    # Technology articles
    tech_texts = [
        "Artificial intelligence is revolutionizing the tech industry with machine learning algorithms.",
        "New smartphone features include advanced camera technology and faster processors.",
        "Cloud computing services are becoming more popular among businesses worldwide.",
        "Cybersecurity threats are increasing as more companies move to digital platforms.",
        "Software development practices are evolving with new programming languages and frameworks.",
        "Data science and analytics are driving business intelligence and decision making.",
        "Virtual reality and augmented reality technologies are creating new user experiences.",
        "Internet of Things devices are connecting everyday objects to the digital world."
    ]
    
    # Sports articles
    sports_texts = [
        "The football team won the championship after a thrilling final match.",
        "Basketball players are training hard for the upcoming tournament season.",
        "Olympic athletes are preparing for the games with intensive workout routines.",
        "Soccer fans are excited about the world cup matches this summer.",
        "Tennis players compete in grand slam tournaments around the world.",
        "Swimming records were broken during the international competition last week.",
        "Baseball season starts with teams showing strong performance in training.",
        "Golf championship attracts professional players from different countries."
    ]
    
    # Health articles
    health_texts = [
        "Regular exercise and healthy diet are essential for maintaining good health.",
        "Medical research shows the benefits of preventive care and early diagnosis.",
        "Mental health awareness is increasing with better access to therapy and counseling.",
        "Vaccination programs are crucial for preventing infectious disease outbreaks.",
        "Nutrition experts recommend balanced meals with fruits and vegetables.",
        "Healthcare systems are adapting to serve aging populations better.",
        "Medical technology advances are improving patient care and treatment outcomes.",
        "Public health initiatives focus on community wellness and disease prevention."
    ]
    
    # Business articles
    business_texts = [
        "Stock market analysis shows positive trends in technology sector investments.",
        "Corporate earnings reports indicate strong financial performance this quarter.",
        "Business strategies are adapting to changing market conditions and consumer behavior.",
        "Startup companies are securing funding through venture capital and investor networks.",
        "International trade agreements are affecting global supply chain operations.",
        "Marketing campaigns are leveraging social media platforms for brand awareness.",
        "Economic indicators suggest steady growth in manufacturing and services sectors.",
        "Business leadership focuses on innovation and sustainable development practices."
    ]
    
    # Create dataset
    data = []
    
    for text in tech_texts:
        data.append({'text': text, 'category': 'Technology'})
    for text in sports_texts:
        data.append({'text': text, 'category': 'Sports'})
    for text in health_texts:
        data.append({'text': text, 'category': 'Health'})
    for text in business_texts:
        data.append({'text': text, 'category': 'Business'})
    
    return pd.DataFrame(data)

# Create and explore dataset
df = create_sample_dataset()
print(f"Dataset shape: {df.shape}")
print(f"\nCategory distribution:")
print(df['category'].value_counts())

# Display sample texts
print("\nSample texts:")
for category in df['category'].unique():
    sample_text = df[df['category'] == category]['text'].iloc[0]
    print(f"{category}: {sample_text}")

## Text Preprocessing for Classification

In [None]:
def preprocess_text(text, remove_stopwords=True, lowercase=True):
    """Preprocess text for classification"""
    # Convert to lowercase
    if lowercase:
        text = text.lower()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Remove stopwords
    if remove_stopwords:
        stop_words = set(stopwords.words('english'))
        words = word_tokenize(text)
        words = [word for word in words if word not in stop_words]
        text = ' '.join(words)
    
    return text

# Apply preprocessing
df['text_processed'] = df['text'].apply(preprocess_text)

print("Preprocessing comparison:")
for i in range(3):
    print(f"\nExample {i+1}:")
    print(f"Original: {df['text'].iloc[i]}")
    print(f"Processed: {df['text_processed'].iloc[i]}")

## Feature Extraction Techniques

In [None]:
# Prepare data
X = df['text_processed']
y = df['category']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# 1. Bag of Words (Count Vectorizer)
count_vectorizer = CountVectorizer(max_features=1000)
X_train_counts = count_vectorizer.fit_transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)

# 2. TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("Feature extraction results:")
print(f"Count Vectorizer - Train shape: {X_train_counts.shape}, Test shape: {X_test_counts.shape}")
print(f"TF-IDF Vectorizer - Train shape: {X_train_tfidf.shape}, Test shape: {X_test_tfidf.shape}")

# Analyze feature importance
feature_names = tfidf_vectorizer.get_feature_names_out()
print(f"\nSample features: {list(feature_names[:10])}")

# Visualize feature distributions
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
counts_sum = np.array(X_train_counts.sum(axis=0)).flatten()
plt.hist(counts_sum, bins=30, alpha=0.7)
plt.title('Count Vectorizer Feature Distribution')
plt.xlabel('Feature Count')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
tfidf_sum = np.array(X_train_tfidf.sum(axis=0)).flatten()
plt.hist(tfidf_sum, bins=30, alpha=0.7)
plt.title('TF-IDF Feature Distribution')
plt.xlabel('TF-IDF Score Sum')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

## Traditional Machine Learning Approaches

In [None]:
# Define classifiers
classifiers = {
    'Naive Bayes': MultinomialNB(),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'SVM': SVC(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100)
}

# Train and evaluate classifiers
results = {}

print("Traditional ML Classification Results:")
print("=" * 60)

for name, classifier in classifiers.items():
    # Train on TF-IDF features
    classifier.fit(X_train_tfidf, y_train)
    
    # Predictions
    y_pred = classifier.predict(X_test_tfidf)
    
    # Accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    # Cross-validation score
    cv_scores = cross_val_score(classifier, X_train_tfidf, y_train, cv=3)
    
    results[name] = {
        'accuracy': accuracy,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'predictions': y_pred
    }
    
    print(f"\n{name}:")
    print(f"  Test Accuracy: {accuracy:.3f}")
    print(f"  CV Score: {cv_scores.mean():.3f} (±{cv_scores.std():.3f})")

# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Accuracy comparison
methods = list(results.keys())
accuracies = [results[method]['accuracy'] for method in methods]
cv_means = [results[method]['cv_mean'] for method in methods]

x = np.arange(len(methods))
width = 0.35

axes[0].bar(x - width/2, accuracies, width, label='Test Accuracy', alpha=0.7)
axes[0].bar(x + width/2, cv_means, width, label='CV Mean', alpha=0.7)
axes[0].set_xlabel('Classifier')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Classifier Performance Comparison')
axes[0].set_xticks(x)
axes[0].set_xticklabels(methods, rotation=45)
axes[0].legend()

# Confusion matrix for best performer
best_classifier = max(results.keys(), key=lambda k: results[k]['accuracy'])
cm = confusion_matrix(y_test, results[best_classifier]['predictions'])

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=classifiers[best_classifier].classes_,
            yticklabels=classifiers[best_classifier].classes_,
            ax=axes[1])
axes[1].set_title(f'Confusion Matrix - {best_classifier}')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')

plt.tight_layout()
plt.show()

# Detailed classification report for best classifier
print(f"\nDetailed Classification Report - {best_classifier}:")
print(classification_report(y_test, results[best_classifier]['predictions']))

## Transformer-Based Classification

In [None]:
# Initialize zero-shot classification pipeline
try:
    zero_shot_classifier = pipeline("zero-shot-classification", 
                                   model="facebook/bart-large-mnli")
    print("Loaded BART model for zero-shot classification")
except:
    try:
        zero_shot_classifier = pipeline("zero-shot-classification")
        print("Loaded default zero-shot classification model")
    except:
        zero_shot_classifier = None
        print("Could not load transformer classification model")

def transformer_classification(texts, candidate_labels):
    """Classify texts using transformer model"""
    if zero_shot_classifier is None:
        return [{'labels': ['N/A'], 'scores': [0.0]} for _ in texts]
    
    results = []
    for text in texts:
        try:
            result = zero_shot_classifier(text, candidate_labels)
            results.append(result)
        except:
            results.append({'labels': ['Error'], 'scores': [0.0]})
    
    return results

if zero_shot_classifier:
    # Test transformer classification
    candidate_labels = ['Technology', 'Sports', 'Health', 'Business']
    test_texts = X_test.tolist()
    
    print("\nTransformer-Based Classification:")
    print("=" * 50)
    
    # Classify a few test examples
    sample_results = transformer_classification(test_texts[:5], candidate_labels)
    
    transformer_predictions = []
    transformer_confidences = []
    
    for i, (text, result) in enumerate(zip(test_texts[:5], sample_results)):
        predicted_label = result['labels'][0]
        confidence = result['scores'][0]
        actual_label = y_test.iloc[i]
        
        transformer_predictions.append(predicted_label)
        transformer_confidences.append(confidence)
        
        print(f"\nText: {text[:100]}...")
        print(f"Actual: {actual_label}")
        print(f"Predicted: {predicted_label} (confidence: {confidence:.3f})")
        print(f"All scores: {dict(zip(result['labels'], result['scores']))}")
    
    # Calculate accuracy for the sample
    sample_accuracy = sum(1 for pred, actual in zip(transformer_predictions, y_test.iloc[:5]) 
                         if pred == actual) / len(transformer_predictions)
    
    print(f"\nTransformer Sample Accuracy: {sample_accuracy:.3f}")
    print(f"Average Confidence: {np.mean(transformer_confidences):.3f}")
else:
    print("Transformer classification not available")

## Feature Importance Analysis

In [None]:
# Analyze feature importance using the best traditional classifier
best_model = classifiers[best_classifier]

def get_top_features(classifier, vectorizer, class_labels, top_n=5):
    """Get top features for each class"""
    feature_names = vectorizer.get_feature_names_out()
    
    if hasattr(classifier, 'coef_'):
        # For linear models
        coef = classifier.coef_
    elif hasattr(classifier, 'feature_importances_'):
        # For tree-based models
        coef = classifier.feature_importances_.reshape(1, -1)
    else:
        return {}
    
    top_features = {}
    
    if len(coef.shape) > 1 and coef.shape[0] > 1:
        # Multi-class
        for i, class_label in enumerate(class_labels):
            if i < coef.shape[0]:
                top_indices = coef[i].argsort()[-top_n:][::-1]
                top_features[class_label] = [
                    (feature_names[idx], coef[i][idx]) 
                    for idx in top_indices
                ]
    else:
        # Binary or single feature importance
        feature_scores = coef.flatten()
        top_indices = feature_scores.argsort()[-top_n:][::-1]
        top_features['overall'] = [
            (feature_names[idx], feature_scores[idx]) 
            for idx in top_indices
        ]
    
    return top_features

# Get top features
class_labels = best_model.classes_
top_features = get_top_features(best_model, tfidf_vectorizer, class_labels)

print(f"Top Features Analysis - {best_classifier}:")
print("=" * 50)

for class_label, features in top_features.items():
    print(f"\n{class_label}:")
    for feature, score in features:
        print(f"  {feature}: {score:.3f}")

# Visualize feature importance
if len(top_features) > 1:
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    axes = axes.ravel()
    
    for i, (class_label, features) in enumerate(top_features.items()):
        if i < len(axes):
            words = [f[0] for f in features]
            scores = [f[1] for f in features]
            
            axes[i].barh(words, scores)
            axes[i].set_title(f'Top Features - {class_label}')
            axes[i].set_xlabel('Feature Importance')
    
    plt.tight_layout()
    plt.show()

## Error Analysis and Model Interpretation

In [None]:
# Analyze misclassified examples
best_predictions = results[best_classifier]['predictions']
misclassified_indices = [i for i, (true, pred) in enumerate(zip(y_test, best_predictions)) 
                        if true != pred]

print(f"Error Analysis - {best_classifier}:")
print("=" * 50)
print(f"Total misclassified: {len(misclassified_indices)} out of {len(y_test)}")
print(f"Error rate: {len(misclassified_indices)/len(y_test):.3f}")

if misclassified_indices:
    print("\nMisclassified Examples:")
    for i in misclassified_indices[:3]:  # Show first 3 errors
        text_idx = y_test.index[i]
        original_text = df.loc[text_idx, 'text']
        true_label = y_test.iloc[i]
        pred_label = best_predictions[i]
        
        print(f"\nText: {original_text}")
        print(f"True Label: {true_label}")
        print(f"Predicted Label: {pred_label}")

# Prediction confidence analysis
if hasattr(best_model, 'predict_proba'):
    prediction_probabilities = best_model.predict_proba(X_test_tfidf)
    max_probabilities = np.max(prediction_probabilities, axis=1)
    
    # Analyze confidence distribution
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    plt.hist(max_probabilities, bins=20, alpha=0.7)
    plt.xlabel('Maximum Prediction Probability')
    plt.ylabel('Frequency')
    plt.title('Prediction Confidence Distribution')
    
    # Confidence vs accuracy
    plt.subplot(1, 2, 2)
    correct_predictions = (best_predictions == y_test).astype(int)
    
    # Bin by confidence
    confidence_bins = np.linspace(0, 1, 11)
    bin_centers = (confidence_bins[:-1] + confidence_bins[1:]) / 2
    bin_accuracies = []
    
    for i in range(len(confidence_bins)-1):
        mask = (max_probabilities >= confidence_bins[i]) & (max_probabilities < confidence_bins[i+1])
        if np.sum(mask) > 0:
            bin_accuracy = np.mean(correct_predictions[mask])
        else:
            bin_accuracy = 0
        bin_accuracies.append(bin_accuracy)
    
    plt.plot(bin_centers, bin_accuracies, 'o-')
    plt.xlabel('Prediction Confidence')
    plt.ylabel('Accuracy')
    plt.title('Confidence vs Accuracy')
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print(f"\nAverage prediction confidence: {np.mean(max_probabilities):.3f}")
    print(f"Minimum confidence: {np.min(max_probabilities):.3f}")
    print(f"Maximum confidence: {np.max(max_probabilities):.3f}")

## Real-World Applications

In [None]:
# Application 1: Email Classification
def email_classifier_demo():
    """Demo email classification system"""
    email_texts = [
        "Congratulations! You've won $1 million. Click here to claim your prize now!",
        "Meeting scheduled for tomorrow at 2 PM in conference room A. Please confirm attendance.",
        "Your account has been compromised. Verify your identity immediately by clicking this link.",
        "Weekly team report attached. Please review and provide feedback by Friday.",
        "Special offer: 50% off all products! Limited time only. Shop now and save big!"
    ]
    
    email_categories = ['spam', 'work', 'phishing', 'work', 'promotional']
    
    return email_texts, email_categories

# Application 2: Customer Feedback Classification
def feedback_classifier_demo():
    """Demo customer feedback classification"""
    feedback_texts = [
        "The product quality is excellent and delivery was fast. Highly recommend!",
        "Terrible customer service. My issue was not resolved after multiple calls.",
        "Good value for money. Product works as expected but could be improved.",
        "Outstanding experience! Will definitely purchase again. Five stars!",
        "Product broke after one week. Very disappointed with the quality."
    ]
    
    sentiment_categories = ['positive', 'negative', 'neutral', 'positive', 'negative']
    
    return feedback_texts, sentiment_categories

# Application 3: News Category Classification
def classify_new_documents(texts, classifier, vectorizer):
    """Classify new documents using trained model"""
    # Preprocess texts
    processed_texts = [preprocess_text(text) for text in texts]
    
    # Vectorize
    text_vectors = vectorizer.transform(processed_texts)
    
    # Predict
    predictions = classifier.predict(text_vectors)
    
    if hasattr(classifier, 'predict_proba'):
        probabilities = classifier.predict_proba(text_vectors)
        confidences = np.max(probabilities, axis=1)
    else:
        confidences = [1.0] * len(predictions)
    
    return predictions, confidences

# Test applications
print("Real-World Application Demos:")
print("=" * 50)

# Demo 1: News classification with our trained model
new_articles = [
    "Scientists develop new vaccine that shows promising results in clinical trials.",
    "Cryptocurrency prices surge as institutional investors increase their holdings.",
    "Olympic games feature exciting competitions with athletes breaking world records.",
    "New smartphone app uses artificial intelligence to improve user productivity."
]

predictions, confidences = classify_new_documents(new_articles, best_model, tfidf_vectorizer)

print("1. News Article Classification:")
for text, pred, conf in zip(new_articles, predictions, confidences):
    print(f"\nText: {text}")
    print(f"Predicted Category: {pred} (confidence: {conf:.3f})")

# Demo 2: Email classification (zero-shot with transformer)
if zero_shot_classifier:
    email_texts, _ = email_classifier_demo()
    email_labels = ['spam', 'work', 'personal', 'promotional']
    
    print("\n2. Email Classification (Transformer):")
    email_results = transformer_classification(email_texts[:3], email_labels)
    
    for text, result in zip(email_texts[:3], email_results):
        print(f"\nEmail: {text[:60]}...")
        print(f"Classification: {result['labels'][0]} (score: {result['scores'][0]:.3f})")

# Performance summary
print("\n3. Model Performance Summary:")
print(f"Best Traditional Model: {best_classifier}")
print(f"Test Accuracy: {results[best_classifier]['accuracy']:.3f}")
print(f"Number of Features: {X_train_tfidf.shape[1]}")
print(f"Training Examples: {X_train_tfidf.shape[0]}")
print(f"Test Examples: {X_test_tfidf.shape[0]}")

## Exercises

1. **Multi-label Classification**: Modify the approach for documents that can belong to multiple categories
2. **Imbalanced Dataset Handling**: Implement techniques for dealing with uneven class distributions
3. **Feature Engineering**: Create custom features like text length, readability scores, etc.
4. **Model Ensemble**: Combine multiple classifiers for improved performance

## Key Takeaways

- **Preprocessing matters**: Proper text cleaning and normalization significantly impact performance
- **Feature extraction choices**: TF-IDF often outperforms simple bag-of-words for classification
- **Model selection**: Different algorithms work better for different types of text data
- **Evaluation beyond accuracy**: Consider precision, recall, and F1-score for each class
- **Transformer models**: Provide excellent performance but require more computational resources

## Best Practices

1. **Start simple**: Begin with basic features and traditional algorithms
2. **Cross-validation**: Always use cross-validation to get reliable performance estimates
3. **Error analysis**: Examine misclassified examples to understand model limitations
4. **Feature engineering**: Domain-specific features can significantly improve performance
5. **Class imbalance**: Handle uneven class distributions with appropriate techniques

## Applications

- **Spam detection**: Classify emails as spam or legitimate
- **Sentiment analysis**: Determine emotional tone of text
- **Topic categorization**: Automatically organize documents by subject
- **Intent classification**: Understand user intentions in chatbots
- **Content moderation**: Identify inappropriate or harmful content

## Next Steps

- Learn about deep learning approaches (LSTM, CNN for text)
- Explore advanced transformer architectures (BERT, RoBERTa)
- Study multi-label and hierarchical classification
- Practice with real-world datasets from your domain
- Learn about active learning and few-shot classification

## Resources

- [Scikit-learn Text Classification](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)
- [Hugging Face Transformers](https://huggingface.co/transformers/)
- [BERT Paper](https://arxiv.org/abs/1810.04805)
- [Text Classification Datasets](https://www.tensorflow.org/datasets/catalog/overview#text)
- [GLUE Benchmark](https://gluebenchmark.com/)