# Text Classification

Use cases:
1. Spam detection
2. Topic labeling
3. Sentiment analysis
4. etc.

### When to use each models

|Method|Best For|Complexity|Data Requirement|
|---|---|:---:|:---:|
|Rule-Based|Simple keyword-based tasks|Low|None|
|Naïve Bayes|Spam detection, sentiment analysis|Medium|Small dataset|
|SVM/Logistic Regression|Topic classification|Medium|Medium-sized dataset|
|Deep Learning (BERT, LSTMs)|Large-scale, complex text tasks|High|Large labeled dataset|


## 1. Rule-based method

Pros:
1. Simple
2. Interpretable

Cons:
1. Need rule updates
2. Can't generalize well

In [None]:
def classify_email(email):
    email = email.lower()
    spam_keywords = ["free money", "lottery", "prize", "win", "jackpot"]
    for keyword in spam_keywords:
        if keyword in email:
            return "spam"
    return "not spam"


sample_emails = [
    "Get your free money!",
    "Obtain huge jackpot!",
    "Guaranteed to win 50x",
    "Please resubmit your form",
    "You have won a prize",
    "You have been selected",
]

print("Classified email as:", list(map(classify_email, sample_emails)))
# Output: ['spam', 'spam', 'spam', 'not spam', 'spam', 'not spam']

Classified email as: ['spam', 'spam', 'spam', 'not spam', 'spam', 'not spam']


## 2. ML-based method
Common algorithms:
1. Naïve Bayes: Good for spam filtering and topic classification.
2. Support Vector Machines (SVM): Works well with small datasets.
3. Logistic Regression: Simple and effective for binary classification.
4. Random Forest: Uses decision trees to improve accuracy.
5. Deep Learning (LSTMs, CNNs, Transformers): Best for complex text data.

### Algorithm differences

Model|How It Works|Pros|Cons|Best Use Cases
---|---|---|---|---
Naïve Bayes (NB)|Uses Bayes' theorem to compute probabilities of a text belonging to a class|Fast, works well with small datasets, handles noisy data|Assumes word independence, doesn’t capture relationships between words|Spam filtering, sentiment analysis, simple text classification
Support Vector Machines (SVM)|Finds a hyperplane that best separates classes in high-dimensional space|Works well with small datasets, robust to overfitting|Computationally expensive for large data, doesn’t capture word order|Topic classification, sentiment analysis, document categorization
Logistic Regression|Uses a linear function to model class probabilities|Simple and interpretable, efficient on small datasets|Limited to linearly separable data, doesn’t handle long-range dependencies|Binary classification (spam vs. not spam), fake news detection
Random Forest|Uses multiple decision trees to make predictions|Works with imbalanced data, robust to noise|Can be slow on large datasets, requires feature engineering|Multi-class classification, author attribution, keyword-based classification

### 2.1. Sentiment analysis using Naïve Bayes
Pros:
1. Can handle large dataset
2. More accurate than rule-based method

Cons:
1. Requires labeled data
2. Needs training

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# sample dataset
texts = [
    "I love this product",
    "This is awesome",
    "I feel great",
    "I am so happy",
    "I am not happy",
    "I feel terrible",
    "I hate this product",
    "I don't like it",
    "I'm terribly in love with this product",
    "I loved the fact that I wasted money on this product",
    "This product broke after 1 day",
    "I love that this product last for 2 years",
    "This product last for eternity",
]
labels = [
    "positive",
    "positive",
    "positive",
    "positive",
    "negative",
    "negative",
    "negative",
    "negative",
    "positive",
    "negative",
    "negative",
    "positive",
    "positive",
]

# create a pipeline
model = make_pipeline(CountVectorizer(), MultinomialNB())
model.fit(texts, labels)

# predict the sentiment of a new text
new_text = [
    "I feel happy that I wasted 10 hours in line",
    "I hate that I wasted my time and money",
    "I love that this product broke after 2 days",
]

print("Predicted sentiment:", model.predict(new_text))
# Expected output: ['negative', 'negative', 'negative']
# Actual output: ['negative', 'negative', 'positive']

Predicted sentiment: ['negative' 'negative' 'positive']


## 3. Deep learning-based method
Uses neural network for text classification<br>

Popular models:
1. RNN (LSTMs, GRUs): Handles sequential text well.
2. CNN for Text: Extracts features from text for classification.
3. Transformers (BERT, GPT, RoBERTa): State-of-the-art models.

### Algorithm differences

Model|How It Works|Pros|Cons|Best Use Cases
---|---|---|---|---
Recurrent Neural Networks (RNNs)|Processes text sequentially, maintaining a memory of previous words|Captures context better than ML models, good for sequence-based tasks|Struggles with long sentences due to vanishing gradients|Sequential text classification (e.g., chatbot intent recognition)
Long Short-Term Memory (LSTM)|A type of RNN that retains long-term dependencies using gates|Handles longer sequences, avoids vanishing gradient problem|Computationally expensive, slow training|Sentiment analysis, document classification, named entity recognition
Gated Recurrent Units (GRU)|A simplified LSTM with fewer parameters|Faster training than LSTM, performs well on medium-length text|Still slower than CNNs and Transformers|Similar use cases as LSTM but with better efficiency
Convolutional Neural Networks (CNNs) for Text|Uses filters to detect patterns in word embeddings (n-grams)|Faster training than RNNs, works well with short text|Doesn’t capture word order dependencies well|Short text classification (e.g., fake news detection, sentiment analysis)
Transformers (BERT, GPT, RoBERTa, etc.)|Uses attention mechanisms to capture long-range dependencies in text|State-of-the-art performance, captures deep contextual meaning|Requires large datasets and GPUs, slow inference|Any advanced NLP task: document classification, chatbots, translation

### 3.1 Text Classification with BERT

In [7]:
from transformers import pipeline

classifier = pipeline(
    "text-classification", model="distilbert-base-uncased-finetuned-sst-2-english"
)

sample_texts = [
    "I love this product",
    "I hate this product",
    "I don't like it",
    "I'm terribly in love with this product",
    "I loved the fact that I wasted money on this product",
    "This product broke after 1 day",
    "I love that this product last for 2 years",
    "I feel happy that I wasted 10 hours in line",
    "I feel happy that I spend 10 hours in line",
    "I hate that I wasted my time and money",
    "I love that this product broke after 2 days",
]

prediction = classifier(sample_texts)

print("Predicted sentiment:", [p["label"] for p in prediction])
# Expected output: ['POSITIVE', 'NEGATIVE', 'NEGATIVE', 'POSITIVE', 'NEGATIVE', 'NEGATIVE', 'POSITIVE', 'NEGATIVE', 'POSITIVE', 'NEGATIVE', 'NEGATIVE']
# Actual output: ['POSITIVE', 'NEGATIVE', 'NEGATIVE', 'POSITIVE', 'NEGATIVE', 'NEGATIVE', 'POSITIVE', 'POSITIVE', 'POSITIVE', 'NEGATIVE', 'NEGATIVE']

Device set to use cpu


Predicted sentiment: ['POSITIVE', 'NEGATIVE', 'NEGATIVE', 'POSITIVE', 'NEGATIVE', 'NEGATIVE', 'POSITIVE', 'POSITIVE', 'POSITIVE', 'NEGATIVE', 'NEGATIVE']


## ML vs DL

### Key differences

Feature|Traditional ML (NB, SVM, RF, etc.)|Deep Learning (LSTM, CNN, Transformers)
---|---|---
Feature Engineering|Requires manual features (TF-IDF, n-grams)|Learns features automatically
Performance on Small Data|Works well|Needs large data for good performance
Context Awareness|Doesn’t capture relationships between words well|Captures deep word relationships
Training Time|Fast|Slow (especially Transformers)
Computational Cost|Low|High (GPUs often needed)
Generalization|Works well on simple tasks|Works well on complex tasks

### When to Use Which Model?
    ✅ If you have a small dataset → Use Naïve Bayes, SVM, or Logistic Regression
    ✅ If you need fast training → Use Random Forest or CNN
    ✅ If you need context understanding → Use LSTM, GRU, or Transformers
    ✅ If you have lots of data & computing power → Use BERT, GPT, or other Transformers