<a href="https://colab.research.google.com/github/sahanyafernando/My_NLP_Learning/blob/main/NLP_Learning/7_TextClassificationWithTransformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demonstration: Text Classification using Transformers (Sentence Transformers)

This notebook demonstrates text classification using **Sentence Transformers** for creating semantic embeddings, which are then used to train classification models. This approach captures semantic meaning better than traditional TF-IDF methods.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

### Step 1: Loading the Dataset

In [None]:
train_df = pd.read_csv("sentiment_analysis.csv", encoding='latin-1').sample(10, random_state=42)
test_df = pd.read_csv("sentiment_analysis.csv", encoding='latin-1')

print("Dataset Loaded:")
print(train_df.head())

### Step 2: Selecting relevant columns

In [None]:
train_df = train_df[['review', 'sentiment']]
test_df = test_df[['review', 'sentiment']]

print("Selected columns:")
print(train_df.head())

### Step 3: Standardizing labels

In [None]:
def standardize_sentiment(Sentiment):
    if Sentiment in ['Positive', 'Extremely Positive']:
        return 1
    elif Sentiment in ['Negative', 'Extremely Negative']:
        return 0
    else:
        return 2 # Neutral

train_df['sentiment'] = train_df['sentiment'].apply(standardize_sentiment)
test_df['sentiment'] = test_df['sentiment'].apply(standardize_sentiment)

print("Standardized labels:")
print(train_df.head())
print("\nLabel distribution:")
print(train_df['sentiment'].value_counts().sort_index())

### Step 4: Install Sentence Transformers Library

In [None]:
!pip install -q sentence-transformers

### Step 5: Generate Sentence Embeddings using Transformers

In [None]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer model
# This model creates 384-dimensional embeddings that capture semantic meaning
model = SentenceTransformer('all-MiniLM-L6-v2')

print("Sentence Transformer model loaded successfully!")
print(f"Model embedding dimension: {model.get_sentence_embedding_dimension()}")

In [None]:
# Generate embeddings for training data
print("Generating embeddings for training data...")
train_embeddings = model.encode(train_df['review'].tolist(), show_progress_bar=True, batch_size=16)

# Generate embeddings for test data
print("\nGenerating embeddings for test data...")
test_embeddings = model.encode(test_df['review'].tolist(), show_progress_bar=True, batch_size=16)

print(f"\nTraining embeddings shape: {train_embeddings.shape}")
print(f"Test embeddings shape: {test_embeddings.shape}")
print(f"\nSample embedding (first 10 dimensions): {train_embeddings[0][:10]}")

### Step 6: Prepare Training and Test Data

In [None]:
X_train = train_embeddings
X_test = test_embeddings
Y_train = train_df['sentiment'].values
Y_test = test_df['sentiment'].values

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
print(f"\nTraining label distribution:")
print(pd.Series(Y_train).value_counts().sort_index())
print(f"\nTest label distribution:")
print(pd.Series(Y_test).value_counts().sort_index())

### Step 7: Training Classification Models

We'll train multiple classifiers on the transformer embeddings and compare their performance.

In [None]:
def train_and_evaluate(model, name, X_train, Y_train, X_test, Y_test):
    print(f"\n{'='*60}")
    print(f"Training {name}...")
    print(f"{'='*60}")
    
    model.fit(X_train, Y_train)
    Y_pred = model.predict(X_test)
    accuracy = accuracy_score(Y_test, Y_pred)
    
    print(f"\n{name} Performance:")
    print(f"Accuracy: {accuracy:.4f}")
    print("\nClassification Report:")
    print(classification_report(Y_test, Y_pred, zero_division=0))
    print("\nConfusion Matrix:")
    print(confusion_matrix(Y_test, Y_pred))
    
    return accuracy

In [None]:
# Define models to train
models = {
    "Logistic Regression": LogisticRegression(random_state=42, max_iter=1000),
    "Support Vector Machine (Linear)": SVC(kernel='linear', random_state=42, probability=False),
    "Support Vector Machine (RBF)": SVC(kernel='rbf', random_state=42, probability=False),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10)
}

results = {}

for name, model in models.items():
    acc = train_and_evaluate(model, name, X_train, Y_train, X_test, Y_test)
    results[name] = acc

### Step 8: Comparing Models to Identify the Best One

In [None]:
print("\n" + "="*60)
print("Summary of Model Accuracies:")
print("="*60)

sorted_results = sorted(results.items(), key=lambda x: x[1], reverse=True)
for name, acc in sorted_results:
    print(f"{name}: {acc:.4f}")

best_model_name = sorted_results[0][0]
print(f"\n{'='*60}")
print(f"Best Performing Model: {best_model_name}")
print(f"{'='*60}")

### Step 9: Key Advantages of Transformer-Based Classification

**Why use Sentence Transformers instead of TF-IDF?**

1. **Semantic Understanding**: Transformers capture meaning and context, not just word frequencies
2. **Dense Embeddings**: 384-dimensional vectors vs sparse TF-IDF matrices (more efficient)
3. **Pre-trained Knowledge**: Leverages knowledge learned from millions of text examples
4. **Better Generalization**: Often performs better on unseen data
5. **Context-Aware**: Understands word relationships and sentence structure

**Comparison with Notebook 5 (TF-IDF approach):**
- Notebook 5: Uses sparse TF-IDF features → classical ML models
- This notebook: Uses dense transformer embeddings → classical ML models
- Both approaches can use the same classifiers, but embeddings provide richer semantic features