<a href="https://colab.research.google.com/github/sahanyafernando/My_NLP_Learning/blob/main/NLP_Learning/7_TextClassificationWithTransformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demonstration: Text Classification using Transformers (Sentence Transformers)

This notebook demonstrates text classification using **Sentence Transformers** for creating semantic embeddings, which are then used to train classification models. This approach captures semantic meaning better than traditional TF-IDF methods.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

### Step 1: Loading the Dataset

In [2]:
train_df = pd.read_csv("sentiment_analysis.csv", encoding='latin-1').sample(10, random_state=42)
test_df = pd.read_csv("sentiment_analysis.csv", encoding='latin-1')

print("Dataset Loaded:")
print(train_df.head())

Dataset Loaded:
                                        review sentiment
9           I am worried about the second wave  Negative
11           Recovered patients are increasing  Positive
0         I loved the movie, it was fantastic!  positive
13  Mask wearing is mandatory in public places   Neutral
5               Covid cases are rising rapidly  Negative


### Step 2: Selecting relevant columns

In [3]:
train_df = train_df[['review', 'sentiment']]
test_df = test_df[['review', 'sentiment']]

print("Selected columns:")
print(train_df.head())

Selected columns:
                                        review sentiment
9           I am worried about the second wave  Negative
11           Recovered patients are increasing  Positive
0         I loved the movie, it was fantastic!  positive
13  Mask wearing is mandatory in public places   Neutral
5               Covid cases are rising rapidly  Negative


### Step 3: Standardizing labels

In [4]:
def standardize_sentiment(Sentiment):
    if Sentiment in ['Positive', 'Extremely Positive']:
        return 1
    elif Sentiment in ['Negative', 'Extremely Negative']:
        return 0
    else:
        return 2 # Neutral

train_df['sentiment'] = train_df['sentiment'].apply(standardize_sentiment)
test_df['sentiment'] = test_df['sentiment'].apply(standardize_sentiment)

print("Standardized labels:")
print(train_df.head())
print("\nLabel distribution:")
print(train_df['sentiment'].value_counts().sort_index())

Standardized labels:
                                        review  sentiment
9           I am worried about the second wave          0
11           Recovered patients are increasing          1
0         I loved the movie, it was fantastic!          2
13  Mask wearing is mandatory in public places          2
5               Covid cases are rising rapidly          0

Label distribution:
sentiment
0    2
1    3
2    5
Name: count, dtype: int64


### Step 4: Install Sentence Transformers Library

In [5]:
!pip install -q sentence-transformers

### Step 5: Generate Sentence Embeddings using Transformers

In [None]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer model
# This model creates 384-dimensional embeddings that capture semantic meaning
model = SentenceTransformer('all-MiniLM-L6-v2')

print("Sentence Transformer model loaded successfully!")
print(f"Model embedding dimension: {model.get_sentence_embedding_dimension()}")

In [None]:
# Generate embeddings for training data
print("Generating embeddings for training data...")
train_embeddings = model.encode(train_df['review'].tolist(), show_progress_bar=True, batch_size=16)

# Generate embeddings for test data
print("\nGenerating embeddings for test data...")
test_embeddings = model.encode(test_df['review'].tolist(), show_progress_bar=True, batch_size=16)

print(f"\nTraining embeddings shape: {train_embeddings.shape}")
print(f"Test embeddings shape: {test_embeddings.shape}")
print(f"\nSample embedding (first 10 dimensions): {train_embeddings[0][:10]}")

### Step 6: Prepare Training and Test Data

In [8]:
X_train = train_embeddings
X_test = test_embeddings
Y_train = train_df['sentiment'].values
Y_test = test_df['sentiment'].values

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
print(f"\nTraining label distribution:")
print(pd.Series(Y_train).value_counts().sort_index())
print(f"\nTest label distribution:")
print(pd.Series(Y_test).value_counts().sort_index())

Training set size: (10, 384)
Test set size: (15, 384)

Training label distribution:
0    2
1    3
2    5
Name: count, dtype: int64

Test label distribution:
0    3
1    4
2    8
Name: count, dtype: int64


### Step 7: Training Classification Models

We'll train multiple classifiers on the transformer embeddings and compare their performance.

In [9]:
def train_and_evaluate(model, name, X_train, Y_train, X_test, Y_test):
    print(f"\n{'='*60}")
    print(f"Training {name}...")
    print(f"{'='*60}")

    model.fit(X_train, Y_train)
    Y_pred = model.predict(X_test)
    accuracy = accuracy_score(Y_test, Y_pred)

    print(f"\n{name} Performance:")
    print(f"Accuracy: {accuracy:.4f}")
    print("\nClassification Report:")
    print(classification_report(Y_test, Y_pred, zero_division=0))
    print("\nConfusion Matrix:")
    print(confusion_matrix(Y_test, Y_pred))

    return accuracy

In [10]:
# Define models to train
models = {
    "Logistic Regression": LogisticRegression(random_state=42, max_iter=1000),
    "Support Vector Machine (Linear)": SVC(kernel='linear', random_state=42, probability=False),
    "Support Vector Machine (RBF)": SVC(kernel='rbf', random_state=42, probability=False),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10)
}

results = {}

for name, model in models.items():
    acc = train_and_evaluate(model, name, X_train, Y_train, X_test, Y_test)
    results[name] = acc


Training Logistic Regression...

Logistic Regression Performance:
Accuracy: 0.8667

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.67      0.80         3
           1       1.00      0.75      0.86         4
           2       0.80      1.00      0.89         8

    accuracy                           0.87        15
   macro avg       0.93      0.81      0.85        15
weighted avg       0.89      0.87      0.86        15


Confusion Matrix:
[[2 0 1]
 [0 3 1]
 [0 0 8]]

Training Support Vector Machine (Linear)...

Support Vector Machine (Linear) Performance:
Accuracy: 0.8667

Classification Report:
              precision    recall  f1-score   support

           0       0.67      0.67      0.67         3
           1       0.80      1.00      0.89         4
           2       1.00      0.88      0.93         8

    accuracy                           0.87        15
   macro avg       0.82      0.85      0.83        15
weight

### Step 8: Comparing Models to Identify the Best One

In [11]:
print("\n" + "="*60)
print("Summary of Model Accuracies:")
print("="*60)

sorted_results = sorted(results.items(), key=lambda x: x[1], reverse=True)
for name, acc in sorted_results:
    print(f"{name}: {acc:.4f}")

best_model_name = sorted_results[0][0]
print(f"\n{'='*60}")
print(f"Best Performing Model: {best_model_name}")
print(f"{'='*60}")


Summary of Model Accuracies:
Random Forest: 0.9333
Logistic Regression: 0.8667
Support Vector Machine (Linear): 0.8667
Support Vector Machine (RBF): 0.8667

Best Performing Model: Random Forest


### Step 9: Key Advantages of Transformer-Based Classification

**Why use Sentence Transformers instead of TF-IDF?**

1. **Semantic Understanding**: Transformers capture meaning and context, not just word frequencies
2. **Dense Embeddings**: 384-dimensional vectors vs sparse TF-IDF matrices (more efficient)
3. **Pre-trained Knowledge**: Leverages knowledge learned from millions of text examples
4. **Better Generalization**: Often performs better on unseen data
5. **Context-Aware**: Understands word relationships and sentence structure

**Comparison with Notebook 5 (TF-IDF approach):**
- Notebook 5: Uses sparse TF-IDF features → classical ML models
- This notebook: Uses dense transformer embeddings → classical ML models
- Both approaches can use the same classifiers, but embeddings provide richer semantic features