<a href="https://colab.research.google.com/github/sahanyafernando/My_NLP_Learning/blob/main/Public_Response_Analysis/notebooks/07_sentence_transformer_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 07 â€“ Sentence Transformer-Based Text Classification

This notebook uses **Sentence Transformers** to create dense embeddings and performs sentiment classification on the multilingual policy dataset. Unlike the classical TF-IDF approach in notebook 04, this uses state-of-the-art transformer-based embeddings that capture semantic meaning across languages.

## Load preprocessing artifacts

Loads outputs saved by `01_data_loading_and_preprocessing.ipynb`. Run that notebook first if this file is missing.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pickle, pathlib

artifacts_root = pathlib.Path("/content/drive/MyDrive/My_NLP_Learning/Public_Response_Analysis")
artifacts_path = artifacts_root / "artifacts/preprocessing_outputs.pkl"

if artifacts_path.exists():
    with open(artifacts_path, "rb") as f:
        artifacts = pickle.load(f)
    df = artifacts["df"]
    print("Loaded preprocessing artifacts and DataFrame.")
    print(f"Dataset shape: {df.shape}")
    print(f"Sentiment labels: {df['sentiment_label'].value_counts().to_dict()}")
else:
    raise FileNotFoundError(
        "Artifacts not found. Please run 01_data_loading_and_preprocessing.ipynb first "
        "and execute the 'Save preprocessing artifacts' cell."
    )

## Install and import Sentence Transformers

We'll use the `sentence-transformers` library to create multilingual embeddings. This library provides pre-trained models optimized for creating sentence embeddings.

In [None]:
!pip install -q sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

## Generate Sentence Embeddings

We'll use a multilingual Sentence Transformer model to convert each text post into a dense vector representation. The model `paraphrase-multilingual-MiniLM-L12-v2` supports 50+ languages and creates 384-dimensional embeddings.

In [None]:
# Initialize the multilingual sentence transformer model
# This model supports 50+ languages including en, es, fr, de, hi (languages in our dataset)
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

print("Sentence Transformer model loaded successfully!")
print(f"Model embedding dimension: {model.get_sentence_embedding_dimension()}")

In [None]:
# Generate embeddings for all texts
print("Generating sentence embeddings...")
texts = df['text'].tolist()
embeddings = model.encode(texts, show_progress_bar=True, batch_size=16)

print(f"Generated embeddings shape: {embeddings.shape}")
print(f"Sample embedding (first 10 dimensions): {embeddings[0][:10]}")

## Prepare Training Data

Split the data into training and testing sets using the sentiment labels as targets.

In [None]:
X = embeddings
y = df['sentiment_label'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"\nTraining label distribution:")
print(pd.Series(y_train).value_counts())
print(f"\nTest label distribution:")
print(pd.Series(y_test).value_counts())

## Train Classification Models

We'll train multiple classifiers on the sentence transformer embeddings and compare their performance.

In [None]:
# Define models to train
models = {
    "Logistic Regression": LogisticRegression(random_state=42, max_iter=1000),
    "SVM (Linear)": SVC(kernel='linear', random_state=42, probability=True),
    "SVM (RBF)": SVC(kernel='rbf', random_state=42, probability=True),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10)
}

results = {}

for name, clf in models.items():
    print(f"\n{'='*60}")
    print(f"Training {name}...")
    print(f"{'='*60}")
    
    # Train the model
    clf.fit(X_train, y_train)
    
    # Make predictions
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    acc = accuracy_score(y_test, y_pred)
    results[name] = acc
    
    print(f"\n{name} Accuracy: {acc:.4f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, y_pred))

print(f"\n{'='*60}")
print("Summary of Model Accuracies:")
print(f"{'='*60}")
for name, acc in sorted(results.items(), key=lambda x: x[1], reverse=True):
    print(f"{name}: {acc:.4f}")

## Language-wise Performance Analysis

Evaluate how well the best model performs across different languages in the dataset.

In [None]:
# Select the best model
best_model_name = max(results, key=results.get)
print(f"Best model: {best_model_name}")
best_clf = models[best_model_name]

# Make predictions on entire dataset for language-wise analysis
y_pred_all = best_clf.predict(X)

df_eval = df.copy()
df_eval['predicted_sentiment'] = y_pred_all

print("\n" + "="*60)
print("Language-wise Performance:")
print("="*60)

for lang in sorted(df_eval['language'].unique()):
    lang_subset = df_eval[df_eval['language'] == lang]
    lang_acc = accuracy_score(lang_subset['sentiment_label'], lang_subset['predicted_sentiment'])
    print(f"\nLanguage: {lang.upper()}")
    print(f"  Accuracy: {lang_acc:.4f}")
    print(f"  Number of samples: {len(lang_subset)}")
    print(f"  Classification Report:")
    print(classification_report(lang_subset['sentiment_label'], lang_subset['predicted_sentiment'], 
                                zero_division=0))

## Comparison with TF-IDF Baseline (Notebook 04)

This cell demonstrates the advantage of sentence transformers over TF-IDF for multilingual text classification. If you ran notebook 04, you can compare the results.

**Key advantages of Sentence Transformers:**
- **Semantic understanding**: Captures meaning, not just word frequencies
- **Multilingual**: Single model handles multiple languages seamlessly
- **Context-aware**: Understands context and word relationships
- **Dense embeddings**: Compact 384-dimensional vectors vs sparse TF-IDF matrices