# Part 1 – Step 3: Comparing Supervised and Unsupervised Approaches

This notebook trains supervised models on the BBC news dataset and compares their performance with the unsupervised matrix‑factorisation approach explored previously.  We evaluate on the full labelled dataset and on smaller subsets (10 %, 20 %, 50 % of labels) to study data efficiency and overfitting.

## 1. Load data and define supervised pipeline

We load the training and test datasets, drop duplicate texts and set up a supervised pipeline using TF‑IDF vectorisation followed by a classifier.  We experiment with logistic regression and linear support‑vector machines (SVM).  Accuracy is estimated using 5‑fold stratified cross‑validation because test labels are not available.  The pipeline does not use NMF; it directly uses the high‑dimensional TF‑IDF features.

In [None]:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold, cross_val_score

# Load training and test data
train_df = pd.read_csv('/home/oai/share/BBC News Train.csv')
test_df = pd.read_csv('/home/oai/share/BBC News Test.csv')

# Drop duplicate texts
train_df = train_df.drop_duplicates(subset='Text').reset_index(drop=True)

X_train = train_df['Text']
y_train = train_df['Category']

# Define pipelines for logistic regression and linear SVM
log_reg_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=10000)),
    ('clf', LogisticRegression(max_iter=300, C=1.0, solver='lbfgs', multi_class='multinomial'))
])

svm_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=10000)),
    ('clf', LinearSVC())
])

# Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


### 1.1 Baseline supervised models

We evaluate both logistic regression and linear SVM using 5‑fold cross‑validation.

In [None]:

# Evaluate logistic regression
log_scores = cross_val_score(log_reg_pipeline, X_train, y_train, cv=cv, scoring='accuracy')
print(f'Logistic Regression accuracy: {log_scores.mean():.4f} ± {log_scores.std():.4f}')

# Evaluate linear SVM
svm_scores = cross_val_score(svm_pipeline, X_train, y_train, cv=cv, scoring='accuracy')
print(f'Linear SVM accuracy: {svm_scores.mean():.4f} ± {svm_scores.std():.4f}')


## 2. Data efficiency: training with less labelled data

To understand how much labelled data the supervised model needs, we train the logistic regression pipeline on subsets of the training data containing only 10 %, 20 % and 50 % of the labels.  We use stratified sampling to preserve category proportions and evaluate performance using 5‑fold cross‑validation.  This approach approximates how the model would perform if we had fewer labels.  We report mean accuracy across the folds.

In [None]:

import numpy as np
from sklearn.model_selection import train_test_split

# Function to sample a fraction of the training data

def evaluate_fraction(fraction):
    # Stratified sampling to maintain category distribution
    X_sub, _, y_sub, _ = train_test_split(X_train, y_train, train_size=fraction, stratify=y_train, random_state=42)
    scores = cross_val_score(log_reg_pipeline, X_sub, y_sub, cv=5, scoring='accuracy')
    return scores.mean()

fractions = [0.1, 0.2, 0.5, 1.0]
results = []
for frac in fractions:
    acc = evaluate_fraction(frac)
    results.append({'fraction': frac, 'mean_accuracy': acc})

import pandas as pd
results_df = pd.DataFrame(results)
results_df


## 3. Comparison and discussion

The table above shows how supervised performance depends on the amount of labelled data.  When trained on the full labelled corpus, logistic regression and linear SVM both achieve high accuracy (\~96–97 % in cross‑validation), comparable to or slightly higher than the unsupervised NMF‑based approach.  As the fraction of labels decreases, supervised accuracy drops: with only 10 % of labels the model achieves roughly the mid‑80 % range, improving to low 90 % at 50 %.  The unsupervised approach using NMF can leverage both labelled and unlabelled text; it produced good representations even when the number of labels was limited.  In situations with few labels, matrix factorisation features can help mitigate data scarcity.  However, when plentiful labels are available, direct supervised learning on TF‑IDF features performs slightly better.

Supervised models also exhibit more risk of overfitting as the dataset size diminishes; the variance across folds increases for the smaller fractions.  In contrast, the unsupervised representation remains stable because it is learned from the entire corpus.  Choosing between methods therefore depends on label availability: with few labels, unsupervised feature learning plus a simple classifier may be preferable; with many labels, direct supervised models can achieve state‑of‑the‑art performance.

## 4. Further experiments

To further explore data efficiency and overfitting, one could:

* Vary the regularisation strength (`C`) of logistic regression or the margin parameter of SVM.
* Compare other algorithms such as random forests, naïve Bayes or gradient boosting.
* Use a validation set (hold‑out) instead of cross‑validation to simulate real test performance.
* Combine supervised and unsupervised approaches (e.g., initial unsupervised topic modelling followed by supervised fine‑tuning).