
# 20 Newsgroups — End‑to‑End Experiments with MLflow

This notebook contains two independent workflows on the 20 Newsgroups dataset:

1. **PyCaret + MLflow** with TF‑IDF → SVD features, automated model comparison/tuning, and artifact logging.
2. **PyTorch MLP + MLflow** with TF‑IDF → SVD features, manual training loop and logging.

> **Notes (Windows users):**
> - Keep artifact paths short if you run a local MLflow server to avoid long‑path issues.
> - If PyCaret/MLflow versions in your environment are incompatible, pin `mlflow==2.12.1` and use a recent PyCaret (e.g., 3.3.x). 
> - If PyCaret raises a `sklearn` private API error, uncomment the small shim in the PyCaret section.


## 1) PyCaret + MLflow (TF‑IDF → SVD)

In [2]:

import os, json
from pathlib import Path
import joblib
import pandas as pd

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

import mlflow

EXPERIMENT_NAME = "20NG-PyCaret"
mlflow.set_experiment(EXPERIMENT_NAME)

ART_DIR = Path("artifacts"); ART_DIR.mkdir(exist_ok=True)

def close_all_runs():
    while mlflow.active_run() is not None:
        mlflow.end_run()

close_all_runs()
print("Experiment ready:", EXPERIMENT_NAME)


Experiment ready: 20NG-PyCaret


In [3]:

dataset = fetch_20newsgroups(subset='all')
X, y = dataset.data, dataset.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

tfidf = TfidfVectorizer(stop_words='english', max_features=30000, sublinear_tf=True)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf  = tfidf.transform(X_test)

svd = TruncatedSVD(n_components=150, random_state=42)
X_train_svd = svd.fit_transform(X_train_tfidf)
X_test_svd  = svd.transform(X_test_tfidf)

joblib.dump(tfidf, ART_DIR / "tfidf_20ng.joblib")
joblib.dump(svd,   ART_DIR / "svd_20ng_150.joblib")

cols = [f"svd_{i}" for i in range(X_train_svd.shape[1])]
train_df = pd.DataFrame(X_train_svd, columns=cols); train_df["label"] = y_train
test_df  = pd.DataFrame(X_test_svd,  columns=cols); test_df["label"]  = y_test

len(train_df), len(test_df), train_df.shape[1]-1 


(15076, 3770, 150)

In [4]:

def _filter_available(models):
    avail = set(models)
    try:
        import lightgbm  
    except Exception:
        avail.discard("lightgbm")
    try:
        import xgboost  
    except Exception:
        avail.discard("xgboost")
    return list(avail)

include_models = _filter_available(["lr", "ridge", "nb", "rf", "lightgbm", "xgboost"])
include_models


['rf', 'nb', 'lightgbm', 'xgboost', 'ridge', 'lr']

In [5]:

from pycaret.classification import (
    setup, compare_models, tune_model, finalize_model, predict_model, save_model, pull
)

clf = setup(
    data=train_df,
    target="label",
    session_id=42,
    fold=3,
    html=False,
    log_experiment=True,
    experiment_name=EXPERIMENT_NAME,
    experiment_custom_tags={"dataset":"20newsgroups","features":"tfidf+svd"},
    log_plots=True,
    log_profile=False,
    log_data=False,
    verbose=True
)

best = compare_models(
    include=include_models,
    n_select=1,
    turbo=False,
    budget_time=300
)

leaderboard = pull()
lb_path = ART_DIR / "leaderboard.csv"
leaderboard.to_csv(lb_path, index=False)
leaderboard.head()


                    Description            Value
0                    Session id               42
1                        Target            label
2                   Target type       Multiclass
3           Original data shape     (15076, 151)
4        Transformed data shape     (15076, 151)
5   Transformed train set shape     (10553, 151)
6    Transformed test set shape      (4523, 151)
7              Numeric features              150
8                    Preprocess             True
9               Imputation type           simple
10           Numeric imputation             mean
11       Categorical imputation             mode
12               Fold Generator  StratifiedKFold
13                  Fold Number                3
14                     CPU Jobs               -1
15                      Use GPU            False
16               Log Experiment     MlflowLogger
17              Experiment Name     20NG-PyCaret
18                          USI             d0d9


                                                           

                             Model  Accuracy     AUC  Recall   Prec.      F1  \
lr             Logistic Regression    0.8199  0.0000  0.8199  0.8230  0.8156   
ridge             Ridge Classifier    0.8184  0.0000  0.8184  0.8182  0.8119   
xgboost  Extreme Gradient Boosting    0.8082  0.9842  0.8082  0.8087  0.8076   
rf        Random Forest Classifier    0.7916  0.9760  0.7916  0.7951  0.7901   
nb                     Naive Bayes    0.6751  0.9435  0.6751  0.7001  0.6802   

          Kappa     MCC  TT (Sec)  
lr       0.8102  0.8106    0.1400  
ridge    0.8087  0.8091    0.6267  
xgboost  0.7980  0.7981   10.4333  
rf       0.7805  0.7807    1.8100  
nb       0.6577  0.6587    0.7000  


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.8199,0.0,0.8199,0.823,0.8156,0.8102,0.8106,0.14
ridge,Ridge Classifier,0.8184,0.0,0.8184,0.8182,0.8119,0.8087,0.8091,0.6267
xgboost,Extreme Gradient Boosting,0.8082,0.9842,0.8082,0.8087,0.8076,0.798,0.7981,10.4333
rf,Random Forest Classifier,0.7916,0.976,0.7916,0.7951,0.7901,0.7805,0.7807,1.81
nb,Naive Bayes,0.6751,0.9435,0.6751,0.7001,0.6802,0.6577,0.6587,0.7


In [6]:

best_tuned = tune_model(best, optimize="Accuracy", n_iter=20, choose_better=True)
final_model = finalize_model(best_tuned)

test_preds = predict_model(final_model, data=test_df)
pred_path = ART_DIR / "test_predictions_head.csv"
test_preds.head(50).to_csv(pred_path, index=False)
test_preds.head()


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 3 folds for each of 20 candidates, totalling 60 fits


                                                         

      Accuracy  AUC  Recall   Prec.      F1   Kappa     MCC
Fold                                                       
0       0.8368  0.0  0.8368  0.8399  0.8376  0.8281  0.8282
1       0.8474  0.0  0.8474  0.8484  0.8476  0.8392  0.8393
2       0.8388  0.0  0.8388  0.8438  0.8400  0.8302  0.8304
Mean    0.8410  0.0  0.8410  0.8440  0.8417  0.8325  0.8326
Std     0.0046  0.0  0.0046  0.0035  0.0043  0.0048  0.0048
                 Model  Accuracy     AUC  Recall   Prec.      F1   Kappa  \
0  Logistic Regression    0.8674  0.9919  0.8674  0.8681  0.8674  0.8603   

      MCC  
0  0.8604  


Unnamed: 0,svd_0,svd_1,svd_2,svd_3,svd_4,svd_5,svd_6,svd_7,svd_8,svd_9,...,svd_143,svd_144,svd_145,svd_146,svd_147,svd_148,svd_149,label,prediction_label,prediction_score
0,0.051136,-0.044209,0.013876,-0.014207,0.010195,0.018256,0.002139,0.019463,0.00418,-0.007518,...,0.013087,0.012224,0.010218,0.01989,0.00767,0.009534,0.016458,1,1,0.6053
1,0.14895,-0.022029,-0.051318,-0.045028,0.060533,-0.028787,0.058711,0.029263,-0.000823,-0.011804,...,0.003848,0.037028,0.018487,-0.008069,-0.015455,-0.01396,0.015685,19,19,0.275
2,0.127395,-0.00219,0.041384,-0.001889,-0.019644,-0.022073,-0.007373,0.001285,0.006122,-0.001303,...,0.01873,0.022513,0.012194,-0.015511,-0.039467,0.005832,0.024015,5,5,0.4727
3,0.154591,0.003464,0.014266,0.020361,-0.034787,-0.003228,-0.012234,-0.059677,0.039645,0.025397,...,-0.027196,0.03685,0.000714,0.00893,-0.060703,-0.016321,-0.039191,7,7,0.3642
4,0.155006,0.042561,-0.049434,-0.031397,0.037215,0.020427,0.065599,0.01982,-0.005767,-0.027879,...,0.004671,0.023195,0.014267,0.00162,-0.002615,-0.031948,-0.003028,17,17,0.7586


In [7]:

save_model(final_model, str(ART_DIR / "best_20ng_pycaret"))

close_all_runs() 

with mlflow.start_run(run_name="extra_artifacts_attach"):
    mlflow.log_params({
        "vectorizer":"tfidf",
        "tfidf_stop_words":"english",
        "tfidf_max_features":30000,
        "tfidf_sublinear_tf":True,
        "svd_n_components":150,
        "test_size":0.2,
        "random_state":42
    })
    mlflow.log_artifact(str(ART_DIR / "tfidf_20ng.joblib"), artifact_path="preprocessing")
    mlflow.log_artifact(str(ART_DIR / "svd_20ng_150.joblib"), artifact_path="preprocessing")
    mlflow.log_artifact(str(ART_DIR / "leaderboard.csv"), artifact_path="reports")
    mlflow.log_artifact(str(ART_DIR / "test_predictions_head.csv"), artifact_path="reports")
    mlflow.log_artifact(str(ART_DIR / "best_20ng_pycaret.pkl"), artifact_path="model_pickles")

print("PyCaret section completed. Check MLflow experiment:", EXPERIMENT_NAME)


Transformation Pipeline and Model Successfully Saved
PyCaret section completed. Check MLflow experiment: 20NG-PyCaret


## 2) PyTorch MLP + MLflow (TF‑IDF → SVD)

In [8]:

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import accuracy_score, classification_report
import joblib
from pathlib import Path

import mlflow
import mlflow.pytorch

mlflow.set_experiment("20ng-mlp-svd") 

dataset = fetch_20newsgroups(subset='all')
X, y = dataset.data, dataset.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

tfidf = TfidfVectorizer(stop_words='english', max_features=30000, sublinear_tf=True)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf  = tfidf.transform(X_test)

svd = TruncatedSVD(n_components=200, random_state=42)
X_train_svd = svd.fit_transform(X_train_tfidf)
X_test_svd  = svd.transform(X_test_tfidf)

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_classes = len(np.unique(y_train))
in_dim = X_train_svd.shape[1]

Xtr = torch.tensor(X_train_svd, dtype=torch.float32)
ytr = torch.tensor(y_train, dtype=torch.long)
Xte = torch.tensor(X_test_svd,  dtype=torch.float32)
yte = torch.tensor(y_test,  dtype=torch.long)

train_loader = DataLoader(TensorDataset(Xtr, ytr), batch_size=256, shuffle=True)
test_loader  = DataLoader(TensorDataset(Xte, yte), batch_size=512, shuffle=False)

class MLP(nn.Module):
    def __init__(self, in_dim, num_classes):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, 512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, num_classes)
        )
    def forward(self, x):
        return self.net(x)

model = MLP(in_dim, num_classes).to(device)
opt = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
loss_fn = nn.CrossEntropyLoss()

best_acc, patience, wait = 0.0, 5, 0

with mlflow.start_run(run_name="mlp-svd-baseline"):
    mlflow.log_params({
        "tfidf_stop_words": "english",
        "tfidf_max_features": 30000,
        "tfidf_sublinear_tf": True,
        "svd_n_components": 200,
        "model_hidden_1": 512,
        "model_hidden_2": 256,
        "dropout": 0.3,
        "optimizer": "Adam",
        "lr": 1e-3,
        "weight_decay": 1e-4,
        "batch_size": 256,
        "patience": patience,
        "device": str(device),
        "num_classes": int(num_classes),
        "input_dim": int(in_dim),
        "random_state": 42,
        "test_size": 0.2,
    })

    for epoch in range(50):
        model.train()
        for xb, yb in train_loader:
            xb, yb = xb.to(device), yb.to(device)
            opt.zero_grad()
            logits = model(xb)
            loss = loss_fn(logits, yb)
            loss.backward()
            opt.step()

        model.eval()
        with torch.no_grad():
            logits = model(Xte.to(device))
            preds = logits.argmax(dim=1)
            acc = (preds.cpu() == yte).float().mean().item()

        print(f"Epoch {epoch+1:02d} | test acc={acc:.4f}")
        mlflow.log_metric("test_accuracy", acc, step=epoch+1)

        if acc > best_acc + 1e-4:
            best_acc, wait = acc, 0
            torch.save(model.state_dict(), "mlp_svd_best.pt")
            mlflow.log_metric("best_accuracy", best_acc, step=epoch+1)
            mlflow.log_artifact("mlp_svd_best.pt", artifact_path="checkpoints")
        else:
            wait += 1
            if wait >= patience:
                print("Early stop")
                break

    try:
        state = torch.load("mlp_svd_best.pt", map_location=device, weights_only=True)
    except TypeError:
        state = torch.load("mlp_svd_best.pt", map_location=device)
    model.load_state_dict(state)

    model.eval()
    with torch.no_grad():
        logits = model(Xte.to(device))
        preds = logits.argmax(dim=1).cpu().numpy()

    final_acc = accuracy_score(y_test, preds)
    print(f"MLP test accuracy: {final_acc:.4f}")
    mlflow.log_metric("final_test_accuracy", final_acc)

    report_str = classification_report(y_test, preds, target_names=dataset.target_names, digits=3)
    print(report_str)
    with open("classification_report.txt", "w", encoding="utf-8") as f:
        f.write(report_str)
    mlflow.log_artifact("classification_report.txt", artifact_path="reports")

    joblib.dump(tfidf, "tfidf_20ng.joblib")
    joblib.dump(svd,   "svd_20ng_200.joblib")
    mlflow.log_artifact("tfidf_20ng.joblib",   artifact_path="preprocessing")
    mlflow.log_artifact("svd_20ng_200.joblib", artifact_path="preprocessing")

    mlflow.pytorch.log_model(model, artifact_path="model", registered_model_name=None)

def predict_texts(texts):
    X = tfidf.transform(texts)
    X = svd.transform(X)
    Xt = torch.tensor(X, dtype=torch.float32).to(device)
    with torch.no_grad():
        logits = model(Xt)
        labels = logits.argmax(dim=1).cpu().numpy().tolist()
    return [dataset.target_names[i] for i in labels]

predict_texts([
    "GPU driver fails on my Mac laptop",
    "Theology debate about atheism and religion",
])


Epoch 01 | test acc=0.6061
Epoch 02 | test acc=0.8003
Epoch 03 | test acc=0.8361
Epoch 04 | test acc=0.8467
Epoch 05 | test acc=0.8560
Epoch 06 | test acc=0.8578
Epoch 07 | test acc=0.8621
Epoch 08 | test acc=0.8621
Epoch 09 | test acc=0.8674
Epoch 10 | test acc=0.8679
Epoch 11 | test acc=0.8708
Epoch 12 | test acc=0.8690
Epoch 13 | test acc=0.8706
Epoch 14 | test acc=0.8740
Epoch 15 | test acc=0.8732
Epoch 16 | test acc=0.8756
Epoch 17 | test acc=0.8767
Epoch 18 | test acc=0.8782
Epoch 19 | test acc=0.8777
Epoch 20 | test acc=0.8753
Epoch 21 | test acc=0.8767
Epoch 22 | test acc=0.8759
Epoch 23 | test acc=0.8796
Epoch 24 | test acc=0.8809
Epoch 25 | test acc=0.8785
Epoch 26 | test acc=0.8804
Epoch 27 | test acc=0.8825
Epoch 28 | test acc=0.8782
Epoch 29 | test acc=0.8801
Epoch 30 | test acc=0.8809
Epoch 31 | test acc=0.8809
Epoch 32 | test acc=0.8812
Early stop
MLP test accuracy: 0.8825
                          precision    recall  f1-score   support

             alt.atheism      0.



['comp.sys.mac.hardware', 'alt.atheism']

Metrics for the best neural network model:

![image.png](screenshots/nn_metrics.png)

The best model chosen by PyCaret is Logistic Regression. The metrics are:

![image.png](screenshots/lr_1.png)

![image.png](screenshots/lr_2.png)

Features importances:

![image.png](screenshots/features_importances_lr.png)

Confusion matrix:

![Alt text](screenshots/lr_cm.png)

![Alt text](screenshots/cr_nn.png)