# Artificial Neural Networks (MLPs) for Tabular Data

An **artificial neural network** for tabular data is usually a **multi-layer perceptron (MLP)**: stacks of `Linear` layers with nonlinear activations (ReLU, GELU, …).

MLPs are a great learning tool because you can understand them end-to-end:

- a forward pass is just matrix multiplications + activations
- training is “just” gradient descent on a loss (via backprop)

On many real-world tabular problems, **tree-based models** (XGBoost/LightGBM/CatBoost) are often the strongest baseline; MLPs tend to shine when you have **lots of data**, **learned embeddings** for categorical features, or you need to combine tabular with other modalities.

---

## Learning goals

By the end, you should be able to:

- explain how an MLP turns features into predictions
- implement a 2-layer MLP in **NumPy** (forward + backprop)
- train it with mini-batch SGD and visualize learning curves
- build the same model in **PyTorch** and compare results
- diagnose common tabular-MLP pitfalls (scaling, overfitting, LR)

## Notation (quick)

- Features: $X \in \mathbb{R}^{n\times d}$ (rows are samples)
- Labels (binary): $y \in \{0,1\}^n$
- First layer: $z_1 = XW_1 + b_1$, $a_1 = \mathrm{ReLU}(z_1)$
- Output logits: $\ell = a_1W_2 + b_2$ (probability via sigmoid)

---

## Table of contents

1. What makes tabular data special?
2. A tiny nonlinear dataset + why scaling matters
3. Baseline: logistic regression (linear boundary)
4. From scratch: a 2-layer MLP in NumPy
5. Practical: the same model in PyTorch
6. Compare models + diagnostics
7. Practical tips for real tabular data
8. Exercises + references


In [None]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from sklearn.datasets import make_moons
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, log_loss
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

SEED = 42
rng = np.random.default_rng(SEED)

import warnings

torch.manual_seed(SEED)
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", message="CUDA initialization*", category=UserWarning)
    has_cuda = torch.cuda.is_available()
if has_cuda:
    torch.cuda.manual_seed_all(SEED)

device = torch.device("cuda" if has_cuda else "cpu")
device


## 1) What makes tabular data special?

Tabular data usually means:

- each row is an entity (customer, transaction, patient)
- columns are heterogeneous features (numeric + categorical + missing)

Compared to images/text, tabular datasets are often smaller and noisier, and the “right” inductive bias is less obvious.

For MLPs specifically, two habits matter a lot:

- **standardize numeric features** (helps optimization)
- treat **categorical features** carefully (often via embeddings)


## 2) A tiny nonlinear dataset + why scaling matters

We’ll use a simple 2D dataset so we can visualize the decision boundary.

Even though it’s 2D, it’s still “tabular”: each row is a sample, and the two columns are features.

To make the scaling issue obvious, we’ll intentionally stretch one feature.


In [None]:
# Dataset
n_samples = 2000
X_raw, y = make_moons(n_samples=n_samples, noise=0.25, random_state=SEED)

# Force a scale mismatch (common in real tabular datasets)
X_raw = X_raw.astype(np.float64)
X_raw[:, 1] *= 3.0
y = y.astype(np.int64)

# Train/val/test split
X_train_raw, X_temp_raw, y_train, y_temp = train_test_split(
    X_raw,
    y,
    test_size=0.30,
    random_state=SEED,
    stratify=y,
)

X_val_raw, X_test_raw, y_val, y_test = train_test_split(
    X_temp_raw,
    y_temp,
    test_size=0.50,
    random_state=SEED,
    stratify=y_temp,
)

# Standardize using train split only
scaler = StandardScaler().fit(X_train_raw)
X_train = scaler.transform(X_train_raw)
X_val = scaler.transform(X_val_raw)
X_test = scaler.transform(X_test_raw)

X_train.shape, X_val.shape, X_test.shape


In [None]:
fig = px.scatter(
    x=X_raw[:, 0],
    y=X_raw[:, 1],
    color=y.astype(str),
    title="Raw features (note the scale mismatch)",
    labels={"x": "feature_1", "y": "feature_2", "color": "class"},
)
fig.update_traces(marker=dict(size=5, opacity=0.7))
fig.show()


In [None]:
X_all = scaler.transform(X_raw)
fig = px.scatter(
    x=X_all[:, 0],
    y=X_all[:, 1],
    color=y.astype(str),
    title="Standardized features (zero mean, unit variance)",
    labels={"x": "z(feature_1)", "y": "z(feature_2)", "color": "class"},
)
fig.update_traces(marker=dict(size=5, opacity=0.7))
fig.show()


In [None]:
def decision_boundary_figure(X2d, y, prob_fn, title, grid_n=250, pad=0.6):
    X2d = np.asarray(X2d)
    y = np.asarray(y)

    x0_min, x0_max = X2d[:, 0].min() - pad, X2d[:, 0].max() + pad
    x1_min, x1_max = X2d[:, 1].min() - pad, X2d[:, 1].max() + pad

    xs = np.linspace(x0_min, x0_max, grid_n)
    ys = np.linspace(x1_min, x1_max, grid_n)
    xx, yy = np.meshgrid(xs, ys)
    grid = np.c_[xx.ravel(), yy.ravel()]

    probs = prob_fn(grid).reshape(xx.shape)

    fig = go.Figure()

    # Probability surface
    fig.add_trace(
        go.Contour(
            x=xs,
            y=ys,
            z=probs,
            zmin=0.0,
            zmax=1.0,
            colorscale="RdBu",
            reversescale=True,
            opacity=0.75,
            colorbar=dict(title="P(class=1)"),
            contours=dict(start=0.0, end=1.0, size=0.1),
        )
    )

    # Decision boundary line at 0.5
    fig.add_trace(
        go.Contour(
            x=xs,
            y=ys,
            z=probs,
            contours=dict(start=0.5, end=0.5, size=0.5, coloring="lines"),
            line=dict(color="black", width=3),
            showscale=False,
        )
    )

    # Points
    fig.add_trace(
        go.Scatter(
            x=X2d[:, 0],
            y=X2d[:, 1],
            mode="markers",
            marker=dict(color=y, colorscale="Viridis", size=5, opacity=0.75),
            name="data",
        )
    )

    fig.update_layout(
        title=title,
        xaxis_title="feature_1 (standardized)",
        yaxis_title="feature_2 (standardized)",
        legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
    )
    return fig


## 3) Baseline: logistic regression (linear boundary)

Logistic regression is a **linear** classifier: it can only draw a single straight line in 2D.

Our dataset needs a curved boundary, so logistic regression should underfit.


In [None]:
log_reg = LogisticRegression(max_iter=2000, random_state=SEED)
log_reg.fit(X_train, y_train)

def eval_sklearn_binary(model, X, y):
    probs = model.predict_proba(X)[:, 1]
    preds = (probs >= 0.5).astype(np.int64)
    return {
        "acc": float(accuracy_score(y, preds)),
        "logloss": float(log_loss(y, probs)),
    }

baseline_metrics = {
    "train": eval_sklearn_binary(log_reg, X_train, y_train),
    "val": eval_sklearn_binary(log_reg, X_val, y_val),
    "test": eval_sklearn_binary(log_reg, X_test, y_test),
}
baseline_metrics


In [None]:
fig = decision_boundary_figure(
    X_train,
    y_train,
    prob_fn=lambda X: log_reg.predict_proba(X)[:, 1],
    title="Logistic regression decision boundary (linear)",
)
fig.show()


## 4) From scratch: a 2-layer MLP in NumPy

A 2-layer MLP is:

1. a linear layer that mixes the input features
2. a nonlinearity (ReLU)
3. another linear layer to produce a logit

Even this small network can produce a **piecewise-linear** decision boundary that bends around the data.


### Forward pass (binary classification)

Hidden layer:

$$
z_1 = XW_1 + b_1\quad\Rightarrow\quad a_1 = \mathrm{ReLU}(z_1)
$$

Output logit:

$$
\ell = a_1 W_2 + b_2
$$

Probability:

$$
p = \sigma(\ell) = \frac{1}{1 + e^{-\ell}}
$$

Loss (binary cross-entropy, computed stably from logits):

$$
\mathcal{L} = \frac{1}{n}\sum_i \left[\log(1+e^{\ell_i}) - y_i\ell_i\right]
$$

Key gradient fact:

$$
\frac{\partial \mathcal{L}}{\partial \ell} = \sigma(\ell) - y
$$


In [None]:
def relu(x):
    return np.maximum(0.0, x)


def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))


def bce_with_logits_loss(logits, y):
    """Mean binary cross-entropy, but computed stably from logits.

    logits: (n, 1)
    y:      (n, 1) in {0,1}
    """
    logits = np.asarray(logits)
    y = np.asarray(y)
    return float((np.logaddexp(0.0, logits) - y * logits).mean())


def accuracy_from_logits(logits, y):
    probs = sigmoid(logits)
    preds = (probs >= 0.5).astype(np.int64)
    return float((preds.ravel() == y.ravel()).mean())


In [None]:
def init_mlp(in_dim, hidden_dim, rng):
    """He initialization is a good default for ReLU networks."""
    W1 = rng.normal(0.0, np.sqrt(2.0 / in_dim), size=(in_dim, hidden_dim))
    b1 = np.zeros((hidden_dim,), dtype=np.float64)

    W2 = rng.normal(0.0, np.sqrt(2.0 / hidden_dim), size=(hidden_dim, 1))
    b2 = np.zeros((1,), dtype=np.float64)

    return {"W1": W1, "b1": b1, "W2": W2, "b2": b2}


def mlp_forward(X, params):
    W1, b1, W2, b2 = params["W1"], params["b1"], params["W2"], params["b2"]

    z1 = X @ W1 + b1
    a1 = relu(z1)
    logits = a1 @ W2 + b2

    cache = {"X": X, "z1": z1, "a1": a1}
    return logits, cache


def mlp_loss_and_grads(X, y, params, weight_decay=0.0):
    """Return loss and gradients for a 2-layer MLP."""
    y = y.reshape(-1, 1).astype(np.float64)
    logits, cache = mlp_forward(X, params)

    loss = bce_with_logits_loss(logits, y)
    if weight_decay:
        loss += 0.5 * weight_decay * (np.sum(params["W1"] ** 2) + np.sum(params["W2"] ** 2))

    probs = sigmoid(logits)
    n = X.shape[0]

    # dL/dlogits = (sigmoid(logits) - y) / n
    dlogits = (probs - y) / n

    dW2 = cache["a1"].T @ dlogits
    db2 = dlogits.sum(axis=0)

    da1 = dlogits @ params["W2"].T
    dz1 = da1 * (cache["z1"] > 0.0)

    dW1 = cache["X"].T @ dz1
    db1 = dz1.sum(axis=0)

    if weight_decay:
        dW1 = dW1 + weight_decay * params["W1"]
        dW2 = dW2 + weight_decay * params["W2"]

    grads = {"W1": dW1, "b1": db1, "W2": dW2, "b2": db2}
    return loss, grads


In [None]:
def train_numpy_mlp(
    X_train,
    y_train,
    X_val,
    y_val,
    *,
    hidden_dim=32,
    lr=0.1,
    epochs=200,
    batch_size=128,
    weight_decay=1e-4,
    seed=SEED,
):
    rng_local = np.random.default_rng(seed)
    params = init_mlp(in_dim=X_train.shape[1], hidden_dim=hidden_dim, rng=rng_local)

    history = {
        "epoch": [],
        "train_loss": [],
        "val_loss": [],
        "train_acc": [],
        "val_acc": [],
    }

    y_train_col = y_train.reshape(-1, 1)
    y_val_col = y_val.reshape(-1, 1)

    for epoch in range(1, epochs + 1):
        idx = rng_local.permutation(X_train.shape[0])

        for start in range(0, X_train.shape[0], batch_size):
            batch_idx = idx[start : start + batch_size]
            Xb = X_train[batch_idx]
            yb = y_train_col[batch_idx]

            _, grads = mlp_loss_and_grads(Xb, yb, params, weight_decay=weight_decay)

            params["W1"] -= lr * grads["W1"]
            params["b1"] -= lr * grads["b1"]
            params["W2"] -= lr * grads["W2"]
            params["b2"] -= lr * grads["b2"]

        train_logits, _ = mlp_forward(X_train, params)
        val_logits, _ = mlp_forward(X_val, params)

        train_loss = bce_with_logits_loss(train_logits, y_train_col)
        val_loss = bce_with_logits_loss(val_logits, y_val_col)

        train_acc = accuracy_from_logits(train_logits, y_train_col)
        val_acc = accuracy_from_logits(val_logits, y_val_col)

        history["epoch"].append(epoch)
        history["train_loss"].append(train_loss)
        history["val_loss"].append(val_loss)
        history["train_acc"].append(train_acc)
        history["val_acc"].append(val_acc)

    return params, history


In [None]:
params_np, hist_np = train_numpy_mlp(
    X_train,
    y_train,
    X_val,
    y_val,
    hidden_dim=32,
    lr=0.1,
    epochs=200,
    batch_size=128,
    weight_decay=1e-4,
)

def eval_numpy_mlp(params, X, y):
    logits, _ = mlp_forward(X, params)
    probs = sigmoid(logits).ravel()
    preds = (probs >= 0.5).astype(np.int64)
    return {
        "acc": float(accuracy_score(y, preds)),
        "logloss": float(log_loss(y, probs)),
    }

numpy_metrics = {
    "train": eval_numpy_mlp(params_np, X_train, y_train),
    "val": eval_numpy_mlp(params_np, X_val, y_val),
    "test": eval_numpy_mlp(params_np, X_test, y_test),
}
numpy_metrics


In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist_np["epoch"], y=hist_np["train_loss"], name="train"))
fig.add_trace(go.Scatter(x=hist_np["epoch"], y=hist_np["val_loss"], name="val"))
fig.update_layout(
    title="NumPy MLP: loss over epochs",
    xaxis_title="epoch",
    yaxis_title="binary cross-entropy",
)
fig.show()


In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist_np["epoch"], y=hist_np["train_acc"], name="train"))
fig.add_trace(go.Scatter(x=hist_np["epoch"], y=hist_np["val_acc"], name="val"))
fig.update_layout(
    title="NumPy MLP: accuracy over epochs",
    xaxis_title="epoch",
    yaxis_title="accuracy",
    yaxis=dict(range=[0.0, 1.0]),
)
fig.show()


In [None]:
fig = decision_boundary_figure(
    X_train,
    y_train,
    prob_fn=lambda X: sigmoid(mlp_forward(X, params_np)[0]).ravel(),
    title="NumPy MLP decision boundary (nonlinear)",
)
fig.show()


In [None]:
probs_np_test = sigmoid(mlp_forward(X_test, params_np)[0]).ravel()
preds_np_test = (probs_np_test >= 0.5).astype(np.int64)
cm = confusion_matrix(y_test, preds_np_test)

fig = px.imshow(
    cm,
    text_auto=True,
    color_continuous_scale="Blues",
    title="NumPy MLP: confusion matrix (test)",
    labels=dict(x="predicted", y="true", color="count"),
)
fig.update_xaxes(tickmode="array", tickvals=[0, 1])
fig.update_yaxes(tickmode="array", tickvals=[0, 1])
fig.show()


## 5) Practical: the same model in PyTorch

PyTorch gives you:

- automatic differentiation (no manual backprop)
- battle-tested optimizers (Adam, SGD+momentum)
- easy batching with `DataLoader`

We’ll build the *same* architecture and train it on the same standardized data.


In [None]:
X_train_t = torch.tensor(X_train, dtype=torch.float32)
y_train_t = torch.tensor(y_train.reshape(-1, 1), dtype=torch.float32)
X_val_t = torch.tensor(X_val, dtype=torch.float32)
y_val_t = torch.tensor(y_val.reshape(-1, 1), dtype=torch.float32)
X_test_t = torch.tensor(X_test, dtype=torch.float32)
y_test_t = torch.tensor(y_test.reshape(-1, 1), dtype=torch.float32)

train_loader = DataLoader(TensorDataset(X_train_t, y_train_t), batch_size=128, shuffle=True)
val_loader = DataLoader(TensorDataset(X_val_t, y_val_t), batch_size=256, shuffle=False)

torch_model = nn.Sequential(
    nn.Linear(X_train.shape[1], 32),
    nn.ReLU(),
    nn.Linear(32, 1),
).to(device)

criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(torch_model.parameters(), lr=0.03, weight_decay=1e-4)

def run_epoch(model, loader, *, train=False):
    if train:
        model.train()
    else:
        model.eval()

    total_loss = 0.0
    total_correct = 0.0
    n = 0

    for xb, yb in loader:
        xb = xb.to(device)
        yb = yb.to(device)

        logits = model(xb)
        loss = criterion(logits, yb)

        if train:
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        with torch.no_grad():
            probs = torch.sigmoid(logits)
            preds = (probs >= 0.5).float()
            total_correct += (preds == yb).float().sum().item()

        total_loss += loss.item() * xb.shape[0]
        n += xb.shape[0]

    return total_loss / n, total_correct / n


In [None]:
torch_hist = {
    "epoch": [],
    "train_loss": [],
    "val_loss": [],
    "train_acc": [],
    "val_acc": [],
}

epochs = 120
for epoch in range(1, epochs + 1):
    train_loss, train_acc = run_epoch(torch_model, train_loader, train=True)
    val_loss, val_acc = run_epoch(torch_model, val_loader, train=False)

    torch_hist["epoch"].append(epoch)
    torch_hist["train_loss"].append(float(train_loss))
    torch_hist["val_loss"].append(float(val_loss))
    torch_hist["train_acc"].append(float(train_acc))
    torch_hist["val_acc"].append(float(val_acc))

@torch.no_grad()
def torch_predict_proba(model, X):
    model.eval()
    Xt = torch.tensor(X, dtype=torch.float32, device=device)
    probs = torch.sigmoid(model(Xt)).detach().cpu().numpy().ravel()
    return probs

probs_torch_test = torch_predict_proba(torch_model, X_test)
preds_torch_test = (probs_torch_test >= 0.5).astype(np.int64)

torch_metrics = {
    "train": {
        "acc": float(accuracy_score(y_train, (torch_predict_proba(torch_model, X_train) >= 0.5).astype(np.int64))),
        "logloss": float(log_loss(y_train, torch_predict_proba(torch_model, X_train))),
    },
    "val": {
        "acc": float(accuracy_score(y_val, (torch_predict_proba(torch_model, X_val) >= 0.5).astype(np.int64))),
        "logloss": float(log_loss(y_val, torch_predict_proba(torch_model, X_val))),
    },
    "test": {
        "acc": float(accuracy_score(y_test, preds_torch_test)),
        "logloss": float(log_loss(y_test, probs_torch_test)),
    },
}
torch_metrics


In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=torch_hist["epoch"], y=torch_hist["train_loss"], name="train"))
fig.add_trace(go.Scatter(x=torch_hist["epoch"], y=torch_hist["val_loss"], name="val"))
fig.update_layout(
    title="PyTorch MLP: loss over epochs",
    xaxis_title="epoch",
    yaxis_title="binary cross-entropy",
)
fig.show()


In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=torch_hist["epoch"], y=torch_hist["train_acc"], name="train"))
fig.add_trace(go.Scatter(x=torch_hist["epoch"], y=torch_hist["val_acc"], name="val"))
fig.update_layout(
    title="PyTorch MLP: accuracy over epochs",
    xaxis_title="epoch",
    yaxis_title="accuracy",
    yaxis=dict(range=[0.0, 1.0]),
)
fig.show()


In [None]:
fig = decision_boundary_figure(
    X_train,
    y_train,
    prob_fn=lambda X: torch_predict_proba(torch_model, X),
    title="PyTorch MLP decision boundary (nonlinear)",
)
fig.show()


In [None]:
cm = confusion_matrix(y_test, preds_torch_test)
fig = px.imshow(
    cm,
    text_auto=True,
    color_continuous_scale="Blues",
    title="PyTorch MLP: confusion matrix (test)",
    labels=dict(x="predicted", y="true", color="count"),
)
fig.update_xaxes(tickmode="array", tickvals=[0, 1])
fig.update_yaxes(tickmode="array", tickvals=[0, 1])
fig.show()


## 6) Compare models + diagnostics

On this toy dataset, both MLPs should learn a nonlinear boundary and outperform logistic regression.

We’ll compare test accuracy and log loss (probabilistic quality).


In [None]:
models = ["log_reg", "numpy_mlp", "torch_mlp"]
test_acc = [
    baseline_metrics["test"]["acc"],
    numpy_metrics["test"]["acc"],
    torch_metrics["test"]["acc"],
]
test_logloss = [
    baseline_metrics["test"]["logloss"],
    numpy_metrics["test"]["logloss"],
    torch_metrics["test"]["logloss"],
]

fig = go.Figure(go.Bar(x=models, y=test_acc))
fig.update_layout(title="Test accuracy", xaxis_title="model", yaxis_title="accuracy", yaxis=dict(range=[0.0, 1.0]))
fig.show()

fig = go.Figure(go.Bar(x=models, y=test_logloss))
fig.update_layout(title="Test log loss (lower is better)", xaxis_title="model", yaxis_title="log loss")
fig.show()


## 7) Practical tips for real tabular data

- **Standardize numeric features** (and keep the scaler fitted on train only).
- **Categorical features**: try learned embeddings (`nn.Embedding`) instead of one-hot for high-cardinality columns.
- **Missing values**: add missingness indicators; don’t just impute and hope.
- **Overfitting** is common: use weight decay, dropout, early stopping, and a strong validation protocol.
- **Learning rate** matters more than architecture. When in doubt, sweep `lr` and use Adam.
- **Baselines first**: compare against logistic regression and strong tree-based models.
- **Calibration**: optimize log loss / calibration if your probabilities will drive decisions.


## 8) Exercises

1. Add another hidden layer (2 hidden layers total). Does it help? Does it overfit?
2. Replace ReLU with `tanh`. What changes in training speed / final accuracy?
3. Implement **dropout** in the NumPy model.
4. Turn this into a **multiclass** problem (softmax + cross-entropy).
5. Try a real tabular dataset (e.g., UCI) and compare with a tree baseline.


## References

- PyTorch: https://pytorch.org/docs/stable/index.html
- Goodfellow, Bengio, Courville — *Deep Learning* (MLP + backprop chapters)
- scikit-learn MLPClassifier docs (for a practical baseline)
