# Jaccard Score (Jaccard Similarity / Intersection-over-Union)

The **Jaccard score** measures similarity between two sets:

$$
J(A, B) = \frac{|A \cap B|}{|A \cup B|}
$$

In ML, you'll often see the same idea as **Intersection-over-Union (IoU)** for binary masks.

## Goals

- Build intuition for **intersection vs union** (and why true negatives don't matter).
- Derive the classification form: $\displaystyle \frac{TP}{TP+FP+FN}$.
- Implement Jaccard from scratch in NumPy (binary, multiclass, multilabel).
- Use Plotly to visualize how thresholds and errors change the score.
- Optimize a tiny logistic regression model with a differentiable **soft Jaccard** loss.

## Quick import (scikit-learn)

```python
from sklearn.metrics import jaccard_score
```


In [None]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)

versions = {
    'numpy': np.__version__,
    'plotly': __import__('plotly').__version__,
}
try:
    import sklearn

    versions['sklearn'] = sklearn.__version__
except Exception:
    versions['sklearn'] = None

versions


## Prerequisites & notation

- Binary labels: $y \in \{0,1\}^n$
- Predicted labels: $\hat{y} \in \{0,1\}^n$
- Predicted probabilities: $p \in [0,1]^n$
- Confusion counts: $TP$, $FP$, $FN$, $TN$

We'll interpret the "positive set" as the indices where a vector equals 1:
$A = \{ i : y_i = 1 \}$ and $B = \{ i : \hat{y}_i = 1 \}$.


## 1) Set intuition

Think of two sets:

- $A$: the "true" items
- $B$: the "predicted" items

The Jaccard score is:

$$
J(A,B) = \frac{|A \cap B|}{|A \cup B|}
$$

- Numerator: what both agree on (**overlap**)
- Denominator: everything that appears in either (**coverage**)

So Jaccard is high only when the overlap is large *and* the union isn't bloated by extras.


In [None]:
A = {1, 2, 3, 5, 8}
B = {2, 3, 4, 8, 9}

intersection = A & B
union = A | B

jaccard = len(intersection) / len(union)

A, B, intersection, union, jaccard


In [None]:
universe = np.arange(0, 10)

A_mask = np.isin(universe, sorted(A))
B_mask = np.isin(universe, sorted(B))

# 0: neither, 1: A only, 2: B only, 3: both
cat = A_mask.astype(int) + 2 * B_mask.astype(int)

colorscale = [
    [0.00, '#ffffff'],
    [0.249999, '#ffffff'],  # neither
    [0.25, '#ff7f0e'],
    [0.499999, '#ff7f0e'],  # A only
    [0.50, '#1f77b4'],
    [0.749999, '#1f77b4'],  # B only
    [0.75, '#2ca02c'],
    [1.00, '#2ca02c'],  # both (intersection)
]

fig = go.Figure(
    data=go.Heatmap(
        z=cat[np.newaxis, :],
        x=universe,
        y=['elements'],
        colorscale=colorscale,
        zmin=-0.5,
        zmax=3.5,
        colorbar=dict(
            title='category',
            tickmode='array',
            tickvals=[0, 1, 2, 3],
            ticktext=['neither', 'A only', 'B only', 'A ∩ B'],
        ),
        hovertemplate='element=%{x}<br>category=%{z}<extra></extra>',
    )
)

fig.update_layout(
    title=f'Jaccard = |A ∩ B| / |A ∪ B| = {len(intersection)}/{len(union)} = {jaccard:.3f}',
    height=220,
    margin=dict(l=20, r=20, t=60, b=20),
)

fig.show()


## 2) Binary classification view (TP / FP / FN)

For binary classification, focus on the **positive class**:

- $A = \{ i : y_i = 1 \}$ (true positives set)
- $B = \{ i : \hat{y}_i = 1 \}$ (predicted positives set)

Then:

- $|A \cap B| = TP$
- $|A \cup B| = TP + FP + FN$

So the Jaccard score becomes:

$$
J = \frac{TP}{TP + FP + FN}
$$

Notice what's missing: **true negatives** $TN$.
If your dataset has tons of negatives, accuracy can look great while Jaccard stays low.


In [None]:
def confusion_counts_binary(y_true, y_pred):
    y_true = np.asarray(y_true).astype(bool)
    y_pred = np.asarray(y_pred).astype(bool)

    tp = np.logical_and(y_true, y_pred).sum()
    fp = np.logical_and(~y_true, y_pred).sum()
    fn = np.logical_and(y_true, ~y_pred).sum()
    tn = np.logical_and(~y_true, ~y_pred).sum()

    return int(tp), int(fp), int(fn), int(tn)


def jaccard_score_binary(y_true, y_pred, *, zero_division=0.0):
    tp, fp, fn, _ = confusion_counts_binary(y_true, y_pred)
    denom = tp + fp + fn
    if denom == 0:
        return float(zero_division)
    return tp / denom


def accuracy_score_binary(y_true, y_pred):
    y_true = np.asarray(y_true).astype(int)
    y_pred = np.asarray(y_pred).astype(int)
    return (y_true == y_pred).mean()


# quick sanity check
y_true = np.array([1, 1, 0, 0, 1, 0])
y_pred = np.array([1, 0, 0, 0, 1, 1])

tp, fp, fn, tn = confusion_counts_binary(y_true, y_pred)
(tp, fp, fn, tn), jaccard_score_binary(y_true, y_pred), accuracy_score_binary(y_true, y_pred)


### 2.1 IoU for segmentation (same formula)

If $y$ and $\hat{y}$ are **binary masks** (pixels in/out of an object), then:

- intersection = pixels correctly predicted as object
- union = pixels that are object in either mask

So IoU = Jaccard on the set of "object pixels".


In [None]:
h, w = 40, 40
yy, xx = np.mgrid[0:h, 0:w]


def circle_mask(*, cx, cy, r):
    return (xx - cx) ** 2 + (yy - cy) ** 2 <= r**2


true_mask = circle_mask(cx=14, cy=20, r=10)
pred_mask = circle_mask(cx=18, cy=20, r=10)

# 0: background, 1: true-only (FN), 2: pred-only (FP), 3: overlap (TP)
cat = true_mask.astype(int) + 2 * pred_mask.astype(int)
iou = jaccard_score_binary(true_mask.ravel(), pred_mask.ravel(), zero_division=1.0)

colorscale = [
    [0.00, '#ffffff'],
    [0.249999, '#ffffff'],
    [0.25, '#d62728'],
    [0.499999, '#d62728'],  # true-only (red)
    [0.50, '#1f77b4'],
    [0.749999, '#1f77b4'],  # pred-only (blue)
    [0.75, '#2ca02c'],
    [1.00, '#2ca02c'],  # overlap (green)
]

fig = go.Figure(
    data=go.Heatmap(
        z=cat,
        colorscale=colorscale,
        zmin=-0.5,
        zmax=3.5,
        showscale=True,
        colorbar=dict(
            title='pixel',
            tickmode='array',
            tickvals=[0, 1, 2, 3],
            ticktext=['background', 'true only (FN)', 'pred only (FP)', 'overlap (TP)'],
        ),
        hovertemplate='x=%{x}<br>y=%{y}<br>category=%{z}<extra></extra>',
    )
)

fig.update_layout(
    title=f'IoU (Jaccard) on a toy mask: {iou:.3f}',
    width=520,
    height=520,
    yaxis=dict(scaleanchor='x', autorange='reversed'),
    margin=dict(l=20, r=20, t=60, b=20),
)
fig.show()


### 2.2 Why true negatives don't matter

Hold $TP$, $FP$, $FN$ fixed and add more and more true negatives.

- Accuracy goes up (because it counts $TN$).
- Jaccard stays exactly the same (because it ignores $TN$).


In [None]:
tp, fp, fn = 10, 5, 5
y_true_core = np.array([1] * tp + [1] * fn + [0] * fp, dtype=int)
y_pred_core = np.array([1] * tp + [0] * fn + [1] * fp, dtype=int)

tn_sizes = np.arange(0, 2001, 100)
accs = []
jaccs = []

for tn in tn_sizes:
    y_true_full = np.concatenate([y_true_core, np.zeros(tn, dtype=int)])
    y_pred_full = np.concatenate([y_pred_core, np.zeros(tn, dtype=int)])

    accs.append(accuracy_score_binary(y_true_full, y_pred_full))
    jaccs.append(jaccard_score_binary(y_true_full, y_pred_full))

fig = go.Figure()
fig.add_trace(go.Scatter(x=tn_sizes, y=accs, mode='lines+markers', name='accuracy'))
fig.add_trace(go.Scatter(x=tn_sizes, y=jaccs, mode='lines+markers', name='jaccard'))

fig.update_layout(
    title=f'Add more TN with TP={tp}, FP={fp}, FN={fn}: Jaccard stays constant',
    xaxis_title='number of added true negatives (TN)',
    yaxis_title='score',
    yaxis=dict(range=[0, 1]),
)
fig.show()


## 3) How FP and FN move Jaccard

For fixed $TP$, Jaccard shrinks as you add either false positives or false negatives:

$$
J = \frac{TP}{TP + FP + FN}
$$


In [None]:
TP = 10
FP_vals = np.arange(0, 31)
FN_vals = np.arange(0, 31)

Z = np.zeros((len(FN_vals), len(FP_vals)), dtype=float)
for i, fn in enumerate(FN_vals):
    for j, fp in enumerate(FP_vals):
        Z[i, j] = TP / (TP + fp + fn)

fig = px.imshow(
    Z,
    x=FP_vals,
    y=FN_vals,
    origin='lower',
    aspect='auto',
    labels={'x': 'FP', 'y': 'FN', 'color': 'Jaccard'},
    title=f'Jaccard for fixed TP={TP}',
)
fig.show()


## 4) Relationship to precision/recall/F1

- Precision: $\displaystyle P = \frac{TP}{TP+FP}$
- Recall: $\displaystyle R = \frac{TP}{TP+FN}$
- F1: $\displaystyle F_1 = \frac{2TP}{2TP+FP+FN}$

Jaccard uses the same ingredients but with a different denominator:

$$
J = \frac{TP}{TP+FP+FN}
$$

A useful identity links Jaccard and F1:

$$
J = \frac{F_1}{2 - F_1}
\quad\Longleftrightarrow\quad
F_1 = \frac{2J}{1 + J}
$$


In [None]:
f1 = np.linspace(0, 1, 501)
j_from_f1 = f1 / (2 - f1)

fig = go.Figure()
fig.add_trace(go.Scatter(x=f1, y=j_from_f1, mode='lines', name='J = F1/(2-F1)'))
fig.update_layout(
    title='Mapping between F1 and Jaccard',
    xaxis_title='F1',
    yaxis_title='Jaccard',
    xaxis=dict(range=[0, 1]),
    yaxis=dict(range=[0, 1]),
)
fig.show()


## 5) Multilabel and multiclass

### Multilabel
Each sample can have **multiple** positive labels.
If $y, \hat{y} \in \{0,1\}^{n\times L}$, you can compute Jaccard:

- per-sample and average (**`samples`**)
- per-label and average (**`macro`**)
- globally over all entries (**`micro`**)

### Multiclass
With mutually-exclusive classes, a common definition is **one-vs-rest** per class and then average.
This matches the way `sklearn.metrics.jaccard_score` generalizes Jaccard when `average != 'binary'`.


In [None]:
def _safe_divide(num, den, *, zero_division=0.0):
    num = np.asarray(num, dtype=float)
    den = np.asarray(den, dtype=float)

    out = np.full_like(num, float(zero_division), dtype=float)
    mask = den != 0
    out[mask] = num[mask] / den[mask]
    return out


def jaccard_score_multilabel(y_true, y_pred, *, average='samples', zero_division=0.0):
    y_true = np.asarray(y_true).astype(bool)
    y_pred = np.asarray(y_pred).astype(bool)

    if y_true.ndim != 2:
        raise ValueError('Expected y_true with shape (n_samples, n_labels)')
    if y_pred.shape != y_true.shape:
        raise ValueError('y_pred must have the same shape as y_true')

    if average == 'micro':
        inter = np.logical_and(y_true, y_pred).sum()
        uni = np.logical_or(y_true, y_pred).sum()
        return float(_safe_divide(inter, uni, zero_division=zero_division))

    if average in (None, 'none'):
        inter_l = np.logical_and(y_true, y_pred).sum(axis=0)
        uni_l = np.logical_or(y_true, y_pred).sum(axis=0)
        return _safe_divide(inter_l, uni_l, zero_division=zero_division)

    if average in ('macro', 'weighted'):
        inter_l = np.logical_and(y_true, y_pred).sum(axis=0)
        uni_l = np.logical_or(y_true, y_pred).sum(axis=0)
        label_scores = _safe_divide(inter_l, uni_l, zero_division=zero_division)
        if average == 'macro':
            return float(label_scores.mean())

        supports = y_true.sum(axis=0)
        if supports.sum() == 0:
            return float(zero_division)
        return float(np.average(label_scores, weights=supports))

    if average == 'samples':
        inter_s = np.logical_and(y_true, y_pred).sum(axis=1)
        uni_s = np.logical_or(y_true, y_pred).sum(axis=1)
        sample_scores = _safe_divide(inter_s, uni_s, zero_division=zero_division)
        return float(sample_scores.mean())

    raise ValueError("average must be one of {'samples','micro','macro','weighted',None}")


def jaccard_score_multiclass(y_true, y_pred, *, average='macro', labels=None, zero_division=0.0):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.ndim != 1 or y_pred.ndim != 1:
        raise ValueError('Expected 1D label arrays')
    if y_pred.shape != y_true.shape:
        raise ValueError('y_pred must have the same shape as y_true')
    if len(y_true) == 0:
        return float(zero_division)

    if labels is None:
        labels = np.unique(np.concatenate([y_true, y_pred]))

    scores = []
    supports = []
    for lab in labels:
        t = y_true == lab
        p = y_pred == lab
        tp = np.logical_and(t, p).sum()
        fp = np.logical_and(~t, p).sum()
        fn = np.logical_and(t, ~p).sum()
        denom = tp + fp + fn

        score = float(zero_division) if denom == 0 else float(tp / denom)
        scores.append(score)
        supports.append(t.sum())

    scores = np.asarray(scores, dtype=float)
    supports = np.asarray(supports, dtype=float)

    if average == 'macro':
        return float(scores.mean())
    if average == 'weighted':
        if supports.sum() == 0:
            return float(zero_division)
        return float(np.average(scores, weights=supports))
    if average == 'micro':
        correct = (y_true == y_pred).sum()
        union = 2 * len(y_true) - correct
        return float(zero_division) if union == 0 else float(correct / union)
    if average in (None, 'none'):
        return scores

    raise ValueError("average must be one of {'micro','macro','weighted',None}")


# examples
y_true_ml = np.array(
    [
        [1, 0, 1],
        [0, 1, 0],
        [1, 1, 0],
        [0, 0, 0],
    ],
    dtype=int,
)
y_pred_ml = np.array(
    [
        [1, 1, 1],
        [0, 1, 0],
        [0, 1, 0],
        [0, 0, 0],
    ],
    dtype=int,
)

scores = {
    'samples': jaccard_score_multilabel(y_true_ml, y_pred_ml, average='samples', zero_division=1.0),
    'micro': jaccard_score_multilabel(y_true_ml, y_pred_ml, average='micro', zero_division=1.0),
    'macro': jaccard_score_multilabel(y_true_ml, y_pred_ml, average='macro', zero_division=1.0),
    'weighted': jaccard_score_multilabel(y_true_ml, y_pred_ml, average='weighted', zero_division=1.0),
    'per-label': jaccard_score_multilabel(y_true_ml, y_pred_ml, average=None, zero_division=1.0),
}

y_true_mc = np.array([0, 1, 2, 2, 1, 0])
y_pred_mc = np.array([0, 2, 2, 1, 1, 0])

scores, {
    'multiclass_macro': jaccard_score_multiclass(y_true_mc, y_pred_mc, average='macro'),
    'multiclass_micro': jaccard_score_multiclass(y_true_mc, y_pred_mc, average='micro'),
    'multiclass_per_class': jaccard_score_multiclass(y_true_mc, y_pred_mc, average=None),
}


In [None]:
try:
    from sklearn.metrics import jaccard_score as sk_jaccard_score

    print('Binary (sklearn):', sk_jaccard_score(y_true, y_pred))
    print('Binary (ours):   ', jaccard_score_binary(y_true, y_pred))

    print('Multilabel macro (sklearn):', sk_jaccard_score(y_true_ml, y_pred_ml, average='macro', zero_division=1.0))
    print('Multilabel macro (ours):   ', jaccard_score_multilabel(y_true_ml, y_pred_ml, average='macro', zero_division=1.0))

    print('Multiclass macro (sklearn):', sk_jaccard_score(y_true_mc, y_pred_mc, average='macro'))
    print('Multiclass macro (ours):   ', jaccard_score_multiclass(y_true_mc, y_pred_mc, average='macro'))
except Exception as e:
    print('sklearn not available:', e)


## 6) Thresholding probabilities

Jaccard is defined on *sets* / *hard labels*.
If your model outputs probabilities $p$, you typically choose a threshold $t$ and set:

$$
\hat{y}_i = \mathbf{1}[p_i \ge t]
$$

Different thresholds trade off $FP$ vs $FN$, so they can change Jaccard a lot.


In [None]:
n = 400
y_true_thr = rng.binomial(1, 0.15, size=n)

# simulate a "model score": positives tend to have higher logits
logits = rng.normal(loc=0.0, scale=1.0, size=n) + 1.5 * y_true_thr
p_thr = 1 / (1 + np.exp(-logits))

thresholds = np.linspace(0.0, 1.0, 201)
j_scores = np.array(
    [jaccard_score_binary(y_true_thr, (p_thr >= t).astype(int), zero_division=0.0) for t in thresholds]
)
best_idx = int(j_scores.argmax())
best_t = float(thresholds[best_idx])
best_j = float(j_scores[best_idx])

fig = px.line(
    x=thresholds,
    y=j_scores,
    labels={'x': 'threshold', 'y': 'Jaccard'},
    title=f'Jaccard vs threshold (best t≈{best_t:.2f}, J≈{best_j:.3f})',
)
fig.add_vline(x=best_t, line_dash='dash', line_color='black')
fig.update_layout(yaxis=dict(range=[0, 1]))
fig.show()


## 7) Using Jaccard in optimization: a soft Jaccard loss

The "hard" Jaccard score uses discrete predictions, so it's **not differentiable** w.r.t. model parameters.

A common trick (especially in segmentation) is to replace hard predictions with probabilities $p$:

- Soft intersection: $I = \sum_i y_i p_i$
- Soft union: $U = \sum_i y_i + \sum_i p_i - \sum_i y_i p_i$

Soft Jaccard:

$$
J_{soft}(y,p) = \frac{I + \varepsilon}{U + \varepsilon}
$$

Soft Jaccard loss:

$$
\mathcal{L}_{IoU}(y,p) = 1 - J_{soft}(y,p)
$$

Gradient w.r.t. a probability $p_i$:

$$
\frac{\partial J_{soft}}{\partial p_i}
= \frac{y_i (U+\varepsilon) - (I+\varepsilon)(1-y_i)}{(U+\varepsilon)^2}
$$

Then use the chain rule for logistic regression, where $p_i = \sigma(x_i^\top w)$.


In [None]:
# Synthetic 2D binary classification (imbalanced)
n0, n1 = 900, 100
X0 = rng.normal(loc=[0.0, 0.0], scale=[1.0, 1.0], size=(n0, 2))
X1 = rng.normal(loc=[2.0, 2.0], scale=[1.0, 1.0], size=(n1, 2))
X = np.vstack([X0, X1])
y = np.array([0] * n0 + [1] * n1, dtype=int)

# shuffle
perm = rng.permutation(len(y))
X = X[perm]
y = y[perm]

# train/test split (pure NumPy)
test_size = 0.30
n_test = int(len(y) * test_size)

X_test = X[:n_test]
y_test = y[:n_test]
X_train = X[n_test:]
y_train = y[n_test:]

# standardize (fit on train)
mu = X_train.mean(axis=0)
sigma = X_train.std(axis=0) + 1e-12
X_train_s = (X_train - mu) / sigma
X_test_s = (X_test - mu) / sigma

# add bias column
Xb_train = np.c_[np.ones(len(y_train)), X_train_s]
Xb_test = np.c_[np.ones(len(y_test)), X_test_s]

fig = px.scatter(
    x=X_train_s[:, 0],
    y=X_train_s[:, 1],
    color=y_train.astype(str),
    title='Training data (standardized)',
    labels={'color': 'class'},
)
fig.show()

Xb_train.shape, Xb_test.shape, float(y_train.mean()), float(y_test.mean())


In [None]:
def sigmoid(z):
    z = np.clip(z, -60, 60)
    return 1 / (1 + np.exp(-z))


def log_loss(y, p, *, eps=1e-12):
    y = np.asarray(y, dtype=float)
    p = np.asarray(p, dtype=float)
    p = np.clip(p, eps, 1 - eps)
    return float(-np.mean(y * np.log(p) + (1 - y) * np.log(1 - p)))


def soft_jaccard_loss(y, p, *, eps=1e-12):
    y = np.asarray(y, dtype=float)
    p = np.asarray(p, dtype=float)
    I = np.sum(y * p)
    U = np.sum(y) + np.sum(p) - I
    return float(1.0 - (I + eps) / (U + eps))


def soft_jaccard_grad_p(y, p, *, eps=1e-12):
    y = np.asarray(y, dtype=float)
    p = np.asarray(p, dtype=float)
    I = np.sum(y * p)
    U = np.sum(y) + np.sum(p) - I
    Ieps = I + eps
    Ueps = U + eps
    dJdp = (y * Ueps - Ieps * (1 - y)) / (Ueps**2)
    return -dJdp


def fit_logreg_gd(Xb, y, *, loss='log', lr=0.1, n_iter=400, l2=0.0, record_every=5):
    y = np.asarray(y, dtype=float)
    w = np.zeros(Xb.shape[1], dtype=float)

    history = {'iter': [], 'loss': [], 'jaccard@0.5': []}

    for t in range(n_iter):
        z = Xb @ w
        p = sigmoid(z)

        if loss == 'log':
            L = log_loss(y, p) + 0.5 * l2 * np.sum(w[1:] ** 2)
            grad = Xb.T @ (p - y) / len(y)
            grad[1:] += l2 * w[1:]
        elif loss == 'soft_jaccard':
            L = soft_jaccard_loss(y, p) + 0.5 * l2 * np.sum(w[1:] ** 2)
            dLdp = soft_jaccard_grad_p(y, p)
            dLdz = dLdp * p * (1 - p)
            grad = Xb.T @ dLdz / len(y)
            grad[1:] += l2 * w[1:]
        else:
            raise ValueError("loss must be 'log' or 'soft_jaccard'")

        w -= lr * grad

        if (t % record_every) == 0 or t == (n_iter - 1):
            y_hat = (p >= 0.5).astype(int)
            j = jaccard_score_binary(y.astype(int), y_hat, zero_division=0.0)
            history['iter'].append(t)
            history['loss'].append(float(L))
            history['jaccard@0.5'].append(float(j))

    return w, history


def best_threshold_for_jaccard(y_true, p, thresholds):
    scores = np.array(
        [jaccard_score_binary(y_true, (p >= t).astype(int), zero_division=0.0) for t in thresholds], dtype=float
    )
    best_idx = int(scores.argmax())
    return float(thresholds[best_idx]), float(scores[best_idx]), scores


In [None]:
# Train two models:
# - standard logistic regression (log-loss)
# - logistic regression with a soft Jaccard loss

w_log, hist_log = fit_logreg_gd(
    Xb_train,
    y_train,
    loss='log',
    lr=0.2,
    n_iter=400,
    l2=0.01,
    record_every=5,
)
w_iou, hist_iou = fit_logreg_gd(
    Xb_train,
    y_train,
    loss='soft_jaccard',
    lr=1.0,
    n_iter=400,
    l2=0.01,
    record_every=5,
)

# Evaluate on test
p_test_log = sigmoid(Xb_test @ w_log)
p_test_iou = sigmoid(Xb_test @ w_iou)

j05_log = jaccard_score_binary(y_test, (p_test_log >= 0.5).astype(int), zero_division=0.0)
j05_iou = jaccard_score_binary(y_test, (p_test_iou >= 0.5).astype(int), zero_division=0.0)

ths = np.linspace(0.01, 0.99, 99)
best_t_log, best_j_log, curve_log = best_threshold_for_jaccard(y_test, p_test_log, ths)
best_t_iou, best_j_iou, curve_iou = best_threshold_for_jaccard(y_test, p_test_iou, ths)

summary = {
    'log_loss': {'J@0.5': j05_log, 'best_t': best_t_log, 'best_J': best_j_log},
    'soft_jaccard': {'J@0.5': j05_iou, 'best_t': best_t_iou, 'best_J': best_j_iou},
}
summary


In [None]:
# Training curves (loss)
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist_log['iter'], y=hist_log['loss'], mode='lines', name='log-loss (train)'))
fig.add_trace(go.Scatter(x=hist_iou['iter'], y=hist_iou['loss'], mode='lines', name='soft Jaccard loss (train)'))
fig.update_layout(title='Training loss curves (different scales)', xaxis_title='iteration', yaxis_title='loss')
fig.show()

# Training curves (Jaccard at threshold 0.5)
fig = go.Figure()
fig.add_trace(
    go.Scatter(x=hist_log['iter'], y=hist_log['jaccard@0.5'], mode='lines', name='log-loss model')
)
fig.add_trace(
    go.Scatter(x=hist_iou['iter'], y=hist_iou['jaccard@0.5'], mode='lines', name='soft Jaccard model')
)
fig.update_layout(
    title='Training: Jaccard@0.5 over iterations',
    xaxis_title='iteration',
    yaxis_title='Jaccard@0.5',
    yaxis=dict(range=[0, 1]),
)
fig.show()

# Threshold tuning on test: maximize Jaccard
fig = go.Figure()
fig.add_trace(go.Scatter(x=ths, y=curve_log, mode='lines', name='log-loss model'))
fig.add_trace(go.Scatter(x=ths, y=curve_iou, mode='lines', name='soft Jaccard model'))
fig.add_vline(x=best_t_log, line_dash='dash', line_color='black')
fig.add_vline(x=best_t_iou, line_dash='dash', line_color='gray')
fig.update_layout(
    title='Test: Jaccard vs threshold (vertical lines = best thresholds)',
    xaxis_title='threshold',
    yaxis_title='Jaccard',
    yaxis=dict(range=[0, 1]),
)
fig.show()


## 8) Pros, cons, and where Jaccard shines

### Pros
- **Interpretable** overlap measure in $[0,1]$.
- **Ignores true negatives**, which is great when negatives are overwhelming (e.g. segmentation background).
- Natural fit for **sets**, **sparse binary features**, **multi-label** problems.
- Symmetric: $J(A,B)=J(B,A)$.

### Cons
- Because it ignores $TN$, it can be misleading when correct negatives matter.
- The "hard" metric is **non-differentiable**, so you usually optimize a surrogate.
- For **small objects** in segmentation, a small boundary shift can drop IoU a lot.
- For multiclass/multilabel, results depend heavily on the averaging choice (`micro` vs `macro` vs `samples`).

### Good use cases
- Image segmentation / detection masks (IoU)
- Multi-label classification (tags)
- Information retrieval and matching (set overlap)


## 9) Pitfalls & diagnostics

- **Union = 0 edge case**: if both sets are empty, Jaccard is undefined ($0/0$). Decide a convention (`zero_division`).
- **Threshold choice**: Jaccard can change a lot with the threshold; tune it on a validation set.
- **Averaging**:
  - `micro` favors frequent labels/classes
  - `macro` treats each label/class equally (often better for rare labels)
  - `samples` answers: "how good are we per example?" (multilabel)
- **Compare with precision/recall** to see whether low IoU comes from extra positives (FP) or missed positives (FN).


## 10) Exercises

1. Create two predictions with the same accuracy but very different Jaccard. Explain the difference using $TP/FP/FN/TN$.
2. For multilabel data, build a case where `micro` is high but `macro` is low. What does that imply?
3. Implement a **soft Dice** (F1) loss and compare its behavior to soft Jaccard on the same toy dataset.


## References

- scikit-learn `jaccard_score`: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html
- IoU/Jaccard loss in segmentation (overview): https://en.wikipedia.org/wiki/Jaccard_index
