# F1 Score (`f1_score`)

The **F1 score** (a.k.a. *F-measure*) summarizes performance on the **positive class** by combining:

- **Precision**: *when we predict positive, how often are we correct?*
- **Recall**: *of all actual positives, how many did we find?*

It’s especially common when:
- the positive class is **rare** (class imbalance)
- **false positives** and **false negatives** both matter (roughly equally)

## Goals

- Derive the F1 formula from the confusion matrix.
- Build a from-scratch NumPy implementation (binary + multiclass averages).
- Visualize how **thresholding** changes precision, recall, and F1.
- Use F1 to **tune** a simple logistic regression classifier.

## Quick import

```python
from sklearn.metrics import f1_score
```


In [None]:
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score as sk_f1_score
from sklearn.metrics import precision_score as sk_precision_score
from sklearn.metrics import recall_score as sk_recall_score

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(7)

In [None]:
import sklearn
import plotly

print('numpy :', np.__version__)
print('sklearn:', sklearn.__version__)
print('plotly:', plotly.__version__)

## 1) Confusion matrix → precision, recall

Assume **binary** classification:

- true label: $y \in \{0, 1\}$ (1 = *positive*)
- predicted label: $\hat{y} \in \{0, 1\}$

The confusion matrix counts:

|            | $\hat{y}=1$ | $\hat{y}=0$ |
|------------|-------------|-------------|
| $y=1$      | TP          | FN          |
| $y=0$      | FP          | TN          |

From these:

$$
\text{precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}},\qquad
\text{recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}
$$

- Precision asks: *how noisy are our positive predictions?*
- Recall asks: *how many positives did we miss?*

In [None]:
# A tiny example
y_true = np.array([1, 1, 1, 0, 0, 0, 0, 0])
y_pred = np.array([1, 0, 1, 1, 0, 0, 0, 0])

tp = np.sum((y_true == 1) & (y_pred == 1))
fp = np.sum((y_true == 0) & (y_pred == 1))
fn = np.sum((y_true == 1) & (y_pred == 0))
tn = np.sum((y_true == 0) & (y_pred == 0))

tp, fp, fn, tn

## 2) The F1 score

The **F1 score** is the **harmonic mean** of precision and recall:

$$
F_1 = \frac{2}{\frac{1}{\text{precision}} + \frac{1}{\text{recall}}}
= \frac{2\text{precision}\,\text{recall}}{\text{precision}+\text{recall}}
$$

Substituting the confusion-matrix definitions gives a very useful form:

$$
F_1 = \frac{2\,\mathrm{TP}}{2\,\mathrm{TP} + \mathrm{FP} + \mathrm{FN}}
$$

Key intuition:
- **Harmonic mean punishes imbalance**: if precision is high but recall is near zero (or vice versa), $F_1$ is near zero.
- **True negatives do not appear** in the formula. That’s great when negatives are abundant (imbalance), but it can also hide poor performance on the negative class.

A generalization is the $F_\beta$ score:

$$
F_\beta = (1+\beta^2)\,\frac{\text{precision}\,\text{recall}}{\beta^2\,\text{precision}+\text{recall}}
$$

- $\beta>1$ emphasizes recall
- $\beta<1$ emphasizes precision

In [None]:
# Harmonic mean vs arithmetic mean
ps = np.linspace(0.001, 0.999, 400)
r_fixed = 0.2

f1 = 2 * ps * r_fixed / (ps + r_fixed)
am = 0.5 * (ps + r_fixed)

fig = go.Figure()
fig.add_trace(go.Scatter(x=ps, y=f1, mode='lines', name='F1 (harmonic mean)'))
fig.add_trace(go.Scatter(x=ps, y=am, mode='lines', name='Arithmetic mean', line=dict(dash='dash')))
fig.update_layout(
    title='Same recall, changing precision: harmonic vs arithmetic mean',
    xaxis_title='Precision',
    yaxis_title='Score',
    legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='left', x=0),
)
fig.show()

In [None]:
# F1 as a function of precision and recall
precision_grid = np.linspace(0, 1, 201)
recall_grid = np.linspace(0, 1, 201)
P, R = np.meshgrid(precision_grid, recall_grid)

den = P + R

F1 = np.zeros_like(den, dtype=float)
np.divide(2 * P * R, den, out=F1, where=den != 0)

fig = px.imshow(
    F1,
    x=precision_grid,
    y=recall_grid,
    origin='lower',
    aspect='auto',
    labels={'x': 'Precision', 'y': 'Recall', 'color': 'F1'},
    title='F1 surface (heatmap) over precision/recall',
)
fig.update_layout(coloraxis_colorbar=dict(tickformat='.2f'))
fig.show()

## 3) NumPy implementation (from scratch)

Below is a minimal implementation that mirrors common `sklearn.metrics.f1_score` behavior:

- **binary** F1 via confusion-matrix counts
- safe handling of **zero division** (when there are no predicted positives, or no true positives)
- multiclass averages: `macro`, `micro`, `weighted`

Convention:

- when a denominator is zero, we return `zero_division` (default `0.0`)

In [None]:
def _as_1d(a):
    a = np.asarray(a)
    return a.ravel()


def _safe_divide(num, den, zero_division=0.0):
    num = np.asarray(num, dtype=float)
    den = np.asarray(den, dtype=float)

    out = np.full(np.broadcast(num, den).shape, float(zero_division), dtype=float)
    np.divide(num, den, out=out, where=den != 0)
    return out


def confusion_counts_binary(y_true, y_pred, *, pos_label=1):
    y_true = _as_1d(y_true)
    y_pred = _as_1d(y_pred)
    if y_true.shape != y_pred.shape:
        raise ValueError(f"shape mismatch: y_true{y_true.shape} vs y_pred{y_pred.shape}")

    yt = y_true == pos_label
    yp = y_pred == pos_label

    tp = np.sum(yt & yp)
    fp = np.sum(~yt & yp)
    fn = np.sum(yt & ~yp)
    tn = np.sum(~yt & ~yp)

    return tp, fp, fn, tn


def precision_recall_f1_from_counts(tp, fp, fn, *, zero_division=0.0):
    precision = _safe_divide(tp, tp + fp, zero_division=zero_division)
    recall = _safe_divide(tp, tp + fn, zero_division=zero_division)
    f1 = _safe_divide(2 * tp, 2 * tp + fp + fn, zero_division=zero_division)
    return precision, recall, f1


def f1_score_binary(y_true, y_pred, *, pos_label=1, zero_division=0.0):
    tp, fp, fn, tn = confusion_counts_binary(y_true, y_pred, pos_label=pos_label)
    _, _, f1 = precision_recall_f1_from_counts(tp, fp, fn, zero_division=zero_division)
    return float(f1)


def f1_score_multiclass(y_true, y_pred, *, labels=None, average='macro', zero_division=0.0):
    '''Multiclass/single-label F1 via one-vs-rest counts.

    average: {'macro','micro','weighted', None}
    '''

    y_true = _as_1d(y_true)
    y_pred = _as_1d(y_pred)
    if y_true.shape != y_pred.shape:
        raise ValueError(f"shape mismatch: y_true{y_true.shape} vs y_pred{y_pred.shape}")

    if labels is None:
        labels = np.unique(np.concatenate([y_true, y_pred]))
    labels = np.asarray(labels)

    tps = []
    fps = []
    fns = []
    supports = []

    for lab in labels:
        tp = np.sum((y_true == lab) & (y_pred == lab))
        fp = np.sum((y_true != lab) & (y_pred == lab))
        fn = np.sum((y_true == lab) & (y_pred != lab))

        tps.append(tp)
        fps.append(fp)
        fns.append(fn)
        supports.append(np.sum(y_true == lab))

    tps = np.asarray(tps)
    fps = np.asarray(fps)
    fns = np.asarray(fns)
    supports = np.asarray(supports)

    per_class_f1 = _safe_divide(2 * tps, 2 * tps + fps + fns, zero_division=zero_division)

    if average is None:
        return labels, per_class_f1

    average = str(average).lower()
    if average == 'macro':
        return float(np.mean(per_class_f1))
    if average == 'weighted':
        w = _safe_divide(supports, supports.sum(), zero_division=0.0)
        return float(np.sum(w * per_class_f1))
    if average == 'micro':
        tp = tps.sum()
        fp = fps.sum()
        fn = fns.sum()
        return float(_safe_divide(2 * tp, 2 * tp + fp + fn, zero_division=zero_division))

    raise ValueError("average must be one of: 'macro', 'micro', 'weighted', None")

In [None]:
# Quick sanity checks vs sklearn
y_true = rng.integers(0, 2, size=200)
y_pred = rng.integers(0, 2, size=200)

ours = f1_score_binary(y_true, y_pred)
sk = sk_f1_score(y_true, y_pred, zero_division=0)
print('binary  f1: ours=', ours, 'sklearn=', sk)

y_true_mc = rng.integers(0, 3, size=300)
y_pred_mc = rng.integers(0, 3, size=300)

for avg in ['macro', 'micro', 'weighted']:
    ours = f1_score_multiclass(y_true_mc, y_pred_mc, average=avg)
    sk = sk_f1_score(y_true_mc, y_pred_mc, average=avg, zero_division=0)
    print(f"multiclass {avg:8s}: ours={ours:.6f} sklearn={sk:.6f}")

## 4) Thresholding: why F1 depends on the decision rule

Many classifiers output a **score** or a **probability** $\hat{p}(y=1\mid x)$.

To produce hard labels we pick a threshold $t$:

$$
\hat{y}(t) = \mathbb{1}[\hat{p} \ge t]
$$

Changing $t$ changes FP/FN, therefore precision/recall, therefore F1.

A common way to **use F1 for optimization** is to choose $t$ (and other hyperparameters) to maximize validation-set F1:

$$
t^* \in \arg\max_{t\in[0,1]} F_1\bigl(y,\ \mathbb{1}[\hat{p}\ge t]\bigr)
$$

This is practical because:
- F1 is **not differentiable** in the model parameters (it jumps when a single point crosses the threshold)
- but it’s easy to optimize over a 1D threshold via a grid search

In [None]:
# Synthetic imbalanced dataset (2D for visualization)
X, y = make_classification(
    n_samples=2500,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_clusters_per_class=1,
    weights=[0.9, 0.1],
    class_sep=1.4,
    random_state=7,
)

X_train, X_tmp, y_train, y_tmp = train_test_split(
    X, y, test_size=0.4, stratify=y, random_state=7
)
X_val, X_test, y_val, y_test = train_test_split(
    X_tmp, y_tmp, test_size=0.5, stratify=y_tmp, random_state=7
)

# Standardize using train statistics (low-level)
mean_ = X_train.mean(axis=0)
std_ = X_train.std(axis=0)
std_ = np.where(std_ == 0, 1.0, std_)

X_train_s = (X_train - mean_) / std_
X_val_s = (X_val - mean_) / std_
X_test_s = (X_test - mean_) / std_

fig = px.scatter(
    x=X_train_s[:, 0],
    y=X_train_s[:, 1],
    color=y_train.astype(str),
    opacity=0.7,
    labels={'x': 'x1 (standardized)', 'y': 'x2 (standardized)', 'color': 'class'},
    title='Training data (imbalanced)',
)
fig.show()

print('class balance (train):', np.bincount(y_train) / y_train.size)

In [None]:
def add_intercept(X: np.ndarray) -> np.ndarray:
    X = np.asarray(X, dtype=float)
    return np.c_[np.ones((X.shape[0], 1)), X]


def sigmoid(z):
    z = np.asarray(z, dtype=float)
    out = np.empty_like(z)
    pos = z >= 0
    out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
    ez = np.exp(z[~pos])
    out[~pos] = ez / (1.0 + ez)
    return out


def log_loss_from_proba(y_true, p, eps=1e-15):
    y_true = np.asarray(y_true, dtype=float)
    p = np.clip(np.asarray(p, dtype=float), eps, 1 - eps)
    return -np.mean(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))


def fit_logistic_regression_gd(
    X,
    y,
    *,
    lr=0.2,
    max_iter=2000,
    alpha=0.0,
    tol=1e-8,
):
    '''Binary logistic regression with gradient descent + optional L2 penalty.'''

    Xb = add_intercept(X)
    y = np.asarray(y, dtype=float).ravel()

    n, d = Xb.shape
    w = np.zeros(d)
    history = []

    for _ in range(max_iter):
        p = sigmoid(Xb @ w)
        loss = log_loss_from_proba(y, p) + 0.5 * alpha * np.sum(w[1:] ** 2)
        history.append(loss)

        grad = (Xb.T @ (p - y)) / n
        grad[1:] += alpha * w[1:]

        w_new = w - lr * grad

        if np.linalg.norm(w_new - w) < tol:
            w = w_new
            break
        w = w_new

    return w, np.asarray(history)


def predict_proba_logreg(X, w):
    Xb = add_intercept(X)
    return sigmoid(Xb @ w)

In [None]:
w, loss_hist = fit_logistic_regression_gd(X_train_s, y_train, lr=0.2, max_iter=3000, alpha=0.05)

fig = go.Figure()
fig.add_trace(go.Scatter(y=loss_hist, mode='lines', name='train log-loss'))
fig.update_layout(title='Training curve (log-loss)', xaxis_title='Iteration', yaxis_title='Log-loss')
fig.show()

w

In [None]:
def precision_recall_f1_at_thresholds(y_true, y_score, thresholds, *, zero_division=0.0):
    y_true = np.asarray(y_true).astype(int).ravel()
    y_score = np.asarray(y_score, dtype=float).ravel()
    thresholds = np.asarray(thresholds, dtype=float)

    y_true_pos = y_true == 1
    pred_pos = y_score[:, None] >= thresholds[None, :]

    tp = np.sum(pred_pos & y_true_pos[:, None], axis=0)
    fp = np.sum(pred_pos & ~y_true_pos[:, None], axis=0)
    fn = np.sum(~pred_pos & y_true_pos[:, None], axis=0)

    precision = _safe_divide(tp, tp + fp, zero_division=zero_division)
    recall = _safe_divide(tp, tp + fn, zero_division=zero_division)
    f1 = _safe_divide(2 * tp, 2 * tp + fp + fn, zero_division=zero_division)

    return precision, recall, f1, tp, fp, fn


p_val = predict_proba_logreg(X_val_s, w)
thresholds = np.linspace(0.0, 1.0, 401)

prec_t, rec_t, f1_t, tp_t, fp_t, fn_t = precision_recall_f1_at_thresholds(
    y_val, p_val, thresholds, zero_division=0.0
)

best_idx = int(np.argmax(f1_t))
t_best = float(thresholds[best_idx])

print('best threshold (val):', t_best)
print('F1 at best threshold (val):', float(f1_t[best_idx]))

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=thresholds, y=prec_t, mode='lines', name='precision'))
fig.add_trace(go.Scatter(x=thresholds, y=rec_t, mode='lines', name='recall'))
fig.add_trace(go.Scatter(x=thresholds, y=f1_t, mode='lines', name='F1', line=dict(width=3)))

fig.add_vline(x=t_best, line_width=2, line_dash='dash', line_color='black')
fig.update_layout(
    title='Precision / Recall / F1 vs threshold (validation set)',
    xaxis_title='Threshold t',
    yaxis_title='Score',
    legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='left', x=0),
)
fig.show()

In [None]:
def confusion_matrix_from_threshold(y_true, y_score, t):
    y_pred = (np.asarray(y_score) >= t).astype(int)
    tp, fp, fn, tn = confusion_counts_binary(y_true, y_pred, pos_label=1)
    mat = np.array([[tn, fp], [fn, tp]])
    return mat, (tp, fp, fn, tn)


mat_05, counts_05 = confusion_matrix_from_threshold(y_val, p_val, 0.5)
mat_best, counts_best = confusion_matrix_from_threshold(y_val, p_val, t_best)

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=(
        f't=0.50 (F1={f1_score_binary(y_val, (p_val>=0.5).astype(int)):.3f})',
        f't={t_best:.2f} (F1={f1_score_binary(y_val, (p_val>=t_best).astype(int)):.3f})',
    ),
)

for col, mat in enumerate([mat_05, mat_best], start=1):
    fig.add_trace(
        go.Heatmap(
            z=mat,
            x=['Pred 0', 'Pred 1'],
            y=['True 0', 'True 1'],
            text=mat,
            texttemplate='%{text}',
            colorscale='Blues',
            showscale=False,
        ),
        row=1,
        col=col,
    )

fig.update_layout(title='Confusion matrices on validation set')
fig.show()

counts_05, counts_best

In [None]:
# Precision-Recall curve with iso-F1 lines
# (each point corresponds to one threshold)

fig = go.Figure()
fig.add_trace(go.Scatter(x=rec_t, y=prec_t, mode='lines', name='PR curve'))
fig.add_trace(
    go.Scatter(
        x=[rec_t[best_idx]],
        y=[prec_t[best_idx]],
        mode='markers',
        marker=dict(size=10, color='red'),
        name=f'Best F1 (t={t_best:.2f})',
    )
)

f_levels = [0.2, 0.4, 0.6, 0.8]
p_line = np.linspace(0.001, 1.0, 400)
for f in f_levels:
    mask = p_line > (f / 2)
    p = p_line[mask]
    r = (f * p) / (2 * p - f)
    r = np.clip(r, 0, 1)

    fig.add_trace(
        go.Scatter(
            x=r,
            y=p,
            mode='lines',
            line=dict(dash='dot', width=1),
            name=f'F1={f}',
            hoverinfo='skip',
        )
    )

fig.update_layout(
    title='Precision–Recall curve (validation) with iso-F1 lines',
    xaxis_title='Recall',
    yaxis_title='Precision',
    xaxis=dict(range=[0, 1]),
    yaxis=dict(range=[0, 1]),
)
fig.show()

In [None]:
# How the threshold changes the *linear* decision boundary
# p = sigmoid(z) >= t  <=>  z >= log(t/(1-t))

def boundary_line(w, t, x1):
    z_thr = np.log(t / (1 - t))
    if np.isclose(w[2], 0.0):
        return None
    x2 = (z_thr - w[0] - w[1] * x1) / w[2]
    return x2


x1 = np.linspace(X_train_s[:, 0].min() - 0.5, X_train_s[:, 0].max() + 0.5, 200)
x2_05 = boundary_line(w, 0.5, x1)
x2_best = boundary_line(w, t_best, x1)

fig = px.scatter(
    x=X_train_s[:, 0],
    y=X_train_s[:, 1],
    color=y_train.astype(str),
    opacity=0.6,
    labels={'x': 'x1 (standardized)', 'y': 'x2 (standardized)', 'color': 'class'},
    title='Logistic regression: threshold shifts the decision boundary',
)

if x2_05 is not None:
    fig.add_trace(go.Scatter(x=x1, y=x2_05, mode='lines', name='t=0.50', line=dict(color='black')))
if x2_best is not None:
    fig.add_trace(go.Scatter(x=x1, y=x2_best, mode='lines', name=f't={t_best:.2f}', line=dict(color='red')))

fig.show()

### Evaluate on the test set

We picked $t^*$ on the **validation** set to avoid overfitting the threshold.

Now compare:

- default $t=0.5$
- tuned $t=t^*$

In [None]:
p_test = predict_proba_logreg(X_test_s, w)

def report_binary(y_true, p, t):
    y_hat = (p >= t).astype(int)
    tp, fp, fn, tn = confusion_counts_binary(y_true, y_hat)
    prec, rec, f1 = precision_recall_f1_from_counts(tp, fp, fn)
    return {
        'threshold': float(t),
        'precision': float(prec),
        'recall': float(rec),
        'f1': float(f1),
        'tp': int(tp),
        'fp': int(fp),
        'fn': int(fn),
        'tn': int(tn),
    }

rep_05 = report_binary(y_test, p_test, 0.5)
rep_best = report_binary(y_test, p_test, t_best)

rep_05, rep_best

## 5) Using F1 for model selection (simple “optimization” loop)

F1 is typically used as a **selection criterion** rather than a differentiable training loss.

Example: tune L2 strength $\alpha$ for logistic regression by:

1) fit the model for each $\alpha$
2) pick the threshold $t$ that maximizes validation F1
3) choose the best $(\alpha, t)$ pair

In [None]:
alphas = [0.0, 0.01, 0.05, 0.2, 1.0]
thresholds = np.linspace(0.0, 1.0, 401)

results = []
for a in alphas:
    w_a, _ = fit_logistic_regression_gd(X_train_s, y_train, lr=0.2, max_iter=3000, alpha=a)
    p_val_a = predict_proba_logreg(X_val_s, w_a)

    _, _, f1_a, _, _, _ = precision_recall_f1_at_thresholds(y_val, p_val_a, thresholds)
    best_idx_a = int(np.argmax(f1_a))

    results.append(
        {
            'alpha': float(a),
            't_best': float(thresholds[best_idx_a]),
            'f1_val_best': float(f1_a[best_idx_a]),
        }
    )

results

In [None]:
alpha_vals = np.array([r['alpha'] for r in results])
f1_vals = np.array([r['f1_val_best'] for r in results])

best = results[int(np.argmax(f1_vals))]

fig = go.Figure()
fig.add_trace(go.Scatter(x=alpha_vals, y=f1_vals, mode='lines+markers', name='best val F1'))
fig.update_layout(
    title='Validation F1 after threshold tuning vs L2 strength',
    xaxis_title='alpha (L2 strength)',
    yaxis_title='best validation F1',
)
fig.show()

best

## 6) Multiclass F1: macro vs micro vs weighted

For multiclass single-label classification, F1 is usually computed by turning each class into a one-vs-rest problem.

- **macro**: average F1 across classes (treat each class equally)
- **weighted**: average F1 across classes weighted by class support
- **micro**: compute global TP/FP/FN across classes before computing F1

Note: in *single-label* multiclass classification, **micro F1 equals accuracy**.

In [None]:
y_true_mc = np.array([0, 0, 0, 1, 1, 2, 2, 2, 2])
y_pred_mc = np.array([0, 2, 0, 1, 0, 2, 2, 1, 2])

labels, per_class = f1_score_multiclass(y_true_mc, y_pred_mc, average=None)
print('labels:', labels)
print('per-class F1:', per_class)

for avg in ['macro', 'micro', 'weighted']:
    ours = f1_score_multiclass(y_true_mc, y_pred_mc, average=avg)
    sk = sk_f1_score(y_true_mc, y_pred_mc, average=avg, zero_division=0)
    print(f"{avg:8s}: ours={ours:.6f} sklearn={sk:.6f}")

## Pros / cons and when to use F1

**Pros**
- Good default when the positive class is **rare** and you care about both FP and FN.
- Single number that summarizes the precision–recall tradeoff.
- Common in information retrieval, detection tasks, and many imbalanced classification settings.

**Cons / limitations**
- Ignores **true negatives**: can be misleading if performance on the negative class matters.
- **Threshold-dependent**: you must pick a threshold (or compare across thresholds).
- Not a **proper scoring rule** (unlike log-loss / Brier), so it’s not ideal for probability calibration.
- Not differentiable in model parameters → usually not used as a direct training loss.
- Can hide tradeoffs: the same F1 can come from very different (precision, recall) pairs.

**Good use cases**
- Highly imbalanced binary classification where the negative class is huge (fraud, churn, defect detection).
- Search / ranking systems after choosing an operating point.
- Segmentation / detection tasks (F1 is closely related to the Dice coefficient).

## Common pitfalls + diagnostics

- **Undefined divisions**: if the model predicts no positives, precision is undefined. Decide a policy (`zero_division=0` is common).
- **Wrong averaging** in multiclass: `macro` emphasizes minority classes; `weighted` tracks overall distribution.
- **Class imbalance** doesn’t magically disappear: F1 helps compared to accuracy, but you still need proper validation and often threshold tuning.
- If you need to compare models *as rankers*, prefer PR curves / average precision instead of a single F1 at one threshold.
- If FP and FN have different costs, prefer $F_\beta$ or an explicit cost-sensitive metric.

## Exercises

1) Implement $F_\beta$ in NumPy and verify it against `sklearn.metrics.fbeta_score`.
2) For the logistic regression example, compare the threshold that maximizes F1 vs the threshold that maximizes accuracy.
3) Create an extremely imbalanced dataset (e.g. 99.5% negatives) and compare accuracy vs F1.
4) For multiclass, create a dataset with one rare class and compare `macro` vs `weighted` F1.

## References

- scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
- scikit-learn user guide (precision/recall/F-score): https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics