# Root Mean Squared Logarithmic Error (RMSLE) — Regression Metric (From Scratch)

RMSLE measures error **in log space**: it is the RMSE between $\log(1 + y)$ and $\log(1 + \hat y)$.

It is most useful when targets are **non-negative**, span **orders of magnitude**, and you care about **multiplicative / percentage-like** errors.

**Goals**
- Build intuition with numeric examples + Plotly visuals
- Write RMSLE/MSLE in clear notation (including domain constraints)
- Implement `root_mean_squared_log_error` in NumPy (from scratch) and validate vs scikit-learn
- Show how RMSLE naturally leads to optimizing a model on a `log1p`-transformed target
- Summarize pros/cons, good use cases, and common pitfalls

## Quick import

```python
from sklearn.metrics import root_mean_squared_log_error
```

Equivalent: `np.sqrt(mean_squared_log_error(...))`.


In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from sklearn.metrics import (
    mean_squared_error,
    mean_squared_log_error,
    root_mean_squared_error,
    root_mean_squared_log_error,
)

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)


## Prerequisites

- Regression setup: true targets $y$ and predictions $\hat y$
- Logarithms and the `log1p` / `expm1` trick:
  - `log1p(y) = log(1 + y)` is stable when $y$ is near 0
  - `expm1(z) = exp(z) - 1` is the inverse of `log1p`


## 1) Definition and notation

Given $n$ samples with **non-negative** targets $y_i \ge 0$ and predictions $\hat y_i \ge 0$, define the log-transformed values:

$$t_i = \log(1 + y_i), \qquad \hat t_i = \log(1 + \hat y_i)$$

The **mean squared logarithmic error** (MSLE) is:

$$\mathrm{MSLE}(y, \hat y) = \frac{1}{n}\sum_{i=1}^n (\hat t_i - t_i)^2 = \frac{1}{n}\sum_{i=1}^n \left(\log(1 + \hat y_i) - \log(1 + y_i)\right)^2$$

The **root mean squared logarithmic error** (RMSLE) is:

$$\mathrm{RMSLE}(y, \hat y) = \sqrt{\mathrm{MSLE}(y, \hat y)}$$

Weighted variant with sample weights $w_i \ge 0$:

$$\mathrm{MSLE}_w = \frac{\sum_{i=1}^n w_i (\hat t_i - t_i)^2}{\sum_{i=1}^n w_i}, \qquad \mathrm{RMSLE}_w = \sqrt{\mathrm{MSLE}_w}$$

Key identity (what makes this metric convenient):

$$\mathrm{RMSLE}(y, \hat y) = \mathrm{RMSE}(\log(1+y), \log(1+\hat y))$$

Notes:
- `log` is the natural logarithm; using another base just scales the metric by a constant.
- For multi-output regression, implementations typically compute RMSLE per output and then average.


## 2) Domain constraints and edge cases

- **Non-negativity**: Most definitions (and scikit-learn) require $y \ge 0$ and $\hat y \ge 0$.
- **Zeros are fine**: `log1p(0) = 0`, which is why `log(1 + y)` is used instead of `log(y)`.
- **Negative predictions**: A linear model can output negative values; for RMSLE you often
  - use a model that enforces $\hat y \ge 0$ (e.g., predict in log space), or
  - clip: $\hat y \leftarrow \max(\hat y, 0)$ at evaluation time (common in practice).
- **Near zero, it behaves like squared error**: for small $y$, $\log(1+y) \approx y$.
- **For large values, it behaves like squared relative error**: for large $y$, $\log(1+y) \approx \log(y)$.


In [None]:
vals = np.array([0.0, 0.1, 1.0, 10.0, 100.0])
pd.DataFrame(
    {
        "y": vals,
        "log1p(y)": np.log1p(vals),
        "expm1(log1p(y))": np.expm1(np.log1p(vals)),
    }
)


## 3) Intuition: RMSLE cares about *ratios* (mostly)

For large targets, the `+1` becomes negligible and:

$$\log(1 + \hat y) - \log(1 + y) \approx \log(\hat y) - \log(y) = \log\left(\frac{\hat y}{y}\right)$$

So for large $y$, a prediction that is off by a factor of $c$ (i.e., $\hat y = c y$) has error approximately:

$$\left(\log(c)\right)^2$$

This means:
- Overpredicting by $\times 2$ and underpredicting by $\div 2$ have the **same** penalty (because $\log(2)$ and $\log(1/2) = -\log(2)$ square to the same value).
- The metric is much less dominated by very large targets than RMSE/MSE.

For small targets, $\log(1+y) \approx y$, so the metric behaves closer to squared error on the original scale.


In [None]:
ratios = np.logspace(-2, 2, 500)  # 0.01 .. 100
y_trues = [0.1, 1.0, 10.0, 100.0, 1000.0]

parts = []
for y in y_trues:
    y_pred = ratios * y
    parts.append(
        pd.DataFrame(
            {
                "ratio": ratios,
                "sq_log_error": (np.log1p(y_pred) - np.log1p(y)) ** 2,
                "series": f"y_true={y:g}",
            }
        )
    )

# large-y approximation: (log ratio)^2
parts.append(
    pd.DataFrame(
        {
            "ratio": ratios,
            "sq_log_error": (np.log(ratios)) ** 2,
            "series": "(log ratio)^2 (large-y approx)",
        }
    )
)

df_ratio = pd.concat(parts, ignore_index=True)

fig = px.line(
    df_ratio,
    x="ratio",
    y="sq_log_error",
    color="series",
    log_x=True,
    title="Per-sample squared log error vs multiplicative ratio",
    labels={
        "ratio": "ratio = y_pred / y_true",
        "sq_log_error": "(log1p(y_pred) - log1p(y_true))^2",
        "series": "curve",
    },
)
fig.add_vline(x=1.0, line_dash="dash", line_color="black")
fig

## 4) A tiny worked example

We'll compute RMSLE step-by-step and compare against scikit-learn.


In [None]:
y_true = np.array([0.0, 1.0, 10.0, 100.0])
y_pred = np.array([0.0, 2.0, 8.0, 120.0])

t_true = np.log1p(y_true)
t_pred = np.log1p(y_pred)
diff = t_pred - t_true

msle = float(np.mean(diff**2))
rmsle = float(np.sqrt(msle))

print("t_true:", t_true)
print("t_pred:", t_pred)
print("diff:", diff)
print("MSLE:", msle)
print("RMSLE:", rmsle)

print("sklearn MSLE:", mean_squared_log_error(y_true, y_pred))
print("sklearn RMSLE:", root_mean_squared_log_error(y_true, y_pred))


In [None]:
df_example = pd.DataFrame(
    {
        "i": np.arange(len(y_true)),
        "y_true": y_true,
        "y_pred": y_pred,
        "log1p(y_true)": t_true,
        "log1p(y_pred)": t_pred,
        "sq_log_error": diff**2,
    }
)

fig = px.bar(
    df_example,
    x="i",
    y="sq_log_error",
    hover_data=["y_true", "y_pred", "log1p(y_true)", "log1p(y_pred)"],
    title="Per-sample MSLE contribution (squared log error)",
    labels={"i": "sample index", "sq_log_error": "(log1p(y_pred) - log1p(y_true))^2"},
)
fig

## 5) RMSLE vs RMSE: what changes when you take logs?

Consider targets that span orders of magnitude.

- With **RMSE**, a fixed *relative* error (say +20%) produces much larger absolute residuals for large targets, so large targets dominate the metric.
- With **RMSLE**, a fixed *relative* error produces approximately the same log residual, so the contributions are more balanced.


In [None]:
y_true_scale = np.array([1.0, 10.0, 100.0, 1000.0])

# Scenario A: same relative error (20% over)
y_pred_rel = 1.2 * y_true_scale

# Scenario B: same absolute error (+10)
y_pred_abs = y_true_scale + 10.0

def sq_error(y_t, y_p):
    return (y_p - y_t) ** 2

def sq_log_error(y_t, y_p):
    return (np.log1p(y_p) - np.log1p(y_t)) ** 2

df_scale = pd.concat(
    [
        pd.DataFrame(
            {
                "scenario": "20% over",
                "y_true": y_true_scale,
                "y_pred": y_pred_rel,
                "squared error": sq_error(y_true_scale, y_pred_rel),
                "squared log error": sq_log_error(y_true_scale, y_pred_rel),
            }
        ),
        pd.DataFrame(
            {
                "scenario": "+10 absolute",
                "y_true": y_true_scale,
                "y_pred": y_pred_abs,
                "squared error": sq_error(y_true_scale, y_pred_abs),
                "squared log error": sq_log_error(y_true_scale, y_pred_abs),
            }
        ),
    ],
    ignore_index=True,
)

df_long = df_scale.melt(
    id_vars=["scenario", "y_true", "y_pred"],
    value_vars=["squared error", "squared log error"],
    var_name="term",
    value_name="contribution",
)

fig = px.bar(
    df_long,
    x="y_true",
    y="contribution",
    color="term",
    barmode="group",
    facet_col="scenario",
    log_y=True,
    title="Per-sample contributions: RMSE/MSE vs RMSLE/MSLE",
    labels={"y_true": "target (y_true)", "contribution": "contribution (log scale)"},
)
fig.show()

for name, yp in [("20% over", y_pred_rel), ("+10 absolute", y_pred_abs)]:
    rmse = root_mean_squared_error(y_true_scale, yp)
    rmsle = root_mean_squared_log_error(y_true_scale, yp)
    print(f"{name:>11} | RMSE={rmse:.4f} | RMSLE={rmsle:.4f}")


## 6) NumPy implementation (from scratch)

We'll implement MSLE and RMSLE with scikit-learn-like handling:

- 1D and 2D targets (`(n_samples,)` or `(n_samples, n_outputs)`)
- Optional `sample_weight`
- `multioutput` ∈ {`"raw_values"`, `"uniform_average"`} or explicit output weights


In [None]:
def _as_2d(y):
    y = np.asarray(y, dtype=float)
    if y.ndim == 1:
        return y.reshape(-1, 1)
    if y.ndim == 2:
        return y
    raise ValueError("y must be 1D or 2D (n_samples,) or (n_samples, n_outputs).")


def _check_non_negative(y, *, name):
    if np.any(y < 0):
        raise ValueError(f"{name} contains negative values; RMSLE/MSLE require y >= 0.")


def mean_squared_log_error_np(y_true, y_pred, *, sample_weight=None, multioutput="uniform_average"):
    """Mean squared logarithmic error (MSLE).

    MSLE(y, y_hat) = mean((log1p(y_hat) - log1p(y))^2)
    """
    y_true_2d = _as_2d(y_true)
    y_pred_2d = _as_2d(y_pred)

    if y_true_2d.shape != y_pred_2d.shape:
        raise ValueError(f"shape mismatch: y_true{y_true_2d.shape} vs y_pred{y_pred_2d.shape}")

    _check_non_negative(y_true_2d, name="y_true")
    _check_non_negative(y_pred_2d, name="y_pred")

    t_true = np.log1p(y_true_2d)
    t_pred = np.log1p(y_pred_2d)
    residual = t_pred - t_true

    if sample_weight is None:
        msle_per_output = np.mean(residual**2, axis=0)
    else:
        w = np.asarray(sample_weight, dtype=float)
        if w.ndim != 1:
            raise ValueError("sample_weight must be 1D of shape (n_samples,).")
        if w.shape[0] != y_true_2d.shape[0]:
            raise ValueError("sample_weight length must match n_samples.")
        w = w.reshape(-1, 1)
        msle_per_output = np.sum(w * residual**2, axis=0) / np.sum(w, axis=0)

    if multioutput == "raw_values":
        return msle_per_output
    if multioutput == "uniform_average":
        return float(np.mean(msle_per_output))

    weights = np.asarray(multioutput, dtype=float)
    if weights.shape != (msle_per_output.shape[0],):
        raise ValueError("multioutput weights must match n_outputs.")
    return float(np.average(msle_per_output, weights=weights))


def root_mean_squared_log_error_np(
    y_true, y_pred, *, sample_weight=None, multioutput="uniform_average"
):
    """Root mean squared logarithmic error (RMSLE): sqrt(MSLE)."""
    msle_per_output = mean_squared_log_error_np(
        y_true,
        y_pred,
        sample_weight=sample_weight,
        multioutput="raw_values",
    )
    rmsle_per_output = np.sqrt(msle_per_output)

    if multioutput == "raw_values":
        return rmsle_per_output
    if multioutput == "uniform_average":
        return float(np.mean(rmsle_per_output))

    weights = np.asarray(multioutput, dtype=float)
    if weights.shape != (rmsle_per_output.shape[0],):
        raise ValueError("multioutput weights must match n_outputs.")
    return float(np.average(rmsle_per_output, weights=weights))


In [None]:
y_true_rand = rng.lognormal(mean=1.2, sigma=0.9, size=(60, 3))
y_pred_rand = y_true_rand * rng.lognormal(mean=0.0, sigma=0.3, size=y_true_rand.shape)

print("ours raw:", root_mean_squared_log_error_np(y_true_rand, y_pred_rand, multioutput="raw_values"))
print("sk   raw:", root_mean_squared_log_error(y_true_rand, y_pred_rand, multioutput="raw_values"))

sample_w = rng.uniform(0.5, 2.0, size=y_true_rand.shape[0])
print("ours weighted:", root_mean_squared_log_error_np(y_true_rand, y_pred_rand, sample_weight=sample_w))
print("sk   weighted:", root_mean_squared_log_error(y_true_rand, y_pred_rand, sample_weight=sample_w))

assert np.allclose(
    root_mean_squared_log_error_np(y_true_rand, y_pred_rand, multioutput="raw_values"),
    root_mean_squared_log_error(y_true_rand, y_pred_rand, multioutput="raw_values"),
)
assert np.isclose(
    root_mean_squared_log_error_np(y_true_rand, y_pred_rand, sample_weight=sample_w),
    root_mean_squared_log_error(y_true_rand, y_pred_rand, sample_weight=sample_w),
)

# Negative values should raise (to match sklearn)
try:
    root_mean_squared_log_error_np([0.0, 1.0], [0.0, -0.1])
except ValueError as e:
    print("caught:", e)


## 7) RMSLE as an objective: gradients and optimization

Because the square root is monotonic, minimizing RMSLE is equivalent to minimizing MSLE.

Let $\Delta_i = \log(1+\hat y_i) - \log(1+y_i)$. Then:

$$\mathrm{MSLE} = \frac{1}{n}\sum_{i=1}^n \Delta_i^2$$

Derivative w.r.t. a prediction $\hat y_i$ (for $\hat y_i > -1$):

$$\frac{\partial\,\mathrm{MSLE}}{\partial\hat y_i} = \frac{2}{n}\,\Delta_i\,\frac{1}{1+\hat y_i}$$

For RMSLE:

$$\frac{\partial\,\mathrm{RMSLE}}{\partial\hat y_i} = \frac{1}{n\,\mathrm{RMSLE}}\,\Delta_i\,\frac{1}{1+\hat y_i}$$

Practical takeaway:
- There is an extra factor $\frac{1}{1+\hat y_i}$, so gradients are larger for small predictions.
- A very common training trick is to optimize in log space: fit a model to $t = \log(1+y)$ using standard squared error, then transform back with `expm1`.


In [None]:
# Synthetic data with multiplicative noise (log-normal in y)
n = 400
x = rng.uniform(0.0, 6.0, size=n)

# True relationship in log1p-space
t = 1.5 + 1.0 * x + rng.normal(0.0, 0.35, size=n)  # t = log1p(y)
y = np.expm1(t)

# Train/test split
perm = rng.permutation(n)
cut = int(0.8 * n)
tr, te = perm[:cut], perm[cut:]

x_tr, y_tr = x[tr], y[tr]
x_te, y_te = x[te], y[te]


In [None]:
fig = px.scatter(
    x=x_tr,
    y=y_tr,
    opacity=0.7,
    title="Synthetic regression data (y spans a wide range)",
    labels={"x": "feature x", "y": "target y"},
)
fig.update_yaxes(type="log")
fig

In [None]:
def predict_linear(x, w, b):
    x = np.asarray(x, dtype=float)
    return w * x + b


def fit_linear_mse_gd(x, y, *, lr=5e-4, steps=600):
    """Fit y ≈ w x + b by minimizing MSE on y (gradient descent)."""
    x = np.asarray(x, dtype=float)
    y = np.asarray(y, dtype=float)

    w = 0.0
    b = 0.0
    n = x.shape[0]

    hist = {"mse": [], "rmsle": [], "w": [], "b": []}

    for _ in range(steps):
        y_hat = predict_linear(x, w, b)
        r = y_hat - y

        mse = float(np.mean(r**2))

        # RMSLE isn't defined for negative predictions in sklearn; clip for evaluation.
        y_hat_clip = np.maximum(y_hat, 0.0)
        rmsle = float(root_mean_squared_log_error_np(y, y_hat_clip))

        grad_w = (2.0 / n) * float(np.dot(r, x))
        grad_b = (2.0 / n) * float(np.sum(r))

        w -= lr * grad_w
        b -= lr * grad_b

        hist["mse"].append(mse)
        hist["rmsle"].append(rmsle)
        hist["w"].append(w)
        hist["b"].append(b)

    return w, b, hist


def fit_log1p_mse_gd(x, y, *, lr=0.05, steps=600):
    """Fit log1p(y) ≈ w x + b (equivalent to optimizing MSLE/RMSLE)."""
    x = np.asarray(x, dtype=float)
    y = np.asarray(y, dtype=float)
    t = np.log1p(y)

    w = 0.0
    b = 0.0
    n = x.shape[0]

    hist = {"mse_log": [], "mse_y": [], "rmsle": [], "w": [], "b": []}

    for _ in range(steps):
        t_hat = predict_linear(x, w, b)  # model predicts log1p(y)
        r = t_hat - t

        mse_log = float(np.mean(r**2))

        y_hat = np.expm1(t_hat)
        y_hat = np.maximum(y_hat, 0.0)

        mse_y = float(np.mean((y_hat - y) ** 2))
        rmsle = float(root_mean_squared_log_error_np(y, y_hat))

        grad_w = (2.0 / n) * float(np.dot(r, x))
        grad_b = (2.0 / n) * float(np.sum(r))

        w -= lr * grad_w
        b -= lr * grad_b

        hist["mse_log"].append(mse_log)
        hist["mse_y"].append(mse_y)
        hist["rmsle"].append(rmsle)
        hist["w"].append(w)
        hist["b"].append(b)

    return w, b, hist


In [None]:
w_y, b_y, hist_y = fit_linear_mse_gd(x_tr, y_tr)
w_t, b_t, hist_t = fit_log1p_mse_gd(x_tr, y_tr)

y_hat_te_mse = np.maximum(predict_linear(x_te, w_y, b_y), 0.0)
y_hat_te_log = np.maximum(np.expm1(predict_linear(x_te, w_t, b_t)), 0.0)

print("Test RMSLE (fit MSE on y):     ", root_mean_squared_log_error_np(y_te, y_hat_te_mse))
print("Test RMSLE (fit on log1p(y)):", root_mean_squared_log_error_np(y_te, y_hat_te_log))

print("Test RMSE  (fit MSE on y):     ", root_mean_squared_error(y_te, y_hat_te_mse))
print("Test RMSE  (fit on log1p(y)):", root_mean_squared_error(y_te, y_hat_te_log))


In [None]:
df_hist = pd.DataFrame(
    {
        "step": np.arange(len(hist_y["rmsle"])),
        "RMSLE (fit MSE on y)": hist_y["rmsle"],
        "RMSLE (fit on log1p(y))": hist_t["rmsle"],
    }
)
df_hist_long = df_hist.melt(id_vars="step", var_name="model", value_name="rmsle")

fig = px.line(
    df_hist_long,
    x="step",
    y="rmsle",
    color="model",
    title="Training curves (RMSLE evaluated on the train set)",
    labels={"rmsle": "RMSLE"},
)
fig

In [None]:
df_pred = pd.DataFrame(
    {
        "y_true": np.concatenate([y_te, y_te]),
        "y_pred": np.concatenate([y_hat_te_mse, y_hat_te_log]),
        "model": np.repeat(
            ["fit MSE on y (linear)", "fit on log1p(y)"],
            repeats=len(y_te),
        ),
    }
)

eps = 1e-6
min_v = float(np.minimum(df_pred["y_true"].min(), df_pred["y_pred"].min()))
max_v = float(np.maximum(df_pred["y_true"].max(), df_pred["y_pred"].max()))
min_v = max(min_v, eps)

fig = px.scatter(
    df_pred,
    x="y_true",
    y="y_pred",
    color="model",
    opacity=0.7,
    title="Test predictions: y_true vs y_pred",
    labels={"y_true": "true y", "y_pred": "predicted y"},
)
fig.add_trace(
    go.Scatter(
        x=[min_v, max_v],
        y=[min_v, max_v],
        mode="lines",
        name="y = x",
        line=dict(color="black", dash="dash"),
    )
)
fig.update_xaxes(type="log")
fig.update_yaxes(type="log")
fig

## 8) Practical usage notes (scikit-learn)

- If you want to optimize for RMSLE, a common baseline is:
  1) transform targets with `log1p`
  2) fit a standard regression model
  3) invert predictions with `expm1`
- To **avoid invalid values**, clip predictions to $\hat y \ge 0$ before computing RMSLE.

Scikit-learn provides `TransformedTargetRegressor` to make the log/exp transform explicit.


In [None]:
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression

X_tr = x_tr.reshape(-1, 1)
X_te = x_te.reshape(-1, 1)

model = TransformedTargetRegressor(
    regressor=LinearRegression(),
    func=np.log1p,
    inverse_func=np.expm1,
)
model.fit(X_tr, y_tr)

y_pred_te = model.predict(X_te)
y_pred_te = np.clip(y_pred_te, 0.0, None)

print("sklearn RMSLE:", root_mean_squared_log_error(y_te, y_pred_te))


## 9) Pros, cons, and when to use RMSLE

**Pros**

- Focuses on *multiplicative* errors: being off by a factor matters more than being off by a constant
- Handles targets spanning orders of magnitude (less dominated by large absolute values)
- Natural when noise is approximately log-normal / heteroscedastic (variance grows with the mean)
- Easy to optimize by modeling $\log(1+y)$ and using squared error there

**Cons**

- Requires non-negative targets and predictions (not suitable when $y$ can be negative)
- Can overweight small targets: mistakes near zero matter a lot
- Reported value is in log units (less directly interpretable than RMSE/MAE)
- If you train in log space and then invert with `expm1`, predictions correspond more to a *median* than a *mean* in the original space (bias can appear)

**Good default when**

- Targets are counts/prices/sales/traffic/demand and you care about relative error
- Targets have a heavy right tail and you want evaluation that doesn't get dominated by the largest cases


## 10) Common pitfalls and diagnostics

- **Invalid negatives**: RMSLE is not defined for negative values in most libraries; enforce $\hat y \ge 0$ (model choice or clipping).
- **Zero-heavy targets**: inspect performance separately on $y=0$ vs $y>0$; RMSLE can behave differently near zero.
- **Compare metrics**: always compare RMSLE with RMSE/MAE; choose based on the cost of absolute vs relative errors.
- **Inspect residuals in log space**: if you optimize for RMSLE, plot $\log(1+\hat y) - \log(1+y)$, not only $\hat y - y$.
- **Remember the +1**: the "relative error" intuition is best when targets are not tiny.


## Exercises

1) Add support for `sample_weight` and explicit `multioutput` weights to the plotting examples (do some outputs matter more?).
2) Create a dataset where the true noise is additive (not multiplicative) and compare RMSE vs RMSLE behavior.
3) Show that for large $y$, MSLE is approximately the squared log-ratio: $(\log(\hat y/y))^2$.


## References

- scikit-learn metrics API: https://scikit-learn.org/stable/api/sklearn.metrics.html
- Kaggle discussions on RMSLE (common for count/price targets)
