This notebook compares the behaviors of regressions with different loss functions (aka. objective) on a toy dataset.

The loss functions considered here include

* mean absolute error (aka. MAE, L1)
* mean squared error (aka. MSE, L2)
* poisson loss (poisson regression)
* pinball loss with $\alpha=0.25$) (quantile regression)

With this experiment, we verify that 

* minimizing MAE predicts conditional median
* minimizing MSE predicts conditional mean
* minimizing poisson loss predicts conditional mean
* minimizing pinball loss predicts conditional quantile

The toy dataset has a single feature $x$, and the label is generated by sampling from a lognormal distribution ($\sigma=1$) conditioned on $x$. We chose a lognormal distribution because

* the use of poisson regression requires the labels to be positive
* the conditional mean and conditional median are more distinguishable than a symmetric distribution.

We use LightGBM for modeling.

# Sampling the dataset

In [1]:
from IPython.display import HTML, Image, display

display(
    HTML(
        data="""
<style>
   div#notebook-container    { width: 100%; }
   div#menubar-container     { width: 100%; }
   div#maintoolbar-container { width: 100%; }
</style>
"""
    )
)

In [2]:
import altair as alt
import lightgbm
import numpy as np
import pandas as pd
import scipy.special
from tqdm import tqdm
from sklearn.model_selection import train_test_split

In [3]:
np.random.seed(42)

In [4]:
# For tuning variance of the log-normal distributions.
SELECT_SIGMA = 1
# For tune the alpha-th quantile.
SELECT_ALPHA = 0.25
# Number of examples in the dataset
NUM_EXAMPLES = 50000

In [5]:
def cond_quantile(mean: np.ndarray, sigma: float, alpha: float) -> np.ndarray:
    """Calculates the conditional quantile of a lognormal distribution."""
    return np.exp(mean + np.sqrt(2) * sigma * scipy.special.erfinv(2 * alpha - 1))

In [6]:
def sample_dataset(sigma: int, alpha: float, size: int) -> pd.DataFrame:
    """Sample a dataset from the log-normal distribution.

    Args:
        sigma: the parameter that tunes the variance of the log-normal distribution.
        alpha: alpha for calculating the true alpha-th quantile.
        size: the size of the dataset to sample.

    Returns:
        a dataframe with xs, ys, true conditional mean, median and alpha-th quantile.
    """

    xs = np.random.uniform(-1, 1, size=size)
    ys = np.random.lognormal(mean=xs, sigma=SELECT_SIGMA, size=size)

    df = pd.DataFrame({"x": xs, "y": ys}).assign(
        cond_mean=lambda df: np.exp(df.x + SELECT_SIGMA ** 2 / 2),
        cond_median=lambda df: np.exp(df.x),
        cond_quantile=lambda df: cond_quantile(mean=df.x.to_numpy(), sigma=sigma, alpha=alpha),
    )
    return df

In [7]:
df_data = sample_dataset(sigma=SELECT_SIGMA, alpha=SELECT_ALPHA, size=NUM_EXAMPLES)

In [8]:
df_data.sample(5, random_state=42)

Unnamed: 0,x,y,cond_mean,cond_median,cond_quantile
33553,-0.440548,0.233526,1.061255,0.643684,0.327903
9427,0.688419,0.968335,3.281889,1.990566,1.014027
199,0.559751,0.548493,2.885653,1.750237,0.891599
12447,0.996614,4.706682,4.466539,2.709093,1.380056
39489,0.210857,2.46192,2.035736,1.234736,0.628995


In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    df_data[["x", "cond_mean", "cond_median", "cond_quantile"]],
    df_data["y"],
    test_size=0.33,
    random_state=42,
)

In [10]:
model_kwargs = dict(
    learning_rate=0.02, n_estimators=300, n_jobs=8, 
)

models = {}
models['l1'] = lightgbm.LGBMModel(objective='l1', **model_kwargs)
models['l2'] = lightgbm.LGBMModel(objective='l2', **model_kwargs)
models['poisson'] = lightgbm.LGBMModel(objective='poisson', **model_kwargs)
models['quantile'] = lightgbm.LGBMModel(objective='quantile', alpha=SELECT_ALPHA, **model_kwargs)

In [11]:
fit_kwargs = dict(
    X=X_train[["x"]],
    y=y_train,
    eval_metric=["l1", "l2", "poisson", "quantile"],
    eval_names = ['eval'],
    eval_set=[(X_test[["x"]], y_test)],
    verbose=False,
)

for model in tqdm(models.values()):
    model.fit(**fit_kwargs)

100%|██████████| 4/4 [00:01<00:00,  2.81it/s]


In [12]:
df_eval = pd.concat(
    [
        pd.DataFrame(model.evals_result_["eval"])
        .stack()
        .to_frame(name="metric_value")
        .rename_axis(["n_estimators", "metric_name"])
        .reset_index()
        .assign(model_objective=objective)
        for objective, model in models.items()
    ]
)

In [13]:
df_eval.shape

(4800, 4)

In [14]:
alt.Chart(df_eval, width=300, height=250).mark_line().encode(
    x="n_estimators:Q",
    y=alt.Y("metric_value", scale=alt.Scale(zero=False)),
    color="model_objective:N",
).facet(facet="metric_name").resolve_scale(y="independent")

Note

* The metrics are correlated, meaning they all go down even though the model is just minimizing one of them as its objective.
* the metric goes down the fastest when the model is using the same metric as its objective. An exception is at the early stage of when the metric is poisson, the L2 objective model reducing the error faster than the Poisson objective model. This can be due to that minimizing either L2 or Poisson objectives predicts the conditional mean. Also note the two metrics always go together among the panels.

In [15]:
dfs_test = []

for objective, model in models.items():
    for label_metric_col in [
        "cond_median",
        "cond_mean",
        "cond_quantile",
    ]:
        X_test_copy = (
            (
                X_test[["x", label_metric_col]]
                .rename(columns={label_metric_col: "label"})
                .assign(
                    label_metric=label_metric_col,
                    model_objective=objective,
                )
            )
            .assign(
                **{
                    f"predicted": model.predict(X_test[["x"]]),
                }
            )
            .assign(err=lambda df: df["predicted"] - df["label"])
        )
        dfs_test.append(X_test_copy)
df_test = pd.concat(dfs_test)

In [16]:
df_plot = df_test.loc[lambda df: df["label"] < 3].sample(5000).assign(zero=0)

circles = (
    alt.Chart()
    .mark_circle(size=10)
    .encode(
        x="label:Q",
        y=alt.Y("err:Q", title="Residual (predicted - label)"),
        color="model_objective",
    )
)

lines = (
    alt.Chart()
    .mark_rule()
    .encode(
        y="zero",
    )
)

(circles + lines).facet(facet="label_metric", data=df_plot)

Note,

* model with L2 objective and Poisson predicts conditional mean, so when compared to the conditonal mean ground truth, the residuals are close to 0.
* model with L1 objective predicts conditional median, so when compared to the conditonal median ground truth, the residuals are close to 0.
* model with quantile objective predicts conditional quantile, so when compared to the conditonal quantile ground truth, the residuals are close to 0.
* the overlap of circles corresponding to L2 and Poisson objective is because minimizing either of them predicts the conditional mean (See derivations in [these notebooks](https://github.com/zyxue/sutton-barto-rl-exercises/tree/master/supervised/loss_function_properties)).