In [32]:
import numpy as np

## Loss Functions

### Outline:
- This notebook will derive the forward (loss function calculation) and backward (derivative of loss function with respect to the prediction $\hat{y}$) mathematical formulas of various loss functions in machine learning.
- Python implementation of both forward and backward

### Motivation for loss functions

The role of a loss function is to measure the "gap" between true labels and predicted labels. The "gap" is intentionally very vague, because every loss function has its own definition of "gap". Once we figure out the "gap", we need our models to learn to adjust itself to be "correct more" next time. This is where the "backward" or derivative of the loss function comes into play. 

### How loss functions help the model learn from its mistake

For a model with parameters $\theta$, the model parameter update rule (a.k.a Gradient Descent Update) is:

$$
\theta^{(t+1)} = \theta^{(t)} - \eta \, \frac{\partial L}{\partial \theta}
$$

or
$$
\theta_{new} = \theta_{old} - \eta \, \frac{\partial L}{\partial \theta}
$$

where:
- $\theta^{(t)}$ is the parameter/weight vector at iteration $t$,
- $\eta$ is the learning rate,
- $\frac{\partial L}{\partial \theta}$ is the gradient of the loss function $L$ with respect to the parameters $\theta$.


They are connected through the chain rule. In a real model, you never update $\hat{y}$ directly; you update the parameters $\theta$ so that the output $\hat{y} = f(x; \theta)$ gets closer to the target $y$.

Remember, by the chain rule, we have:

$$
\frac{\partial L}{\partial \theta} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial \theta}
$$

And we know that $\hat{y} = f(x; \theta)$, so if $x$ doesn't change, we are modeling the relationship between $\theta$ and $\hat{y}$.

If you imagine a case where the relationship between $\theta$ and $\hat{y}$ is simple (say, $\hat{y}$ is a linear function of $\theta$ so that $\frac{\partial \hat{y}}{\partial \theta}$ behaves like an identity function), then the update on $\theta$ results in a change in $\hat{y}$ that is (up to scaling) the same as directly subtracting $\eta \frac{\partial L}{\partial \hat{y}}$. That is, conceptually you can write

$$
\begin{align*}
\hat{y}_{new} &= f(x; \theta_{new}) \\
&= f(x; \theta_{old} - \eta \, \frac{\partial L}{\partial \theta}) \\
&\approx \hat{y}_{\text{old}} - \eta \frac{\partial L}{\partial \hat{y}} \\
\end{align*}
$$

**Note:** The last equation is an oversimplified version, which doesn't always hold if $f$ is non-linear, which is often the case. It's used to conceptually illustrate the model learning process with respect to the gradients of the loss functions $L$.
- If $\frac{\partial L}{\partial \hat{y}}$ is positive (indicating $\hat{y}$ is too high), the update subtracts a positive quantity, thus decreasing $\hat{y}$.
- If $\frac{\partial L}{\partial \hat{y}}$ is negative (indicating $\hat{y}$ is too low), subtracting a negative value increases $\hat{y}$.

### Binary Cross Entropy
Also known as: "Log Loss" or "negative log-likelihood"

#### Forward
$$
- \frac{1}{N} \sum_{i=1}^N \left[ y_i \log{\hat{y_i}} + (1 - y_i) \log{(1 - \hat{y_i})} \right]
$$

Where 
- $N$ is the total number of samples
- $y_i$ is the true label (0 or 1) for the $i$-th sample
- $\hat{y_i}$ is the predicted probability for the $i$-th sample

The formulation of this loss function actually comes from how we model the probability mass function (PMF) of a Bernoulli-distributed random variable $Y$, which represents a binary outcome of 0 or 1.
The likelihood of $Y$ taking on a specific value of $y$ (either 0 or 1) is:

$$
P(Y = y) = p^y (1 - p)^{1-y}
$$

Then if we take the log-likelihood of $Y$:
$$
log P(Y = y) = y \log (p) + (1-y) \log (1 - p)
$$

In our machine learning world, we want to maximize this likelihood for our model to correctly model this distribution. But since gradient descent usually is concerned with minimizing the loss function, we will convert this problem from: **maximize log likelihood** to **minimize negative log likelihood**, so we take the negation of that, which gives us exactly the first formula known as the "Binary Cross Entropy" loss:
$$
- \left[ y \log{\hat{y}} + (1 - y) \log{(1 - \hat{y})} \right]
$$

Then we can divide it by the number of samples $N$ in an iteration. I will omit that here since it's exactly the same.

In [33]:
y_true = np.ones((4,))
y_pred = np.random.rand(4)
y_pred[0] = 0.99 # make the first prediction close to true label of 1
print(y_true, y_pred)

def binary_cross_entropy(y_true, y_pred, reduce="mean"):
    raw_bce = -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    print(f"{raw_bce=}")
    if reduce == "mean":
        return np.mean(raw_bce)
    elif reduce == "sum":
        return np.sum(raw_bce)

binary_cross_entropy(y_true, y_pred) # the first sample's loss should be close to 0, while the others are higher

[1. 1. 1. 1.] [0.99       0.46340523 0.14644962 0.91515728]
raw_bce=array([0.01005034, 0.76915337, 1.92107383, 0.08865934])


np.float64(0.6972342200251623)

#### Backward

Recall
$$
\text{BCE} = - \left[ y \log{\hat{y}} + (1 - y) \log{(1 - \hat{y})} \right] \\
$$

Compute the derivative separate for the two terms:
$$
\frac{\partial y \log \hat{y}}{\partial \hat{y}} = y \cdot \frac{1}{\hat{y}} \\
\frac{\partial (1 - y) \log{(1 - \hat{y})}}{\partial \hat{y}} = (1 - y) \cdot \frac{-1}{1-\hat{y}} \\
$$

Combine terms
$$
\frac{\partial \text{BCE}}{\partial \hat{y}} = - \left( y \cdot \frac{1}{\hat{y}} + (1 - y) \cdot \frac{-1}{1-\hat{y}} \right) \\

= - \frac{y}{\hat{y}} + \frac{1 - y}{1-\hat{y}} \\

= \frac{(1 - y)(\hat{y}) - y(1 - \hat{y})}{\hat{y} (1 - \hat{y})} \\

= \frac{\hat{y} - y}{\hat{y} (1 - \hat{y})}
$$

Now we can see that:

- when $y = 1$, the gradient simplifies to $-\frac{1}{\hat{y}}$, which is very negative when $\hat{y}$ is small, and pushing $\hat{y}$ upwards closer to $y = 1$
- when $y = 0$, the gradient simplifies to $\frac{1}{1 - \hat{y}}$, which is very positive when $\hat{y}$ is large, and pushing $\hat{y}$ downwards closer to $y = 0$

The pushing of $\hat{y}$ can be seen with this update function
$$
\hat{y}_{new} = \hat{y}_{old} - \text{learning rate} \cdot \frac{\partial \text{BCE}}{\partial \hat{y}}
$$

- So you can see, when the gradient is very positive, we push $\hat{y}$ downwards closer to $y = 0$.
- when the gradient is very negative, we push $\hat{y}$ upwards closer to $y = 1$.



In [34]:
def bce_backward(y_true, y_pred):
    return (y_pred - y_true) / (y_pred * (1 - y_pred))

print(y_true, y_pred)
bce_backward(y_true, y_pred) # We should see that the gradient of the elements except the first element will be very negative, so it will push y_pred closer to y = 1

[1. 1. 1. 1.] [0.99       0.46340523 0.14644962 0.91515728]


array([-1.01010101, -2.15793851, -6.82828697, -1.09270835])

### Cross Entropy

Let's generalize to the cross entropy loss formula, which can be used to compute the difference between two probability distributions.

We will try to be consistent in the notation, in case we get lost, these notations are equivalent:
- $p_{true}$ = $y_{true}$
- $p_{pred}$ = $y_{pred}$ = $\hat{y}$

There are couple of ways to compute the cross entropy loss depending on the type of inputs we pass in. We will discuss each one separately.

##### 1. probabilities for $y_{pred}$ and one-hot encoding for $y_{true}$

Inputs:
- $y_{true}$: A **one-hot** encoded vector representing the true class (e.g., [0, 1, 0] for class 2).
- $y_{pred}$: A probability distribution over classes (typically obtained by applying softmax to logits).

What It Does:
- The formula  $-\sum_{c} y_{\text{true},c} \log(y_{\text{pred},c})$  is the general definition of cross entropy.
- In a **one-hot** scenario, only the term corresponding to the correct class contributes to the loss, so it simplifies to  $-\log(\hat{y}_{\text{correct}})$.

Loss:
$$
\text{CE} = -\sum_{c} y_{\text{true}, c} \,\log \bigl(y_{\text{pred}, c}\bigr).
$$

In a one-hot scenario, only the term corresponding to the correct class contributes, so it simplifies to

$$
-\log \bigl(y_{\text{pred}, t}\bigr)
$$

where $t$ is the target class index.

#### Forward

Let's recall the binary cross entropy formula when $C = 2$:

$$
y_1 = y \; \text{and} \; y_0 = 1-y \\
\hat{y}_1 = \hat{y} \; \text{and} \; \hat{y}_0 = 1-\hat{y} \\
\text{Cross Entropy} = \text{Binary Cross Entropy} = -\Bigl[y \log(\hat{y}) + (1-y) \log(1-\hat{y})\Bigr]
$$

When we have more than 2 classes, let's denote the number of classes as $C$:

$$

\text{Cross Entropy (CE)}
= -\sum_{c} y_{\text{true}, c} \,\log \bigl(y_{\text{pred}, c}\bigr)
$$

In [35]:
def cross_entropy_one_hot(y_true, y_pred):
    return - np.sum(y_true * np.log(y_pred))

ce = cross_entropy_one_hot(y_true, y_pred)
bce = binary_cross_entropy(y_true, y_pred)
print(ce, bce)
np.sum(bce) == ce

raw_bce=array([0.01005034, 0.76915337, 1.92107383, 0.08865934])
2.7889368801006493 0.6972342200251623


np.False_

##### 2. logits for $y_{pred}$ and target index for $y_{true}$

Inputs:
- $\mathbf{z} = (z_1, \dots, z_C)$ are the logits (raw scores) from the model.
- $t$ (an integer) is the index of the correct class.

What happens internally:
- Numerical Stability: It subtracts the maximum logit from all logits before exponentiating. This is the log-sum-exp trick that helps prevent numerical overflow.
- Internally computes softmax which converts logits into a probability distribution $\hat{y}_j = \frac{e^{z_j}}{\sum{k} e^{z_k}}$



Since the true label is given as an index, the loss is computed as:

$$
\text{Cross Entropy} = -\log(y_{pred, t}) = -\log \biggl( \text{softmax}(\text{logits}_{target}) \biggr) = -\log \biggl( \frac{e^{z_t}}{\sum_j e^{z_j}} \biggr)
$$

Where $t$ is the target class' index, and $j$ is the all the classes

This matches the generic cross entropy definition in the one-hot case (only the target class term contributes).

In [36]:
def cross_entropy_logits(logits, target_index):
    max_logit = np.max(logits)
    stable_logits = logits - max_logit
    
    exp_logits = np.exp(stable_logits)
    softmax = exp_logits / np.sum(exp_logits)
    
    loss = - np.log(softmax[target_index])
    return loss, softmax

loss, softmax = cross_entropy_logits(np.array([1,2,2]), 1)

print(loss, softmax)

0.8619948040582511 [0.1553624 0.4223188 0.4223188]


#### Backward

We first consider the derivative of the cross entropy with respect to the predicted probability $y_{\text{pred}}$ for the target class t:

$$
\frac{\partial \text{CE}}{\partial y_{\text{pred}, t}}
= - \frac{\partial}{\partial y_{\text{pred}, t}} \bigl[\log(y_{\text{pred}, t})\bigr]
= -\frac{1}{y_{\text{pred}, t}}.
$$

For any other class $j \neq t, y_{\text{true}, j} = 0$, so the loss does not directly depend on $y_{\text{pred}, j}$:

$$
\frac{\partial \text{CE}}{\partial y_{\text{pred}, j}}
= 0 \quad \text{(for } j \neq t\text{)}.
$$

**Gradient with Respect to Logits ($z_i$)**

Recall that

$$
\text{CE} = -\log\!\Bigl(\frac{e^{z_t}}{\sum_{j} e^{z_j}}\Bigr)
$$

and

$$
y_{\text{pred}, i} = \frac{e^{z_i}}{\sum_{k} e^{z_k}}
$$

By applying the chain rule (and using the known derivative of softmax), we arrive at the well-known result for all classes i:

$$
\frac{\partial \text{CE}}{\partial z_i} = y_{\text{pred}, i} - y_{\text{true}, i}
$$

**In more detail:**

1.	Softmax Derivative:

$$
\frac{\partial y_{\text{pred}, i}}{\partial z_j} = y_{\text{pred}, i}\,\bigl(\delta_{ij} - y_{\text{pred}, j}\bigr)
$$


where $\delta_{ij}$ is the Kronecker delta (1 if $i=j$, 0 otherwise).

2.	Chain Rule:

$$
\frac{\partial \text{CE}}{\partial z_i}
= \sum_{j} \frac{\partial \text{CE}}{\partial y_{\text{pred}, j}}
\cdot \frac{\partial y_{\text{pred}, j}}{\partial z_i}
$$

Only the term for $j = t$ is nonzero in $\frac{\partial \text{CE}}{\partial y_{\text{pred}, j}}$. Substituting and simplifying leads to

$$
\begin{align}
\frac{\partial \text{CE}}{\partial z_i} &= \bigl(y_{\text{pred}, i} - \delta_{it}\bigr)
&= y_{\text{pred}, i} - y_{\text{true}, i}
\end{align}
$$

**Interpretation**

The pushing of $\hat{y}$ can be seen with this update function
$$
\hat{y}_{new} = \hat{y}_{old} - \text{learning rate} \cdot \frac{\partial \text{CE}}{\partial \hat{y}}
$$

1.	Target Class ($i = t$):
    $$
    \frac{\partial \text{CE}}{\partial z_t} = y_{\text{pred}, t} - 1
    $$

    If $y_{\text{pred}, t} < 1$, this gradient is negative, indicating $z_t$ should increase (raising the probability of the correct class).

2.	Non-Target Classes ($i \neq t$):
    $$
    \frac{\partial \text{CE}}{\partial z_i} = y_{\text{pred}, i}
    $$

    A positive gradient means $z_i$ should decrease to lower the probability for incorrect classes.

In [37]:
def grad_cross_entropy(softmax, target_index):
    grad = softmax.copy()
    grad[target_index] -= 1  # Subtract 1 for the target index
    return grad

grad = grad_cross_entropy(softmax, target_index=1)
print(grad, softmax)

[ 0.1553624 -0.5776812  0.4223188] [0.1553624 0.4223188 0.4223188]


### Mean Squared Error
Also known as: "L2 loss"

The Mean Squared Error measures the average of the squares of the errors between the predicted values and the actual values. Usually used in solving regression problems.

1.	Punishing Outliers
    - Because of the squaring, MSE penalizes larger deviations heavily, making the model try very hard to avoid large mistakes. This can be good if you truly want to push the model to be accurate across all samples, but it can also make MSE sensitive to outliers.
2.	Relationship to Variance
    - Minimizing MSE is related to minimizing variance (in the simplest case of linear models, it also ties in neatly with least squares solutions).
3.	Common Use Cases
    - Classic linear regression.
    - Neural networks for regression outputs (e.g., predicting continuous values).
    - Any scenario where the average squared difference is a sensible measure of error.

#### Forward
$$
\text{MSE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2
$$

Where:
- $N$ is the total number of samples
- $y_i$ is the true label (regression target) for the $i$-th sample
- $\hat{y}_i$ is the predicted value for the $i$-th sample

In [38]:
y_true = np.array([3.0, -1.0, 2.0, 7.0])
y_pred = np.array([2.5, -2.0, 2.0, 8.0])

def mean_squared_error(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

mse_value = mean_squared_error(y_true, y_pred)
print("MSE:", mse_value)

MSE: 0.5625


#### Backward

We want the derivative of MSE with respect to $\hat{y}_i$:
$$
\frac{\partial \text{MSE}}{\partial \hat{y}_i}
= \frac{\partial}{\partial \hat{y}_i} \left( \frac{1}{N} \sum_{j=1}^N (y_j - \hat{y}_j)^2 \right)
= \frac{2}{N} (\hat{y}_i - y_i)
$$

In vectorized form for all samples:

$$
\nabla_{\hat{y}} \text{MSE} = \frac{2}{N} (\hat{y} - y)
$$

**Interpretation:**

The pushing of $\hat{y}$ can be seen with this update function
$$
\hat{y}_{new} = \hat{y}_{old} - \text{learning rate} \cdot \frac{\partial \text{MSE}}{\partial \hat{y}}
$$


- Proportional Error Correction: The gradient is directly proportional to the difference $\hat{y} - y$. If the prediction $\hat{y}$ is much higher than the target $y$, the gradient is large and positive (by the formula couple lines above), signaling the model to decrease its prediction. Conversely, if $\hat{y}$ is too low, the gradient is large and negative, pushing the prediction upward.
- Smooth and Symmetric: Because the error is squared, the penalty increases quadratically with the error size. This smooth, continuous gradient helps the optimizer make fine adjustments, especially as predictions get closer to the true values.
- Diminishing Updates: As the model improves and the error decreases, the gradient becomes smaller, resulting in smaller updates. This is helpful for fine-tuning the model as it converges toward the optimal solution.

In [39]:
def mse_backward(y_true, y_pred):
    # N is the number of samples
    N = y_true.size
    return 2.0 * (y_pred - y_true) / N

mse_grad = mse_backward(y_true, y_pred)
print("MSE gradient:", mse_grad)

MSE gradient: [-0.25 -0.5   0.    0.5 ]


### Hinge Loss

Commonly used in support vector machines and other “maximum-margin” methods for classification problems. Assumes labels ($\pm 1$).

We will try to give an overview without going to Support Vector Machine details.

**Key characteristics:**
1.	Margin Enforcement
    - The condition $y \hat{y} \ge 1$ means the data point is not just classified correctly $y \hat{y}>0$ but also lies outside the margin boundary $\text{distance} \ge \frac{1}{\|\mathbf{w}\|}$.
    - When $y \hat{y} \ge 1$, the hinge loss is 0, indicating “no penalty” if the point is on the correct side with some margin.
2.	Focus on Hard Examples
    - If a sample is correctly classified and beyond the margin, it doesn’t contribute to the loss or gradient.
    - Only misclassified or within-margin points incur a penalty, making the training more focused on the “hard” or borderline cases.
3.	Linear Gradient
    - For points within the margin, the gradient is constant $-y$. This leads to sub-gradient methods for optimization (SVM solvers).
4.	Use Cases
    - SVM classification (especially linear or kernel SVMs).
    - Sometimes in neural nets for classification, though logistic-based or cross-entropy losses are more common there.

#### Forward
The Hinge Loss is defined as:

$$
\begin{align*}
\text{Hinge}(y, \hat{y}) &= \max\Bigl(0, 1 - y \cdot \hat{y}\Bigr) \\
&= \max\big(0,\, 1 - y_i \mathbf{w}\cdot \mathbf{x}_i + b)\big)
\end{align*}
$$


Where:
- $y$ is the true label in ${-1, +1}$
- $\hat{y}$ is the predicted score (often the output before a sign function)
- $x_i$ is the input $i$-th sample
- $\mathbf{w}$ is the weight/parameter vector of the model
- $b$ is the bias vector of the model



In [40]:
# For demonstration, true labels must be either +1 or -1
y_true = np.array([1, 1, -1, -1])
y_pred = np.array([2.0, -0.5, 0.5, -2.0])  # some raw scores

def hinge_loss(y_true, y_pred):
    # hinge loss for each sample
    losses = np.maximum(0, 1 - y_true * y_pred)
    return losses

hl = hinge_loss(y_true, y_pred)
print("Hinge Loss (per sample):", hl)
print("Average Hinge Loss:", np.mean(hl))

Hinge Loss (per sample): [0.  1.5 1.5 0. ]
Average Hinge Loss: 0.75


#### Backward

To find the derivative of $\text{Hinge}(y, \hat{y})$ with respect to $\hat{y}$:

$$
\frac{\partial \text{Hinge}}{\partial \hat{y}} =
\begin{cases}
0, & \text{if } y \cdot \hat{y} \geq 1 \\
- y, & \text{if } y \cdot \hat{y} < 1
\end{cases}
$$

**Interpretation**

The pushing of $\hat{y}$ can be seen with this update function
$$
\hat{y}_{new} = \hat{y}_{old} - \text{learning rate} \cdot \frac{\partial \text{Hinge}}{\partial \hat{y}}
$$


- Focus on Violations: The gradient is non-zero only when the prediction is within the margin or misclassified. This means the model only gets “punished” (and thus updated) when it makes an error or when the prediction is not confident enough. No update is applied if the prediction is correct and sufficiently confident (i.e., it exceeds the margin).
- Constant Correction: The gradient value $-y$ is constant (it doesn’t depend on how far the prediction is from the margin). This creates a uniform push in the direction of the correct class. For instance, if a positive sample is not confident enough, the gradient will always be -1, signaling the need to increase the score regardless of the degree of error.
- Sparsity in Updates: Because the gradient is zero for well-classified examples, the learning algorithm focuses on the “hard” examples where the decision boundary is ambiguous or incorrect, which is especially useful in classification tasks like those tackled by support vector machines.

In [41]:
def hinge_loss_backward(y_true, y_pred):
    # derivative of hinge loss for each sample
    grad = np.zeros_like(y_pred)
    mask = (1 - y_true * y_pred) > 0  # where hinge is active
    grad[mask] = -y_true[mask]
    return grad

hl_grad = hinge_loss_backward(y_true, y_pred)
print("Hinge Loss gradient:", hl_grad)

Hinge Loss gradient: [ 0. -1.  1.  0.]


### Quantile Loss

Also known as the “pinball loss,” used in quantile regression. Given a quantile ($\tau \in (0,1)$).


1.	Predicting a Quantile Instead of the Mean
    - Unlike MSE (which approximates the mean of the target variable), quantile loss tries to fit a particular quantile $\tau\in(0,1)$. For example:
      - $\tau=0.5$ = median regression (the model’s prediction is the median of y given the features)
      - $\tau=0.9$ = the model aims to be accurate at the 90th-percentile “high” end of y
2.	Asymmetric Penalty
    - When $y > \hat{y}$, the error contributes $\tau \times  (y - \hat{y}) $
    - When $y < \hat{y}$, the error contributes $(\tau-1)\times (y - \hat{y}) $
    - Because $\tau-1$ is negative if $\tau<1$, this effectively penalizes overestimates differently than underestimates.
3.	Robustness and Distribution Insight
    - By shifting $\tau$, you can get a more complete picture of the conditional distribution of y. For instance, you might want to estimate the 0.05 (5th) or 0.95 (95th) quantiles for “worst-case” or “best-case” scenarios
    - Median regression ($\tau=0.5$) is more robust to outliers than MSE, because absolute deviations grow more slowly than squared deviations

#### Forward
$$
\text{Quantile}(y, \hat{y}; \tau)
= \max\biggl(\tau (y - \hat{y}), (\tau - 1) (y - \hat{y})\biggr)
$$

Another common way to write it is piecewise:
$$
\text{Quantile}(y, \hat{y}; \tau) =
\begin{cases}
\tau \cdot (y - \hat{y}) & \text{if } y \ge \hat{y}, \\
(\tau - 1) \cdot (y - \hat{y}) & \text{otherwise}.
\end{cases}
$$

Where:
- $y$ is the true value,
- $\hat{y}$ is the predicted value,
- $\tau$ is the quantile (e.g., 0.5 for median).

In [42]:
y_true = np.array([5.0, 0.0, 3.0])
y_pred = np.array([4.0, 1.0, 2.5])

def quantile_loss(y_true, y_pred, quantile=0.5):
    # e is the residual (error)
    e = y_true - y_pred
    # apply piecewise definition
    loss = np.where(e >= 0, quantile * e, (quantile - 1) * e)
    return np.mean(loss)

ql_value = quantile_loss(y_true, y_pred, quantile=0.5)
print(f"{y_pred=}, {y_true=}")
print("Quantile Loss:", ql_value)

y_pred=array([4. , 1. , 2.5]), y_true=array([5., 0., 3.])
Quantile Loss: 0.4166666666666667


#### Backward

The derivative of the quantile loss with respect to $\hat{y}$ depends on the sign of $y - \hat{y}$:

$$
\frac{\partial \text{Quantile}}{\partial \hat{y}_i} =
\begin{cases}
-\tau, & \text{if } y_i > \hat{y}_i, \\
-(1 - \tau), & \text{if } y_i < \hat{y}_i, \\
\text{any value in between} \; [-\tau, -(1-\tau)], & \text{if } y_i = \hat{y}_i.
\end{cases}
$$

In [43]:
def quantile_loss_backward(y_true, y_pred, quantile=0.5):
    error = y_true - y_pred
    grad = np.zeros_like(y_pred)
    grad[error > 0] = -quantile
    grad[error < 0] = (1 - quantile)
    # error == 0: can be any value in [-quantile, -(1-quantile)]
    return grad

ql_grad = quantile_loss_backward(y_true, y_pred, quantile=0.5)
print(f"{y_pred=}, {y_true=}")
print("Quantile Loss gradient:", ql_grad)

y_pred=array([4. , 1. , 2.5]), y_true=array([5., 0., 3.])
Quantile Loss gradient: [-0.5  0.5 -0.5]
