## Abstract

One of the challanges of this competition is an unusual metric. We have to return regression prediction and it's confidence simultaniosly, but usually we're asked to return either class prediction with probability or only regression one. This notebook presents a method for solving this task with Splitted Loss: task will be splitted into two part and for each part particular loss will be presentd with high correlation with target Laplace Loss. Mathematical proof for Splitted Loss will be given with corresponding implementations and pictures for some cases. Moreover, some insights under Laplace Loss will be covered that can helps in this competition even without Splitted Loss using.

## 1. Introduction

This competetition have some problems with target loss mentioned above. Under this circumstances there are two ways to solve this problem:
- Adding one layer to predict a Sigma with FVC prediction 
- Splitting challange into two easier ones: first for a FVC prediction and the second for a Sigma prediction (with calculated FVC prediction as the one of features).

Each case has it's own pros and cons. First approach is easy to implement with original loss, but it's unavailable to use many of popular gradient boosting methods, cause even a custom objective require only 1D vector as output of the model. On the other side, the second method has even more sierious problem: what type of loss we have to use in each task to achieve the best result in terms of an original loss? 

The next sections are about the second approach with the first task for FVC prediction and the second for Sigma one. Sigma prediction task look easier, cause in that moment delta from Laplace Loss is a constant already, loss input becomes 2D dimensinal and available to use straightforward. But FVC prediction has undefined Sigma. Choosing some particular loss like MAE or RMSE can help you to find some estimation of an exact solution, but there is no guarantee that such losses will correlate with Laplace Loss. It would be convinient to split target Laplace Loss into two losses in additional way and find FVC estimation via minimizing first item and Sigma via minimizing the second. In this terms target optimization obviously come down to two highly correlated tasks and improvement in one of this tasks with high probability will affect the target loss. So this notebook introduces one way to do that.

## 2. Splitted Loss

In this competition we need to optimize Laplace Log Likelihood with the next corresponding loss function:

$$
Loss = \frac{\sqrt{2} \cdot \Delta}{\sigma} + \log({\sqrt{2} \cdot \sigma}),
$$

where 
- $\Delta$ is an abbreviation for  $|y_{true} - y_{pred}|$,
- $y_{true}$ is a known label,
- $y_{pred}$ and $\sigma$ are model output.

Let's introduce the next notions:

$$
\begin{cases}
   \sigma = \sqrt{2} \cdot \Delta_{pred}, \\
   \Delta = \Delta_{true}.
 \end{cases}
$$

It's easy to prove that an optimal $\sigma$ with fixed $\Delta$ is $\sigma = \sqrt{2} \cdot \Delta$, so the exchange above turns this equality into next easier one: $\Delta_{true} = \Delta_{pred}$. Moreover, it gives more sense about variables ($\Delta_{pred}$ is a clear estimation for MAE instead of some vague $\sigma$ parameter) and makes formula more understandable:

$$
Loss = \frac{\sqrt{2} \cdot \Delta}{\sigma} + \ln({\sqrt{2} \cdot \sigma}) = 
\frac{\sqrt{2} \cdot \Delta_{true}}{\sqrt{2} \cdot \Delta_{pred}} + \ln(\sqrt{2} \cdot \sqrt{2} \cdot \Delta_{pred}) = 
\frac{\Delta_{true}}{\Delta_{pred}} + \ln(2 \cdot \Delta_{pred}) \sim
\frac{\Delta_{true}}{\Delta_{pred}} + \ln(\Delta_{pred}).
$$

Now let's take step further with addition and subtruction of some item:

$$
Loss = \frac{\Delta_{true}}{\Delta_{pred}} + \ln(\Delta_{pred}) \pm \ln(\Delta_{true}) =
\ln(\Delta_{true}) + \left[ \frac{\Delta_{true}}{\Delta_{pred}} + \ln(\Delta_{pred}) - \ln(\Delta_{true}) \right] = \\
\ln(\Delta_{true}) + \left[ \frac{\Delta_{true}}{\Delta_{pred}} + \ln{\frac{\Delta_{pred}}{\Delta_{true}}} \right] = 
Loss_1 + Loss_2
$$

In terms of dependences:

$$
Loss(y_{true}, y_{pred}, \sigma_{pred}) = Loss_1(y_{true}, y_{pred}) + Loss_2(\Delta_{true}, \Delta_{pred}),
$$

So we've got the next formaula for Splitted Loss:

$$
\begin{cases}
   Loss_1(y_{true}, y_{pred}) = \log(|y_{true} - y_{pred}|), \\
   Loss_2(\Delta_{true}, \Delta_{pred}) = \frac{\Delta_{true}}{\Delta_{pred}} + \log{\frac{\Delta_{pred}}{\Delta_{true}}}.
 \end{cases}
$$

## 3. Related Losses

We can write avereged $Loss_1$ as follows:

$$
\frac{1}{N} \cdot \sum_{i=1}^{N} Loss_1(y_{true}, y_{pred}) =
\frac{1}{N} \sum_{i=1}^{N} \ln(\Delta_{true}^{i}) = 
\ln\left[ \prod_{i=1}^{N} \Delta_{true}^{i}\right]^{1/N} =
\ln G(\Delta_{true}^{i}, ..., \Delta_{true}^{N})
$$

Where $G$ is a *geometric mean*. So minimizing the $Loss_1$ is equivalent to minimizing of a $G$, so you can use one of an available loss for train your model without writing any custom loss. But the socond loss $Loss_1$ has more sofisiticted behaviour and we have to implement it from scratch in order to use the theory below.

## 4. Properties and Observations

There are some properties for better understanding Splitted Loss and observations for making sense and getting some insights:

1. $Loss_1(y_{true}, y_{pred})$ is minimal for $y_{true} = y_{pred}$ with optimal value $-\infty$.
2. $Loss_2(\Delta_{true}, \Delta_{pred})$ is minimal for $\Delta_{true} = \Delta_{pred}$ with optimal value $1$:

    With denoting $x = \frac{\Delta_{true}}{\Delta_{pred}}$:
    $$
    Loss_2' = (x + \ln\frac{1}{x})' = (x - \ln x)' = 1 - \frac{1}{x} = 0 \rightarrow x_{opt} = 1, \\
    Loss_2'' = (1 - \frac{1}{x})' = \frac{1}{x^2} > 0 \rightarrow x_{opt} \in argmin(Loss_2)
    $$
    
3. $Loss_1(y_{true}, y_{pred})$ is the lower bound for $Loss(y_{true}, y_{pred}, \sigma_{pred})$:

    $$
    Loss(y_{true}, y_{pred}, \sigma_{pred}) \geq
    Loss(y_{true}, y_{pred}, \sigma_{pred}^{opt}) = 
    Loss(y_{true}, y_{pred}, \sigma_{true}) \sim 
    \frac{\Delta_{true}}{\Delta_{true}} + \ln(\Delta_{true}) \sim \\
    \sim \ln(\Delta_{true}) =
    \ln(|y_{true} - y_{pred}|) = Loss_1(y_{true}, y_{pred})
    $$
    
    So when minimizing $Loss_1$ for the first task we minimizing a lower bound for a target Laplace Loss. That's an another point of view of the loss splitting that we use: first of all we minimize a lower bound of the Laplace Loss and secondly, we try to achieve this lower bound with $Loss_2$ minimizing.


4. $Loss_2(\Delta_{true}, \Delta_{pred})$ is minimal for $\Delta_{true} = \Delta_{pred}$:

    - Case $\Delta_{pred} < \Delta_{true}$ ($x = \frac{\Delta_{true}}{\Delta_{pred}} > 1$):   $Loss_2(\Delta_{true}, \Delta_{pred}) = x - \ln x$.
    
    - Case $\Delta_{pred} \geq \Delta_{true}$ ($x = \frac{\Delta_{pred}}{\Delta_{true}} \geq 1$):   $Loss_2(\Delta_{true}, \Delta_{pred}) = \frac{1}{x} + \ln x \sim  \ln x$.
    
    So, we can see that penalty for $\Delta_{pred} < \Delta_{true}$ is bigger, than for $\Delta_{pred} \geq \Delta_{true}$ cause of $x - \ln x > \ln x$ (it's obvious with $x \rightarrow \infty$, but with small $x$ needs a proof):
    
    $$
    f(x) = x - 2 \cdot \ln x \\
    f(x)' = 1 - \frac{2}{x} = 0 \rightarrow x_{opt} = 2, \\
    f(x)'' = (1 - \frac{2}{x})' = \frac{2}{x^2} > 0 \rightarrow f(x) >= f(x_{opt}) \approx 0.613
    $$
    
    It means that $Loss_2(\Delta_{true}, \Delta_{pred})$ is unsymmetric function and penalty for underrated $\Delta_{pred}$ is bigger then for overrated one, so in the unsertainty in Sigma prediction you can use this fact for choose one of the highest variants of estimations. It will be more clear with pictures in the next section.

## 5. Implementation

This section constists of implementation both Laplace Loss and each part of Splitted Loss:

In [None]:
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import numpy as np

def loss_1(y_true, y_pred):
    return np.log(np.abs(y_true - y_pred))

def loss(y_true, y_pred, delta_pred):
    delta_true = np.abs(y_true - y_pred)
    return delta_true / delta_pred + np.log(delta_pred)

def loss_2(delta_true, delta_pred):
    x = delta_true / delta_pred
    return x - np.log(x)

Check an additional property:

In [None]:
def check_correctness(n=100, eps=1e-6):
    y_true = np.random.rand(n)
    y_pred = np.random.rand(n)

    delta_true = np.abs(y_true - y_pred)
    delta_pred = np.random.rand(n)
    
    l1 = loss_1(y_true, y_pred)
    l2 = loss_2(delta_true, delta_pred)
    l = loss(y_true, y_pred, delta_pred)
    
    assert np.all(np.abs(l - l1 - l2) < eps)

In [None]:
for _ in range(1000):
    check_correctness()
print("OK!")

Now let's take a look on a behavior of the losses. First part of The Splitted Loss looks expectedly:

In [None]:
y_preds = np.arange(70, 800)
y_trues = [70, 100, 200, 300, 500, 1000]
colors = ["blue", "red", "green", "brown", "black"]

plt.figure(figsize=(16, 8))
plt.title("Loss_1")
for y_true, color in zip(y_trues, colors):
    plt.plot(y_preds, loss_1(y_true, y_preds), label="delta_true={}".format(y_true), color=color)
plt.xlabel("y_pred")
plt.ylabel("loss value")
_ = plt.legend()

The same with the second part:

In [None]:
delta_preds = np.arange(70, 1000)
delta_trues = [70, 100, 300, 500, 1000, 2000]
colors = ["blue", "green", "red", "brown", "black"]
plt.figure(figsize=(16, 8))

plt.title("Loss_2")
for delta_true, color in zip(delta_trues, colors):
    plt.plot(delta_preds, loss_2(delta_true, delta_preds), label="delta_true={}".format(delta_true), color=color)
    plt.scatter(delta_true, loss_2(delta_true, delta_true), color=color)
plt.xlabel("delta_pred")
plt.ylabel("value")
_ = plt.legend()

On the second part we can see an observation noticed in the section above: penalty for underrated $\Delta_{pred}$ is bigger then for overrated one. Moreover, bigger $\Delta_{true}$ stronger penalty for underrated $\Delta_{pred}$ could be. On the other hand, penalty for overrated $\Delta_{pred}$ isn't change so much even with relatively large $\Delta_{pred}$. It easy to explain cause in this case loss roughly speaking equals $\ln \frac{\Delta_{pred}}{\Delta_{true}}$ so division and logarithm using decrease a loss anough.