# Cross entropy

Given logit $t$ and $p$:

\begin{align*}
t &= \ln \frac{p}{1 - p} \\
p & = \frac{e^t}{1 + e^t} \\
1 - p &= \frac{1}{1 + e^t}
\end{align*}

Cross-entropy:

\begin{align*}
L 
&= - \Big[ y \ln p + (1 - y ) \ln (1 - p) \Big ] \\
&= - \Big[ y (t - \ln (1 + e^t)) - (1 - y) \ln (1 + e^t) \Big] \\
&= - \Big[ yt - y \ln (1 + e^t) - \ln (1 + e^t) + y \ln (1 + e^t) \Big] \\
&= - \Big[ yt - \ln (1 + e^t) \Big]
\end{align*}

Then the gradient:

\begin{align*}
\frac{\partial L}{ \partial t} 
&= -y + \frac{e^t}{1 + e^t} \\
&= p - y \\
\end{align*}

The Hessian:

\begin{align*}
\frac{\partial^2 L}{ \partial t^2} 
&= \frac{e^t (1 + e^t) - e^t e^t }{(1 + e^t)^2} \\
&= \frac{e^t}{(1 + e^t)^2} \\
&= \frac{e^t}{1 + e^t} \frac{1}{1 + e^t} \\
&= p ( 1 - p) \\
\end{align*}

The results matches the implementation at https://github.com/microsoft/LightGBM/blob/4971a06668df7eabeb7d4bb1987abb442f2970c9/src/objective/xentropy_objective.hpp#L83-L84, where `t = score`, and `p = z`

So the *important conclusion* is that with cross entropy as the objective function in binary classification, the ensemble model is predicting logit.

### In terms of $t = c + f$, where $c$ is a constant, and $f$ is the new tree to fit,

\begin{align*}
L 
&= - \Big[ yt - \ln (1 + e^t) \Big] \\
&= - \Big[ y(c + f) - \ln \left(1 + e^{c + f} \right) \Big] \\
\end{align*}

\begin{align*}
\frac{\partial L}{\partial f}
&= - y + \frac{e^{c + f}}{1 + e^{c + f}} \Big] \\
\end{align*}

\begin{align*}
\frac{\partial^2 L}{\partial f^2}
&= \frac{e^{c + f}(1 + e^{c + f}) - e^{c + f} e^{c + f}}{(1 + e^{c + f})^2} \\
&= \frac{e^{c + f}}{1 + e^{c + f}} \frac{1}{1 + e^{c + f}} \\
&= p ( 1 - p) \\
\end{align*}

# MSE:

\begin{align*}
L &= \frac{1}{2}(t - y)^2 \\
\frac{\partial L}{\partial t} &= t - y \\
\frac{\partial^2 L}{\partial t^2} &= 1
\end{align*}


### In terms of $t = c + f$, where $c$ is a constant, and $f$ is the new tree to fit,

\begin{align*}
L &= \frac{1}{2}(c + f - y)^2 \\
\frac{\partial L}{\partial f} &= c + f - y\\
\frac{\partial^2 L}{\partial t^2} &= 1
\end{align*}


So the derivative wrt. `t` or wrt. `c + f` are identical.