# Logistic Regression

#### Initial definitions

![Feature Matrix](../Images/X_matrix.png)

$$
\mathbf{y}=\left(\begin{array}{c}{y_{1}} \\ {y_{2}} \\ {\vdots} \\ {y_{m}}\end{array}\right)
$$

$$
\mathbf{\theta}=\left(\begin{array}{c}{\theta_{1}} \\ {\theta_{2}} \\ {\vdots} \\ {\theta_{n}}\end{array}\right)
$$

#### $g(z)$: Logistic(sigmoid) function

$$h_{\theta}(x)=g\left(\theta^{T} x\right)=\frac{1}{1+e^{-\theta^{T} x}},$$
where
$$g(z)=\frac{1}{1+e^{-z}}$$

<img src="../Images/sigmoid.png" alt="Sigmoid" width="500"/>

#### Useful property of the derivative of the sigmoid function:
$$
\begin{aligned}
g^{\prime}(z) &=\frac{d}{d z} \frac{1}{1+e^{-z}} \\
&=\frac{1}{\left(1+e^{-z}\right)^{2}}\left(e^{-z}\right) \\
&=\frac{1}{\left(1+e^{-z}\right)} \cdot\left(1-\frac{1}{\left(1+e^{-z}\right)}\right) \\
&=g(z)(1-g(z))
\end{aligned}
$$

#### Let us assume that
$$
\begin{array}{l}
{P(y=1  | x ; \theta)=h_{\theta}(x)} \\
{P(y=0  | x ; \theta)=1-h_{\theta}(x)}
\end{array}
$$

#### More compactly 
$$
p(y | x ; \theta)=\left(h_{\theta}(x)\right)^{y}\left(1-h_{\theta}(x)\right)^{1-y}
$$

#### Likelihood:
$$
\begin{aligned}
L(\theta) &=p(\vec{y} | X ; \theta) \\
&=\prod_{i=1}^{m} p\left(y^{(i)} | x^{(i)} ; \theta\right) \\
&=\prod_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)\right)^{y^{(i)}}\left(1-h_{\theta}\left(x^{(i)}\right)\right)^{1-y^{(i)}}
\end{aligned}
$$

#### Log-likelihood:
$$
\begin{aligned}
\ell(\theta) &=\log L(\theta) \\
&=\sum_{i=1}^{m} y^{(i)} \log h\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h\left(x^{(i)}\right)\right)
\end{aligned}
$$

Log-liklihood needs to be maximized. Maximizing $\ell(\theta)$ is the same as minimizing $-\ell(\theta)$.

#### Loss function:
$$
J(\theta)= -\frac{1}{m} \ell(\theta)  =-\frac{1}{m}\left[\sum_{i=1}^{m} y^{(i)} \log h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right]
$$

#### Loss function in matrix form:
$$
J(\theta)=\frac{1}{m} \cdot\left[-y^{T} \log (h_{\theta}(X))-(1-y)^{T} \log (1-h_{\theta}(X))\right]
$$

#### Gradient descent

$$
\theta:=\theta - \alpha \nabla_{\theta} J(\theta)
$$

$$
\begin{aligned}
\frac{\partial}{\partial \theta_{j}} \ell(\theta) &=\left(y \frac{1}{g\left(\theta^{T} x\right)}-(1-y) \frac{1}{1-g\left(\theta^{T} x\right)}\right) \frac{\partial}{\partial \theta_{j}} g\left(\theta^{T} x\right) \\
&=\left(y \frac{1}{g\left(\theta^{T} x\right)}-(1-y) \frac{1}{1-g\left(\theta^{T} x\right)}\right) g\left(\theta^{T} x\right)\left(1-g\left(\theta^{T} x\right)\right) \frac{\partial}{\partial \theta_{j}} \theta^{T} x \\
&=\left(y\left(1-g\left(\theta^{T} x\right)\right)-(1-y) g\left(\theta^{T} x\right)\right) x_{j} \\
&=\left(y-h_{\theta}(x)\right) x_{j}
\end{aligned}
$$

$$
\theta_{j}:=\theta_{j}-\alpha \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)}
$$

#### Gradient of loss function in matrix form:
$$
\nabla_{\theta} J(\theta)=\frac{1}{m} \cdot X^{T}(h_{\theta}(X)-y)
$$