<a href="https://colab.research.google.com/github/salilathalye/chats-with-austin/blob/main/CWA_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Deliberate Practice - Binary Logistic Regression
<p>Dependent Variable


*   Categorical
*   Binary-valued e.g. Case | No Case, 1 | 0, Success | Failure

<p>Independent Variables


*   Continuous 
*   Binary
*   Categorical coded as dummy/indicator variables

<p> Approach
Treats the dependent variable as an outcome of a Bernoulli trial



###Bernoulli Trial
$$Y_i|x_{1,i},x_{1,i},x_{1,i},\ldots,x_{m,i} \sim Bernoulli(p_i)$$
Outcomes $Y_i$ conditioned on the explanatory variables follow a Bernoulli distribution with parameter $p_i$

$$E[Y_i|x_{1,i},x_{1,i},x_{1,i},\ldots,x_{m,i}] = p_i$$
The expected value of $Y_i$ converges to the the probability of success $p_i$

$$Pr(Y_i = y_i|x_{1,i},x_{1,i},x_{1,i},\ldots,x_{m,i}) = p_i^{y_i}(1-p_i)^{(1-yi_i)}$$
The Probability Mass Function (PMF) of the Bernoulli Trial takes the values $p_i$ for Success $(y_i = 1)$ or $(1-p_i)$ for Failure $(y_i = 0)$.

####Binomial Distribution
#####Binomial Likelihood Function
<p>n tries, r successes given probability p of success
<p>
$$L(p) = {n \choose r} p^r(1-p)^{(n-r)}$$

####Odds Ratio
p is the probability of the positive event
$$0 \lt \frac{p}{(1-p)} \lt 1$$

####Logit aka Log-Odds
$$logit(p) = log_e \frac{p}{(1-p)}$$

$log_e$ also written as $ln$ is the natural logarithm, the inverse of $log_e(x)$ is $e^{x}$, where $e$ is Euler's Number.



Why are we concerned with log-odds? Taking the natural log of the odds ratio provides a continous outcome. The logit serves as a link function converting the binary outcome into a continuous outcome.

###Linear Functions of Explanatory Variables
$$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_nx_n$$

$$y = w_0 + w_1x_1 + w_2x_2 + \ldots + w_nx_n$$

###Logistic Sigmoid Function aka Sigmoid Function

The logistic function $F(t)$ takes on the values between 0 and 1

$$F(t) = \frac{e^t}{e^t + 1}$$

Dividing the numerator and denominator by $e^t$ we get

$$F(t) = \frac{1}{1 + e^{-t}}$$


$$\phi(z) = \frac{1}{1+e^{-z}}$$

If we denote t as a linear function explanatory variables x
$$F(t) = \frac{1}{1+e^{-(\beta_0 + \beta_1x_1 + \ldots + \beta_nx_n)}}$$

####Model
Logistic Regression models the log odds as a linear function of k factors (predictors). 

$${logit}(p(y=1|\mathbf x)) = 
\beta_0x_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_kx_k = 
\sum\limits_{i=0}^{k} {\beta_ix_i} = \boldsymbol \beta^\intercal \mathbf x$$

Here $p(y=1|\mathbf x)$ represents the conditional probability that y belongs to class 1, given its k features represented in $\mathbf x$.

$\beta_0$ is the intercept and $x_0$ is set to 1.

Here z is the net input i.e. the linear combination of the weights and input features $z = \boldsymbol \beta^\intercal \mathbf x$.


$$\hat y =
\left\{
    \begin{array}{ll}
        1  & \mbox{if } \phi(z) \geq 0.5 \\
        0 & \mbox {otherwise}
    \end{array}
\right.$$

###Learning the weights $\beta_i$ - Maximum Likelihood Estimation


Likelihood of observing the data, given the model parameters, assuming the data 
follow a distribution

$$P(X;\theta)$$

Find the set of parameters $\theta$ that maximize the likelihood

$$max L(X; \theta)$$

We have to do this across the set of n samples, so it is a joint probability of the conditional probability of each sample.

$$max \prod\limits_{i=1}^{n} P(x_i;\theta)$$
When we multiple hundreds of small probabilities together we can encounter [arithmetic underflow](https://en.wikipedia.org/wiki/Arithmetic_underflow). Taking the natural log converts the multiplication into a sum, it does not affect the computation of argmax (why?). This is called the log-likelihood function. Rather than maximize a function we prefer to minimize a function (first derivative = 0 at minimum) therefore we also negate the function. We have converted the problem from finding $\theta$ that maximizes the joint probability to a problem of minimizing the negative log-likelihood.
$$min  - \sum\limits_{i=1}^{n} log P(x_i;\theta)$$

### Cost Function for Negative Log-Likelihood
$$L(\mathbf {w}) = P(\mathbf y | \mathbf x; \mathbf w) = \prod\limits_{i=1}^{n}P(y^{(i)}|x^{(i)};\mathbf w) = \prod\limits_{i=1}^{n}(\phi(z^{(i)}))^{y^{(i)}}(1-\phi(z^{(i)}))^{1 - y^{(i)}}$$

Taking the natural logarithm, we convert the product into sum

$$log L(\mathbf w) = \sum\limits_{i}^{}y^{(i)}log(\phi(z^{(i)})+(1-y^{(i)})log(1-\phi(z^{(i)}))$$

Reframing the problem to minimize the negative log-likelihood
$$J(\mathbf {w}) = - log L(\mathbf w)$$

$$J(\mathbf {w}) = - \sum\limits_{i}^{}\left[y^{(i)}log(\phi(z^{(i)})+(1-y^{(i)})log(1-\phi(z^{(i)}))\right]$$

$$J(\mathbf {w}) = \sum\limits_{i}^{}\left[-y^{(i)}log(\phi(z^{(i)})-(1-y^{(i)})log(1-\phi(z^{(i)}))\right]$$

$\mathbf w\$ are the weights we wish to learn, the superscript (i) represents each sample.

###Partial Derivative of the Sigmoid Function

$$\frac{\partial {}}{\partial z}\phi(z) = \frac{\partial {}}{\partial z}\frac{1}{1+e^{-z}} = \frac{1}{(1+e^{-z})^2}e^{-z}$$

This is because

$$\frac{\partial {}}{\partial z}\frac{1}{1+e^{-z}} = \frac{\partial {}}{\partial z}({1+e^{-z}})^{-1} $$

$$-1({1+e^{-z}})^{-2}\frac{\partial {}}{\partial z}({1+e^{-z}})$$

$$\frac{1}{(1+e^{-z})^2}e^{-z}$$

$$\frac{1}{(1+e^{-z})^2}((1 + e^{-z}) -1) = \frac{1}{1+e^{-z}} \left(1 - \frac{1}{1+e^{-z}}\right) = \phi(z)(1-\phi(z))$$

Therefore

$$\frac{\partial {}}{\partial z}\phi(z) = \phi(z)(1-\phi(z))$$

###References
*Raschka, Mirjalili Python Machine Learning Third Edition, Pakt
*Burkov, Andriy The Hundred-Page Machine Learning Book
*Abhishek Thakur, Approaching (Almost) Any Machine Learning Problem
*Geron, Aurelien Hands-On Machine Learning with Scikit-Learn & Tensorflow
*Strickland, Jeffrey Logistic Regression Inside-Out