# Logistic Regression

**odds, odds ratio, and probability**

$$odds(p) = (\frac{p}{1-p})$$
$$odds\ ratio = \frac{\frac{X_A}{X_B}}{\frac{Y_A}{Y_B}}$$
$$relative\_risk = \frac{\frac{X_A}{X_A+Y_A}}{\frac{X_B}{X_B+Y_B}}$$
where X is treated, Y is control, A is impacted, B is not impacted
$$ probability = \frac{odds}{1+odds} = \frac{4}{1+4} = 0.8$$

**Distribution of logistic regression predictor and outcome variables**

$$Z = logit(P) = log(odds) = log(\frac{P}{1-P}) = \theta^Tx = \theta_0 + \theta_1$$
$$e^Z = \frac{P}{1-P}$$
$$P = \frac{e^Z}{1+e^Z} = \frac{1}{1+e^{-Z}}$$

**Sigmoid function (logistic function for binary classification and a neuron activation function)**

$$\sigma(x) = \frac{1}{1+e^{-\theta x}}$$

**Derivative of sigmoid funtion (we can expand this to softmax)**

$$\frac{d}{dx}\sigma(x)=\frac{e^{-x}}{(1+e^{-x})^2}$$
or 
$$\sigma'(x) = \sigma(x)(1-\sigma(x)) $$ 

**Logistic Regression Definition (put the above concept together)**

* Hypothesis function $h_{\theta}(x)$
  Logit: $Z = \theta^Tx$
  $$h_{\theta}(x) = \frac{1}{1+e^Z} = \frac{1}{1+e^{-\theta^T x}}$$

* Decision Boundry:
  $$h_{\theta}(x) \geq 0.5  \to y = 1$$
  $$h_{\theta}(x) < 0.5  \to y = 0$$
  or
  $$\theta^T \geq 0 \to y = 1$$
  $$\theta^T < 0 \to y = 0$$

* Cost Function (Measure the goodness of our hypothesis with respect to all data samples)
  $$J(\theta) = \frac{1}{m} \sum^m_{i=1}Cost(h_\theta(x^{(i)}), y^(i))$$
  $$J(\theta) = \frac{1}{m} \sum^m_{i=1}(-y^ilog(h_\theta(x^i)) - (1-y^i)log(1-h_\theta(x^i)) )$$
  $$J(\theta) = -\frac{1}{m} \sum^m_{i=1}(y^ilog(h_\theta(x^i)) + (1-y^i)log(1-h_\theta(x^i)) )$$

In [7]:
import math
import numpy as np

def logit(P):
    return log(P/(1-P))

def sigmoid(p, x):
    Z = -1*(p.T@x)
    return 1/(1+np.exp(-Z))

def d_sigmoid(p, x):
    return sigmoid(p, x)*(1 - sigmoid(p, x))

# Probabilistic Programming and Bayesian DL

* Bayesian Statistics
    * Beta Distribution
    * Binomial likelihood
* Probability Theory
* Probabilistic library (PyMc3, Stan)

# Information Theory

  * PMI (Pointwise Mutual Inforamtion): how much knowing one outcome tells you about another
    * $$\text{PMI}(x, y) = \log_2\frac{p(x, y)}{p(x)\ p(y)}$$
    * if x, y are indepdent, PMI = 0 as P(x, y) = 0

  * Entropy (Shannon entropy) is how 'uncertain' the outcome of some experiment is. 
    * The more uncertain the more spread out the disbribution, the higher the entropy
    * $$\text{Entropy}(X) = H(X) = -\Sigma_x\ p(x) \log_2 p(x)$$
    * to find expected value $E[log_2{p(x)}]$ for the probability distribution
    * Example: BinaryEntropy (coin flip)
      * BinaryEntripy(p = 0) = 0.0 always get tail, no uncertainty
      * BinaryEntropy(p = 0.5) = 1.0 max uncertainty (note that entropy value can be infinitely large)
    * entropy is the average number of bits per message going across the wire given that you optimally designed your encoding ($-log_2P(x)$) in general case
  * Cross Entropy
    * The expected value for the number of bits you'd put on the wire in the case where you send messages with probability $P(X)$ but designed an optimal code with $Q(X)$
    * $H(X) = CrossEntropy(P, Q) = -\sum_x P(X) log_2 Q(x)$
    * crossEntropy is 0 if P match Q (prediction match the class)
  * KL Divergence
    * the size of the *penalty* for using the wrong distribution to optimize our code).  That difference is known as the [Kullback–Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence), or KL divergence for short.
    * It is a measure of how different two probability distributions are.  The more $Q$ differs from $P$, the worse the penalty would be, and thus the higher the KL divergence.
    * $ D_{KL}(P\ ||\ Q) = CE(P, Q) - H(P)$
    * $D_{KL}(P\ ||\ Q) \ne D_{KL}(Q\ ||\ P)$ not symmetric


In [None]:
import math
2 import numpy
3 print (math.log(1.0/0.98)) # Natural log (ln)
4 print (numpy.log(1.0/0.02)) # Natural log (ln)
5
6 print (math.log10(1.0/0.98)) # Common log (base 10)
7 print (numpy.log10(1.0/0.02)) # Common log (base 10)
8
9 print (math.log2(1.0/0.98)) # Binary log (base 2)
10 print (numpy.log2(1.0/0.02)) # Binary log (