# Statistical Learning

---

## Statistical Learning
- Represent random vector $X$ whose components $x_{i}$ are random variables and quantifiable outcome as random vector $Y$ with joint probability $P(X, Y)$
- The joint density of the random variables $x_{i}$ : $f(X)$ and their joint distribution $F(X)$
$$f(X) = f(x_{1} , ... . , x_{n})$$ and $$F(X) = F(x_{1} , ... x_{n})$$
- Start with empirical distribution $F(X)$ that has empirical parameters $\theta$ and major objective of statistics is to give exact description of the example based on estimates of the parameters
- Predict $x$ vs. Estimate $\theta$

## Structure of Machine Learning
- Given some inputs, as representation, calculate something about them 
- Assume that there is a function $f(X)$ that describes the approximate relationship between $Y$ and $X$:
$$f(X) = E(Y|X = x, \theta)$$
where $\theta$ is parameter of the data distribution

## Point Estimation
- Point Estimation is the attempt to provide “best” prediction of some quantity of interest.
- Distinguish estimates of parameters from their true values by denoting estimate of a parameter $\theta$ by $\hat{\theta}$ which is also a random variable (function of $x_{i}$ r.v.)
- Let ${x^1, x^2, ..., x^n}$ be a set of $n$ independent and identically distributed data points. A **point estimator** or **statistic** is any function of the data:
$$\hat{\theta}_{n} = g(X) = g(x^1 , x^2 , ..., x^n)$$

## Point Estimation Example
- If the function $g(X)$ is properly selected than the estimation error $\theta - \hat{\theta}_{n}$ decreases as number $n$ of examples increases
- Assume $\mu$ denotes the mean grade point average of all college students, and the sample space is {1,2,3,4}. If $x_i$ are the observed grades of a sample of 88 students, then:
$$\hat{\mu} = \frac{1}{88} \sum^{88}_{1} x_{i} = 3.12$$
is a point estimate of $\mu$, the mean grade point average of all the students in the population

## Function Estimation
- Approximating $f$ with a model or estimate $\hat{f}$ chosen from hypothesis space
- Select functions from a carefully specified set, known as hypothesis space
- Decide how to represent data set and select hypothesis space
- Generally this space is indexed by a set of parameters $\theta$ that can be tuned to create different machines:
$$H ∶ {f(Y|X, \theta)}$$

---

## MLE
- Goal is to find **“Maximum Likelihood Estimation”** for parameter $\theta$.  Find optimal way to fit a distribution to the data.
- Consider a set of samples $X$ chosen according to one of family of probability distributions but we don’t know parameters of distribution. We define Likelihood function $$L(\theta | X) = f(X;θ) = P(X) = x$$ as joined probability distribution of samples X.
- Let $P_{model}(X, \theta)$ be family of probability distributions over the space $\theta$. The maximum likelihood estimator for $\theta$ is defined as:
$$\theta_{ML} = arg_{\theta}max L(\theta|X)$$
$$\theta_{ML} = arg_{\theta}max P_{model}(x^1,x^2,...,x^m; \theta)$$
$$\theta_{ML} = arg_{\theta}max \Pi_{i=1}^m P_{model}(x^i; \theta)$$

## MLE Binomial Example:
- To solve for the Likelihood, find the derivative of the distribution function
- Bernoulli distribution formula is: $P_{model}(x^i; \theta) = \theta^{x_{i}}(1 − \theta)^{1−x_{i}}$
- $ L(\theta|x) = \Pi_{i=1}^m P_{model}(x^i; \theta)$
- If $Sum = \sum^{n}_{i=1} x_i$, then $L(\theta|x) = \theta^{sum}(1 − \theta)^{n−sum}$
- Find $\theta$ which maximizes $L(\theta | x)$ by solving first derivative of $L()$ equal to 0
- Solution is: $$\hat{\theta}_{ML} = \frac{\sum^{n}_{i=1} x_i}{n}$$

- For $X$ = {$1,0,0,0,1,1,0,0,0,1$}, $\theta =4/10$

## MLE Normal Distribution Example:
- Formula for Normal Distribution:
$$p(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-(x-\mu)^2/2\sigma^2}$$
- $\mu$ determines the distribution's **Mean**
- $\sigma$ determines **Standard Deviation**
- Use likelihood of Normal Distribution to find optimal values for $\mu$ and $\sigma$ given data $x$:
$$L(\mu, \sigma | X) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-(x-\mu)^2/2\sigma^2}$$
$$L(\mu, \sigma | X) = L(\mu, \sigma | x_1, x_2,...,x_n) = L(\mu, \sigma | x_1) \times L(\mu, \sigma | x_2) \times...\times L(\mu, \sigma | x_n)$$
- To solve for the maximum likelihood of a parameter ($\mu$, $\sigma$), find where the slope of its likelihood functiio is 0
- Take 2 different derivatives:
    - One with respect to $\mu$ when $\sigma$ is constant, to find MLE where derivative is 0
    - One with respect ro $\sigma$ when $\mu$ is constant, to find MLE where derivative is 0
1. Take the log of both functions (Log of likelihood function peaks at the same value)
2. $\ln(L(\mu, \sigma | X)) = \ln(\frac{1}{\sqrt{2\pi \sigma^2}}e^{-(x_1-\mu)^2/2\sigma^2}) \times...\times \ln(\frac{1}{\sqrt{2\pi \sigma^2}}e^{-(x_n-\mu)^2/2\sigma^2})$
- The derivative of the Log function with rspect to $\mu$ is:
$$\frac{\partial}{{\partial\mu}} \ln(L(\mu, \sigma | X)) = \frac{1}{\sigma^2}((x_1+x_2+...+x_n) - n\mu)$$
- Solve for 0
$$\frac{1}{\sigma^2}((x_1+x_2+...+x_n) - n\mu) = 0$$
$$\mu = \frac{(x_1+x_2+...+x_n)}{n}$$ 
- MLE for $\mu$ is the **Mean** of the measurements 
- The derivative of the Log function with rspect to $\sigma$ is:
$$\frac{\partial}{{\partial\sigma}} \ln(L(\mu, \sigma | X)) = -\frac{n}{\sigma} + \frac{1}{\sigma^3}((x_1-\mu)^2 + ... + (x_n - \mu)^2)$$
- Solve for 0
$$-\frac{n}{\sigma} + \frac{1}{\sigma^3}((x_1-\mu)^2 + ... + (x_n - \mu)^2) = 0$$
$$\sigma = \sqrt{\frac{(x_1-\mu)^2+...+(x_n - \mu)^2}{n}}$$
- MLE for $\sigma$ is the **Standard Deviation** of the Measurements

---

## Entropy 
- Entropy is a measure of uncertainty to the occurrence or nonoccurrence of any event from a collection of multiple mutually exclusive events
- For discrete random variable $X$ with $K$ states and pmf is probability $P[x=k]$
- Quantify uncertainty of event $x_k = {x=k}$.  The uncertainty will be maximized if $p(x = k) = 1/K$
- Hence, measure of uncertainty satisfies property: $I(x=k) = −\ln P[x = k]$
- Define “Information for an event”: $I(x) = −\ln P(x)$, it is measured in bits or shannons
- The entropy of a random variable $X$ with distribution $P$ is defined as the expected value of the uncertainty of its outcomes:
$$H(X) = E[I(x)] = −\sum^{K}_{k=1} P[x=k] \ln P[x=k]$$

## Binary Entropy
- For special case of binary random variable $X$ {$0,1$}
$$p(X=1) = \theta \space and \space p(X=0) = 1 − \theta$$
$$H(X) = −[p(X = 1) \log_2 p(X=1) + p(X=0) \log_2 p(X = 0)]$$
$$H(X) = −[\theta \log_2 \theta + (1 − \theta)\log_2 (1 − \theta)]$$

## KL Divergence
- Measure how different two distributions $P(x)$ and $Q(x)$ are using **Kullback-Leibler (KL) divergence**:
$$D_{KL} (P||Q) = E_{x \sim P}[\log \frac{P(x)}{Q(x)}]$$
$$ D_{KL} (P||Q) = E_{x \sim P}[\log P(x) - \log Q(x)]$$
- The KL divergence has many useful properties:
    - Nonnegative
    - Not symmetric
- KL is not a true distance measure between two distributions because it is asymmetric. The choice which direction to use for KL is problem dependent

## Cross Entropy
- Cross Entropy between two distributions $P$ and $Q$
$$H(P,Q) = H(P) + D_{KL} (P||Q)$$
$$H(P,Q) = − E_{x \sim P} [\log Q(x)]$$
- Cross Entropy is always larger than Entropy, and is equal only when $P$ and $Q$ are the same
- Minimizing Cross Entropy will move closer to the desired distribution

---

## Neural Network Activations and Common Activation Functions
- In artificial neurons, use activation function to propagate the output of one layer’s nodes forward to next layer
- Activation functions, model non-linearity in NN that cannot be solved with linear functions (scalar to scalar functions)
- Functions used in Deep Learning models after linear combination of inputs; mimic neuron firing 
- Used to calculate parameters of the probability distributions used in probabilistic DL models
- Simplest Linear function: $f(x) = wx$
    - Function passes the signal through unchanged
    - Mostly used in the input layer of neural networks

## Rectified Linear Unit - ReLU
- Activates a node only if the input is above a certain threshold:
$$f(x) = max(0,kx)$$
- In this example, the threshold is 0, and x is the input to the neuron
- ReLU is very common in the current state of art DNN
- Popular modification in MobileNet is ReLU6 which behaves as general ReLU, but output saturates at level 6

## Logistic Sigmoid
- Logistic Sigmoid Function:
$$ \sigma(x) = \frac{1}{1 + e^{-x}}$$
- Saturates when its argument is very positive or very negative, it becomes **insensitive** to small changes in the input
- Used for binary classification class 
- Commonly used to produce $p$ parameter of Bernoulli distribution (between 0 and 1): 
$$p(y=1|x;\theta) = \sigma(\theta^T x)$$

## Softmax Function
- Softmax function, generalization of logistic function, (soft version of a maximum) is:
$$ g(x_i) = \frac{e^{-x^i}}{\sum_{j=1}^k e^{xj}}, i=1,2...,k $$
- Useful property is that calculated probabilities are in range $(0,1)$, but sum of all $P$ is equal to 1 
- High value will have high probability
- The squashing function, its output always sums to 1, “winner-take-all” model 
- Used for multiple classification logistic regression model
- Works well with MLE used as estimator when model is learning parameters
- It has multiple output values that can saturate $$softmax(z) = softmax(z + c)$$ $c$ is some scalar

---