# Statistical Learning

---

## Statistical Learning
- Represent random vector $X$ whose components $x_{i}$ are random variables and quantifiable outcome as random vector $Y$ with joint probability $P(X, Y)$
- The joint density of the random variables $x_{i}$ : $f(X)$ and their joint distribution $F(X)$
$$f(X) = f(x_{1} , ... . , x_{n})$$ and $$F(X) = F(x_{1} , ... x_{n})$$
- Start with empirical distribution $F(X)$ that has empirical parameters $\theta$ and major objective of statistics is to give exact description of the example based on estimates of the parameters
- Predict $x$ vs. Estimate $\theta$

## Structure of Machine Learning
- Given some inputs, as representation, calculate something about them 
- Assume that there is a function $f(X)$ that describes the approximate relationship between $Y$ and $X$:
$$f(X) = E(Y|X = x, \theta)$$
where $\theta$ is parameter of the data distribution

## Point Estimation
- Point Estimation is the attempt to provide “best” prediction of some quantity of interest.
- Distinguish estimates of parameters from their true values by denoting estimate of a parameter $\theta$ by $\hat{\theta}$ which is also a random variable (function of $x_{i}$ r.v.)
- Let ${x^1, x^2, ..., x^n}$ be a set of $n$ independent and identically distributed data points. A **point estimator** or **statistic** is any function of the data:
$$\hat{\theta}_{n} = g(X) = g(x^1 , x^2 , ..., x^n)$$

## Point Estimation Example
- If the function $g(X)$ is properly selected than the estimation error $\theta - \hat{\theta}_{n}$ decreases as number $n$ of examples increases
- Assume $\mu$ denotes the mean grade point average of all college students, and the sample space is {1,2,3,4}. If $x_i$ are the observed grades of a sample of 88 students, then:
$$\hat{\mu} = \frac{1}{88} \sum^{88}_{1} x_{i} = 3.12$$
is a point estimate of $\mu$, the mean grade point average of all the students in the population

## Function Estimation
- Approximating $f$ with a model or estimate $\hat{f}$ chosen from hypothesis space
- Select functions from a carefully specified set, known as hypothesis space
- Decide how to represent data set and select hypothesis space
- Generally this space is indexed by a set of parameters $\theta$ that can be tuned to create different machines:
$$H ∶ {f(Y|X, \theta)}$$

## MLE
- Goal is to find **“Maximum Likelihood Estimation”** for parameter $\theta$
- Consider a set of samples $X$ chosen according to one of family of probability distributions but we don’t know parameters of distribution. We define Likelihood function $$L(\theta | X) = f(X;θ) = P(X) = x$$ as joined probability distribution of samples X.
- Let $P_{model}(X, \theta)$ be family of probability distributions over the space $\theta$. The maximum likelihood estimator for $\theta$ is defined as:
$$\theta_{ML} = arg_{\theta}max L(\theta|X)$$
$$\theta_{ML} = arg_{\theta}max P_{model}(x^1,x^2,...,x^m; \theta)$$
$$\theta_{ML} = arg_{\theta}max \Pi_{i=1}^m P_{model}(x^i; \theta)$$