## Table of Contents

* [Ensemble Learning](#Ensemble-Learning)
* [SVM](#SVM)
* [Naive Bayes Classifier and Density Estimator](#Naive-Bayes-Classifier-and-Density-Estimator)

### Ensemble Learning

#### Bagging

* Create k bootstrap samples (split your data k ways)
* Train a distinct classifier on each split
* Classify a testing point on majority vote/average

OOB (out of bag) average error used instead of CV error.

#### Random Forest

* Bag training set, training n trees (same as normal bagging).
* Also, at each tree split, a random sample of m features is chosen instead of all the features.

#### AdaBoost

1. Train 1 model with instance weights $w_{t}$
2. Compute training error $\epsilon_{t}$
3. Choose $\beta_{t} = \frac{1}{2}ln(\frac{1-\epsilon_{t}}{\epsilon_{t}})$
4. Update instance weights $w_{t+1,i} = w_{t,i}exp(-\beta_{t}y^{(i)}h_{t}(x^{(i)}))$. This makes it so that misclassified instances are considered more in the next model.
5. Repeat 1-4 this T times. Final model is weighted combination of all these models.

Ada boost works best with "weak" learners. In practice it does not overfit, and can be proven to reach 100% training accuracy.

### SVM

Line (2-dimensions): $\theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} = 0$

Hyperplane (d-dimensions): $\theta_{0} + \theta_{1}x_{1} + ... + \theta_{d}x_{d} = 0$

Looking for a hyperplane where:

$\theta_{0} + \theta_{1}x^{(i)}_{1} + ... + \theta_{d}x^{(i)}_{d} > 0\text{ if }y^{(i)} = 1$

$\theta_{0} + \theta_{1}x^{(i)}_{1} + ... + \theta_{d}x^{(i)}_{d} < 0\text{ if }y^{(i)} = -1$

#### Classifier formulation 1

*Note*: $y$ = 1 for the positive class, $y$ = -1 for the negative class.

$\text{max M}$

$y^{i}(\theta^{T}x) \geq M \text{ } \forall i$

$||\theta||_{2} = 1$ &larr; Normalization constraint

#### Classifier formulation 2

$\text{Min} ||\theta||^{2}$

$y^{i}(\theta^{T}x^{(i)}) \geq 1 \text{ } \forall i$

* This is the easier formulation to optimize, and is equivalent.
* Maximum margin classifier given by solution $\theta$ to this optimization problem.

#### Adding slack

Maximum margin is not always the best. Using maximum margin is not resilient to outliers.

$\text{max M}$

$y^{i}(\theta^{T}x) \geq M(1 - \epsilon_{i}) \text{ } \forall i$

$||\theta||_{2} = 1$

$\epsilon_{i} \geq 0, \Sigma_{i}\epsilon_{i} = C$

C is the error budget hyperperameter.

#### Adding slack (formulation 2)

$\text{Min } ||\theta||^{2} + C\text{ }\Sigma_{i}\epsilon_{i}$

$y^{i}(\theta^{T}x^{(i)}) \geq 1 - \epsilon_{i} \text{ } \forall i$

$\epsilon_{i} \geq 0$

Final classifier: $f(z) = \theta_{0} + \Sigma_{i}\alpha_{i}<z,x^{(i)}>$. This is a linear combination of the inner product of the point and the support vectors.

#### Properties

* SVM is resilient to outliers.
* Finds "max margin classifier".

#### Hinge Loss

$J(\theta) = C\text{ }\Sigma^{n}_{i=1} max(0,1 - y^{(i)}h(x^{(i)})) + \Sigma^{d}_{j=1}\theta^{2}_{j}$

#### Kernels

A kernel can be subsituted for the linear combination fo the inner products of the support vectors.

* Polynomial kernel of degree m
    * $K(a,b) = (1 + \Sigma^{d}_{i=1}a_{i}b_{i})^{m}$
* Radial Basis Fuction (RBF) (gaussian kernel)
    * $K(a,b) = exp(1-\gamma\Sigma^{d}_{i=0}(a_{i}-b_{i})^{2})$
    
Pros:

* Non-linear features
* More flexible decision boundary
* Testing is computationally efficient
    
Cons:

* Kernels need to be tuned (additional hyperparameters)
* Training radial or polynomail kernels takes longer than linear SVM

### Naive Bayes Classifier and Density Estimator

#### Bayes' Rule

$P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}$

#### Prior and Joint Probabilities

**Prior probability**: Degree of belief without any other evidence

**Joint probability**: Matrix of combined probabilities of a set of variables

#### Density Estimation

A density estimator learns a mapping from a set of attributes to a probability.

Density estimator can tell you how likely a dataset is, assuming that all records were indepenently generated.

$\hat{P}(x_{1} \land x_{2} \land ... \land x_{n} | M) = \Pi^{n}_{i=1}\hat{P}(x_{i}|M)$

For large datasets, this usually will underflow (become really small), so log probabilities are used:

$log \hat{P}(x_{1} \land x_{2} \land ... \land x_{n} | M) = \Pi^{n}_{i=1}log\hat{P}(x_{i}|M)$

**Pros**

* Density estimators can learn distribution of training data
* Can compute probability for a record
* Can do inference (predict likelihood of a record)

**Cons**
* Can overfit to the training data and not genrealize to test data
* Curse of dimensionality

Naive Bayes classifier will fix these cons.

#### Naive Bayes Classifier

Uses training data to estimate $P(X|Y) and P(Y)$, then uses Bayes' rule to infer $P(Y|X_{new})$

Need to assume that each feature is independent.

Some probabilities can be 0 based on this: If there are 0 examples of label $y$ given feature $x_{i}=z$. Fix this with Laplace Smoothing.

##### Laplace Smoothing

Essentially, add 1 to each count so that no probability can be 0 (only close to 0).

Naive Bayes classifier gives predictions, not probabilities, as the denominator of Bayes' rule is ignored.
