# Table of Contents

* [Probability and Linear Algebra](#Probability-and-Linear-Algebra)
    * [Random Variables](#Random-Variables)
    * [Axioms/Theorems of Probability/Set Theory](#Axioms/Theorems-of-Probability/Set-Theory)
    * [Discrete Random Variables](#Discrete-Random-Variables)
    * [Continuous Random Variables](#Continuous-Random-Variables)
    * [Expectation and Variance](#Expectation-and-Variance)
    * [Conditional Probability and Bayes Theorem](#Conditional-Probability-and-Bayes-Theorem)
    * [Vectors and Matrices](#Vectors-and-Matrices)
    * [Matrix Products](#Matrix-Products)
    * [Matrix Properties](#Matrix-Properties)
    * [Linear Independence](#Linear-Independence)
* [Linear Regression](#Linear-Regression)
    * [Simple LR](#Simple-LR)
    * [Terminology and Metrics](#Terminology-and-Metrics)
    * [Closed form solution to simple LR](#Closed-form-solution-to-simple-LR)
    * [Multiple LR](#Multiple-LR)
    * [Closed form solution to multiple LR](#Closed-form-solution-to-multiple-LR)
    * [Feature Standardization](#Feature-Standardization)
    * [Categorical Variables](#Categorical-Variables)
    * [Gradient Descent](#Gradient-Descent)
    * [GD for Linear Regression](#GD-for-Linear-Regression)
* [Perceptrons](#Perceptrons)
    * [Online Learning vs Batch Learning](#Online-Learning-vs-Batch-Learning)
    * [Perceptron Limitations and Improvements](#Perceptron-Limitations-and-Improvements)
* [KNN](#KNN)
* [Cross Validation](#Cross-Validation)
    * [K-fold CV](#K-fold-CV)
* [Classification Metrics](#Classification-Metrics)
    * [Confusion Matrix](#Confusion-Matrix)
* [Logistic Regression](#Logistic-Regression)
    * [Maximum Likelihood Estimation](#Maximum-Likelihood-Estimation)
    * [Gradient Descent for Logistic Regression](#Gradient-Descent-for-Logistic-Regression)
* [LDA](#LDA)

### Probability and Linear Algebra

#### Random Variables

A random variable $A$ represents an event that can take place.

*Example*

* $A$ = I have a headache
* $A$ = Sally will be president in 2020

$P(A)$ is the probability $A$ will be true.

#### Axioms/Theorems of Probability/Set Theory
$P(A \lor B) = P(A) + P(B) - P(A \land B)$

$P(A \lor B) \leq P(A) + P(B)$

$P(\lnot A) = 1 - P(A)$

$P(A) = P(A \land B) + P(A \land \lnot B)$

#### Discrete Random Variables

Discrete random variables (DRV) take on a finite number of values. Uniform random variables are discrete random variables in which each possibility has an equal probability.

Bernouli random variables are DRV where there are 2 possibilities.

Binomial random variables are DRV where we want to find the probability that a Bernouli random variable comes up k times. To find this, use the following formula (assume p is the probability of the positive case):

${n\choose k}p^{k}(1 - p)^{n - k}$

#### Continuous Random Variables

$X$ is a continuous random variable (CRV) if $X$ can take on an infinite number of values.

The **cumulative distribution function CDF** $F$ for $X$ is defined for every value $x$ by:

$F(x) = Pr(X \leq x)$

The **probability distribution function PDF** $f(x)$ for $X$ is

$f(x) = \frac{dF(x)}{dx}$

Think of PDF as probability at a point, and CDF of probability that variable is at least that point.

#### Expectation and Variance

Expectation: The weighted average value for a random variable.

*Properties*:
* $E[ag(X)] = aE[g(X)] \text{ (a is constant)}$
* $E[f(X) + g(X)] = E[f(X)] + E[g(X)]$

Variance: The average value of the square distance from the mean value. Can be calculated as follows:

$E[X^{2}] - E[X]^{2}$

Here is a good video for finding these values for CRV: https://youtu.be/Ro7dayHU5DQ

#### Conditional Probability and Bayes Theorem


### $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$

#### Vectors and Matrices

Vectors are ordered sets of numbers. Row vectors are of dimensions $n\times 1$, column vectors are of dimensions $1\times n$

Norms are a measure of the "length" of a vector.

* L1 norm: $||x||_{1} = \sum_{n}^{i=1}|x_{i}|$
* L2 norm: $||x||_{2} = \sqrt{\sum_{n}^{i=1}x_{i}^{2}}$

#### Matrix Products

Vector dot (inner) product: let $r$ be a row vector, $c$ be a column vector $rc = \sum_{n}^{i=1}r_{i}c_{i}$

In order to multiply matrices, their dimensions must match up: $A \in I\!R^{m\times n} B \in I\!R^{n\times p}$

#### Matrix Properties

Associative: $(AB)C = A(BC)$

Distributive: $A(B + C) = AB + AC$

**NOT** Commutative: $AB \neq BA$

Transpose: Think of it as flipping a matrix. A row vector transpose is a column vector. $m\times n$ &rarr; $n \times m$

#### Linear Independence

A set of vectors are linearly independent if none of them can be written as a slinear combination of the others.

### Linear Regression

One of the most widely used ML techniques.

#### Simple LR

$h_{\theta}(x) = \theta_{0} + \theta_{1}x$

In this case, x is single dimensional. This is essentially finding the "best fit line".

Cost function: $J(\theta) = \frac{1}{n}\sum^{n}_{i=1}(h_{\theta}(x^{(i)} - y^{(i)})^{2}$ - This is the Mean Square Error (MSE)

#### Terminology and Metrics

Residuals - Difference between predictions and actual values.

$R^{(i)} = |y^{(i)} - \hat{y}^{(i)}|\text{, }\hat{y}^{(i)} = h_{\theta}(x^{(i)})$

Residual Sum of Squares (RSS)

$RSS = \sum[y^{(i)} - \hat{y}^{(i)}]^{2} $

Residual Standard Error (RSE)

$RSE = \sqrt{\frac{RSS}{n - 2}}$

#### Closed form solution to simple LR

$\theta_{0} = \bar{y} - \theta_{1}\bar{x}$

$\theta_{1} = \frac{\sum(x^{(i)} - \bar{x})(y^{(i)} - \bar{y})}{\sum(x^{(i)} - \bar{x})^{2}}$

$\bar{x},\bar{y} = \text{mean of x, y respectively}$

This was found by taking the partial derivative of the loss function and setting it = 0.


#### Multiple LR

$h_{\theta}(x) = \theta^{T}x$

$J(\theta) = \frac{1}{n}\sum^{n}_{i = 1}(h_{\theta}(x^{(i)}) - y^{(i)})^{2}$

#### Closed form solution to multiple LR

$\theta = (X^{T}X)^{-1}X^{T}y$

$X$ is the matrix where each row is the number 1 followed by $x_{1}^{(i)}:x_{d}^{(i)}$

#### Feature Standardization

$\mu_{j},s_{j} = \text{mean, standard deviation of feature j}$

$x_{j}^{(i)} \leftarrow \frac{x_{j}^{(i)} - \mu_{j}}{s_{j}}$

This rescales features to have 0 mean and 1 variance. This is good because different features can be on very different scales, making LR not work as well.

#### Categorical Variables

Example: State - has 50 values. To encode, we create 49 indicator variables.

$x_{MA} = 1 \text{ if State = MA and 0 otherwise}$

This can make it so that data becomes too sparse.

#### Gradient Descent

Repeat the following:

$\theta_{j} \leftarrow \theta_{j} - \alpha\frac{\partial}{\partial\theta_{j}}J(\theta)$

Intuition: The derivative indicates whether theta at this point is too high or too low as it indicates the slope at this point, and this will converge to the minimum.

#### GD for Linear Regression

$\theta \leftarrow \theta - \alpha \sum^{n}_{i=1}(h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)}$



### Perceptrons

Classification algorithm where y=1 for the positive case, y=-1 for the negative case.

$h(x) = sign(\theta^{T}x)$

Perceptron update rule:

$\theta_{j} \leftarrow \theta_{j} - \frac{1}{2}(h(x^{(i)}) - y^{(i)})x_{j}^{(i)}$

If $x^{(i)}$ is misclassified, do the following: $\theta \leftarrow \theta + y^{(i)}x^{(i)}$

#### Online Learning vs Batch Learning

Online Learning - Model update is performed after every observation.

Batch Learning - Model update is performed on entire training set. Updates are performed by computing the average update, and then updating theta with that.

#### Perceptron Limitations and Improvements

<p style="text-decoration: underline">Limitations<p>
* Dependent on starting point
* Could take a long time to converge
* Can overfit data
* Many different decision boundaries are possible

<p style="text-decoration: underline">Improvements<p>

* Use a combination of multiple Perceptrons
* Averaged Perceptron: Average intermediate perceptrons

### KNN

Find the K nearest samples in the training data to a test point, classify by majority vote.

Typical metric to determine distance is euclidean distance: $\sqrt{(\sum^{k}_{i=1}(x_{i} - y_{i})^{2})}$

Cons:

* Does not learn any model
* Instance learner - needs all data at test time.

k=1 &rarr; overfit
k=n &rarr; potential underfit, takes much longer to classify

Choose k through cross validation.

### Cross Validation
* Split training into training and validation data.
* Hold out validation data from training, and measure error with it. This way you test on data the algorithm hasn't seen before.

#### K-fold CV

Split data into k partitions of equal size, train model with same hyper perameters on each partition, and test on each partition. Each partition has a train and val set.

### Classification Metrics

$accuracy = \frac{\text{# correct preds}}{\text{# total instances}}$

$error = 1 - accuracy$

#### Confusion Matrix

<p style='text-align: center'>Predicted Class</p>

|             |     | Yes | No |
|-------------|-----|-----|----|
|Actual Class | Yes | TP  | FN |
|             | No  | FP  | TN |

$precision = \frac{TP}{TP + FP}$ Precision is a measure of how few false positives we had.

$recall = \frac{TP}{TP + FN}$ Recall is a measure of how few false negatives we had.


### Logistic Regression

$h_{\theta}(x) = g(\theta^{T}x)$

$g(z) = \frac{1}{1 + e^{-z}}$ This is known as the sigmoid function.

The sigmoid function has the nice property of always being between 0 and 1. This allows logistic regression to be a probabalistic classifier, where the higher $h(x)$ is, the more confident it is.

To make predictions we use a threshold. If $h(x)$ > threshold, we predict 1, else we predict 0.

#### Maximum Likelihood Estimation

What is the likelihood of training data for parameter $\theta$?

$Max_{\theta}L(\theta) = P[Y|X;\theta]$

Assumption: training points are independent

$L(\theta) = \Pi^{n}_{i=1}P[y^{(i)}|x^{(i)};\theta]$

Max likelihood is equivalent to maximizing the log of the likelihood (log likelihood)

$log L(\theta) = \sum^{n}_{i=1}log P[y^{(i)}|x^{(i)};\theta]$

Expand right side

$\theta_{MLE} = argmax_{\theta}\sum^{n}_{i=1}y^{(i)}log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) log(1 - h_{\theta}(x^{(i)}))$

Make this negative for the cost function, as we want to minimize it.

$J(\theta) = -\sum^{n}_{i=1}y^{(i)}log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) log(1 - h_{\theta}(x^{(i)}))$

**intuition**: If y = 1, then the second term of cost = 0. If y = 0, the first term = 0.

#### Gradient Descent for Logistic Regression

$\theta \leftarrow \theta - \alpha \sum^{n}_{i=1}(h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)}$

Same as linear regression! Except h is different.

### LDA

In [None]:
|