<h1><center>Machine Learning - Week 3</center></h1>

<h2><center>Classification Problems</center></h2>

**Classification:** Discretely valued output ($y$) variable

**Binary Classification:** $y \in \{0,1\}$ where 0 = Negative Class (ex. benign tumor), 1 = Positive Class (ex. malignant tumor)

<h2><center>Logistic Regression</center></h2>

### Hypothesis Representation

Logistic regression classifies values between $0 \leq h_\theta(x) \leq 1$

**Hypothesis:** $h_\theta(x) = g(\theta^Tx) = h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}$

**Sigmoid/Logistic Function:** $g(z) = \frac{1}{1+e^{-z}}$

![](img/sigmoid.png)

$h_\theta(x)$ is the estimated probability that $y = 1$ on input $x$
 - Mathetmatically, $h_\theta(x) = P(y=1 \space | \space x;\theta)$ means probability that $y=1$ given $x$, parameterized by $\theta$
 - Predict $y = 1$ if $h_\theta(x) \geq 0.5$, $y = 0$ if $h_\theta(x) < 0.5$
 - Equivalently: Predict $y = 1$ if $\theta^Tx \geq 0$, $y = 0$ if $\theta^Tx < 0$

### Decision Boundary
**Decision Boundary:** The regression defined by $\theta^Tx = 0$ - not always linear
 - Linear Decision Boundary ![](img/linear_decision_boundary.png)
 - Non-Linear Decision Boundary ![](img/non-linear_decision_boundary.png)





### Cost Function

Existing Cost Function: $J(\theta) = \frac{1}{m} \sum_{i=1}^{m}\frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2 = \frac{1}{m} \sum_{i=1}^{m} \text{Cost}(h_\theta(x^{(i)}),y^{(i)})$ where $Cost(h_\theta(x), y) = \frac{1}{2}(h_\theta(x) - y)$
- Produces a *non-convex* curve, with many local extrema ![](img/non-convex.png)
- We want a *convex* curve ![](img/convex.png)

Instead: $\text{Cost}(h_\theta(x),y) = \begin{cases} -\log(h\theta(x)) \space \space \space \space \space \space \space \space \space \space \text{ if } y=1 \\ -\log((1-h_\theta(x)) \space \space \text{ if } y=0 \end{cases} \space \space \space \space \leftrightarrow \space \space \space \space -y\log(h_\theta(x)) - (1-y)\log(1-h_\theta(x))$

![](img/logistic_regression_1.png) ![](img/logistic_regression_2.png)

**Cost Function:** $J(\theta) = -\frac{1}{m} \sum_{i=1}^m [y^{(i)}\log (h_\theta(x^{(i)})) + (1-y^{(i)})\log(1-h_\theta(x^{(i)}))]$

**Vectorized Cost Function:** $J(\theta) = \frac{1}{m}\cdot(-y^T\log(h)-(1-y)^T\log(1-h))$ where $h = g(X\theta)$

### Gradient Descent

Goal is to $\min_\theta J(\theta)$ simultaneously updating each $\theta_j$

$\text{Repeat } \{ \\
\space \space \space \space \theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta) = \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\cdot x_j^{(i)} \\
\}$

**Vectorized Gradient Descent:** $\theta := \theta - \frac{\alpha}{m}X^T(g(X\theta)-y)$

### Advanced Optimization

Other optimization algorithms:
 - *Conjugate Gradient*
 - *BFGS*
 - *L-BFGS*

Advantages:
 - No need to manually pick $\alpha$, pick automatically each iteration
 - Often faster than gradient descent

Disadvantages:
 - More complex
 
*Find libraries to implement these in Python (`fminunc`)*

### Multiclass Classification

**One vs. All:** Take multiple classes and convert each to a binary classification subproblem
- Train logistic classifier $h_\theta^{(i)}(x)$ for each class $i$ to predict probability that $y=i$
- On each new input, pick the class $i$ that maximizes: $\max_ih_\theta^{(i)}(x)$
![](img/one_vs_all.png)

<h2><center>Solving Overfitting</center></h2>

### Problem of Overfitting

**Underfitting:** Regression not fitting the data very well - *high bias* (eg. to fit a line rather than a quadratic function)

**Overfitting:** Algorithm fits the training data too exactly, but fails to generalize on new exampeles - *high variance*  in hypothesis comes with too many factors

We can either:
- Reduce # of features
     - Manually select which features to keep
     - Model selection algorithm
- Regularization: 
    - Keep all features but reduce magnitude of parameters $\theta_j$
    - Works well with lots of features, each of which is useful for predicting $y$

### Regularization

Small values for parameters $\theta_0, \theta_1, \ldots, \theta_n$
 - Simpler hypothesis (smoother)
 - Less prone to overfitting

### Regularized Cost Function

$J(\theta) = \frac{1}{2m}[\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2 + \lambda \sum_{j=1}^n \theta_j^2]$
- *Regularization parameter* $\lambda$: Controls the tradeoff between fitting the training data (cost term) and simplifying the hypothesis (regularization term)
- *Regularization term* $\lambda \sum_{j=1}^n \theta_j^2$

### Regularized Linear Regression

#### Gradient Descent

Originally:

$\text{Repeat } \{ \\
\space \space \space \space \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \\
\space \space \space \space \theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \\
\}$

Now:

$\text{Repeat } \{ \\
\space \space \space \space \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \\
\space \space \space \space \theta_j := \theta_j - \alpha [(\frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}) + \frac{\lambda}{m}\theta_j] \\
\}$

Note the term in square brackets is $\frac{\partial}{\partial \theta_j}J(\theta)$ of the regularized $J(\theta)$

Equivalently:

$\text{Repeat } \{ \\
\space \space \space \space \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \\
\space \space \space \space \theta_j := \theta_j(1-\alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}\\
\}$

Note: $1-\alpha \frac{\lambda}{m} < 1$ usually slightly less than 1 for $\alpha, \lambda, m > 0$, therefore, multiplying by $\theta_j$ shrink the parameter's magnitude

#### Normal Equation

Normal equation now becomes ($\lambda > 0$):

$\theta = (X^TX + \lambda L)^{-1}X^Ty$

where $L = \begin{bmatrix}0\\&1\\&&1\\&&&\ddots\\&&&&1\end{bmatrix}$

The new term is $\lambda$ multiplied by an $(n+1) \times (n+1)$ matrix

Note: Regularization removes issues related to non-invertibility

### Regularized Logistic Regression

**Regularized Cost Function:** $J(\theta) = -\frac{1}{m} \sum_{i=1}^m [y^{(i)}\log (h_\theta(x^{(i)})) + (1-y^{(i)})\log(1-h_\theta(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2$

**Regularized Gradient Descent:**

$\text{Repeat } \{ \\
\space \space \space \space \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)} \\
\space \space \space \space \theta_j := \theta_j - \alpha [(\frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}) + \frac{\lambda}{m}\theta_j] \\
\}$

Note: looks the same as linear regression, but is different because hypothesis is different