# Logistic Regression
Logistic Regression is used when the dependent variable(target) is **categorical**.

The hypothesis function is $h_{\Theta}(x) = \Theta^Tx$.
- if $h_{\Theta} \geq 0.5$, predict $y = 1$

- if $h_{\Theta} \leq 0.5$, predict $y = 0$

Since we want $0 \leq h_{\Theta} \leq 1$: $h_{\Theta}(x) = g(\Theta^Tx)$. 

The function $g(z) = \frac{1}{1+e^{-z}}$ is called the **sigmoid function** or **logistic function**

We can interpret $h_{\Theta}(x)$ as the estimate probability that $y=1$ on input $x$, that is, $P(y=1|x;\Theta)$.

## Decision boundary
A decision boundary is the region of a problem space in which the output label of a classifier is ambiguous.

## Cost function
Src: https://ml-cheatsheet.readthedocs.io/en/latest/logistic_regression.html
\+ http://neuralnetworksanddeeplearning.com/chap3.html

We want to minimize the cost function $J(\Theta) = \frac{1}{2m}\sum_{i=1}^m (h_{\Theta}(x^i) - y^i)^2 $.

This time, we cannot use the same loss function as before (Mean Squared Error), since the **prediction function is non-linear**. Squaring this prediction as we do in MSE results in a non-convex function with many local minimums (which means gradient descent may not find the optimal global minimum)

### Cross-Entropy
$J(\Theta) = \frac{1}{m}\sum_{i=1}^m Cost(h_{\Theta}(x^i),y^i)$, where

- $Cost(h_{\Theta}(x^i),y^i) = -log(h_{\Theta}(x))$, when $y = 1$

- $Cost(h_{\Theta}(x^i),y^i) = -log(1 - h_{\Theta}(x))$, when $y = 0$

Cost = 0 if $y = 1$ and $h_{\Theta}(x) = 1$, but as $h_{\Theta}(x) -> 1$, $Cost -> \infty $  

The cost function **penalizes confident and wrong** predictions more than it rewards confident and right predictions.

### Simpllified Cost Function
$Cost(h_{\Theta}(x^i),y^i) = -ylog(h_{\Theta}(x)) -(1-y)log(1 - h_{\Theta}(x))$

## Gradient Descent

### The algorithm
$\Theta_j = \Theta_j - \alpha\frac{\delta}{\delta \Theta_j}J(\Theta)$

### Replacing the Cost Function
$\Theta_j = \Theta_j - \alpha\sum_{i=1}^m(h_{\Theta}(x^i) - y^i)*x^i_j$

## Multiclass classification

### One vs All (One vs Rest)
Train a logistic regression classifier $h_{\Theta}^i(x)$ for each class $i$ to predict the probability that $y = i$. We train a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. While testing, you simply classify the sample as belonging to the class with maximum score among the N classifiers.
