# Classification

A supervised learning algorithm to predict the class/category for a given input feature from a finite set of possible outputs. This is done using logistic regression.

## Logistic regression
A supervised learning algorithm which is used for **binary classification** problems. The algorithm outputs whether for a given set of input, does the input belongs to a particular class or not.
$$
f_{\vec{w},b}(\vec{x}) = g(z) = \frac{1}{1 + e^{-z}}
$$
where, $z = f_{\vec{w},b}(\vec{x}) => \vec{w}\vec{x} + b$ and $\frac{1}{1 + e^{-z}}$  is called as **sigmoid/logistic function**<br>

### Sigmoid function
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
A mathematical function of which for a given real value $z$, will output a value between $0$ and $1$. When,
1.  $z = 0  =>  \sigma(z) = 0.5$
2.  $z$ reaches positive infinity, $\sigma{z}$ reaches closer to $1$
3.  $z$ reaches negative infinity, $\sigma{z}$ reaches closer to $0$<br>
Plotting the output of sigmoid function gis as a $"S"$ curve intersecting the y-axis at 0.5.

## Decisoin boundary
A value$(z)$ where the sigmoid function is zero. 
$$\sigma(z = c) = 0$$
For other values of $z$ when $\sigma(z) >= c$, the modal predicts the input belongs to class 1. If $\sigma(z) < c$, then the modal predicts the input belongs to the class 0.<br> The decisoin boundary doesn't necessarily need to be 0.5.


### Cost functoin ($J(f(x), y)$)

In the case of cost function for logistic regression, making use of MSE (Mean Squared Error) is a bad choice. As applying MSE for the logistic regression will give zig-zag convex curve with multiple local minima resulting in the modal getting at any of those minima and doesn't converge to get the optimal values for the modal parameters. So logistic regression requires a suitable cost function which will be suitable for its prediction nature (0 or 1) which is as follows:
$$
J(f_{\vec{w}, b}(\vec{x})) = \frac{1}{m}\sum_{i=1}^{n}\frac{1}{2}L(f_{\vec{w}, b}(\vec{x}^{i}), y^{i})
$$
where the loss $L$ is defined as follows,
$$
L(f_{\vec{w}, b}(x^{i}), y^{i}) = \begin{cases} -log(f_{\vec{w}, b}(\vec{x}^{i})), for  \vec{y}^{i} = 1 \\ -log(1 - f_{\vec{w}, b}(\vec{x}^{i})), for  \vec{y}^{i} = 0 \end{cases}
$$

1.  **Case 1 (y = 1):** Plotting $f(x^{i})$ against J(f(x^{i}), y^{1}) will give an exponential curve intersecting the x axis at 1. Since the modal outputs value only between 0 and 1, consider the portion  of the curve from $x = 0$ to $x = 1$. Given $y = 1$, and the modal predicts 1, ($\hat{y} = 1$), the cost will be minimal as the point will be closer to 1 in the curve. When the modal predicts value away from 1, the cost increases as the point will move away from the point 1 along the x-axis.
2.  **Case 2 (y = 0):** Plotting $f(x^{i})$ against J(f(x^{i}), y^{1}) will give an exponential curve intersecting the x axis at the origin. Since the modal outputs value only between 0 and 1, consider the portion  of the curve from $x = 0$ to $x = 1$. Given $y = 0$, and the modal predicts 0, ($\hat{y} = 0$), the cost will be minimal as the point will be closer to 0 in the curve. When the modal predicts value away from 0, the cost increases as the point will move away from the point 0 along the x-axis.<br>

This loss function will give a curved plot (loss function vs modal parameter) enabling the modal to converge to find the minima.

### Simplified cost function

For the purpose of implementation of gradient descent, the above given loss $L$ can be rewritten as follows,
$$
L(f_{\vec{w}, b}(x^{i}), y^{i}) = -y^{i}log(f_{\vec{w}, b}(\vec{x}^{i})) - (1 - y)log(1 - f_{\vec{w}, b}(\vec{x}^{i}))
$$
when,
1.  $y = 1$: Substituting $y = 1$ in the above equation gives us, $L(f_{\vec{w}, b}(x^{i}), y^{i}) = -log(f_{\vec{w}, b}(\vec{x}^{i}))$
2.  $y = 0$: Substituting $y = 0$ in the above equation gives us, $L(f_{\vec{w}, b}(x^{i}), y^{i}) = -log(1 - f_{\vec{w}, b}(\vec{x}^{i}))$<br>
Plugging the above loss function into the cost function, we get:
$$
J(f_{\vec{w}, b}(\vec{x})) = -\frac{1}{m}\sum_{i=1}^{n}[y^{i}log(f_{\vec{w}, b}(\vec{x}^{i})) + (1 - y^{i})log(1 - f_{\vec{w}, b}(\vec{x}^{i}))]
$$
The above given is the cost function of logistic regression.