# Classification 

### Basic Intuition

As opposed to regression where we predicted some real value y, in classification we need to categorize an observation into classes. 

Eg. We want to predict whether a review was good or bad. We ave two classes positive, negative. Lets say we have two features: awesome, awful. Our prediction can be w1*(no. of awesome) + w2*(no. of awful). If this weighted sum of words is >0 we say the review is positive, if <0 we say negative. We can find w's which suit our data the best. We have a linear classifier. 

But ideally we would want to know how confident we are on our prediction. Eg. We are 55% confident that it was a positive review. So rather than {0,1} we can have probability which is always bounded between  \[0,1\].


So we have our weighted sum, that can take values -$\infty$ to +$\infty$ and we would want a link function that squishes it to \[0,1\]. One such link function is sigmoid.  

For a given fit, 
$$ P(y = 1|x_i, w_i) = \hat y = \frac{1}{1+e^{-W_ix_i}} $$
$$ P(y = 0|x_i, w_i) = 1 - \hat y = \frac{e^{-W_ix_i}}{1+e^{-W_ix_i}}$$

\[
 P(y_i|x_i, w_i) = 
\begin{cases} 
      \frac{1}{1+e^{-W_ix_i}} & y=1 \\
      1- \frac{1}{1+e^{-W_ix_i}} & y=0 
   \end{cases}
\]

To put this together in equation we can rewrite it as 

$$ P(y_i|x_i, w_i) = (\hat y_i)^{y_i}(1-\hat y_i)^{(1-y_i)} $$



### Probability vs Likelihood

**P (data| distribution) vs L(distribution| data)** 

* Eg. (probability) Given distribution ($\mu _1, \sigma _1 $) of indian weights, whats probability that ritika is 60kg?. 
    Indian weights distribution remains constant, we get different probability for different observations (weights)
* Eg. (likelihood) Given ritika is 60kg, whats the likelihood that indian weights have certain distribution ($\mu _2, \sigma _2 $) 
    Our observation remains same, we get different likelihood for different distributions ($\mu _x, \sigma _x $) 


In our classification formulation, using sigmoid as our link function. 

If we have known w and x. We get different 'probabilities' for different outputs i.e. y=0, y=1.
If we have some known observations x, and output y. We get different 'likelihoods' for different w's. 

## Formulation 

We have known N independent observations. Now, we want to find a function (i.e. right values of w's) which fits these observations the best! (i.e. for given $x_i$ we want to be 100% confident for class 1 if y_i = 1 and vice versa)

Thus, we want to maximize our likelihood of w's given x,y. 


$$  =  P(y_i|x_i) $$ 

If for any point likelihood is l(w_i|x_i), likelihood of all the N independent observations (ob1&obs2...&obsN : multiply probability), 

$$ l(w|x_i) = \prod_{i=0} ^{N}  l(w_i|x_i) $$  

We want to maximize this quantity. For convenience, since log is monotonous, we can maximize log likelihood as well. 

$$ log.l(w|x_i) = log(\prod_{i=0} ^{N}  l(w_i|x_i) $$  
$$ ll(w|x_i) = \sum \limits_{i=0} ^{N}  log l(w_i|x_i) $$  (log(ab) = loga+logb)




$$ ll(w) = \sum \limits_{i=0} ^{N} (\hat y_i)^{y_i}(1-\hat y_i)^{(1-y_i)} $$

(This looks exactly like probability equation, but here w is variable. There w was fixed and we were finding y.)

$$ ll(w) = \sum \limits_{i=0} ^{N} y.log(\frac{1}{1+e^{-w^{T}x}})+(1-y)log(\frac{e^{-w^{T}x}}{1+e^{-w^{T}x}}) $$
$$ ll(w) = \sum \limits_{i=0} ^{N} -y.log(1+e^{-w^{T}x})+(1-y)log(e^{-w^{T}x})+(1-y)log(1+e^{-w^{T}x}) $$
$$ ll(w) = \sum \limits_{i=0} ^{N}-(1-y)w^{T}x-log(1+e^{-w^{T}x}) $$


For one data point, 
$$ \frac{\delta ll}{\delta\omega} = -(1-y)x-\frac{e^{-w^{T}x}(-x)}{1+e^{-W^{T}x}}$$
$$ \frac{\delta ll}{\delta\omega} = [-(1-y)x+(1-\hat y)x]$$
$$ \frac{\delta ll}{\delta\omega} = [x(y-\hat y)]$$

For all observations, matrix form becomes
$$ \frac{\delta ll}{\delta\omega} = X^{T}.(Y-\hat Y)$$

 

Since we are maximizing likelihood, we incrementally increase the weight, also called gradient ascent.
$$ \omega = \omega + \alpha.\frac{\delta L}{\delta\omega} $$


#### Note
Mean square error isnt used in logistic regression [because the formulation is non convex ](https://towardsdatascience.com/why-not-mse-as-a-loss-function-for-logistic-regression-589816b5e03c)

## Overfitting in classification is twice as bad 

### overfit classifier -> overly confident predictions

[Visualizing_Overfitting](detailed_notebooks/Visualize_Overfitting_in_LogReg.ipynb)

<br>
<br>

Maximum likelihood estimation prefers the model that is most certain. Thus, even for border line data points, where ideally we should have low cnfidence, MLE pushes it to either 0 or 1! This can cause severe overfitting which results in traits mentioned below: 

* Coefficients can go to infinity for linearly separable data. (If we have linearly seperable data, even if we multiply weights by a high value, the solution remians same eg. w1x + w0 = 0, 1000w1x+ 1000 = 0) MLE will prefer the weight that is highest (we maximize weights!)
* Narrow range where model is uncertain. So we are really confident at places where we can really go wrong. 
* Overly complex decision boundaries. (if we have a lot of features, we will somehow get a plane that makes our data linearly seperable. And coefficients can go to infinity for this linearly separable data in high dimensional space). 


Simple log regression using one feature of iris data shows the following:
<img src="helper/LC3.JPG" alt="Drawing" style="width: 800px;"/> 

For the same iris dataset, if we use polynomial regression, we are even prone to overfitting. 

<img src="helper/LC4.JPG" alt="Drawing" style="width: 1200px;"/> 

#### Observation 

In the fig 1, without any regularization, the sigmoid becomes very steep. So our region of uncertainity is really narrow. Also, the weights increase as we increase epochs. 

In fig 2, with increasing degree of polynomial our decision boundary gets overly complex. 

## Tackle overfitting with L2 regularization 

<br>
To avoid overfiiting we need to add representation of magnitude of weight in our cost function in addition to the error. In L2 (Ridge) regularization we take L2 norm as given below:
$$ ll(w) = \sum \limits_{i=0} ^{N}  log l(w_i|x_i) - \frac{\lambda }{2}||w_i||^2_2 $$

(Since we are maximizing our likelihood, we **reduce** it by the regularization term

Thus the GD update becomes:
$$ w = w + \alpha(\frac{\delta \cal{L}}{\delta \omega} -\lambda\omega)$$ 

Note: in sklearn, C = 1/$\alpha$



<br>
<br>

<img src="helper/LC2.JPG" alt="Drawing" style="width: 800px;"/> 



## Obtaining sparsity with L1 regularization

<br>

[Detailed_code](detailed_notebooks/L1_Reg_Feature_importance.ipynb)
<img src="helper/LC1.PNG" alt="Drawing" style="width: 800px;"/> 

* Case 1 - No Reg: All non zero coefficients 
* Case 2 - With L1 reg:  knocks down unimportant features. Plotting the features, we observe the classes look linearly seperable using these two features.This seems the right fit. 
* Case 3 - With heavy L1 reg: L1 reg retains only one (most important feature) as seen from the parallel plot. Although our test predictions are bad. So model has bias.
* Case 4: With very heavy L1: All coeeficients are zero. 