# Linear Regression

## Assumptions of Linear Regression 

###  Linearity 
* y is linearly related to x 


### Homoscadascity
* Variance of residuals is same for any value of predicted value 
* $y^{\cap} = w_{0} +w_{1}x + \epsilon $ 

$\epsilon $ is equally likely to be above or below the predicted line 
![](./helper/1_2.JPG)  


### Independence 
* Observations are independent of each other. 
eg. We cant use LR on time series because each observation may depend on the past values


### Normality 
* The residuals have normal distribution 



## LR Formulation

$$ \hat y = {\theta}^{T}.X $$


$ \hat y $= prediction

\theta are feature weights

and X is array of features



$ y = h(\theta.x) + \epsilon $        $\epsilon$ is irreducible error

To train we need to find y-$ \hat y $ (cost function) Our loss function is RSS (residual some of squares). We can also use mean square error (MSE) which is average of RSS.

$$ RSS = \sum({y- \hat y})^2 $$ 

$$ MSE = \frac{1}{m}\sum({y- \hat y})^2 $$    where m is total number of observations 



In matrix form 
$$  Loss Function = \frac{1}{2m}(Y-\hat Y)^T(Y-\hat Y) $$ #Note: the denominator may may not have 2 depending on formulation (no consensus)

## Solving LR

### Normal/Closed Form Solution

[Derivation and limitations](./LinearRegression/LR_OLS.ipynb)


Differentiating loss function and equating to zero, we get

$$ \theta = (X^{T}X)^{-1}X^{T}Y $$ 

Note: The inverse will exist if no. of independent observations > no. of features, and features are independent of each other. 


Adv:
* Direct solution. No hyper parameter tuning required.

Disadv:
Computational Complexity:
* $(X^T X)^{-1}$ has Computational Complexity of $O(m^{2.4})$ to $O(m^3)$. It increases with increase in number of features, number of observations.
For ex: if you double the number of features, the time increases by $2^{2.4}$ to  $2^{3}$
* For pseudo inverse, SVD has Complexity of $O(m^2)$
 
### Gradient Descent

[Derivation](./LinearRegression/LR_Gradient_Descent.ipynb)



$$ while ||RSS||   <\epsilon: $$
$$ \theta = \theta + \frac{\alpha}{m}(Y-\hat Y)X^{T} $$

**Interpretation of the formula:**


Eg. House price vs no. of rooms. We initially start with a small weight that underpredicts, so $Y-\hat Y$ is positive on average. What will decrease the residual? Increase $ \hat Y$ For that weight should be higher. 

So, positive error --> increase weights (hence + sign)

For weight update, we multiply our gradient by the respective feature (intuition: We want gradient to have high impact on = feature with more magnitude over our iteration)


* Adv: 
Well suited when we have large number od observations, features.
* Disadv: 
Tune learning rate


[Hands On for GD and OLS vs Sklearn](./LinearRegression/LR_Hands_On_GD_OLS.ipynb)

## Interpreting LR Results

A particular coefficient $w_{i} $ can be interpreted as change in y w.r.t unit change in $x_{i}$ with all other x at some constant value. 
Eg. In house prediction, it may happen that we have square feet, no of rooms etc. The coefficient for no of room may be negative. This can be counterintuitive. We expect higher price for higher number of rooms. But since we also have square feet as a feature, if we hve constant sqaure and have more number of rooms, such house with many tiny rooms may have less price. Thus, the coefficient of no. of rooms might be negative. In case we built our model with only no. of rooms, it may have positive coefficient. Thus, coefficients dont make sense in silos but when seen together with others. 

There might be some points in the dataset which donot follow general trend and can influence our predictions. Mainly: 


### High Leverage Points 
Extreme values of x where there are no other observations.Heavily effects the least square line, as centre of mass of x heavily influenced by that point. If high leverage point follows the trend of other data, that may not cause big problem. But if doesnt, can heavily influence the resulting fit. 

### Influential Observation
Ones removing which changes the fit significantly. High leverage points are one candidate for this. But observations with x in normal value but y having strong outliers are one of them.

[Influential points hands on](./LinearRegression/LR_High_Leverage_Points.ipynb)

# Multiple Regression


In simple linear regression, we had y = w*x. We dont need to be restricted to only x but can derive features from our inputs. So this can be written as  

$$ \hat y = h(\theta.x)= {\theta}^{T}.X $$

(every feature may/may not be dependent on all the x values.)


### Time Series 

Eg. In time series data, to model the seasonality we can have a sin(x) term. $ \phi$ is the phase which controls the start of seaonality. We can keep adding multiple features dervied from x (**Doubt:** so the assumption that observations are independent doesnt need to hold now?? )

$ y = w1x+w2sin(2pix+\phi) $

### Polynomial Regression 
We create features from product of original input features.

# Error

#### Training Error 
Loss in training set. Might not be a good represntation unless training data includes entire population.

#### Generalization error           
What we really want. Ideal out of model error when we have the entire population data. But we dont have all of them. 

[notes](Errors/Errors.ipynb)

#### Testing error       
What we can actually compute



### 3 Sources of error

#### Bias (reducible)

* low complexity -> high bias
* Inflexibility of our model to capture the true relationship. 

#### Variance (reducible) 

* Low complexity model -> low variance
* How sensitive is the model to the data samples considered. 

#### Noise (irreducible)          
Data inherently noisy. Cant remove this.



# Overfitting And Regularization

### How to observe overfitting

We can find whether a model is overfitting (not generalizing well) by checking the train and test time loss. 
But is there any other thing which is represntative of overfitting? 

In [Relation of overfitting and weight magnitude](./Regularization/Overfitting&Weights.ipynb)   we observe that when model overfits, its coefficients become very large. 

Overfit can occur due to two scenarios : 
1. With large number of features (Model has lot of flexibility to explain the data)
2. With increased polynomial order in a feature (same reason)

How number of obs affect overfit:           
1. if small data : rapid overfit as model complexity increases.                
2. if large data : Hard to go over all data, so harder to overfit  

We need to reduce the coefficients values and inturn keep the model simple and yet maintain similar bias.

Thus the Cost Equation : Measure of Fit (Small implies good Fit) + Measure of Coeff Magnitude (Small implies not overfit)       

### How to quantify the coeff magnitude?            
1. Sum :  Doesnt hold good if one coeff is >> 1 and other is << 1. The sum becomes zero.            
2. Sum absolute value (L1 Norm):
$$ |w_0| + |w_1| ... |w_n|  = ||W||_1$$         
3. Sum of Squares (L2 Norm)
$$ w_0^2 + w_1^2 ... w_n^2  = ||W||^2_2$$      

## Ridge Regression

[Detailed_Derivation](./Regularization/Ridge_Regression_Derivation.ipynb)


$$ Cost = RSS(w) + \lambda ||W||^2_2 $$


### Case 1 : Closed form Solution 

In OLS for XX^T matrix to be invertible we need No. of observations > no. of features             

Ridge, **closed form solution always exists provided $\lambda$ > 0**, can work with  No. of observations < no. of features as well 


$$ w = (X^TX + \lambda I)^{-1} X^Ty $$

### Case 2 : Gradient Descent

$$ w = (1 - 2\alpha \lambda)w + 2\alpha *[(Y- \hat Y)X^T] $$

at every step we first reduce our weight to $(1 - 2\alpha \lambda)w $ and then add the RSS component. Thus,  we make sure to reduce model complexity by reducing magnitude of weights. 


### Prevent Regularization of Intercept 

If we use the generic regularizer equation, it penalizes the intercept as well (ie, shrinks the intercept!). Which need not be correct always. 
So we may exclude weight corresponding to intercept in our regularization. 

( Above examples taking the case of ridge )
#### Option 1: 
Dont add $W_0^2$
$$ W_0 = W_0 - \alpha \nabla cost $$
$$ W_j = (1 - 2\alpha \lambda)W_j - \alpha \nabla cost $$
j > 0 (remaining features)

#### Option 2: With Centering       
When you center about 0, then small intercept doesnt matter. You can proceed  as normal. 
Steps:          
1. Transform y to have 0 means          
2. Run ridge as normal (closed form/ GD)


[same_text](./Regularization/Normalize_Input_For_Regularization.ipynb)

# Why Feature Selection

1. Efficient computation (if we have say 1 billion features)
2. Easy interpretation

Some Techniques: 
### All Subsets 
We start with zero features (random noise) and increase number of features and build a model for each of these combinations. 
Eg. For no. of features = 1, we build a model for each features seperately. Then for no. of features = 2, we build model for all combinations of 2 features and so on. At the end we select the model with the best cross validation score. 


Disadv: For D features, $ 2^{D+1}$ models to be evaluated.

### Greedy Methods 
#### Forward selection
At each step of all subset, we select the feature that performs the best and then combine that with remianing feature and so on. 
At every step we have D, (D-1), .... 1 (= D(D-1)/2 ~ $D^2$) options. $O(D^2) < 2^{D+1} $

There are other methods like backward stepwise selection, combination of both etc. 

# Lasso Regression 

### Why Ridge + Thresholding doesnt work: 

In ridge we have square of weights to be minimized. Thus, if we have two highly correlated features, rather than assigning a large weight to one feature and zero to another, it will assign small weights to both. (4^2+4^2 = 32 , 8^2+0 = 64. while minimizing for loss, 32 looks better) 

If we now put a threshold and knock off small coefficients both the correlated coefficient will be knocked off. At least one should be preserved, its valuable information getting lost. 


### Why Lasso
Because ridge doesnt make coeff go exactly zero but very close to zero. If majority of coefficients are zero:
* sparse input matrix are faster to compute and efficient. 
* Less number of features implies better interpretability.

Lasso knocks out features by making the coefficients zero.

### Cost Function 

$$ Cost = RSS(w) + \lambda ||W||_1 $$

[Lasso Derivation](Lasso_Regularization_Derivation.ipynb)

|W| isnt differentiable, hence no closed form solution. 

We use coordinate descent. Instead of computing gradients for all features together, we minimize cost function one coefficient at a time and keeps others fixed. Then we moves to other features and follow same method recursively. 

For least squares solution using coordinate descent looks like 

$$ w_{j} = \rho $$
$$ \rho = -2\sum \limits _{i =0} ^{N} h_{j}(x)(y_{i}- \sum \limits _{k != j} w_{k}h_{k}x)  $$ 

#### Coordinate Descent
* No learning rate hyperparameter. 
* we minimize cost function one coefficient at a time and keeps others fixed. Do this for each feature. 
* Stopping criteria: Cycle through all coordinates and converge max steps <$\epsilon$ (max step means the jump we took at each iteration)
* Intuition of the equation, we check the residuals without $ j^{th} $ term. If our residual is large, that means this feature is important hence we assing that large value to weight. Again we multiply by the feature value because we want our update to be proportional to that


$$w_j = \rho_j + \lambda/2 \hspace{0.5cm} if \hspace{0.5cm}\rho < -\lambda/2 $$
$$w_j = 0   \hspace{1cm}\hspace{0.5cm}if \hspace{0.5cm} \rho   [-\lambda/2, \lambda/2] $$
$$w_j = \rho_j - \lambda/2\hspace{0.5cm}  if \hspace{0.5cm}\rho > \lambda/2 $$

![](Regularization/helper/L7.JPG) 

[Ridge vs Lasso w.r.t Coefficients](Ridge_vs_Lasso_Coefficients_Poly_Reg.ipynb)


[Feature_Selection_n_Lasso](./Regularization/Feature_Selection_n_Lasso.ipynb)


# Visualing Contours

### Ridge 
![title](./Regularization/helper/Ridge_Contour.PNG)

$$ RSS(w) = \sum (y_i - w_0h_0 - w_1h_1)^2 $$ 
(Equation of ellipse) 

$$ Ridge cost = \lambda (w_0^2 + w_1^2)$$
(Equation of circle)



* The mid point of ellipse (b) is the optimal point w.r.t. RSS. Origin (a) is optimal w.r.t L2 reg. 
* We need to find the intersection of these two curves, since both terms present in our cost function.
* Curve R2 has lower RSS, but higher L2 penalty. R1 has higher RSS but less L2 penalty. 


### Lasso
![title](./Regularization/helper/Lasso_Contour.PNG) 

In lasso, the probability to hit the sharp corner of diamond shape is higher than smooth circular contour for ridge. Thus, Lasso quickly approaches to sparse solution (features coef going to zero).

[same text](./Regularization/Visualize_Ridge_Lasso.ipynb)

In [1]:
from IPython.display import Video

Video("./Regularization/helper/ridge_intiution_video.mp4")

In [2]:
from IPython.display import Video

Video("./Regularization/helper/lasso_intiution_video.mp4")

## Lasso Disadv:
### Debiasing Lasso
Lasso shrinks weights and can potentially cause high bias situation. To avoid that:
1. Run lasso to select optimal features         
2. Run least sq with those selected features. Implying those feature coefficients wont be shrunk relative to lasso output.


### Correlated Variables:
If you have a collection of strongly correlated features, lasso will tend to just select amongst them pretty much arbitrarily. A small tweak in the data might lead to change in variable included. Eg In housing price prediction, square feet and lot size 

### Ridge outperforms Lasso
It's been shown empirically that in many cases, ridge regression actually outperforms lasso in terms of predictive performance. 

Elastic net: Fused ridge and lasso objectives




# Elastic net

$$ Penalty = \alpha||w||_{2} + (1-\alpha)|w|_{1} $$

![](Regularization\helper/EN1.JPG)

In [5]:
### TODO: handson for debiasing lasso, check if  ridge outperforms lasso in terms of MSE? 
### MSE vs RMSE vs MAE vs Huber 
#### R2 adjusted R2