## Multiple Linear Regression

The R-squared will always either increase or remain the same when you add more variables. Because you already have the predictive power of the previous variable so the R-squared value can definitely not go down. And a new variable, no matter how insignificant it might be, cannot decrease the value of R-squared.


Most of the concepts in multiple linear regression are quite similar to those in simple linear regression. The formulation for predicting the response variable now becomes:

$Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2}+..........+\beta_{p}X_{p} +\epsilon $

Apart from the formulation, there are some other aspects that still remain the same:

* The model now fits a hyperplane instead of a line
* Coefficients are still obtained by minimising the sum of squared errors, the least squares criteria
* For inference, the assumptions from simple linear regression still hold - zero-mean, independent and normally distributed error terms with constant variance



>The new aspects to consider when moving from simple to multiple linear regression are:

**Overfitting**
As you keep adding the variables, the model may become far too complex
It may end up memorising the training data and will fail to generalise
A model is generally said to overfit when the training accuracy is high while the test accuracy is very low

**Multicollinearity**
Associations between predictor variables, which you will study later

**Feature selection**
Selecting the optimal set from a pool of given features, many of which might be redundant becomes an important task


## Multicollinearity

Multicollinearity refers to the phenomenon of having related predictor variables in the input dataset. In simple terms, in a model which has been built using several independent variables, some of these variables might be interrelated, due to which the presence of that variable in the model is redundant. You drop some of these related independent variables as a way of dealing with multicollinearity.

#### Multicollinearity affects:
---
**Interpretation:**
* Does “change in Y, when all others are held constant” apply?

**Inference:** 
* Coefficients swing wildly, signs can invert
* p-values are, therefore, not reliable
---

Multicollinearity is, thus, a big issue when you are trying to **interpret the model.** It is essential to detect and deal with the multicollinearity present in the model.

You saw two basic ways of dealing with multicollinearity

**Looking at pairwise correlations**
1. Looking at the correlation between different pairs of independent variables

**Checking the Variance Inflation Factor (VIF)**

 1.Sometimes pairwise correlations aren't enough
 
 2.Instead of just one variable, the independent variable might depend upon a combination of other variables
 
 3.VIF calculates how well one independent variable is explained by all the other independent variables      
   combined
The VIF is given by:

$$ VIF_{i} = \frac{1}{1-R_{i}^2}$$

where 'i' refers to the i-th variable which is being represented as a linear combination of rest of the independent variables. You'll see VIF in action during the Python demonstration on multiple linear regression.

The common heuristic we follow for the VIF values is:

**>10:  Definitely high VIF value and the variable should be eliminated.**

**>5:  Can be okay, but it is worth inspecting.**

**< 5: Good VIF value. No need to eliminate this variable**


**Effects of Multicollinearity**

Which of the following is not affected by multicollinearity i.e., if you add more variables that turn out to be dependent on already included variables?
R-squared value
The predictive power given by the R-squared value is not affected because even though you might have redundant variables in your model, they would play no role in affecting the R-squared. Recall the thought experiment that Rahim had conducted in one of the lectures. So suppose you have two variables, X1 and X2 which are exactly the same. So using any of the following, say, 10X1 or (4X1 + 6X2) will give you the same result. In the second case, even though you have increased one variable, the predictive power remains the same.

**VIF**

VIF is a measure of:
How well a predictor variable is correlated with all the other variables, excluding the target variable
VIF measures how well a predictor variable can be predicted using all other predictor variables

**Some methods that can be used to deal with multicollinearity are:**

**1.Dropping variables**
* Drop the variable which is highly correlated with others
* Pick the business interpretable variable

**2.Create new variable using the interactions of the older variables**
* Add interaction features, i.e. features derived using some of the original features

**3.Variable transformations**
* Principal Component Analysis (covered in a later module)



#### Dealing with Categorical Variables

So far, you have worked with numerical variables. But many times, you will have non-numeric variables in the datasets. These variables are also known as categorical variables. Obviously, these variables can't be used directly in the model since they are non-numeric.


When you have a categorical variable with say 'n' levels, the idea of dummy variable creation is to build 'n-1' variables, indicating the levels. For a variable say, 'Relationship' with three levels namely, 'Single', 'In a relationship', and 'Married', you would create a dummy table like the following:


|Relationship Status |	Single|	In a relationship|	Married|
|--------------------|--------|------------------|---------|        
|Single              |	1     |	0                |	0      |     
|In a relationship   |  0     |	1                |	0      |  
|Married             |	0	  | 0                |	1      |       


But you can clearly see that there is no need of defining three different levels. If you drop a level, say 'Single', you would still be able to explain the three levels.

Let's drop the dummy variable 'Single' from the columns and see what the table looks like:

|Relationship Status |	In a relationship|	Married|
|--------------------|------------------ |---------|        
|Single              |	0                |	0      |     
|In a relationship   |	1                |	0      |  
|Married             |	0                |	1      |       


If both the dummy variables namely 'In a relationship' and 'Married' are equal to zero, that means that the person is single. If 'In a relationship' is one and 'Married' is zero, that means that the person is in a relationship and finally, if 'In a relationship' is zero and 'Married' is 1, that means that the person is married.

### Feature Scaling

#### Why do we need to scale the features?

1.Ease of interpretation ( if all variables on same scale then it's easy to compare the Coefficient)

2.Faster convergence of gradient descent method

>Which of these will changes when you scale features?
 1. p-values
 2. Model Accuracy
 3. Both
 4. None

Answer : None ( it just changes coefficient)

#### Will you scale dummy variables?
This question arises because value of dummy variables is already in between 0 and 1.

Answer is both are ok but if you scale then it's good for LASSO regression ( findout why)


**It is important to note that scaling just affects the coefficients and none of the other parameters like t-statistic, F-statistic, p-values, R-squared, etc.**

There are two major methods to scale the variables:

i.e. standardisation and 

MinMax scaling. 

**Standardisation** basically brings all of the data into a standard normal distribution with mean zero and standard deviation one. 

**MinMax scaling**, on the other hand, brings all of the data in the range of 0 and 1. The formulae in the background used for each of these methods are as given below: 


$$ Standardisation:x = \frac{x-mean(x)}{sd(x)}$$

$$ MinMax Scaling:x = \frac{x-min(x)}{max(x)-min(x)}$$


To know more about dummy variables ([Link](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqwhat-is-dummy-coding/))

Why it's necessary to create dummy variables ([Link](https://stats.stackexchange.com/questions/89533/convert-a-categorical-variable-to-a-numerical-variable-prior-to-regression))

When to Normalise data and when to standardise? ([Link](https://stackoverflow.com/questions/32108179/linear-regression-normalization-vs-standardization))

Various scaling techniques ([Link](https://en.wikipedia.org/wiki/Feature_scaling))

[Image](https://i.stack.imgur.com/hcP4l.png)
![Image](https://i.stack.imgur.com/hcP4l.png)

### How to handel categorical variable when it has more than 10 levels?




### Model Assessment and Comparison

Now, for the assessment, you have a lot of new considerations to make. Besides, selecting the best model to obtain decent predictions becomes quite subjective. You need to maintain a balance between keeping the model simple and explaining the highest variance (which means that you would want to keep as many variables as possible). This can be done using the key idea that a model can be penalised for keeping a large number of predictor variables. 

 

Hence, there are two new parameters that come into picture:

$$ Adjusted R^2 = 1 - \frac{(1-R^2)(N-1)}{N-p-1}$$

$$ AIC  = n * log(\frac{RSS}{n}) + 2p$$

Here, n is the sample size meaning the number of rows you'd have in the dataset and p is the number of predictor variables.

#### R-squared vs Adjusted R-squared

Why do you think it is better to use adjusted R-squared in the case of multiple linear regression?

The major difference between R-squared and Adjusted R-squared is that R-squared doesn't penalise the model for having more number of variables. Thus, if you keep on adding variables to the model, the R-squared will always increase (**or remain the same in the case when the value of correlation between that variable and the dependent variable is zero**). Thus, R-squared assumes that any variable added to the model will increase the predictive power.

Adjusted R-squared on the other hand, penalises models based on the number of variables present in it. So if you add a variable and the Adjusted R-squared drops, you can be certain that that variable is insignificant to the model and shouldn't be used. So in the case of multiple linear regression, you should always look at the adjusted R-squared value in order to keep redundant variables out from your regression model.

Adjusted $R^2$ adjusts the value of $R^2$ such that a model with a larger number of variables is penalized.

[AIC](https://en.wikipedia.org/wiki/Akaike_information_criterion)

[BIC](https://en.wikipedia.org/wiki/Bayesian_information_criterion)

[Mallows' CP](https://en.wikipedia.org/wiki/Mallows%27s_Cp)


### Feature Selection

[Feature Selection](http://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/)

When building a multiple linear regression model, you might have quite a few potential predictor variables; selecting just the right ones becomes an extremely important exercise.

To get the optimal model, you can always try all the possible combinations of independent variables and see which model fits the best. But this method is obviously, time-consuming and infeasible. Hence, you need some other method to get a decent model. This is where manual feature elimination comes in, where you:

* Build the model with all the features
* Drop the features that are least helpful in prediction (high p-value)
* Drop the features that are redundant (using correlations and VIF)
* Rebuild model and repeat

Note that, the second and third steps go hand in hand and the choice of which features to eliminate first is very subjective. You'll see this during the hands-on demonstration of multiple linear regression in Python in the next session.


Model Assessment

Question: After performing inferences on a linear model built with several variables, you concluded that the variable ‘r’ was insignificant. This meant that the variable ‘r’:

1.Had a high p-value


2.Had a low p-value


3.Had a high VIF


4.Had a low VIF

Answer: Had a high p-value

Feedback :
A high p-value means that the variable is not significant, and hence, doesn't help much in prediction.


Now, manual feature elimination might work when you have a relatively low number of potential predictor variables, say, ten or even twenty. But it is not a practical approach once you have a large number of features, say 100. In such a case, you automate the feature selection (or elimination) process. Let's see how.


##### Recursive feature elimination 
is based on the idea of repeatedly constructing a model (for example, an SVM or a regression model) and choosing either the best or worst performing feature (for example, based on coefficients), setting the feature aside and then repeating the process with the rest of the features. This process is applied until all the features in the dataset are exhausted. Features are then ranked according to when they were eliminated. As such, it is a greedy optimisation for finding the best performing subset of features.