# Multiple Linear Regression

Welcome to the session on **Multiple Linear Regression**. So far, we have discussed the simple linear regression, where the model is built using one independent variable only. But, what if you have **multiple independent variables**? 

How do you make a predictive model in such a case? 

Build a multiple linear regression on top of such a data is one such solution.

In this session
You will use the example of sales prediction using the TV marketing budget that you saw in the previous session to build a multiple linear regression model. But now, instead of just one variable, you will have three variables to deal with. The **marketing budget will be split into three marketing channels — TV marketing, radio marketing, and newspaper marketing**. You will see how adding more variables brings in many new problems and how do you approach them. In the end, you will learn about feature selection and feature elimination to build the most optimal model.



This session is almost completely a theoretical session on multiple linear regression and its various aspects. So don't worry much if you don't get everything in the first go as you will also see each of these aspects in action in the next session as well which is a Python demonstration on multiple linear regression where things will become clearer.



## Motivation

The term ‘multiple' in multiple linear regression gives you a fair idea in itself. It represents the **relationship between two or more independent input variables and a response variable**. Multiple linear regression is needed when one variable might not be sufficient to create a good model and make accurate predictions.

![49.png](attachment:4665c585-4da7-4a78-912d-9735567d4e47.png)


#### Q1: In the simple linear regression model between TV and sales, the accuracy, or the 'model fit', as measured by R-squared was about 0.81. But, when you brought in the radio and the newspaper variables along with TV, the R-squared increased to 0.91 and 0.83, respectively. Do you think the R-squared value will always increase (or at least remain the same) when you add more variables?

    Yes

    No

    Can't say

You saw that multiple linear regression proved to be useful in creating a better model, as there was a significant change in the value of R-squared. Recall that the R-squared for simple linear regression using 'TV' as the input variable was 0.816. When you have two variables as input - 'Newspaper' and 'TV', the R-squared gets increased to 0.836. Using 'Radio' along with 'TV' increased its value to 0.910. So it seems that adding a new variable helps explain the variance in the data better.



It is recommended that you check the R-squared after adding these variables to see how much has the model improved.



Let’s now look at the formulation of multiple linear regression. The multiple linear regression is just an extension of simple linear regression. Hence, the formulation is largely the same.

Most of the concepts in multiple linear regression are quite similar to those in simple linear regression. The formulation for predicting the response variable now becomes:

![49.png](https://latex.upgrad.com/render?formula=Y%20%3D%20%5Cbeta_%7B0%7D%20%2B%20%5Cbeta_%7B1%7D%20X_%7B1%7D%20%2B%20%5Cbeta_%7B2%7D%20X_%7B2%7D%20%2B%20%5Cldots%20%2B%20%5Cbeta_%7Bp%7D%20X_%7Bp%7D%20%2B%20%5Cepsilon)

Apart from the formulation, there are some other aspects that still remain the same:

* The model now fits a hyperplane instead of a line
* Coefficients are still obtained by minimising the sum of squared errors, the least squares criteria
* For inference, the assumptions from simple linear regression still hold - zero-mean, independent and normally distributed error terms with constant variance

#### Q2: Which of the following assumptions changes for multiple linear regression?

    The error terms should be normally distributed.
    
    The error terms are centred at zero.
    
    The error terms have constant variance.
    
    None of the above.
    

## Moving from SLR to MLR: New Considerations

The new aspects to consider when moving from simple to multiple linear regression are:

1. **Overfitting**

* As you keep adding the variables, the model may become far too complex
* It may end up memorising the training data and will fail to generalise
* A model is generally said to overfit when the training accuracy is high while the test accuracy is very low

2. **Multicollinearity**

* Associations between predictor variables, which you will study later

3. **Feature selection**
* Selecting the optimal set from a pool of given features, many of which might be redundant becomes an important task


In the link below, you can understand more about the overfitting concept.

[Overfitting](https://elitedatascience.com/overfitting-in-machine-learning)



#### Q3: Which of these two models would be a better fit to the data

![50.png](attachment:4e2a366c-bd2b-4e63-8e23-e7dba513b296.png)

![51.png](attachment:6417863c-a30f-4929-bd58-75c20af20470.png)

**Answer**:

The first one

Correct! The first model seems to be generalising well on the dataset. So if more such similar data is introduced, the accuracy will not drop. But the second model clearly seems to have memorised all the data points in the dataset and hence, is displaying overfitting which might not be good if new data points are introduced.

## Multicollinearity

in the last segment, you learned about the new considerations that are required to be made when moving to multiple linear regression. Rahim has already talked about **overfitting**. Let’s now look at the next aspect, i.e., **multicollinearity**.

**Multicollinearity** refers to the phenomenon of having related predictor variables in the input dataset. In simple terms, in a model which has been built using **several independent variables**, some of these **variables might be interrelated**, due to which the presence of that variable in the model is redundant. You **drop some of these related independent variables as a way of dealing with multicollinearity**.


**Multicollinearity affects:**

* Interpretation:
    * Does “change in Y, when all others are held constant” apply?
* Inference: 
    * Coefficients swing wildly, signs can invert
    * p-values are, therefore, not reliable

Multicollinearity is, thus, a big issue when you are trying to interpret the model. It is essential to detect and deal with the multicollinearity present in the model.

![52.png](attachment:6609a5f1-f047-4bd1-937e-4e67c16d8ee6.png)

## Detect multicollinearity

Let's see how you can detect multicollinearity in the model.

**Two basic ways of dealing with multicollinearity:**

1. Looking at pairwise correlations
    * Looking at the correlation between different pairs of independent variables
2. Checking the Variance Inflation Factor(VIF)
    * Sometimes pairwise correlations aren't enough
    * Instead of just one variable, the independent variable might depend upon a combination of other variables
    * VIF calculates how well one independent variable is explained by all the other independent variables combined

    The VIF is given by:
   ![vif](https://latex.upgrad.com/render?formula=V%20I%20F_%7Bi%7D%20%3D%20%5Cfrac%7B1%7D%7B1%20-%20%5Cleft%28R_%7Bi%7D%5Cright%29%5E%7B2%7D%7D)

where 'i' refers to the i-th variable which is being represented as a linear combination of rest of the independent variables. You'll see VIF in action during the Python demonstration on multiple linear regression.

The common heuristic we follow for the VIF values is:

    > 10:  Definitely high VIF value and the variable should be eliminated.
    
    > 5:  Can be okay, but it is worth inspecting.
    
    < 5: Good VIF value. No need to eliminate this variable.

But once you have detected the multicollinearity present in the dataset, how exactly do you deal with it?


![53.png](attachment:e8cfb559-6699-4dea-a6e8-82367576ae98.png)