# Multiple Linear Regression

Welcome to the session on **Multiple Linear Regression**. So far, we have discussed the simple linear regression, where the model is built using one independent variable only. But, what if you have **multiple independent variables**? 

How do you make a predictive model in such a case? 

Build a multiple linear regression on top of such a data is one such solution.

In this session
You will use the example of sales prediction using the TV marketing budget that you saw in the previous session to build a multiple linear regression model. But now, instead of just one variable, you will have three variables to deal with. The **marketing budget will be split into three marketing channels — TV marketing, radio marketing, and newspaper marketing**. You will see how adding more variables brings in many new problems and how do you approach them. In the end, you will learn about feature selection and feature elimination to build the most optimal model.



This session is almost completely a theoretical session on multiple linear regression and its various aspects. So don't worry much if you don't get everything in the first go as you will also see each of these aspects in action in the next session as well which is a Python demonstration on multiple linear regression where things will become clearer.



## Motivation

The term ‘multiple' in multiple linear regression gives you a fair idea in itself. It represents the **relationship between two or more independent input variables and a response variable**. Multiple linear regression is needed when one variable might not be sufficient to create a good model and make accurate predictions.

![49.png](attachment:4665c585-4da7-4a78-912d-9735567d4e47.png)


#### Q1: In the simple linear regression model between TV and sales, the accuracy, or the 'model fit', as measured by R-squared was about 0.81. But, when you brought in the radio and the newspaper variables along with TV, the R-squared increased to 0.91 and 0.83, respectively. Do you think the R-squared value will always increase (or at least remain the same) when you add more variables?

    Yes

    No

    Can't say

You saw that multiple linear regression proved to be useful in creating a better model, as there was a significant change in the value of R-squared. Recall that the R-squared for simple linear regression using 'TV' as the input variable was 0.816. When you have two variables as input - 'Newspaper' and 'TV', the R-squared gets increased to 0.836. Using 'Radio' along with 'TV' increased its value to 0.910. So it seems that adding a new variable helps explain the variance in the data better.



It is recommended that you check the R-squared after adding these variables to see how much has the model improved.



Let’s now look at the formulation of multiple linear regression. The multiple linear regression is just an extension of simple linear regression. Hence, the formulation is largely the same.

Most of the concepts in multiple linear regression are quite similar to those in simple linear regression. The formulation for predicting the response variable now becomes:

![49.png](https://latex.upgrad.com/render?formula=Y%20%3D%20%5Cbeta_%7B0%7D%20%2B%20%5Cbeta_%7B1%7D%20X_%7B1%7D%20%2B%20%5Cbeta_%7B2%7D%20X_%7B2%7D%20%2B%20%5Cldots%20%2B%20%5Cbeta_%7Bp%7D%20X_%7Bp%7D%20%2B%20%5Cepsilon)

Apart from the formulation, there are some other aspects that still remain the same:

* The model now fits a hyperplane instead of a line
* Coefficients are still obtained by minimising the sum of squared errors, the least squares criteria
* For inference, the assumptions from simple linear regression still hold - zero-mean, independent and normally distributed error terms with constant variance

#### Q2: Which of the following assumptions changes for multiple linear regression?

    The error terms should be normally distributed.
    
    The error terms are centred at zero.
    
    The error terms have constant variance.
    
    None of the above.
    

## Moving from SLR to MLR: New Considerations

The new aspects to consider when moving from simple to multiple linear regression are:

1. **Overfitting**

* As you keep adding the variables, the model may become far too complex
* It may end up memorising the training data and will fail to generalise
* A model is generally said to overfit when the training accuracy is high while the test accuracy is very low

2. **Multicollinearity**

* Associations between predictor variables, which you will study later

3. **Feature selection**
* Selecting the optimal set from a pool of given features, many of which might be redundant becomes an important task


In the link below, you can understand more about the overfitting concept.

[Overfitting](https://elitedatascience.com/overfitting-in-machine-learning)



#### Q3: Which of these two models would be a better fit to the data

![50.png](attachment:4e2a366c-bd2b-4e63-8e23-e7dba513b296.png)

![51.png](attachment:6417863c-a30f-4929-bd58-75c20af20470.png)

**Answer**:

The first one

Correct! The first model seems to be generalising well on the dataset. So if more such similar data is introduced, the accuracy will not drop. But the second model clearly seems to have memorised all the data points in the dataset and hence, is displaying overfitting which might not be good if new data points are introduced.

## Multicollinearity

in the last segment, you learned about the new considerations that are required to be made when moving to multiple linear regression. Rahim has already talked about **overfitting**. Let’s now look at the next aspect, i.e., **multicollinearity**.

**Multicollinearity** refers to the phenomenon of having related predictor variables in the input dataset. In simple terms, in a model which has been built using **several independent variables**, some of these **variables might be interrelated**, due to which the presence of that variable in the model is redundant. You **drop some of these related independent variables as a way of dealing with multicollinearity**.

**Multicollinearity happens when independent variables (features) are highly correlated with each other, which can make the model unstable and the estimated coefficients unreliable.**

**Multicollinearity affects:**

* Interpretation:
    * Does “change in Y, when all others are held constant” apply?
* Inference: 
    * Coefficients swing wildly, signs can invert
    * p-values are, therefore, not reliable

Multicollinearity is, thus, a big issue when you are trying to interpret the model. It is essential to detect and deal with the multicollinearity present in the model.

![52.png](attachment:6609a5f1-f047-4bd1-937e-4e67c16d8ee6.png)

## Detect multicollinearity

Let's see how you can detect multicollinearity in the model.

**Two basic ways of dealing with multicollinearity:**

1. Looking at pairwise correlations
    * Looking at the correlation between different pairs of independent variables
2. Checking the Variance Inflation Factor(VIF)
    * Sometimes pairwise correlations aren't enough
    * Instead of just one variable, the independent variable might depend upon a combination of other variables
    * VIF calculates **how well one independent variable is explained by all the other independent variables** combined

    The VIF is given by:
   ![vif](https://latex.upgrad.com/render?formula=V%20I%20F_%7Bi%7D%20%3D%20%5Cfrac%7B1%7D%7B1%20-%20%5Cleft%28R_%7Bi%7D%5Cright%29%5E%7B2%7D%7D)

where 'i' refers to the i-th variable which is being represented as a linear combination of rest of the independent variables. You'll see VIF in action during the Python demonstration on multiple linear regression.

The common heuristic we follow for the VIF values is:

    > 10:  Definitely high VIF value and the variable should be eliminated.
    
    > 5:  Can be okay, but it is worth inspecting.
    
    < 5: Good VIF value. No need to eliminate this variable.

But once you have detected the multicollinearity present in the dataset, how exactly do you deal with it?


![53.png](attachment:e8cfb559-6699-4dea-a6e8-82367576ae98.png)

### What to Do if VIF is High?
1. Dropping variables
    * Drop the variable which is highly correlated with others
    * Pick the business interpretable variable
2. Create new variableusing the interactions of the older variables
    * Add interaction features, i.e. features derived using some of the original features
3. Variable transformations
    * Principal Component Analysis (covered in a later module)
  
Additional Reading

[Partial Least Squares (PLS)](https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/supporting-topics/partial-least-squares-regression/what-is-partial-least-squares-regression/)

#### Q1: Which of the following is not affected by multicollinearity i.e., if you add more variables that turn out to be dependent on already included variables?
    
    p-values
    
    Coefficients
    
    R-squared value

Answer: R-squared value

The predictive power given by the R-squared value is not affected because even though you might have redundant variables in your model, they would play no role in affecting the R-squared. Recall the thought experiment that Rahim had conducted in one of the lectures. So suppose you have two variables, X1 and X2 which are exactly the same. So using any of the following, say, 10X1 or (4X1 + 6X2) will give you the same result. In the second case, even though you have increased one variable, the predictive power remains the same.

#### Q2: VIF is a measure of:

    How well a predictor variable is correlated with all the other variables, including the target variable
    
    How well a predictor variable is correlated with all the other variables, excluding the target variable
    
    How well a target variable is correlated with all the other predictor variables

Answer: How well a predictor variable is correlated with all the other variables, excluding the target variable

VIF measures how well a predictor variable can be predicted using all other predictor variables

#### Q3: When calculating the VIF for one variable using a group of variables, the Equation came up to be 0.75. What will the approximate VIF for this variable be?
    
    1
    
    2
    
    4
    
    5
#### Q4: Is the VIF obtained in the previous case a good VIF value?

    Yes
    
    No
    
    It is okay, but still worth inspecting

Answer : Yes

The common heuristic for VIF values is that if it is greater than 10, it is definitely high. If the value is greater than 5, it is okay but worth inspecting. And anything lesser than 5 is definitely okay.

## Dealing with Categorical Variables

So far, you have worked with numerical variables. But many times, you will have **non-numeric variables** in the data sets. These variables are also **known as categorical variables**. Obviously, these variables **can't be used directly in the model** since they are non-numeric.

When you have a **categorical variable with say 'n' levels**, the idea of dummy variable creation is to **build 'n-1' variables**, indicating the levels. For a variable say, 'Relationship' with three levels namely, 'Single', 'In a relationship', and 'Married', you would create a dummy table like the following:

| Relationship Status | Single | In a relationship | Married |
|---------------------|--------|-------------------|---------|
| Single              | 1      | 0                 | 0       |
| In a relationship   | 0      | 1                 | 0       |
| Married             | 0      | 0                 | 1       |

But you can clearly see that there is **no need of defining three different levels**. If you drop a level, say 'Single' (remember **build 'n-1' variables**), you would still be able to explain the three levels.


Let's drop the dummy variable 'Single' from the columns and see what the table looks like:

| Relationship Status | In a relationship | Married |
|---------------------|-------------------|---------|
| Single              | 0                 | 0       |
| In a relationship   | 1                 | 0       |
| Married             | 0                 | 1       |


If both the dummy variables namely 'In a relationship' and 'Married' are equal to zero, that means that the person is single. If 'In a relationship' is one and 'Married' is zero, that means that the person is in a relationship and finally, if 'In a relationship' is zero and 'Married' is 1, that means that the person is married.

![54.png](attachment:01403c04-17f3-4c87-a8d3-7ee682cc35e7.png)

#### Q5: The creation of dummy variables to convert a categorical variable into a numeric variable is an important step in data preparation. Consider a case where a categorical variable is a factor with 22 levels. How many dummy variables will be required to represent this categorical variable while developing the linear regression model?

    20
    
    21
    
    22
    
    23

Answer: 21

n-1 variables


## Scaling the Variables

Before you move on to the next segment, there’s one concept that needs to be addressed, the concept of scaling the variables. 

### Whay we need to scale the variables?
Note: the different variables can have different ranges one can be 1 to 100 , another can be 1 to 1000 so it becomes difficult so for 

1. Ease of inteprations 
2. Faster convergence of gradient decent memthods 
   
### What parametres changes with scaling? 

It is important to note that **scaling just affects the coefficients** and none of the other parameters like t-statistic, F-statistic, p-values, R-squared, etc.


### How do we scale?
There are two major methods to scale the variables, i.e. 

1. **Standardization**: Standardization basically brings all of the data into a standard normal distribution with mean zero and standard deviation one.
2. **MinMax scaling**: on the other hand, brings all of the data in the range of 0 and 1.

The formulas in the background used for each of these methods are as given below:

* Standardisation: ![x](https://latex.upgrad.com/render?formula=x%20%3D%20%5Cfrac%7Bx%20-%20m%20e%20a%20n%20%5Cleft%28%5Cright.%20x%20%5Cleft.%5Cright%29%7D%7Bs%20d%20%5Cleft%28%5Cright.%20x%20%5Cleft.%5Cright%29%7D)
* MinMax Scaling: ![x](https://latex.upgrad.com/render?formula=x%20%3D%20%5Cfrac%7Bx%20-%20m%20i%20n%20%5Cleft%28%5Cright.%20x%20%5Cleft.%5Cright%29%7D%7Bm%20a%20x%20%5Cleft%28%5Cright.%20x%20%5Cleft.%5Cright%29%20-%20m%20i%20n%20%5Cleft%28%5Cright.%20x%20%5Cleft.%5Cright%29%7D)

### Do we need to scale dummy variable ?
dummy variable is the variable assigned to the categorical variables (0 - 1), generally we dont need to scale but we can in some cases

**Additional Reading**

* To know more about dummy variables ([here](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqwhat-is-dummy-coding/))
* Why it's necessary to create dummy variables ([here](https://stats.stackexchange.com/questions/89533/convert-a-categorical-variable-to-a-numerical-variable-prior-to-regression))
* When to Normalise data and when to standardise? ([here](https://stackoverflow.com/questions/32108179/linear-regression-normalization-vs-standardization))
* Various scaling techniques ([here](https://en.wikipedia.org/wiki/Feature_scaling))

## Model Assessment and Comparison

Once the model is built, you would want to assess it in terms of its predictive powers. For multiple linear regression, you may build more than one model, with different combinations of the independent variables. In such a case, you would also need to compare these models with one another to check which one yields optimal results. 

![55.png](attachment:90bdcc5f-de10-4864-9e50-d9c4285ed7de.png)


Now, for the assessment, you have a lot of new considerations to make. Besides, selecting the best model to obtain decent predictions becomes quite subjective. You need to maintain a balance between keeping the model simple(leess features) and explaining the highest variance(![image.png](https://latex.upgrad.com/render?formula=%5Cbeta_0)) (which means that you would want to keep as many variables as possible). This can be done using the key idea that a model can be penalised for keeping a large number of predictor variables. 

 
Hence, there are two new parameters that come into picture:

 
![Adjusted R](https://latex.upgrad.com/render?formula=R%5E%7B2%7D%20%3D%201%20-%20%5Cfrac%7B%5Cleft%28%5Cright.%201%20-%20R%5E%7B2%7D%20%5Cleft.%5Cright%29%20%5Cleft%28%5Cright.%20N%20-%201%20%5Cleft.%5Cright%29%7D%7BN%20-%20p%20-%201%7D)

![AIC](https://latex.upgrad.com/render?formula=A%20%5Cmathbb%7BC%7D%20%3D%20n%20%5Ctimes%20log%20%5Cleft%28%5Cfrac%7BR%20S%20S%7D%7Bn%7D%5Cright%29%20%2B%202%20p)

 

Here, **n**: is the sample size meaning the number of rows you'd have in the dataset and **p**: is the number of predictor variables.

### Adjusted R-squared vs AIC


| Metric               | Adjusted R-squared                          | AIC (Akaike Information Criterion)                |
|----------------------|----------------------------------------------|--------------------------------------------------|
| **Measures**         | Model fit with penalty for complexity        | Model likelihood with penalty for complexity     |
| **Higher/Lower Better?** | Higher is better                          | Lower is better                                  |
| **Penalizes Complexity** | Yes (based on number of predictors)       | Yes (based on number of parameters)              |
| **Scale**            | 0 to 1                                       | Can be any real number (lower = better)          |
| **Model Type**       | Linear regression                            | Any likelihood-based model (linear, logistic, etc.) |
| **Use case**         | Explaining variance                          | Comparing models and information loss            |


### ✅ When is Adjusted R-squared better?
* You want to understand how much variance is explained by your predictors.
* You’re working only with linear regression models.
* You're focused on interpretability — Adjusted R² is more intuitive (0–1 scale).

### ✅ When is AIC better?
* You’re comparing different models, not just with different variables, but possibly:
    * Different transformations
    * Different likelihood functions (e.g., logistic regression, Poisson, etc.)
* You care more about predictive accuracy and model parsimony.
* You want a general-purpose metric for models beyond linear regression.

### 🔁 Example Scenario
* Say you're building multiple linear regression models with different variable combinations:

* Adjusted R² tells you how well the model explains the variation in your data (fit).

* AIC tells you how good the model is in terms of information loss and overfitting tradeoff — even if it fits slightly worse, a simpler model may win.


| Goal                                               | Prefer This Metric |
| -------------------------------------------------- | ------------------ |
| Explain how much variance is captured              | Adjusted R-squared |
| Compare models (even with different types)         | AIC (or BIC)       |
| Work only with linear regression models            | Adjusted R-squared |
| Work with non-linear models or logistic regression | AIC                |
| Want to penalize complexity more heavily           | AIC/BIC            |




**Additional Reading :**

The following links provide a detail study on AIC and other parameters used in automatic feature selection :

* [AIC](https://en.wikipedia.org/wiki/Akaike_information_criterion)
* [BIC](https://en.wikipedia.org/wiki/Bayesian_information_criterion)
* [Mallows' CP](https://en.wikipedia.org/wiki/Mallows%27s_Cp)


#### Q6:When a model was built from a dataset with 101 samples and 10 predictor variables, the R-squared value was found to be 0.7. What will the value of the adjusted R-squared be for the same model?

    0.46
    
    0.50
    
    0.67
    
    0.73

Answer:  0.67
apply Adjusted R^2    


#### Q7:Why do you think it is better to use adjusted R-squared in the case of multiple linear regression?
Adjusted R-squared is better for multiple linear regression because it accounts for model complexity and prevents overfitting by penalizing unnecessary variables as compared to R-squared.

**The Problem with Regular R-squared**: It rewards model complexity — which can lead to overfitting.

R-squared **always increases or stays the same** when you **add more variables** to the model — **even if** those **variables** are irrelevant or **don’t improve prediction**.

So, a higher R-squared doesn’t always mean a better model.

**What Adjusted R-squared Does**
Adjusted R-squared **adjusts the R-squared** value **based on the number of predictors** in the model.

It penalizes the model for adding features that do not improve the model significantly.

Interpretation 

| Situation                      | R² | Adjusted R² | Meaning               |
|-------------------------------|----|--------------|------------------------|
| Add useful variable            | ↑  | ↑            | Model improved         |
| Add useless variable           | ↑  | ↓ or ↔       | Model not improved     |
| Model overfits with many vars | ↑  | ↔ or ↓       | Adjusted R² warns you  |



## Feature Selection

When building a multiple linear regression model, you might have quite a few potential predictor variables; selecting just the right ones becomes an extremely important exercise.

### How you can select the optimal features for building a good model ?

To get the optimal model, you can always try all the possible combinations of independent variables and see which model fits the best. But this method is obviously, time-consuming and infeasible. Hence, you need some other method to get a decent model. 

This is where manual feature elimination comes in, where you:

1. Build the model with all the features (2^P models for p features)
2. Drop the features that are least helpful in prediction (high p-value)
3. Drop the features that are redundant (using correlations and VIF, VIF value is high)
4. Rebuild model and repeat

Note that, the second and third steps go hand in hand and the choice of which features to eliminate first is very subjective. You'll see this during the hands-on demonstration of multiple linear regression in Python in the next session.


#### Q8: After performing inferences on a linear model built with several variables, you concluded that the variable ‘r’ was insignificant. This meant that the variable ‘r’:
    
    Had a high p-value
    
    Had a low p-value
    
    Had a high VIF
    
    Had a low VIF
Answer: Had a high p-value
A high p-value means that the variable is not significant, and hence, doesn't help much in prediction.

not Had a high VIF : VIF tells the relationship of one variable with all the other variables. It doesn't determine whether a variable has predictive power or not.


**Manual feature elimination** might work when you have a relatively low number of potential predictor variables, say, ten or even twenty. But it is **not a practical approach once you have a large number of features, say 100**. In such a case, you automate the feature selection (or elimination) process. Let's see how.

![56.png](attachment:02e99c04-f58b-4c83-a66a-5ac58284d777.png)

You need to combine the manual and the automated approaches in order to get an optimal model relevant to the business. Hence, you first do an automated elimination (coarse tuning), and when you have a small set of potential variables left to work with, you can use your expertise and subjectivity to eliminate a few other features (fine tuning).

### 🔍 What is RFE?
**RFE (Recursive Feature Elimination)** is a feature selection technique used in machine learning to select the most important features from a dataset.

🧠 Core Idea
**RFE:**

* **Fits a model** (like linear regression, logistic regression, decision tree, etc.).
* **Ranks features** based on importance (e.g., coefficient size or impurity reduction).
* **Removes the least important feature**(s).
* Repeats the process **recursively** until the desired number of features is left.

📉 **It eliminates** one (or more) features at each step based on the model's performance.

### Why use RFE?
* Reduces **overfitting** by removing irrelevant features.

Improves **model accuracy** and **interpretability**.

Helps in **dimensionality reduction**.

### ⚙️ How RFE Works (Step-by-Step):
1. **Train** a model on all features.
2. Compute feature importance.
3. Eliminate the **least important** feature.
4. Repeat steps 1–3 until only the desired number of features remains.

In [4]:
# Python example

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# Create dummy data
X, y = make_regression(n_samples=100, n_features=10, noise=0.1)

# Base model
model = LinearRegression()

# Apply RFE to select top 5 features
rfe = RFE(estimator=model, n_features_to_select=5)
rfe = rfe.fit(X, y)

# Get selected features
print("Selected Features:", rfe.support_)
print("Feature Ranking:", rfe.ranking_)

Selected Features: [ True  True False False  True False  True  True False False]
Feature Ranking: [1 1 5 3 1 2 1 1 6 4]


#### Q9: Suppose you have to build five multiple linear regression models for five different datasets. You're planning to use about 10 variables for each of these models. The number of potential variables in each of these datasets are 15, 30, 65, 10, and 100. In which of these cases you would definitely need to use RFE?
    
    1st and 4th cases
    
    1st, 2nd, and 4th cases
    
    3rd and 5th cases
    
    2nd, 3rd, and 5th cases

## 📘 Summary

Here’s a brief summary of what you learned in this session:

### 🔹 When One Variable Might Not Be Enough
- A lot of variance isn’t explained by just one feature.
- This can lead to inaccurate predictions.

### 🔹 Formulation of Multiple Linear Regression (MLR)
- MLR helps us understand how much the dependent variable changes when we change the independent variables.

### 🔹 New Considerations in Moving from SLR to MLR
- **Overfitting**: When the model becomes too complex, performs well on training data, but poorly on testing data.
- **Multicollinearity**: Detect if independent variables are correlated with each other to reduce redundancy.
- **Feature Selection**: Identify and retain only the most relevant features; drop redundant or irrelevant ones.

### 🔹 Dealing with Categorical Variables
- **Dummy Variables**: Used for encoding categories with fewer levels (e.g., marital status example).

### 🔹 Feature Scaling
- **Standardization**: Ensures data is internally consistent by centering around the mean and scaling by standard deviation.
- **Min-Max Scaling**: Scales values to a fixed range, typically 0–1.
- **Scaling Categorical Variables**: Convert categories to numeric format as models can't handle strings directly.

### 🔹 Model Assessment and Comparison
- **Adjusted R-squared**: Increases only if a new term improves the model more than expected by chance.
- **AIC, BIC**: Criteria used for automatic feature selection by balancing model fit and complexity.

### 🔹 Feature Selection Techniques
- **Manual Feature Selection**: Time-consuming process of choosing the right variables manually.
- **Automated Feature Selection**:
  - Select top 'n' features
  - Use Forward, Backward, or Stepwise selection based on AIC
- **Regularization**: Adds penalty terms to the loss function to prevent overfitting.
- **Balancing Both**: A mix of manual and automated feature selection often gives the best results.

---

## 🚀 Coming Up

Now that you have gained an understanding of the theoretical considerations in building a multiple linear regression model, you will create one such model in Python in the next session.




#### Q10.  Suppose you built a model with some features. Now you go and add another variable to the model.  Which of the following statements would be true? *(More than one option may be correct)*

- [ ] The R-squared value may decrease or increase  
- [ ] The R-squared value will either increase or remain the same  
- [ ] The Adjusted R-squared value may increase or decrease  
- [ ] The Adjusted R-squared value will either increase or remain the same

 Answers:
✔️ The R-squared value will either increase or remain the same
R² never decreases when you add more predictors — it either increases or stays the same (even if the predictor is useless).

✔️ The Adjusted R-squared value may increase or decrease
Adjusted R² adjusts for model complexity. If the new variable doesn't contribute meaningfully, it can decrease.


#### Q12.Overfitting is more probable when:

- [ ] Number of data points are more
- [ ] Number of data points are less
- [ ] Number of data points doesn't matter for overfitting

Answer: Number of data points are less

Overfitting is the condition wherein the model is so complex that it ends up memorising almost all the data points on the train set. Hence, this condition is more probable if the number of data points is less since the model passing through almost every point becomes easier.

#### Q13.
VIF is a measure of:

- [ ] How well a predictor variable is correlated with all the other variables, including the target variable
- [ ] How well a predictor variable is correlated with all the other variables, excluding the target variable
- [ ] How well a target variable is correlated with all the other predictor variables

Answer:  How well a predictor variable is correlated with all the other variables, excluding the target variable

#### Q14. Suppose you were predicting the sales of a company using two variables 'Social Media Marketing' and 'TV Marketing'. You found out that the correlation between 'Social Media Marketing' and 'TV Marketing' is 0.9. What will be the approximate value of VIF for either of them?

- [ ]  0.81
- [ ]  '1.23
- [ ]  5.26
- [ ]  10

Answer:

r(correlation coffeecient) = 0.9

R-Square = r^2 = 0.9^2 = .81

VIF = 1/(1-R^2) = 1/(1-.81) = 5.26


#### Q15: Suppose you have 'n' categorical variables, each with 'm' levels. How many dummy variables would you need to represent all the levels of all the categorical variables?

- [ ]  m * n
- [ ]  (m+1) * n
- [ ]  m * (n-1)
- [ ]  (m-1) * n

Answer: (m-1) * n

Each of the dummy variables has 'm' levels. So to represent one categorical variable, you would require (m-1) levels. Hence, to represent 'n' categorical variables, you would need (m-1)*n dummy variables.

#### Q16: Which of the following is/are an example of an automated approach for linear regression?

- [ ] Recursive Feature Elimination
- [ ] Stepwise Selection using AIC
- [ ] Regularisation

Answer: All 

#### Q17:After performing inferences on a linear model built with several variables, you concluded that the variable ‘r’ was almost being described by other feature variables. This meant that the variable ‘r’:

- [ ] Had a high p-value
- [ ] Had a low p-value
- [ ] Had a high VIF
- [ ] Had a low VIF

Answer: Had a high VIF

If the variable is being described well by the rest of the feature variables, it means it has a high VIF meaning it is redundant in the presence of the other variables.

# Graded questions

**Comprehension:**

You are given a multiple linear regression model: ![Y](https://latex.upgrad.com/render?formula=Y%20%3D%20%5Cbeta%20_%7B0%7D%2B%20%5Cbeta%20_%7B1%7Dx_%7B1%7D%20%2B%20%5Cbeta%20_%7B2%7Dx_%7B2%7D%20%2B%20%5Cbeta%20_%7B3%7Dx_%7B3%7D)

Recall that the null hypothesis states that the variable is insignificant. Thus, if we fail to reject the null hypothesis, you can say that the predictor is insignificant.

For e.g. if you fail to reject null hypothesis for ![x](https://latex.upgrad.com/render?formula=x_%7B1%7D), you can say that ![x](https://latex.upgrad.com/render?formula=x_%7B1%7D) is insignificant. This would also imply that the coefficient for ![x](https://latex.upgrad.com/render?formula=x_%7B1%7D) i.e., ![x](https://latex.upgrad.com/render?formula=%5Cbeta%20_%7B1%7D) = 0.

In other words, the null hypothesis tests if the predictor's coefficient, i.e ![x](https://latex.upgrad.com/render?formula=%5Cbeta%20_%7Bi%7D)  = 0.
If the null hypothesis is rejected then ![x](https://latex.upgrad.com/render?formula=%5Cbeta_%7Bi%7D%20%5C%5Cneq%200) 


#### Q18: If  ![](https://latex.upgrad.com/render?formula=%5Cbeta_%7B1%7D%20%3D%20%5Cbeta_%7B2%7D%20%3D%200)  holds and β3 = 0 fails to hold, then what can you conclude?

- [ ] There is high correlation between x1 and x2
- [ ] There is a linear relationship between the outcome variable(Y) and x3
- [ ] There is a linear relationship between the outcome variable and x1, x2

**Explaination**

**💡 Background: Multiple Linear Regression**

In a multiple linear regression model, the goal is to explain the relationship between a **dependent variable \( Y \)** and multiple **independent variables \( x_1, x_2, x_3, \ldots \)** using a linear equation:

$
Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon
$

Each $( \beta_i )$ represents the **coefficient** (or weight) of the variable $( x_i )$. Hypothesis testing on these coefficients helps us determine whether each $( x_i )$ contributes meaningfully to predicting $( Y )$.

---

**🧪 What does hypothesis testing do here?**

For each coefficient $( \beta_i )$, we test the **null hypothesis**:

$
H_0: \beta_i = 0 \quad \text{(Variable is NOT significant)}
$

$
H_a: \beta_i \ne 0 \quad \text{(Variable IS significant)}
$

---

**✅ Given in the question:**

- $( \beta_1 = 0 )$ → **True** → variable $( x_1 )$ is **not significant**  
- $( \beta_2 = 0 )$ → **True** → variable $( x_2 )$ is **not significant**
- $( \beta_3 = 0 )$ → **False** → variable $( x_3 )$ **is significant**

So from statistical testing:

- The **p-values for $( \beta_1 )$ and $( \beta_2 )$** are high → we **fail to reject** the null hypothesis → they **do not explain Y significantly**.
- The **p-value for $( \beta_3 )$** is low → we **reject the null** → $( x_3 )$ **is useful** in explaining $( Y )$.

---

**🎯 Conclusion:**

Because $( x_3 )$ is the only variable with a significant coefficient:

👉 **There is a linear relationship between the outcome variable \( Y \) and \( x_3 \).**

The other variables, $( x_1 )$ and $( x_2 )$, do **not** show evidence of a relationship with $( Y )$ (at least not a statistically significant one in this model).

---

**⚠️ Why other options are wrong:**

**❌ "There is a linear relationship between Y and x1, x2" **
- No, because their coefficients were **not significant**. We have no evidence to support a relationship.

**❌ "There is high correlation between x1 and x2" **
- This is about **multicollinearity**, which is **not** inferred directly from the hypothesis test results of individual $( \beta_i )$.  
- To check multicollinearity, we use **correlation matrix** or **VIF** (Variance Inflation Factor), not p-values or hypothesis tests.

---

**✅ Summary**

| Coefficient | Significance     | Interpretation                         |
|-------------|------------------|----------------------------------------|
| $( \beta_1 )$  | Not significant | $( x_1 )$ does **not** help predict $( Y )$ |
| $( \beta_2 )$  | Not significant | $( x_2 )$ does **not** help predict $( Y )$ |
| $( \beta_3 )$  | Significant     | $( x_3 )$ **does** help predict $( Y )$     |

➡️ So the only **valid conclusion**:  
✔️ There is a linear relationship between **$( Y )$** and **$( x_3 )$**.

---

Let me know if you’d like to visualize this with a regression table or want help interpreting p-values or multicollinearity!

      
#### Q19:If  β1  = β2 = β3 = 0  holds true, then what can you conclude?

    There is no linear relationship between y and any of the 3 independent variables
    
    There is a linear relationship between y and all of the 3 independent variables
    
    There is linear relationship between x1, x2 and x3

Answer:

#### Q20:An analyst observes a positive relationship between digital marketing expenses and online sales for a firm. However, she intuitively feels that she should add an additional predictor variable, one which has a high correlation with marketing expenses.

- [ ] If the analyst adds this independent variable to the model, which of the following could happen? More than one choices could be correct.
- [x] The model’s R-squared will decrease
- [ ] The model’s adjusted R-squared could decrease
- [x ] The Beta-coefficient for predictor - digital marketing expenses, will remain same
- [ ] The relationship between marketing expenses and sales can become insignificant

Answer:


### 🧠 Explanation:

| Option                              | Outcome | Reason                                                                                                                                      |
|-------------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------|
| **Adjusted R² could decrease**      | ✅      | Adjusted R² penalizes unnecessary variables. If the new variable doesn't add real predictive power, adjusted R² might drop.                |
| **Relationship can become insignificant** | ✅      | Due to multicollinearity, standard errors of coefficients increase, making previously significant predictors appear insignificant.         |
| **R² will decrease**                | ❌      | R² never goes down with added variables.                                                                                                   |
| **Beta for marketing remains same** | ❌      | Multicollinearity can change the coefficients significantly.                                                                               |

#### Q21:
Suppose you need to build a model on a data set which contains 2 categorical variables with 2 and 4 levels respectively. How many dummy variables should you create for model building?

- [ ] 4
- [ ] 5
- [ ] 6
- [ ] 8

Answer: 4

To determine how many dummy variables to create for model building, you apply the rule:

For a categorical variable with k levels, you need k - 1 dummy variables.

Given:
* First categorical variable: 2 levels → requires 1 dummy variable
* Second categorical variable: 4 levels → requires 3 dummy variables

        3+1 = 4 variables
  
#### Q22: If one of the feature variables, say, A, is being explained well by some of the other feature variables, this would mean that the variable A has:

- [ ] A high p-value
- [ ] A low p-value
- [ ] A high VIF
- [ ] A low VIF

Answer: A high VIF

📘 Explanation:
* VIF (Variance Inflation Factor) measures how much the variance of a regression coefficient is inflated due to multicollinearity.
* If feature A can be predicted from other features, its VIF will be high, indicating redundancy and instability in the regression model.

#### Q23: Given different Rsq values of linear regression models on the same training dataset, which model would you choose as the best predictor? Assume that all these models are multiple linear regression models 

[Hint: Do you think only Rsq values are sufficient here?]

- [ ] 0.86
- [ ] 0.76
- [ ] 0.94
- [ ] Rsq values alone are insufficient to answer this question.

Answer : Rsq values alone are insufficient to answer this question.


While R-squared tells you how much of the variation in the dependent variable is explained by the model, it does not account for:

1. Overfitting – A higher R² might just mean you've added more variables, not necessarily better prediction.
2. Number of predictors – More predictors artificially inflate R².
3. Predictive Power on New Data – R² is calculated on the training set and doesn't guarantee generalization to test data.

**What should you look at instead?**

* Adjusted R² – Adjusts for the number of predictors.
* Test set performance (e.g., RMSE, MAE) – Measures generalization.
* Cross-validation scores – More robust estimate of real-world performance.
* Residual plots – Helps detect patterns or violations of assumptions.
* Multicollinearity checks – High R² may hide redundant features. VIF

#### Q24: State true or false:

Each time you add a feature to a model, the R-squared increases or remain same, even if it is by chance. It never decreases. 

- [ ] True
- [ ] False

Answer: 