## Linear Regression

- It's a simple approach to supervised learning. It assumes that the dependence of $Y$ on $X_1,X_2,X_3,...,X_p$ is linear.
- True regression functions are never linear. Although it may seem overly simplistic,linear regression is extremely useful both conceptually and practically.

<br>

## Linear Regression for the advertising data

Looking at the advertising data,which tells you the sales with three kind of advertising budget.
<br><br>
<div style="text-align:center">
    <img src="advertising_data.png" alt="Description of image">
</div>
<br><br>

There are some questions you may want to ask: 
- Is there a relationship b/w advertising budget and sales?
- How strong is the relationship b/w advertising budget and sales?
- Which media contributes to sales?
- How accurately can we predict sales?
- Is the relationship linear?
- Is there synergy among the advertising media?

## Simple linear regression using a simple predictor *X*

- We assume a model
$$Y = \beta_{0} + \beta_{1}X + \epsilon,$$
 where $\beta_{0}$ and $\beta_{1}$ are two unknown constants that represent the *intercept* and *slope*,and are also knwon as *coefficients* or *parameters*, and $\epsilon$ is the error term.
- Given some estimates $\hat{\beta_{0}}$ and $\hat{\beta_{1}}$ for the model coefficients,we predict future sales using
$$\hat{y} = \hat{\beta_{0}} + \hat{\beta_{1}}x,$$
where $\hat{y}$ indicates a prediction of *Y* on the basis of *X = x*. The hat symbol denotes an estimated value.

## Estimation of parameters by least squares

- Let $\hat{y_i} = \hat{\beta_{0}} + \hat{\beta_{1}}x_i$ be the prediction of *Y* based on the *i*th value of *X*. Then $e_i = y_i - \hat{y_i}$ represent the *i*th *residual*.
- We define *Residual Sum of Squares* (RSS) as 
$$RSS = e_{1}^2 + e_{2}^2 + ........ + e_{n}^2$$
- The least squares approach chooses $\beta_{0}$ and $\beta_{1}$ to minimise *RSS*. The minimising value can be shown to be
$$\hat{\beta_1} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2},$$

$$\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x},$$

where $\bar{y}$ and $\bar{x}$ are the sample mean.

## Assessing the accuracy of the Coefficient Estimates

- The standard error of an estimator reflects how it varies under repeated sampling. We have
$$\text{SE}(\hat{\beta_0})^2 = \sigma^2 \left[\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\right],$$
$$\text{SE}(\hat{\beta_1})^2 = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2},$$
where $\sigma^2 = Var(\epsilon)$

- We have our estimates ,now we want to know how precise are those estimates. Look at the second term,the numerator is the noise and denominator is the spread of the axis around their mean.The more the noise around the line,the less precise the estimates and the more the spread,the more the slope pinned down. In the below image,if all the points were concentrated in small x-region,there would have been a lot of variance in the slope per sample.The spread improves the precision hence. 
<br><br>
<div style="text-align:center">
    <img src="img2.png" alt="Description of image">
</div>

- These standard errors can be used to compute **confidence intervals**. A *95%* confidence interval is defined as a range of values such that with *95%* probability,the range will contain the true unknown value of the parameter.It has the form
$$\hat{\beta_{1}} \pm 2SE(\hat{\beta_{1}})$$
- That is,there is a 95% chance that the interval 
$$[\hat{\beta_{1}} - 2SE(\hat{\beta_{1}}), \hat{\beta_{1}} + 2SE(\hat{\beta_{1}})]$$
will contain the true value of $\beta_{1}$(under a scenario where we got repeated samples like present sample.)
- For advertising data(TV),the 95% confidence interval for $\beta_{1}$ is [0.042,0.053]. i.e. the true slope is greater than,and TV advertising has a positive effect on sales.

## Hypothesis Testing and Confidence Intervals

- Standard errors can also be used to perform **hypothesis testing** on the coefficients. The most common hypothesis test involves the **null hypothesis** of <br><br>
$H_0$ : There is no relationship between X and Y versus the alternative hypothesis <br><br>
$H_A$ : There is some relationship between X and Y <br><br>

- Mathematicly this corresponds to<br><br>
$$H_0: \beta_1 = 0$$
versus
$$H_A: \beta_1 \neq 0$$
- To test the null hypothesis, we compute a **t-statistic**,given by
$$t = \frac{\hat{\beta_1} - 0}{SE(\hat{\beta_1})}$$
- This will have a *t*-distribution with *n*-2 degrees of freedom,assuming $\beta_1 = 0$.
- Using statistical software,it is easy to compute the probability of observing any value equal |*t*| or larger.We call this probability the **p-value**. 
<br><br>
<div style="text-align:center">
    <img src="img3.png" alt="Description of image">
</div>

- How to interpret this result?<br>
The second line is measuring the effect of TV advertising on sales.It says the probability of observing the value of 17.67(*t-statistic*) under the assumption of *null hypothesis* (TV advertising has no effect on sales) is less than 10e-4 i.e. possible but very unlikely.<br>
**Our conclusion therefore is that TV advertising has an effect on sales**
- There is also a relationship between confidence intervals and hypothesis testing.<br>
1. If we reject the null hypothesis,then the confidence interval constructed for that coefficent will not carry *zero*.
2. But if we can't reject the null hypothesis,then confidence interval will contain *zero*.

## Assessing the overall accuracy of the model

- We compute the **Residual Standard Error**
$$\text{RSE} = \sqrt{\frac{1}{n-2}\text{RSS}} = \sqrt{\frac{1}{n-2}\sum_{i=1}^n (y_i - \hat{y_i})^2}.$$
where the **Residual sum-of-squares** is $\text{RSS} = \sum_{i=1}^n (y_i - \hat{y}_i)^2$.
- **R-Squared** or fraction of variance explained is
$$R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS},$$
where $\text{TSS} = \sum_{i=1}^n (y_i - \bar{y})^2$ is the **total sum of squares.**
- It can be shown that in this simple linear regression setting that $R^2 = r^2$ ,where *r* is the correlation between *X* and *Y*:
$$\text{r} = \text{Cor}(X,Y) = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}$$
<br><br>
<div style="text-align:center">
    <img src="img4.png" alt="Description of image">
</div>
<br><br>




- What does that $R^2$ value tells? <br>
Using TV advertising budget we reduced the variance in sales by almost 60% .

## Multiple Linear Regression

- Here our model is
$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon$$

- We interpret $\beta_j$ as the average effect on *Y* of a one unit increase in $X_j$, holding all the predictors fixed,in the advertising example the model becomes <br>
**sales = $\beta_0$ + $\beta_1$ X radio + $\beta_2$ X radio + $\beta_3$ X newspaper + $\epsilon$**
<br><br>
<div style="text-align:center">
    <img src="img5.png" alt="Description of image">
</div>
<br><br>
Earlier with a single predictor it was a line but now it's a hyperplane as shown in the image above for two predictors.

## Interpreting Regression Coefficients

- The ideal scenario is when predictors are uncorrelated - a balanced design: <br>
    1. Each coefficient can be estimated and tested separately. <br> 
    2. Interpretations such as *" a unit change in $X_j$ is associated with a $\beta_j$ change in Y, while all other variables stay fixed "*, are possible.
- Correlations amongst predictors cause problems: <br>
    1. The variance of all coefficients tend to increase sometimes,dramatically. <br>
    2. Interpretations become hazardous - when $X_j$ changes,everything else changes.
- **Claims of causality** should be avoided for observational data.

## Estimation and Prediction for Multiple Regression

- Given estimates $\hat{\beta_0},\hat{\beta_1},...,\hat{\beta_p},$ we can make predictions using the formula
$$\hat{y} = \hat{\beta_0} + \hat{\beta_1}x_1 + \hat{\beta_2}x_2 + \cdots + \hat{\beta_p}x_p$$
- We estimate $\beta_0,\beta_1,...,\beta_p,$ as the values that minimize the sum of squared residuals
$$RSS = \sum_{i = 1}^n (y_i - \hat{y_i})^2$$
The values that minimize *RSS* are the multiple least squares regression coefficients. <br><br>
Following is the result of the advertising data:
<br><br>
<div style="text-align:center">
    <img src="img6.png" alt="Description of image">
</div>
<br><br>
You can say that the presence of newspaper advertising may have significant on sales but not in the presence of radio and newspaper. You can see that there's a correlation between radio and newspaper, so it might be the case that radio has soaked up the effects of newspaper and no longer needed in the model.

## Some important questions

1. Is at least one of the predictors $X_1,X_2,...,X_p$ useful in predicting the response?
2. Do all the predictors help to explain Y , or is only a subset of the predictors useful?
3. How well does the model fit the data?
4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

## Is atleast one predictor useful?
For this question we can use the **F-statistic**
$$F = \frac{(TSS - RSS)/p}{RSS/(n - p - 1)} \sim F_{p,n-p-1}$$

<div style="text-align:center">
    <img src="img7.png" alt="Description of image" width="300">
</div>

## Deciding on the important variable

- The most direct approach is called **all subsets** or **best subsets** regression: we compute the least squares fit for all possible subsets and then choose between them based on some criterian that balances training error with model size.
- However we often can't examine all models,since they are $2^p$ of them ; for example when p = 40 there are over a billion models! <br>
Instead we need an automated approach that searches through a subset of them.There are two of them:

### Forward Selection

- Begin with a **null model** - a model that contains an intercept but no predictors.
- Fit *p* simple linear regressions and add to the null model that variable that results in the lowest RSS.
- Add to that model the variable that results in the lowest RSS amongst all two variable models.
- Continue until some stopping rule is satisfied, for eg when all remaining variables have a p-value above some threshold.

### Backward Selection

- Start with all variables in the model.
- Remove the variable with the largest p-value i.e. the variable that is the least statistically significant.
- The new (p-1) variable model is fit, and the variable with largest p-value is removed.
- Continue until a stopping rule is reached . For instance, we may stop when all remaining variables have  a significant p-value defined by some significant threshold.

*NOTE: Later we discuss more systematic criteria for choosing an "optimal" member in the path of models produced by fwd or bwd stepwise selection.
These include* **Mallow's $C_p$,Akaike information criteria(AIC),Bayesian Information Criteria(BIC),adjusted $R^2$** and **Cross Validation**

## Other Considerations in the Regression Model

### Qualitative Predictors
- Some predictors aren't quantitative but qualitative, taking a discrete set of values.
- These are also called **categorical** predictors or **factor variables**.
- In the example below,in addition to the 7 quantitative variables there four qualitative variables: **gender,student status,marital status,and ethnicity**
<div style="text-align:center">
    <img src="img8.png" alt="Description of image">
</div>


- This leads to the creation of **dummy variables**.

## Extensions of the Linear Model

### Interactions

- In our previous analysis of the Adverstising data, we assumed that the effect on sales of increasing one advertising medium is independent of the amount spent on the other media. 
- For example,the linear model,
$$\hat{sales} = \beta_{0} + \beta_{1} * TV + \beta_{2} * radio + \beta_{3}*newspaper$$
states that the average effect on sales of a one-unit increase in TV is always $\beta_1$, reagrdless of the amount spent on radio.
- But suppose that the spending money on radio advertising actually increases the effectiveness of TV advertising, so that the slope term for TV should increase as radio increases.
- In that case,given a fixed budget of $100,100 spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or radio.
- In Marketing this is known as **synergy** effect, and in statistics it's reffered as **interaction** effect.
<div style="text-align:center">
    <img src="img9.png" alt="Description of image">
</div> 

- When the levels of either TV or radio are low,then true sales are lower than predicted by the linear model but when advertising is split between TV and radio,then the model tends to underestimate the sales.

### Modelling Interactions - Advertising Data

Model takes the form <br><br>
$$sales = \beta_0 + \beta_1*TV + \beta_2*radio + \beta_3*(radio*TV) + \epsilon $$
$$sales = \beta_0 + (\beta_1 + \beta_3*radio)*TV + \beta_2*radio + \epsilon$$

<div style="text-align:center">
    <img src="img10.png" alt="Description of image" width = "600">
</div>

### Interpretation
- The results in this table suggests that interactions are important.
- The p-value of the interaction term **TV*radio** is extremely low,indicating that there is string evidence for $H_A: \beta_3 \neq 0$
- The $R^2$ for the interaction model is 96.8% compared to only 89.7% for the model that predicts sales using TV and radio without an interaction.
- This means that (96.8 - 89.7)/(100 - 89.7) = 69% of the variability in sales that remains after the additive model has been explained by the interaction term.

### Hierarchy 
- Sometimes it is the case that the an interaction term has very small p-value,but the associated main(here TV and radio) effects do not.
- The heirarchy principle: <br>
*"If we include an interaction in a model,we should also include the main effects,even if the p-values associated with their coefficients aren't significant"*
- The rationale for this principle is that interactions are hard to interpret in a model without main effects - their meaning is changed.

### Interactions between qualitative and quantitative variables
Consider the credit card data set and suppose that we wish to predict balance using income(quantitative) and student(qualitative).Without an interaction term the model takes the form : <br>

<div style="text-align:center">
    <img src="img11.png" alt="Description of image" width = "600">
</div>

With interactions it takes the following form:
<div style="text-align:center">
    <img src="img12.png" alt="Description of image" width = "600">
</div>
<br> <br>
<div style="text-align:center">
    <img src="img13.png" alt="Description of image" width = "600">
</div>


### Non-linear effects of predictors

<div style="text-align:center">
    <img src="img14.png" alt="Description of image" width = "600">
</div>

The figure suggests that the model
$$mpg = \beta_0 + \beta_1*horsepower + \beta_2*horsepower^2 + \epsilon$$
may provide a better fit.
<div style="text-align:center">
    <img src="img15.png" alt="Description of image" width = "600">
</div>

In above example we created an extra variable to accomodate polynomials.

## Potential Problems
When we fit a linear regression model to a particular dataset,many problems may occur.Most common among these are the following: <br><br>
 1.Non-linearity of the response-predictor relationships.<br>
 2.Correlation of error terms.<br>
 3.Non-constant variance of error terms.<br>
 4.Outliers.<br>
 5.High-leverage points.<br>
 6.Collinearity.<br>

