# Chapter 3. Multiple Regression Analysis: Estimation

In chapter 2 we learned how to use simple regression analysis to explain a dependent variable y as a function of a single independent variable x. The primary drawback in using simple regression analysis for empirical work is that it is very difficult to draw ceteris paribus conclussions about how x affects y: the key assumption, SLR.4, that all other factors affecting y are uncorrelated with x is often unrealistic

Multiple regression analysis is more amenable to ceteris paribus analysis because it allows us to explicitly control for many other factors that simultaneously affect the dependent variable. In addition, if we add more factors to our model that are useful for explaining y, then more of the variation in y can be explained. Thus multiple regression analysis can be used to build better models for predicting the dependent variable.

An additional advantage of multiple regression analysis is that it can incorporate fairly general functional forms of relationships

## 3-1 The model with k Independent Variables

The general multiple linear regresion (MLR) model can be written in the population as

\begin{equation}
y=\beta_0+\beta_0*x_1+\beta_0*x_2+\beta_0*x_2+\dots+\beta_k*x_k+u
\end{equation}

where:
&beta;0 is the intercept. &beta;1 is the incercept associated with x1 and so on

u is the error term or disturbance. It contains factors other than x1,x2,..,xn that affect y. No matter how many explanatory variables we include in our model, there will always be factors we cannot include, and these are collectively contained in u.

## 3-2 Mechanics and Interpretation of Ordinary Least Squares

### 3-2a Obtaining the OLS Estimates

In the general case with k independent variables, we seek estimates, $\hat{\beta_0},\hat{\beta_1},\dots,\hat{\beta_k}$, in the equation

\begin{equation}
\tag{3.11}
y=\hat{\beta_0}+\hat{\beta_1}*x_1+\hat{\beta_2}*x_2+\dots+\hat{\beta_k}*x_k
\end{equation}

The OLS estimates, k+1 of them, are chosen to minimize the sum of squared residuals:

\begin{equation}
\tag{3.12}
\sum_{i=1}^n (y_i-\hat{\beta_1}x_{i1}-\hat{\beta_2}x_{i2}-\dots-\hat{\beta_k}x_{ik})^2
\end{equation}

### 3-2b Interpreting the OLS Regression Equation

More important than the details underlying the computation of the $\hat{\beta_j}$ is the interpretation of the estimated equation. In the case of two independent variables:

\begin{equation}
\tag{3.14}
y=\hat{\beta_0}+\hat{\beta_1}*x_1+\hat{\beta_2}*x_2
\end{equation}

The intercept $\hat{\beta_0}$ in equation (3.14) is the predicted value of y when x1=0 and x2=0

The estimates $\hat{\beta_1}$ and $\hat{\beta_2}$ have partial effect or ceteris paribus, interpretations. From equation (3.14) we have

\begin{equation}
\Delta\hat{y}=\hat{\beta_1}\Delta\hat{x_1}+\hat{\beta_2}\Delta\hat{x_2}
\end{equation}

In particular, when x2 is held fixed, so that $\Delta x_2=0$ then

\begin{equation}
\Delta\hat{y}=\hat{\beta_1}\Delta\hat{x_1}
\end{equation}

holding x2 fixed. The key point is that by including x2 in our model, we obtain a coefficient on x1 with a ceteris paribus interpretation. This is why multiple regression analysis is so useful.

### Wooldridge. Example 3.1 Determinants of College GPA

The variables in GPA1 include the college grade point average (colGPA), high school GPA (hsGPA), and achievent test score (ACT) for a sample of 141 students from a large university. Both college and high school GPAs are on a four-point scale. The following code computes OLS

In [3]:
library(foreign)
gpa1 <- read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/gpa1.dta?raw=true")

# Just obtain parameter estimates:
lm(colGPA ~ hsGPA+ACT, data=gpa1)

# Store results under "GPAres" and display full table:
GPAres <- lm(colGPA ~ hsGPA+ACT, data=gpa1)
summary(GPAres)



Call:
lm(formula = colGPA ~ hsGPA + ACT, data = gpa1)

Coefficients:
(Intercept)        hsGPA          ACT  
   1.286328     0.453456     0.009426  



Call:
lm(formula = colGPA ~ hsGPA + ACT, data = gpa1)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85442 -0.24666 -0.02614  0.28127  0.85357 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.286328   0.340822   3.774 0.000238 ***
hsGPA       0.453456   0.095813   4.733 5.42e-06 ***
ACT         0.009426   0.010777   0.875 0.383297    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3403 on 138 degrees of freedom
Multiple R-squared:  0.1764,	Adjusted R-squared:  0.1645 
F-statistic: 14.78 on 2 and 138 DF,  p-value: 1.526e-06


How do we interpret these results ?

The intercept 1.29 is the predicted college GPA if hsGPA and ACT are both set as zero. Since no one who attends college has either a zero high school GPA or a zero on the achievement test, the intercept in this equation is not, by itself, meaningful.

More interesting estimates are the slope of coefficients on hsGPA nad ACT. AS expected theres is a positive partial relationship between colGPA and hsGPA. Holding ACT fixed, another point on hsGPA is associated with .453 of a point on the college GPA, or almost half a point. In other words if we choose two students, A and B, and these students have the same ACT score, but the high school GPA of student A is one point higher than the high school GPA of student B, then we predict Student A to have a college GPA .453 higher than that of Student B.

The sign of ACT immplies that, while holding hsGPA fixed, a change in the ACT score of 10 points (a very large change given that the maximum ACT score is 36 and its average 24) affects colGPA by less than one-tenth of a point. This is a small effect and it suggests that, once high school GPA is accounted for, the ACT score is not a strong predictor of college GPA 

Later we will show that the coefficient on ACT is not only small but also statistically insignificant.

### Wooldridge. Example 3.2 Hourly Wage Equation

Using the 526 observations on workers in WAGE1, we include educ (years of education), exper (years of labor market experience) and tenure (years with current employer) in an equation explaining log(wage). 

In [4]:
library(foreign)
wage1 <- read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/wage1.dta?raw=true")

# Just obtain parameter estimates:
lm(log(wage) ~ educ+exper+tenure, data=wage1)

# Store results under "GPAres" and display full table:
WAGEres <- lm(log(wage) ~ educ+exper+tenure, data=wage1)
summary(WAGEres)



Call:
lm(formula = log(wage) ~ educ + exper + tenure, data = wage1)

Coefficients:
(Intercept)         educ        exper       tenure  
   0.284360     0.092029     0.004121     0.022067  



Call:
lm(formula = log(wage) ~ educ + exper + tenure, data = wage1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.05802 -0.29645 -0.03265  0.28788  1.42809 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.284360   0.104190   2.729  0.00656 ** 
educ        0.092029   0.007330  12.555  < 2e-16 ***
exper       0.004121   0.001723   2.391  0.01714 *  
tenure      0.022067   0.003094   7.133 3.29e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4409 on 522 degrees of freedom
Multiple R-squared:  0.316,	Adjusted R-squared:  0.3121 
F-statistic: 80.39 on 3 and 522 DF,  p-value: < 2.2e-16


The coefficient .092 means that, holding exper and tenure fixed, another year of education is predicted to increase log(wage) by .092 which translates into an approximate 9.2% increase in wage. Alternatively, if we take two people with the same levels of experience and job tenure, the coefficient on educ is the proportionate difference in predicted wage when their education levels differ by one year. 

Whether it is a good estimate of the ceteris paribus return to another year of education requires us to study the statistical properties of OLS.

### 3-2c On the Meaning of "Holding Other Factors Fixed" in Multiple Regression

In example 3.1 we observed that the coefficient on ACT measures the predicted difference in colGPA, holding hsGPA fixed. The power of multiple regression analysis is that it provides this ceteris paribus interpretation even though the data have not been collected in a ceteris paribus fashion. In giving the coefficient on ACT a partial effect interpretation, it may seem that we actually went out and sampled people with the same high school GPA but possibly with different ACT scores. This is not the case. The data are a random sample from a large university, there were no restrictions placed on the sample values og hsGPA or ACT in obtaining our sample. Rarely do we have the luxury of holding certain variables fixed in obtaining our sample. Multiple regression effectively allows us to mimic this situation without restricting the values of any independent variables.

The power of multiple regression analysis is that it allows us to do in nonexperimental environments what natural scientists are able to do in a controlled laboratory setting: keep other factors fixed.

### 3-2h Goodness-of-Fit

As with simple regression, we can define the total sum of squares (SST), the explained sum of squares (SSE) and the residual sum of squares or sum of squared residuals (SSR) as

\begin{equation}
\tag{3.24}
SST=\sum_{i=1}^n (y_i-\vec{y})^2
\end{equation}

\begin{equation}
\tag{3.25}
SSE=\sum_{i=1}^n (\hat{y_i}-\vec{y_i})^2
\end{equation}

\begin{equation}
\tag{3.26}
SSR=\sum_{i=1}^n \hat{u_i}^2
\end{equation}

Using the same argument as in the simple regression case, it can be shown that: $SST=SSE+SSR$

Just as in the simple regression case, the R-squared is defined to be

\begin{equation}
\tag{3.28}
R^2:= SSE/SST = 1-SSR/SST 
\end{equation}

An important fact about $R^2$ is that it never decreases, and it usually increases when another independent variable is added to a regression and the same set of observations is used for both regressions. 

If two regressions use different set of observation, then in general, we cannot tell how the R-squareds will compare, even if one regression uses a subset of regressors.Missing data can be an important practical issue.

The fact that $R^2$ never decreases when any variable is added to a regression makes it a poor tool for deciding whether one variable of several variables should be added to the model. The factor that should determine whether an explanatory variable belongs in a model is whether the explanatory variable has a nonzero partial effect on y in the population.

### Wooldridge. Example 3.5 Explaining Arrest Records

CRIME1 contains data on arrests during the year 1986 and other information on 2725 men born in either 1960 or 191 in California. Each man in the sample was arrested at least once prior to 1986. the variable narr86 is the number of times the man was arrested during 1986. The variable pcnv is the proportion of arrests prior to 1986 that led to conviction, avgsen is average sentence length served for prior convictions, ptime86 is months spent in prison in 1986 and qemp86 s the number of quarters during which the man was employed in 1986.

A lineal model explaining arrests is

\begin{equation}
narr86=\beta_0 + \beta_1 * pcnv +\beta_2 * avgsen + \beta_3 * ptime86 +\beta_4 * qemp86 +u
\end{equation}

Where pcnv is a proxy for the likelihood for being convicted of a crime and avgsen is a measure of expected severity of punishment, if convicted. The variable ptime86 captures the incarcerative effects of crime. Labor market opportunities are crudely captured by qemp86

First we estimate the model without the variable avgsen.

In [13]:
library(foreign)
crime1<-read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/crime1.dta?raw=true")

# Model without avgsen:

summary( lm(narr86 ~ pcnv+ptime86+qemp86, data=crime1) )



Call:
lm(formula = narr86 ~ pcnv + ptime86 + qemp86, data = crime1)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7118 -0.4031 -0.2953  0.3452 11.4358 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.711772   0.033007  21.565  < 2e-16 ***
pcnv        -0.149927   0.040865  -3.669 0.000248 ***
ptime86     -0.034420   0.008591  -4.007 6.33e-05 ***
qemp86      -0.104113   0.010388 -10.023  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8416 on 2721 degrees of freedom
Multiple R-squared:  0.04132,	Adjusted R-squared:  0.04027 
F-statistic:  39.1 on 3 and 2721 DF,  p-value: < 2.2e-16


The results indicate that, as a group, the three variables pcnv, ptime86 and qemp86 explain about 4.1% of the variation in narr86.

If we increase pcnv by .50 (a large increase in the probability of conviction) then, holding the other factors fixed, $\Delta{\hat{narr86}}=-.150(.50)=-.075$. This means that when pcnv increases by .50 the predicted fall in arrests in 100 men is 7.5

Similarly a longer prison term leads to a lower predicted number of arrests. In fact, if ptime86 increases from 0 to 12, predicted arrests for a particular man fall by .034(12)=.408

Another quarter in which legal employment is reported lowers predicted arrests by .104 (10.4 arrests among 100 men).

If avgsen is added to the model, we know that $R^2$ will increase.

In [14]:
library(foreign)
crime1<-read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/crime1.dta?raw=true")

# Model without avgsen:

summary( lm(narr86 ~ pcnv+avgsen+ptime86+qemp86, data=crime1) )


Call:
lm(formula = narr86 ~ pcnv + avgsen + ptime86 + qemp86, data = crime1)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9330 -0.4247 -0.2934  0.3506 11.4403 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.706756   0.033151  21.319  < 2e-16 ***
pcnv        -0.150832   0.040858  -3.692 0.000227 ***
avgsen       0.007443   0.004734   1.572 0.115993    
ptime86     -0.037391   0.008794  -4.252 2.19e-05 ***
qemp86      -0.103341   0.010396  -9.940  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8414 on 2720 degrees of freedom
Multiple R-squared:  0.04219,	Adjusted R-squared:  0.04079 
F-statistic: 29.96 on 4 and 2720 DF,  p-value: < 2.2e-16


We notice that adding the average sentence variable increases $R^2$ from .0413 to .0422. The sign of the coefficient says that a longer average sentence length increases criminal activity.

## 3-3 The expected Value of the OLS Estimators

Wooldridge 2016 states and discusses four assumptions, which are direct extensions of the simple regression model assumptions, under which the OLS estimators are unbiased for the population parameters. We also explicitly obtain the bias in OLS when an important variable has been ommited from the regression.

### Assumption MLR.1 Linear in Parameters

The model in the population can be written as

\begin{equation}
y=\beta_0+\beta_1*x_1+\beta_2*x_2+\dots+\beta_k*x_k+u
\tag{3.31}
\end{equation}

Where $\beta_0, \beta_1, \dots,\beta_k$ are the unknown parameters of interest and u is an unobserved random error or disturbance term

### Assumption MLR.2 Random Sampling

We have a random sample of n observations ${(x_{i1},x_{i2},\dots,x_{ik},y_i):i=1,2,\dots,n}$ following the population model in Assumption MLR.1

Under MLR.1 and MLR.2 the OLS estimators $\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2,\dots,\hat{\beta}_k$ from the regression of y on $x_1,x_2,\dots,x_k$ are now considered to be estimators of $\beta_0,\beta_1,\beta_2,\dots,\beta_k$. Still we did not include conditions under which the OLS estimates are well defined for a given sample. The next assumption fills that gap.

### Assumption MLR.3 No perfect collinearity

In the sample (and therefore in the population) none of the independent variables is constant, and these are not exact linear relationships among the independent variables

If an independent variable in 3.31 is an exact linear combination of the other independent variables, then we say the model suffers from perfect collinearity and it cannot be estimated by OLS.

It is important to note that Assumption MLR.3 does allow independent variables to be correlated. They just cannot be perfectly correlated. Also please note that 

### Assumption MLR.4 Zero Conditional Mean

The error u has an expected value of zero given any values of the independent variables. In other words

$E(u|x_1,x_2,\dots, x_k)=0$

One way that Assumption MLR.4 can fails is if the functional relationship between the explained and explanatory variables is misspecified in equation (3.31). Chapter 9 in wooldridge discuss ways of detecting functional form misspecification.

Ommiting an important factor that is correlated with any of $x_1,x_2,\dots,x_k$ causes MLR.4 to fail.

### Theorem 2.1. Unbiadseness of OLS

Using Assumptions MLR.1 through MLR.4,

\begin{equation}
E(\hat{\beta}_j)=\beta_j,j=0,1,\dots,k,
\end{equation}

### 3-3a Including Irrelevant Variables in a Regression Model

In terms of unbiasedness of $\hat{\beta}_1$ and $\hat{\beta}_1$ adding a variable $x_3$ which is irrelevant (it has no effect on y after $x_1$ and $x_2$ have been controlled for) has no effect. However, it might affect the variances of the OLS estimators

### 3-3b Ommited Variable Bias

Now suppose that, rather than including an irrelevant variable, we omit a variable that actually belongs in the true (or) population model. This is often the problem of excluding a relevant variable or underspecifying the model. It can be shown (e.g. Wooldridge 2016, page 79) that leads to OLS estimators which are biased.

## 3-4 The Variance of OLS Estimators

In addition to the central tendencies of the $\hat{\beta}_j$, we also need to obtain a measure of the variance of OLS estimators. In order to do so first we need to add a homoskedasticity assumption.

### Assumption MLR.5 Homoskedasticity

The error u has the same variance given any value of the explanatory variables. In other words,

\begin{equation}
Var(u|x_1,x_2,\dots,x_3)=\sigma^2
\end{equation}

Assumption MLR.5 means that the variance in the error term,u, conditional on the explanatory variables, is the same for all combinations of outcomes of the explanatory variable. If this assumption fails then the model exhibits heteroskedasticity.

In the equation

$wage=\beta_0+\beta_1*educ+\beta_2*exper+\beta_3*tenure+u$

homoskedasticity requires that the variance of the unobserved error u does not depend on the levels of education, experience or tenure. That is,

$Var(u|educ, exper, tenure)=\sigma^2$ 

Assumptions MLR.1 through MLR.5 are collectively known as Gauss-Markov assumptions (for cross-sectional analysis). 

So far our assumptions are suitable only when applied to cross-sectional analysis with random sampling.

### Theorem 3.2 Sampling Variances of the OLS Slope Estimators

Under Assumptions MLR1 through MLR.5, conditional on the sample values of the independent variables,

\begin{equation}
Var(\hat{\beta}_j)=\frac{\sigma^2}{SST_j(1-R_j^2)}
\tag{3.51}
\end{equation}

for j=1,2,...,k where $SST_j=\sum_{i=1}^{n}(x_{ij}-\vec{x_j})^2$ is the total sample variation in $x_j$ and $R_i^2$ is the R-squared from regressing $x_j$ on all other independent variables

### 3-4a The Components of the OLS Variances. Multicollinearity

Equation (3.51) shows that the variance of $\hat{\beta_j}$ depends on three factors. 

The error variance, $\sigma^2$. From equation (3.51) a larger $\sigma^2$ means larger sampling variances for the OLS estimators. Because the error variance is a feature of the population, it has nothing to do with the sample size. It is the one component of (3.51) that is unknown. For a given dependent variables, y, there is only one way to reduce the error variance and that is to add more explanatory variables to the equation (i.e. take some factors out of the error term).

The Total Sample Variation in $x_j, SSt_j$. From equation (3.51) we see that the larger the total variation in $x_j$ the smaller is $Var(\hat{\beta_j})$. Thus, everyting being equal, for estimating $\beta_j$ we prefer to have as much sample variation in $x_j$ as possible. Rarely it is possible in social sciences to choose the sample values of the independent variables, there is however a way to increase the sample variation in each of the independent variables: increase the sample size.

The Linear Relationships among the independent Variables $R_j^2$. It is important to note that this R-squared is distinct from the R-squared we defined in the regression of y over x's. In this case $R_1^2$ is the R-squared from the simple regression of $x_1$ on $x_2$. Thus in this case a value of $R_1^2$ close to one indicates that $x_2$ explains much of the variation in $x_1$ in the sample. This means that both variables are highly correlated.

In the general case,$R_j^2$ is the proportion of the total variance in $x_j$ that can be explained by the other independent variables appearing in the equation. When $R_j^2$ is "close" to one, we see from equation (3.51) that $Var(\hat{\beta_j})$ will tend to be large. High correlation between two or more variables is called multicollinearity. 

Just as a large value of $R_j^2$ can cause a large $Var(\hat{\beta_j})$, so can a small value of $SST_j$. in the social sciences, where we are usually passive colectors of data, there is no good way to reduce variances of unbiased estimators other than to collect more data.

Another important point is that a high degree of correlation between certain independent variables can be irrelevant as to how well we can estimate other parameters in the model. For example, consider a model with three independent variables:

\begin{equation}
y=\beta_0+\beta_1*x_1+\beta_2*x_2+\dots+\beta_3*x_3
\end{equation}

Where x2 and x3 are correlated. then their variances may be large, but the amount of correlation between x2 and x3 has no direct effect on $Var(\hat{\beta_1})$. In fact if x1 is uncorrelated with x2 and x3 then $R_1^2=0$ and $Var(\hat{\beta_1})=\sigma^2/SST_1$, regardless of how much correlation there is between x2 and x3. 

### 3-4c Estimating $\sigma^2$: Standard Errors of the OLS Estimators

We now show how to choose an unbiased estimator of $\sigma^2$, which then allows us to obtain unbiased estimators of $Var(\hat{\beta_j})$.

Because $\sigma^2=E(u^2)$, an unbiased estimator of $\sigma^2$ would be the sample average of the squared errors: $n^{-1}\sum_{i=1}^{n}u_i^2$, unfortunately this is not a true estimator because we do not observe the $u_i$.

Nevertheless, recall that the errors can be written as $u_i=y_i-\beta_0-\beta_1*x_{i1}-\beta_2*x_{i1}-\dots-\beta_k*x_{ik}$ and replacing each $\beta_j$ with its OLS estimator, we get the OLS residuals:

$\hat{u}_i=y_i-\hat{\beta}_0-\hat{\beta}_1*x_{i1}-\hat{\beta}_2*x_{i1}-\dots-\hat{\beta}_k*x_{ik}$

The unbiased estimator os $\sigma^2$ in the general multiple regression case is

\begin{equation}
\hat{\sigma}^2=\frac{\sum_{i=1}^{n}\hat{u}_i^2}{n-k-1}=SSR/(n-k-1)
\end{equation}

### Theorem 3.3 Unbiassed Estimation of $\sigma^2$

Under the Gauss-Markov assumption MLR.1 through MLR.5, $E(\hat{\sigma}^2)=\sigma^2$

For constructing confidence intervals and conducting tests we will need to estimate the standard deviation of $\hat{\beta_j}$, which is just the square root of the variance:

\begin{equation}
sd(\hat{\beta_j})=\sigma/[SST_j(1-R_j^2)]^{1/2}
\end{equation}

since $\sigma$ is unknown, we replace it with its estimator $\hat{\sigma}$. This gives us the standard error of $\hat{\beta}_j$

\begin{equation}
se(\hat{\beta_j})=\hat{\sigma}/[SST_j(1-R_j^2)]^{1/2}
\tag{3.58}
\end{equation}

Because (3.58) is obtained directly from the variance formula in (3.51) and because (3.51) relies on the homoskedasticity Assumption MLR.5. It follows that the standard error formula in (3.58) is not a valid estimator of $sd(\hat{\beta}_j)$ if the errors exhibit heteroskedasticity. Thus, while heteroskedasticity does not cause bias in the $\hat{\beta_j}$ it does lead to bias in the usual formula for $Var(\hat{\beta}_j)$, which then invalidates the standard errors.

## 3-5 Efficiency of OLS: The Gauss-Markov Theorem

Let $\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2,\dots,\hat{\beta}_k$ denote the OLS estimators in model (3.31) under Assumptions MLR.1 through MLR.5 The Gauss-Markov theorem says that in the class of linear unbiased estimators, OLS has the smallest variance.

### Theorem 3.4 Gauss-Markov Theorem

Under Assumptions MLR.1 through MLR.5 $\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2,\dots,\hat{\beta}_k$ are the best linear unbiased estimators (BLUES) of $\beta_0,\beta_1,\beta_2,\dots,\beta_k$ respectifully