**Importing the needed libraries/modules**

In [1]:
import pandas as pd
from scipy import stats
from scipy.linalg import lstsq
from scipy.stats import f
import numpy as np

#**Solution example Xr17-01** <br>
A developer who specializes in summer cottage properties is considering purchasing a large tract of land adjoining a lake. The current owner of the tract has already subdivided the land into separate building plots and has prepared the plots by removing some of the trees. The developer wants to forecast the value of each plot. From previous experience, she knows that the most important factors affecting the price of a plot are size, number of mature trees and distance to the lake. From a nearby area, she gathers the relevant data for 60 recently sold plots.

##Load in the data

Link to data: <br>
https://github.com/saoter/AQM2023/raw/main/Workshop%207/data/Xr17-01.xlsx

**First load in the data to a pandas dataframe**

In [2]:
df = pd.read_excel('https://github.com/saoter/AQM2023/raw/main/Workshop%207/data/Xr17-01.xlsx')

In [3]:
df.head() #Showing first 5 data observations

Unnamed: 0,Price,Lot size,Trees,Distance
0,105.4,41.2,24,42
1,91.2,44.8,5,71
2,183.3,21.3,72,43
3,93.8,43.9,58,14
4,207.5,57.7,52,12


##**Objective a)** <br>
Find the regression equation

**This step selects the columns from the DataFrame df that represent your independent variables (Lot size, Trees, Distance) and stores them in a new DataFrame X. Think of X as a table of your predictor variables.**

In [4]:
X = df[['Lot size', 'Trees','Distance']]
X.head()

Unnamed: 0,Lot size,Trees,Distance
0,41.2,24,42
1,44.8,5,71
2,21.3,72,43
3,43.9,58,14
4,57.7,52,12


**Here, we create a variable y and assign it the values from the 'Price' column in the DataFrame df. y is your dependent variable, the one you want to predict or explain.**

In [5]:
y = df['Price']
y.head()

0    105.4
1     91.2
2    183.3
3     93.8
4    207.5
Name: Price, dtype: float64

**As you may remember, the equation for multiple regression is;** <br>
$$Y = β0 + β1*X1 + β2*X2 + ... + βn*Xn + ε$$
<br>
**Where β0 is the intercept with the y-axis on a graph.**

**In this step, we add a new column to the X DataFrame called 'Intercept,' where all the values are set to 1. This is a common practice in regression to account for the constant (intercept) in the equation. Adding the column of 1s (the intercept) to your feature matrix allows your linear regression model to find both the slope (coefficients for other features) and the intercept (the point where the line crosses the Y-axis) during the fitting process. It ensures that the linear equation accounts for the situation when all independent variables are 0.**

In [6]:
X['Intercept'] = 1

Imagine you have a bunch of data points (like test scores and study hours) and you want to find a simple equation that best explains how they're related. This equation could help you predict one thing based on the other.

Here's a quick overview of the underlying math(not that it's important):

The Equation: You're trying to find an equation that looks like this:

$Test Score=β0+β1xStudy Hours$ <br>
"Test Score" is what you're trying to predict.<br>
"Study Hours" is the thing you think affects the test score.<br>
β0 and β1 are numbers you want to find that make the equation work best.<br><br>

**Matrix Math**: To do this, we use some math that involves matrices (don't worry if you're not familiar with these). The equation becomes:

$Y=XβY=Xβ$<br>
Y is all the test scores. <br>
X is a table with study hours and a bunch of rows (each row is a student).<br>
β is a list of numbers (one for the constant and one for study hours).

**Best Fit**: We want to find the best numbers in β that make our equation fit our data points as closely as possible. It's like adjusting the equation until it matches the real data.

**Least Squares Method**: We use something called the "least squares method" to find these best numbers. This method tries different numbers for β and checks how close they get to the real data.

**Python Code**: The Python code provided below, coefficients, _, _, _ = lstsq(X, y), is like a magic tool that does the math for us. It finds the best numbers for β that make our equation fit our data points best.

**Result**: After running the code, you get the best numbers for β, which are the coefficients. These coefficients tell you how much study hours affect the test score (in this example). You can use this equation to make predictions based on new study hours.

**This line uses the least squares method (lstsq) to find the coefficients for the linear regression equation. The coefficients represent the relationship between the independent variables and the dependent variable. We're assigning these coefficients to the variable coefficients. The underscores _ are used as placeholders for values we don't need in this case.**

In [7]:
coefficients, _, _, _ = lstsq(X, y)

In [8]:
coefficients

array([ 0.69990446,  0.67881312, -0.3783608 , 51.39121643])

**intercept = coefficients[-1]**: Here, we extract the last value from the coefficients variable, which represents the intercept (β0) or constant term in the regression equation.

In [9]:
intercept = coefficients[-1]

In [10]:
intercept

51.39121642851501

**coefficients_array = coefficients[:-1]**: This line removes the last value from the coefficients_array variable, leaving you with the coefficients for the independent variables (Lot size, Trees, Distance) in the same order they were in the DataFrame X.

In [11]:
coefficients_array = coefficients[:-1]
coefficients_array

array([ 0.69990446,  0.67881312, -0.3783608 ])

**print(f"Intercept: {intercept}")**: Finally, we print out the intercept value, which is the constant in your regression equation. This is the value where the regression line crosses the y-axis when all independent variables are zero. <br><br>
**print(f"Coefficients: {coefficients}")**: We also print the coefficients for your independent variables. These values tell you how much the dependent variable changes for each unit change in the respective independent variable, while holding all other variables constant.

In [12]:
# Print the results
print(f"Intercept: {intercept}")
print(f"Coefficients: {coefficients_array}")

Intercept: 51.39121642851501
Coefficients: [ 0.69990446  0.67881312 -0.3783608 ]


Thus, our equation becomes (rounded coefficient values): <br>
$$Price = 51.39 + 0.70 \cdot \text{Lot size} + 0.68 \cdot \text{Trees} - 0.38 \cdot \text{Distance}$$


##**Objective b)** <br>
What is the standard error of the estimate?

**Calculate the number of observations (n)**

In [13]:
n = len(y)
n

60

**Calculate the number of predictors (k)**

In [14]:
k = X.shape[1] - 1  # Subtract 1 for the intercept
k

3

**Calculate the predicted values (y_hat) using the coefficients using matrix dot product**

In [15]:
y_hat = np.dot(X, coefficients)

**Calculate the residuals (differences between observed and predicted values)**

In [16]:
residuals = y - y_hat
residuals.head()

0    24.772359
1    31.922615
2    84.395789
3   -22.391132
4    84.966344
Name: Price, dtype: float64

**Calculate the residual sum of squares (SSE)**

In [17]:
sse = np.sum(residuals**2)
sse

90694.3330841533

**Calculate the standard error of the estimate using the provided formula** <br>


The standard error of the estimate is calculated as:

$$
\text{Standard Error} = \sqrt{\frac{SSE}{n - k - 1}}
$$

Where:
- \(SSE\) is the residual sum of squares.
- \(n\) is the number of observations.
- \(k\) is the number of predictors (independent variables).

In [18]:
std_error = np.sqrt(sse / (n - k - 1))

In [19]:
print(f"Standard Error of the Estimate: {std_error}")

Standard Error of the Estimate: 40.24352944532851


**Interpretation**: <br>
The standard error of the estimate (SE) is a measure of the accuracy of the regression model in predicting the dependent variable (in this case, "Price"). Specifically, it tells you the typical amount by which the actual observed values of the dependent variable are expected to differ from the predicted values produced by your regression model.

In this case, we've calculated a standard error of the estimate of approximately 40.24. Here's how you can interpret it:

- Accuracy of Predictions: On average, you can expect the predicted values of "Price" made by the regression model to be off by about 40.24 units from the actual observed values.

- Variability: A lower standard error of the estimate indicates that the predicted values are closer to the actual values, implying a more accurate model. Conversely, a higher SE suggests greater variability in the predictions, indicating a less accurate model.

- R-squared Relationship: The SE is related to the R-squared value of your regression model. R-squared measures the proportion of variance in the dependent variable that is explained by the independent variables. The SEE is the square root of the residual sum of squares (SSE) divided by the degrees of freedom, where the degrees of freedom is (n - k - 1), with 'n' being the number of observations and 'k' being the number of predictors. R-squared is equal to 1 minus the ratio of SSE to the total sum of squares (SST). Therefore, a lower SE typically corresponds to a higher R-squared value, indicating a better-fitting model.

- Prediction Intervals: The SE can be used to construct prediction intervals for individual predictions. For instance, if your model predicts a house price of 200,000, you might say that you are 95% confident that the actual house price falls within $40.24 of that prediction (assuming normality and other assumptions hold).

In summary, a lower standard error of the estimate is desirable because it suggests that your regression model provides more accurate predictions of the dependent variable. Conversely, a higher SE indicates less accuracy and greater variability in the predictions.

##**Objective c)**<br>
What is the coefficient of determination? What does this statistic tell you?

**So we need to calculate the R-squared, aka the coefficient of determination**

The formula for calculating the coefficient of determination R-squared is:


$$R^2 = 1 - \frac{SSE}{SST}$$


Where:
- R-squared is the coefficient of determination.
- SSE is the residual sum of squares, representing unexplained variability.
- SST is the total sum of squares, representing total variability.

**We have already calculated SSE, so we only need to calculate SST**

The formula for calculating the total sum of squares SST is:


$$SST = \sum_{i=1}^{n} (y_i - \bar{y})^2$$


Where:
- SST is the total sum of squares.
- n is the number of observations.
- y_i represents the individual observed values of the dependent variable.
- y_bar is the mean (average) of the observed values.

In [20]:
#Calculate mean value of y (aka y_bar)
mean_y = np.mean(y)
#Calculate the squared sum of difference between observed values y and the y-bar (mean values)
sst = np.sum((y - mean_y)**2)

In [21]:
# Calculate the R-squared (coefficient of determination)
r_squared = 1 - (sse / sst)

In [22]:
# Print the R-squared value
print(f"R-squared (Coefficient of Determination): {r_squared}")

R-squared (Coefficient of Determination): 0.24247188773540462


**What does this statistic tell us?** <br>
The coefficient of determination, often denoted as R-squared, is a measure of how well the independent variables in your regression model explain the variability in the dependent variable. In this case, you've calculated an R-squared value of approximately 0.2425. Here's what this statistic tells you:

- **Proportion of Explained Variance**: R-squared measures the proportion of the total variability in the dependent variable ("Price" in your case) that is explained by the independent variables (Lot size, Trees, and Distance) included in your regression model. In your model, about 24.25% of the variability in house prices is explained by these three independent variables.

- **Model Fit**: An R-squared value of 0.2425 suggests that your regression model explains a relatively modest proportion of the variability in house prices. This means that there are other factors or sources of variation that are not accounted for by the model. In other words, the model does not fit the data very well.

- **Predictive Power**: The R-squared value can also serve as an indicator of the model's predictive power. A higher R-squared indicates that the model's predictions are better at explaining the observed variation in house prices, while a lower R-squared suggests that the model's predictions have limited explanatory power.

- **Interpretation**: An R-squared value of 0.2425 suggests that, in your model, only about 24.25% of the variation in house prices can be attributed to Lot size, Trees, and Distance. The remaining 75.75% of the variation is due to other factors not included in the model or random variation.

- **Considerations**: While R-squared provides valuable information, it's important to note that a low R-squared does not necessarily mean your model is useless. It depends on the context and the research question. Additionally, R-squared should be interpreted in conjunction with other diagnostic measures and domain knowledge.

In summary, an R-squared value of 0.2425 indicates that your regression model has limited explanatory power and does not account for a substantial portion of the variability in house prices. Further exploration and consideration of additional variables may be necessary to improve the model's performance.

##**Objective d)**<br>
What is the coefficient of determination (R-squared), adjusted for degrees of freedom? Why does this value differ from the R-squared value? what does this tell us about the model?

The formula for calculating the adjusted coefficient of determination is:


$$R^2_{\text{adj}} = 1 - \frac{(1 - R^2) \cdot (n - 1)}{n - k - 1}$$


Where:
- R^2_adj is the adjusted coefficient of determination.
- R^2 is the regular coefficient of determination.
- n is the number of observations.
- k is the number of predictors (independent variables) in the model.


We've already calculated all the variables needed for the equation, so we simply need to put them together like so:

In [23]:
r_squared_adj = 1 - ((1 - r_squared) * (n - 1) / (n - k - 1))

In [24]:
print(f"Adjusted R-squared: {r_squared_adj}")

Adjusted R-squared: 0.20189002457837268


**Why does this value differ from the R-squared value?** <br>
- **Regular R-squared**: Regular R-squared measures the proportion of the total variance in the dependent variable that is explained by the predictors in the model. It increases as you add more predictors, even if those additional predictors do not significantly improve the model's explanatory power. Therefore, regular R2R2 may overestimate the goodness of fit when you include unnecessary predictors.

- **Adjusted R-squared**: Adjusted R-squared adjusts the regular R-squared to account for the number of predictors in the model. It penalizes the inclusion of unnecessary predictors by subtracting a term that depends on the number of predictors (k) and the number of observations (n). This adjustment ensures that the adjusted R-squared increases only when the addition of a new predictor significantly improves the model's fit. As a result, adjusted R-squared is typically lower than regular R-squared when you have more predictors

**What does this tell us about the model?**<br>
- **Model Fit**: The adjusted R2R2 value indicates the goodness of fit of the regression model. In this case, the adjusted R2R2 is approximately 0.2019, which means that the model explains about 20.19% of the total variance in the dependent variable (Price) based on the independent variables (Lot size, Trees, and Distance).

- **Model Complexity**: The adjusted R2R2 adjusts for model complexity by penalizing the inclusion of unnecessary predictors. It takes into account both the explanatory power of the model and the number of predictors. In your case, it appears that while the model does have some explanatory power, it is relatively simple compared to the number of predictors included.

- **Limited Explained Variance**: The adjusted R2R2 value of 0.2019 suggests that the model explains a relatively modest portion of the variability in house prices. This means that there are other factors or sources of variation not included in the model that contribute to the variation in prices. It's important to acknowledge that a substantial portion of the variation remains unexplained.

- **Room for Improvement**: An adjusted R2R2 of 0.2019 indicates that there may be room for improvement in the model's predictive accuracy. You might consider exploring additional predictors or refining the model to better explain the variance in house prices.

- **Interpretation**: In practical terms, this adjusted R2R2 value means that only about 20.19% of the variability in house prices can be attributed to Lot size, Trees, and Distance in the model. The remaining 79.81% of the variability is due to other factors or random variation.

In summary, the adjusted R2R2 value of 0.2019 suggests that while your regression model has some explanatory power, it is relatively simple and does not account for a large portion of the variability in house prices. Further model refinement or the inclusion of additional predictors may be necessary to improve its performance. Additionally, it's important to consider domain knowledge and other diagnostic measures when interpreting the results.

##**Objective e)**<br>
Test the validity of the model. What does the p-value of the test statistic tell you?

The formula for the F-test is:

$$F = \frac{{\text{Mean Square Regression (MSR)}}}{{\text{Mean Square Error (MSE)}}} = \frac{{\left(\sum_{i=1}^{n} (y_i - \bar{y})^2 - \text{SSE}\right) / k}}{{\text{SSE} / (n - k - 1)}} $$

Where:
- \( F \) is the F-statistic.
- "Mean Square Regression (MSR)" represents the variance explained by the regression model.
- "Mean Square Error (MSE)" represents the unexplained variance (residuals) in the model.
- \( y_i \) represents individual observed values.
- \(y_bar \) represents the mean of observed values.
- "SSE" is the residual sum of squares.
- \( k \) is the number of predictor variables.
- \( n \) is the number of observations.

In [25]:
mean_square_regression = ((np.sum((y - np.mean(y))**2) - sse) / k)

In [26]:
mean_square_error = sse / (n - k - 1)

In [27]:
F_statistic = mean_square_regression / mean_square_error

In [28]:
print(f"F-statistic: {F_statistic}")

F-statistic: 5.974883085016509


In [29]:
# Set the desired significance level (e.g., alpha = 0.05 for a 95% confidence level)
alpha = 0.05
# Calculate the critical F-value for the given degrees of freedom
df2 = n - k - 1
# df1 is our k = 3 from earlier, df2 is calculated as: n - k - 1
critical_f_value = f.ppf(1 - alpha, k, df2)

print(f"Critical F-value: {critical_f_value}")

Critical F-value: 2.7694309320231345


**Null Hypothesis (H0​)**: The null hypothesis for the F-test is that the model has no predictive power, meaning that all the regression coefficients (including the intercept) are equal to zero. In other words, the independent variables (Lot size, Trees, and Distance) do not have any significant effect on the dependent variable (Price).

**Alternative Hypothesis (H1​)**: The alternative hypothesis is that the model does have predictive power, meaning that at least one of the regression coefficients is not equal to zero. In this case, at least one of the independent variables has a significant effect on the dependent variable.

In [30]:
if F_statistic > critical_f_value:
  print(f'Nullhypothesis H0 rejected due to F-value({F_statistic}) being higher than the critical F-value({critical_f_value})')
elif F_statistic < critical_f_value:
  print(f'Nullhypothesis H0 accepted due to F-value({F_statistic}) being lesser than the critical F-value({critical_f_value})')

Nullhypothesis H0 rejected due to F-value(5.974883085016509) being higher than the critical F-value(2.7694309320231345)


**Result of hypothesis test:**<br>
Since our calculated F-value is higher than the critical value, we have sufficient evidence to reject the null-hypothesis. This suggests that at least one of the independent variables has a significant effect on the dependent variable.

##**Objective f)**<br>
Interpret each of the coefficients

We can simply show the variable names like so:<br>
(I add [:-1] to not show the last index, which is the Intercept variable we added)

In [31]:
X.columns[:-1]

Index(['Lot size', 'Trees', 'Distance'], dtype='object')

Then we can show the coefficients_array we created earlier that contains our variable coefficients.

In [32]:
coefficients_array

array([ 0.69990446,  0.67881312, -0.3783608 ])

Alternatively we can show it in a more nicely formatted way;

In [33]:
print(f'Coefficient for {X.columns[0]}: {coefficients_array[0]}\n') # \n to add space between the prints
print(f'Coefficient for {X.columns[1]}: {coefficients_array[1]}\n') # \n to add space between the prints
print(f'Coefficient for {X.columns[2]}: {coefficients_array[2]}')

Coefficient for Lot size: 0.6999044587005427

Coefficient for Trees: 0.6788131166439563

Coefficient for Distance: -0.3783607979932825


**Interpretation of Lot size coefficient**: For each one-unit increase in Lot size (e.g., if Lot size increases by one square meter), and while holding Trees and Distance constant, we expect Price to increase by β1 units. <br>
**Example**: If β1 is 0.6999, it means that, on average, for every additional square meter of Lot size, the Price is expected to increase by 0.6999 units, assuming Trees and Distance remain constant.<br><br>

**Interpretation of Trees coefficient**: For each one-unit increase in Trees (e.g., if the number of Trees increases by one), and while holding Lot size and Distance constant, we expect Price to increase by β2 units. <br>
**Example**: If β2 is 0.6788, it means that, on average, for every additional tree on the property, the Price is expected to increase by 0.6788 units, assuming Lot size and Distance remain constant. <br><br>

**Interpretation of Distance coefficient**: For each one-unit increase in Distance (e.g., if the distance from a certain location increases by one unit), and while holding Lot size and Trees constant, we expect Price to change by β3 units.<br>
**Example**: If β3 is -0.3784, it means that, on average, for every one-unit increase in Distance (e.g., one mile farther from a reference point), the Price is expected to decrease by 0.3784 units, assuming Lot size and Trees remain constant.

##**Objective g)** <br>
Test to determine whether each of the independent variables is linearly related to the price of the plot in the model.

Hypothesis test

- H0: Our independent variable has a significant linear relationship with our dependant variable.
- H1: Our independent variable does not have a significant linear relationship with our dependant variable.

In [34]:
t_stat_lot_size, p_value_lot_size = stats.pearsonr(df['Lot size'], df['Price'])
# Print the results for Lot size
print(f"Lot size vs. Price:")
print(f" - t-statistic: {t_stat_lot_size:.4f}")
print(f" - p-value: {p_value_lot_size:.4f}")

Lot size vs. Price:
 - t-statistic: 0.3035
 - p-value: 0.0184


Since the p-value is **lower** than our significance level (0.05 or 5%) we **can conclude** that there is a linear relationship between this independent variable and the dependent one. **We accept the nullhypothesis for Lot size**

In [35]:
t_stat_trees, p_value_trees = stats.pearsonr(df['Trees'], df['Price'])
# Print the results for Lot size
print(f"Trees vs. Price:")
print(f" - t-statistic: {t_stat_trees:.4f}")
print(f" - p-value: {p_value_trees:.4f}")

Trees vs. Price:
 - t-statistic: 0.3891
 - p-value: 0.0021


Since the p-value is **lower** than our significance level (0.05 or 5%) we **can conclude** that there is a linear relationship between this independent variable and the dependent one. **We accept the nullhypothesis for Trees**

In [36]:
t_stat_Distance, p_value_Distance = stats.pearsonr(df['Distance'], df['Price'])
# Print the results for Lot size
print(f"Distance vs. Price:")
print(f" - t-statistic: {t_stat_Distance:.4f}")
print(f" - p-value: {p_value_Distance:.4f}")

Distance vs. Price:
 - t-statistic: -0.2326
 - p-value: 0.0737


Since the p-value is **higher** than our significance level (0.05 or 5%) we **can not conclude** that there is a linear relationship between this independent variable and the dependent one. **We reject the nullhypothesis for Distance**

##**Objective h)** <br>
Predict with 90% confidence the selling price of a 40.000 square foot plot that has 50 mature trees and is 25 feet away from the lake

For the sake of simplifying what is going on here, we will load them in from our earlier variables:

In [37]:
coefficients

array([ 0.69990446,  0.67881312, -0.3783608 , 51.39121643])

In [38]:
coeff_lot_size = coefficients[0]
coeff_Trees = coefficients[1]
coeff_Distance = coefficients[2]
intercept = coefficients[3]

We plug in the values from the assignment objective as their own variables

In [39]:
lot_size = 40000
trees = 50
distance = 25

In [40]:
predicted_price = intercept + coeff_lot_size * lot_size + coeff_Trees * trees + coeff_Distance * distance
predicted_price

28072.05120033259

We use the sse variable from earlier to calculate standard error of estimate (SE)

In [41]:
SE = np.sqrt(sse / (n - k - 1))

We set the significance level and calculate our critical value for a two-tailed interval (alpha / 2)

In [42]:
alpha = 0.10  # 10% significance level
critical_value = stats.t.ppf(1 - alpha / 2, df=n - k - 1)

Calculate margin of error (ME)

In [43]:
ME = critical_value * SE

In [44]:
# Calculate the lower and upper bounds of the confidence interval
lower_bound = predicted_price - ME
upper_bound = predicted_price + ME

In [45]:
print(f"Predicted Price: {predicted_price} \n")
print(f"90% Confidence Interval: ({lower_bound}, {upper_bound})")

Predicted Price: 28072.05120033259 

90% Confidence Interval: (28004.742999795257, 28139.359400869922)


##**Objective i)**<br>
Estimate with 90% confidence the average selling price of 50.000 square foot plots that have ten mature trees and are 75 feet from the lake

We repeat the process from objective h), but we can reuse a lot of the variables we created;

In [46]:
lot_size = 50000
trees = 10
distance = 75

In [47]:
predicted_price = intercept + coeff_lot_size * lot_size + coeff_Trees * trees + coeff_Distance * distance
predicted_price

35025.025222772594

In [48]:
# Calculate the lower and upper bounds of the confidence interval
lower_bound = predicted_price - ME
upper_bound = predicted_price + ME

In [49]:
print(f"Predicted Price: {predicted_price} \n")
print(f"90% Confidence Interval: ({lower_bound}, {upper_bound})")

Predicted Price: 35025.025222772594 

90% Confidence Interval: (34957.71702223526, 35092.33342330993)


#**I know that was a lot to process, but now it's your turn:**

##**Assignment Xr17-02**<br>
Pat Statsdud, a student ranking near bottom of the statistics class decided that a certain amount of studying could actually improve final grades. However, too much studying would not be warranted because Pat's ambition (if that's what one could call it) was to ultimately graduate with the absolute minimum level of work. Pat was registered in a statistics course that had only 3 weeks to go before the final exam and for which the final grade was determined in the following way:<br><br>
Total mark = 20% (Assignment) + 30% (Midterm test) + 50% (Final exam) <br><br>
To determine how much work to do in the remaining 3 weeks, Pat needed to be able to predict the final exam mark on the basis of the assignment mark (worth 20 points) and the midterm mark (worth 30 points). Pat's marks on these were 12/20 and 14/30 respectively. Accordingly, Pat undertook the following analysis. The final exam mark, assignment mark and midterm mark for 30 students who took the statistics course last year were collected.

First load in the data; <br>
link: https://github.com/saoter/AQM2023/raw/main/Workshop%207/data/Xr17-02.xlsx

**a)** Determine the regression equation

**b)** What is the standard error of the estimate? Briefly describe how you interpret this statistic

**c)** What is the coefficient of determination? (R-squared) What does this statistic tell you?

**d)** Test the validity of the model

**e)** Interpret each of the coefficients

**f)** Can Pat infer that the assignment mark is linearly related to the final grade in this model?

**g)** Can Pat infer that the midterm mark is linearly related to the final grade in this model?

**h)** Predict Pats final exam mark with 95% confidence

**i)** Predict Pats final exam mark with 90% confidence

##**Assignment Xr17-03 (for the quick ones)**: <br>
The CEO of a company that manufactures dry-wall wants to analyze the variables that affect demand for his product. Drywall is used to construct walls in houses and offices. The CEO decides to develop a regression model in which the dependent variable is monthly sales of drywall (in hundreds of 4x8 sheets) and the independent variables are as follows:
- Number of building permits issued in the country
- Five year mortgage rates (in percentage points)
- Vacancy rate in apartments (in percentage points)
- Vacancy rate in office buildings (in percentage points) <br>
To estimate a regression model he took monthly observations from the past 2 years.

Link to data: https://github.com/saoter/AQM2023/raw/main/Workshop%207/data/Xr17-03.xlsx

Load in the data

**a)** Analyse the data using multiple regression

**b)** What is the standard error of the estimate? Can you use this statistic to assess the models fit? If so, how?

**c)** What is the coefficient of determination? (R-squared), and what does it tell you about the regression model?

**d)** Test the overall validity of the model

**e)** Interpret each of the coefficients

**f)** Test to determine whether each of the independent variables is linearly related to drywall demand in this model

**g)** Predict next months drywall sales with 95% confidence if the number of building permits is 50, the 5 year mortgage rate i 9.0% and the vacancry rates are 3.6% in apartments and 14.3% in office buildings