# **Assignment 3**
**Data Science Tools and Advanced Modelling Techniques**

## Student Information

**Your name (as appears on Canvas):** Wei Ning Chan

**Your student ID:** 23385923

## Marks

**Total Marks earned:** 

**Score for Full Marks:**  70

**Bonus marks:** 20

## Assignment Outline

Throughout this course's assignment series, you will get the training needed to complete the final report. With each new assignment in this course, you will master the technical skills to implement analyses (Assignment 1), train reproducible research practices (Assignment 2), and learn key concepts regarding prediction modelling (Assignment 3).

In Assignment 3, you will learn more about key concepts of prediction modelling:

- Implementation of Basic Supervised Learning Methods (KNN and Linear Regression)

- Prediction Error Sources (Underfitting vs. Overfitting)

- Prediction Performance Measures

- Predictive Performance Assessment Methods (Hold-Out Set)

- Variable Selection



## Introduction

The goal of prediction modelling is to predict an outcome variable using other available variables (i.e. predictors) for new and previously unseen subjects/units. This makes prediction modelling desirable given its applicability to many different problems (e.g., medical patient outcomes, weather, image recognition, finances, etc.), and that is why machine learning has become a very popular discipline these past few decades.

## Question 1

**Objective**: We're going to apply linear regression modelling to the Ames City Iowa housing dataset and attempt to predict Sale Price using the following variables:

- Land Slope
- Lot Area
- Overall Condition
- First Floor Surface Area (Square Feet)
- Building Type

Please run this code before proceeding and examine the variables we'll be using:

In [2]:
set.seed(200)
library(tidyverse)
Ames_Data = read_csv("../data/ames.csv")
Ames_Data=Ames_Data[,c("Sale_Price","Land_Slope","Lot_Area","Overall_Cond","First_Flr_SF","Bldg_Type")]
head(Ames_Data)

[1mRows: [22m[34m2925[39m [1mColumns: [22m[34m75[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (40): MS_SubClass, MS_Zoning, Street, Alley, Lot_Shape, Land_Contour, Ut...
[32mdbl[39m (35): Order, Lot_Frontage, Lot_Area, Year_Built, Year_Remod_Add, Mas_Vnr...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Sale_Price,Land_Slope,Lot_Area,Overall_Cond,First_Flr_SF,Bldg_Type
<dbl>,<chr>,<dbl>,<chr>,<dbl>,<chr>
215000,Gtl,31770,Average,1656,OneFam
105000,Gtl,11622,Above_Average,896,OneFam
172000,Gtl,14267,Above_Average,1329,OneFam
244000,Gtl,11160,Average,2110,OneFam
189900,Gtl,13830,Average,928,OneFam
195500,Gtl,9978,Above_Average,926,OneFam


Next, we want to divide our dataset into a training set (to train the model) and a test set (to test the model's performance). Let's proceed by using 80% of the original dataset as our train set:

In [3]:
Train_SI = sample(1:nrow(Ames_Data), size=ceiling(0.8*nrow(Ames_Data)))
Test_SI = which(!(1:nrow(Ames_Data) %in% Train_SI))
Ames_Train = Ames_Data[Train_SI,]
Ames_Test = Ames_Data[Test_SI,]
paste("Number of Observations in Training Set: ", nrow(Ames_Train), sep="")
paste("Number of Observations in Test Set: ", nrow(Ames_Test), sep="")

### Part A

Explain, in your own words, why have we decided to split the dataset into two parts **before** creating our prediction model?

**Hint:** Think about the problem that happens if we don't split the dataset and use the entire dataset to both build and test our model.

(5 points) for providing the appropriate reasoning.

**[Answer Here]** 
* If we don't split the dataset, the model would be built and trained too strictly and result in overfitting. This would result in good performance on the training data but poor generalisation to new unseen data. By splitting the data, we would ensure that the model predictive power is robust and not just tailored to a specific instance in the training data.

### Part B

Let's fit a linear regression to model Sale Price in terms of the other variables listed above in the objective. Complete the following code (replace the `...` with appropriate code):

In [4]:
SalePrice_Model = lm(Sale_Price ~ Land_Slope + Lot_Area + Overall_Cond + First_Flr_SF + Bldg_Type, data=Ames_Train)
summary(SalePrice_Model)


Call:
lm(formula = Sale_Price ~ Land_Slope + Lot_Area + Overall_Cond + 
    First_Flr_SF + Bldg_Type, data = Ames_Train)

Residuals:
    Min      1Q  Median      3Q     Max 
-234977  -33943   -5897   28280  319981 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -5.533e+04  7.426e+03  -7.452 1.29e-13 ***
Land_SlopeMod              1.206e+04  5.546e+03   2.175 0.029744 *  
Land_SlopeSev             -4.603e+04  1.707e+04  -2.696 0.007073 ** 
Lot_Area                   1.024e+00  1.724e-01   5.937 3.34e-09 ***
Overall_CondAverage        3.951e+04  3.061e+03  12.908  < 2e-16 ***
Overall_CondBelow_Average -2.277e+04  6.453e+03  -3.529 0.000425 ***
Overall_CondExcellent      4.241e+04  1.009e+04   4.204 2.73e-05 ***
Overall_CondFair          -5.252e+04  8.559e+03  -6.137 9.88e-10 ***
Overall_CondGood           7.597e+03  4.011e+03   1.894 0.058329 .  
Overall_CondPoor          -3.359e+04  1.906e+04  -1.763 0.078083 .  
Overall_Co

Intepreting Model Output: 
- What is the coefficient value for First Floor Surface Area?
- Which coefficient is not statistically signficant at the 5% level?
- What is the standard error for the coefficient corresponding to "Poor" Overall Condition?

**Hint:** Recall that "e+XX" indicates how many decimal places to move the decimal to the right. So 5.000e+04 is 50000.

(10 points) - 4 points for correct code completion, 2 points for each question answered correctly

**[Answer Here]**
* 118.2
* Overall_CondGood (p-value: 0.058329) and Overall_CondPoor (p-value: 0.078083)
* 19060

### Part C

Now we want to test the predictive performance of this linear regression model. What is an appropriate choice of performance measures to use here?

**A)** Sensitivity

**B)** Mean Squared Error

**C)** Krippendorf's Alpha

**D)** Area Under the Curve

**E)** Matthew's Correlation Coefficient

**Hint:** Think about the type of outcome we are predicting.

(5 points) for correct choice.

**[Answer Here]**
* B
* Since Sale_Price is a continuous variable, it would be suitable to use a MSE to test the predictive performance of this linear regression model.

### Part D

Let's choose to use mean absolute error (MAE) to assess the performance of our model. Complete the below code to compute this measure for both the training set and the test set (replace the `...` with appropriate code):

In [5]:
TrainingSetError = sum(abs(predict(SalePrice_Model, Ames_Train) - Ames_Train$Sale_Price))/nrow(Ames_Train)
TestSetError = sum(abs(predict(SalePrice_Model, Ames_Test) - Ames_Test$Sale_Price))/nrow(Ames_Test)

**Hint:** Recall the expression for mean absolute error (MAE). What absolute differences are we taking and summing together?

(5 points) for correct code completion

### Part E

In an effort to improve the linear regression model's performance, a student includes many of the other available variables as predictors in the model. The student also tries out many different combinations of those predictors and selects a final model based on the best performance on the test set. Name at least one possible problem with this modelling approach, if we assume to use the same above modelling procedure (i.e. every new model is repeatedly built using the same code as above, only the model predictors change). Please explain in a couple of sentences.

(5 points, 5 bonus) - For naming one problem correctly, bonus marks as well for naming two problems correctly.

**[Answer Here]** 
* One possible problem is overfitting the model to the test set. This could result in the model performing extremely well on the test data but not be optimised for new unseen data. 
* Another problem is data snooping bias. This occurs when the test data influences the model selection process, leading to overly optimistic performance estimates. A seperate validation set for model selection is needed to keep the model from having this bias.

### Part F

Let's assume that the above student is interested in performing model selection (i.e. selecting predictors), and wants to select the final model that minimizes the performance measure they're using. How can the student update their modelling pipeline so that predictive performance of the selected final model can still be assessed accurately? 

Choose **all** that apply:

**A)** Use a training/validation/test split of the dataset

**B)** Use a bootstrap validation

**C)** Use a repeated cross-validation

**D)** Use a nested cross-validation

**E)** Use a LOESS instead of linear regression

(5 points) for correct choices selected, -2 marks for each incorrect choice

**[Answer Here]**
* A, C and D.

### Part G - (BONUS)

To tackle the problem of model selection, the student decides to use a LASSO model instead of a linear regression. What is the problem with using a LASSO model to build a prediction model here? Assume that the correct predictive performance assessment pipeline is being used.

**Hint:** Look at the variable types of the predictors.

(5 bonus) for correct identification of problem

**[Answer Here]**
* One problem using LASSO is that it may have difficulties handling categorical variables, leading to incorrect or suboptimal selection of predictors. In the dataset, Land_Slope, Overall_Cond and Bldg_Type are all categorical variables. LASSO requires a numerical input, thus these categorical variables would need to be transformed into a suitable format to be able to use LASSO properly.

## Question 2

**Objective**: Next, we're going to now use k-nearest neighbours (k=9) to predict Central Air Conditioning (Y/N) using the following variables:

- Ground Living Area (Square Feet)
- Lot Area
- Total Basement Surface Area (Square Feet)
- First Floor Surface Area (Square Feet)
- Garage Area

Please run this code before proceeding and examine the variables we'll be using:

In [6]:
set.seed(300)
library(tidyverse)
library(class)
Ames_Data = read_csv("../data/ames.csv")
Ames_X_Data=as.data.frame(Ames_Data[,c("Gr_Liv_Area","Lot_Area","Total_Bsmt_SF","First_Flr_SF","Garage_Area")])
Ames_Y_Data=as.vector(data.frame(Ames_Data)[,"Central_Air"])
Ames_Y_Data=factor(Ames_Y_Data, levels=c("Y", "N"))
head(Ames_X_Data)
table(Ames_Y_Data)

[1mRows: [22m[34m2925[39m [1mColumns: [22m[34m75[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (40): MS_SubClass, MS_Zoning, Street, Alley, Lot_Shape, Land_Contour, Ut...
[32mdbl[39m (35): Order, Lot_Frontage, Lot_Area, Year_Built, Year_Remod_Add, Mas_Vnr...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Unnamed: 0_level_0,Gr_Liv_Area,Lot_Area,Total_Bsmt_SF,First_Flr_SF,Garage_Area
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1656,31770,1080,1656,528
2,896,11622,882,896,730
3,1329,14267,1329,1329,312
4,2110,11160,2110,2110,522
5,1629,13830,928,928,482
6,1604,9978,926,926,470


Ames_Y_Data
   Y    N 
2729  196 

And then again, we want to divide our dataset into a training set (to train the model) and a test set (to test the model's performance). Let's proceed by using 80% of the original dataset as our train set:

In [7]:
Train_SI = sample(1:nrow(Ames_X_Data), size=ceiling(0.8*nrow(Ames_X_Data)))
Test_SI = which(!(1:nrow(Ames_X_Data) %in% Train_SI))

Ames_X_Train = Ames_X_Data[Train_SI,]
Ames_X_Test = Ames_X_Data[Test_SI,]

Ames_Y_Train = Ames_Y_Data[Train_SI]
Ames_Y_Test = Ames_Y_Data[Test_SI]

paste("Number of Observations in Training Set: ", nrow(Ames_X_Train), sep="")
paste("Number of Observations in Test Set: ", nrow(Ames_X_Test), sep="")

And at last, we make predictions using k-nearest neighbours (selecting k=9) on our test set using the following code:

In [8]:
KNN_Test_Preds = knn(Ames_X_Train, Ames_X_Test, Ames_Y_Train, k=9)
table(KNN_Test_Preds)

KNN_Test_Preds
  Y   N 
579   6 

### Part A

Build the confusion matrix and then calculate the sensitivity, specificity, positive predictive value and negative predictive value by completing the following code (replace the `...` with appropriate code):

In [9]:
Con_Mat = table(Ames_Y_Test, KNN_Test_Preds)

Sensitivity = Con_Mat[1,1]/(Con_Mat[1, 1]+Con_Mat[2, 1])
Specificity = Con_Mat[2,2]/(Con_Mat[2, 2]+Con_Mat[1, 2])

PPV = Con_Mat[1,1]/(Con_Mat[1, 1]+Con_Mat[1, 2])
NPV = Con_Mat[2,2]/(Con_Mat[2, 2]+Con_Mat[2, 1])

(15 points) - 3 points for each correct line of code

(5 bonus) - Use the appropriate functions in caret to compute these same measures

In [11]:
### Bonus

library(caret)

confusion_matrix = confusionMatrix(Ames_Y_Test, KNN_Test_Preds)
confusion_matrix

Confusion Matrix and Statistics

          Reference
Prediction   Y   N
         Y 546   1
         N  33   5
                                          
               Accuracy : 0.9419          
                 95% CI : (0.9197, 0.9594)
    No Information Rate : 0.9897          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.2133          
                                          
 Mcnemar's Test P-Value : 1.058e-07       
                                          
            Sensitivity : 0.9430          
            Specificity : 0.8333          
         Pos Pred Value : 0.9982          
         Neg Pred Value : 0.1316          
             Prevalence : 0.9897          
         Detection Rate : 0.9333          
   Detection Prevalence : 0.9350          
      Balanced Accuracy : 0.8882          
                                          
       'Positive' Class : Y               
                              

### Part B

Explain in one or two sentences why using the "accuracy" measure for this problem is likely a bad idea.

**Hint:** Examine the outcome more closely.

(5 points) for correct reasoning

**[Answer Here]**
* As the outcome variable (Central Air Conditioning) is highly imbalanced, with majority of the instances being 'Y' (2729) and minority being 'N' (196). Due to this huge difference in instances, the accuracy calculated may be misleading. As a high accuracy can still be achieved by predicting the majority class without capturing the minority class, thus failing to measure the model's true performance.

### Part C

We can see the results from the confusion matrix that prediction of the "N" class using our k-nearest neighbours approach is very poor (only 5 out of 38 are correctly predicted), likely because the class is rare. Technically speaking, we can't perform any further modelling now that we've seen the test error (unless we have additional unused data leftover), because we otherwise risk overfitting (showing the importance of planning out your prediction modelling). 

However, for pedagogical purposes, let's assume we anticipated a problem with predicting this minority class in our data **before** we started modelling. Name a technique we could have used on the data to potentially improve predictive performance on the minority class.

(5 points, 5 bonus) for correctly identifying one of the methods (and explaining what it does), another 5 bonus for explaining which is better to use for this specific problem.

**[Answer Here]**
* Upsampling can be used to potentially improve the predictive performance on the minority class. By randomly duplicating observations from the minority class until the majority and minority is even or close to even, it would help alleviate the severe class imbalance problem, thus potentially improve predictive performance on the minority class.
* Downsampling can also be used potentially improve the predictive performance on the minority class. By randomly removing observations from the majority class until the majority and minority is even or close to even, it would help alleviate the severe class imbalance problem, thus potentially improve predictive performance on the minority class.
* Upsampling is better to use for this specific problem, as the minority class is very small compared to the majority class. Randomly removing observations from the majority class until majority and minority class are even (downsampling), would result in a major loss of information and potentially affect the predictive performance of the model. Thus, upsampling would be more suitable in this specific problem.

### Part D

Here is the code for the application of this technique from Part C, as well as running k-nearest neighbours (k=9) once again:

In [12]:
Y_Count = table(Ames_Y_Train)[1]
N_Count = table(Ames_Y_Train)[2]
Diff = Y_Count-N_Count
N_Inds = which(Ames_Y_Train=="N")
Repeats = sample(N_Inds, Diff, replace=TRUE)

Ames_X_Train_New = rbind(Ames_X_Train, Ames_X_Train[Repeats,])
Ames_Y_Train_New = c(Ames_Y_Train, Ames_Y_Train[Repeats])

KNN_Test_Preds = knn(Ames_X_Train_New, Ames_X_Test, Ames_Y_Train_New, k=9)
table(KNN_Test_Preds)
table(Ames_Y_Test, KNN_Test_Preds)

KNN_Test_Preds
  Y   N 
433 152 

           KNN_Test_Preds
Ames_Y_Test   Y   N
          Y 421 126
          N  12  26

We can see that the model is better at predicting the minority class, but now prediction of the majority class suffers (124 "Y" observations being predicted as "N"). This highlights the problem with k-nearest neighbours handling imbalanced outcome data. 

The `knn` function from the `class` package in R only takes the k-nearest neighbours and assigns predictions based on majority vote (ties are broken **randomly**). So if 5 of the neighbours are "N" and 4 of the neighbours are "Y", then the prediction point will be classified as "N", regardless of how close the "Y" neighbours are. 

How can we use the distances of the k-nearest neighbours of any given prediction point to potentially improve predictions? This can be a general explanation, no need for mathematical equations/expressions.

(5 points) for correct explanation

**[Answer Here]**
* We can use the distances of the k-nearest neighbours to do distance-based weighting. Instead of using the majority vote among the k-nearest neighbours, we can weight the votes based on the distances of the neighbours. Closer neighbours can have a greater influence on the prediction than further neighbours. So for each prediction, we can assign a weight to each of the k-nearest neighbours based on their distance from the point being predicted (closer have more influence than those further away), we then used the weighted votes (for each class, multiply votes by respective weight) to make the final prediction. This allows the prediction to be more balanced as it will be less affected by skewed distant neighbours and allows the model to be more sensitive to local patterns.

### Part E

As you may have noticed, we have not conducted any hyperparameter tuning of the k hyperparameter, instead we've always set k=9 (an initial, hopefully reasonable choice). However, let's consider a modelling approach where we do incorporate hyperparameter tuning to select and test our final model. Specifically, let's say that we're going to perform 20-fold cross-validation on our training set, repeated 100 times, to ultimately select the hyperparameter k with the best performance (let's say average of sensitivity and specificity).

What is one possible concern with this approach (with respect to **specifically** the outcome class variable), and how can this problem be addressed?

**Hint:** What's going to happen to the outcome distribution of each fold? Will each fold always have "N" observations?

(5 points) for correctly identifying the problem and how it can be addressed.

**[Answer Here]**
* Problem: Each fold may not contain a representative sample of the minority class, leading to inconsistent and unreliable performance estimates for the minority class, skewing the overall evaluation of the model's performance such as sensitivity and specificity.
* Solution: Use stratified k-fold cross-validation, where each fold maintains the original class distribution. This ensures that each fold contains a proportionate number of observations for each class, preserving the balance between the majority and minority classes within each fold. Stratification helps provide a more accurate and consistent evaluation of the model’s performance across different classes.

## Upload your work from Assignment 3

- Each student will upload the Jupiter Notebook on Canvas Course 2: https://canvas.ubc.ca/courses/155856

 `Assignment_3C2_[team #]_[student name].ipynb`

EXAMPLE: `Assignment_3C2_Team1_Nikolas_Krstic.ipynb`

- Please write at the title who was responsible for writing each paragraph. 

Upload the word document on Canvas Course 2 under:
`Assignments -> Assignment 3` 
