# DSCI 100 Review Session 2 Worksheet

### Loading relevant packages for notebook

In [1]:
## Otherwise I highly recommend using DSCI100 jupyterlab (access it through canvas) since packages are already up to date.
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6) #limits output of dataframes to 6 rows

"package 'tidyverse' was built under R version 4.3.3"
"package 'ggplot2' was built under R version 4.3.3"
"package 'tidyr' was built under R version 4.3.3"
"package 'dplyr' was built under R version 4.3.3"
── [1mAttaching core tidyverse packages[22m ──────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Us

## Chapter 6: Classification I (Training and Predicting)

### 6.0 Important packages for chapter 6
___

* `tidymodels`
    * K-nearest neighbour algorithm is implemented in the parsnip PACKAGE included in the tidymodels package collection.
    * The tidymodels package collection also provides the workflow

### 6.1 Classification and Training Sets
___

* Classification is predicting a **categorical class** (sometimes called a label) for an observation given its other **quantitative variables** (sometimes called features). 
* Generally, a classifier assigns an observation (e.g. a new patient) to a class (e.g. diseased or healthy) on the basis of how similar it is to other observations for which we know the class (e.g. previous patients with known diseases and symptoms).
* These observations with known classes that we use as a basis for prediction are called a training set.
* We call them a “training set” because we use these observations to train, or teach, our classifier so that we can use it to make predictions on new data that we have not seen previously.

> **Training Data/Set:** It is a collection of observations for which we know the true classes. It can be used to explore and build our classifier.

### 6.3 K-Nearest Neighbours Algorithm
___

In order to classify a new observation using a K-nearest neighbour classifier, we have to:

1) Compute the distance between the new observation and each observation in the training set  
2) Sort the data table in ascending order according to the distances  
3) Choose the top K rows of the sorted table  
4) Classify the new observation based on a majority vote of the neighbour classes  

#### **<u>Example Code:</u>**

```
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) |>  #(1)
        	set_engine("kknn") |>                                            #(2)
        	set_mode("classification")                                       #(3)

knn_fit <- fit(knn_spec, target_variable ~ predictor_variables, df)          #(4)

new_obs <- tibble(Permimeter = 0, Concavity = 3.5)                           #(5)
predict(knn_fit, new_obs)                                                    #(6)
```

1) We create a model specification for K-nearest neighbours classification by calling the `nearest_neighbor()` function.  
   Here we specify that we want to use K=5 neighbours.  
   The `weight_func` argument controls how neighbours vote when classifying a new observation  
   By setting it to `"rectangular"`, it measures the straight-line distance.
   Each of the K nearest neighbours gets exactly 1 vote as described above.

2) The `set_engine()` trains the model with a particular computational engine which needs to be specified in its argument.  
   In this case, we specify the `"kknn"` engine

3) The `setmode()` specifies what type of problem this is in its argument  
   In this case, it's a `"classification"` problem.  

4) In order to fit the model on the breast cancer data, we need to pass the model specification as the 1st argument.  
   Specify the variables we need to use to make the prediction and what variable to use as the target in the 2nd argument.  
   (Note: if you want to use all the other variables/columns to predict, then type `target_variable ~ .` )  
   The data frame being used as the 3rd argument.  
   The fit object lists the function that trains the model as well as the “best” settings for the number of neighbours and weight function  

5) `new_obs <- tibble(x_col = ..., y_col = ...)` creates a new observation with the x and y values.

6) prediction is made on the new observation using the fitted model and the new observation.

#### 6.3.1 Common problems using K-NN Classification

1) **Varying scales of each variable**  
When using K-nearest neighbour classification, the scale of each variable matters since large scale variables can have a greater (unwanted) affects.

2) **Class Imbalance**  
Another potential issue in a data set for a classifier is class imbalance, i.e., when one label is much more common than another.  
If there are many more data points with one label overall, the algorithm is more likely to pick that label in general 


#### 6.3.2 Solution to these problems

1) **Scaling and Centering**  
When all variables in a data set have a mean (center) of 0 and a standard deviation (scale) of 1, we say that the data have been standardized.  
As a rule of thumb, standardizing your data should be a part of the preprocessing you do before any predictive modelling / analysis.

2) **Balancing**  
Rebalance the data by oversampling the rare class.  
We replicate rare observations multiple times in our data set to give them more voting power in the K-nearest neighbour algorithm.   

#### 6.3.3 Data Preprocessing

> **Note:** 
>* Scaling & Centering and Balancing are part of preprocessing the data
>* In the `tidymodels` framework, data preprocessing is done by using a Recipe.
>* `prep()` and `bake()` are used in conjunction with `recipe()` to preprocess data (e.g. centering and scaling data). 

**Explanation of the `prep()`, `bake()` workflow:**
- `prep()` calculates the standard deviations and means required to scale and center the data. If you run the recipe before `prep()`, it just mentions the pre-processing steps it has to take.
- `bake()` applies the results of `prep()` on to the data. 
- You might be wondering, "why are these two separate functions, then?". Well, you might want to calculate the standard deviations and means for one data set and use those numbers to scale a DIFFERENT data set.
- For example, you might want to find the standard deviations for the training data set and use that to scale the testing data set. 
- This is because training our model or even standardizing our data based on the test data jeopardizes the validity of the test data and violates the golden rule of machine learning: never use any part of the test data to help make your model.

#### 6.3.4 Scaling and Centering

* When all variables in a data set have a mean (center) of 0 and a standard deviation (scale) of 1, we say that the data have been standardized.
* As a rule of thumb, standardizing your data should be a part of the preprocessing you do before any predictive modelling / analysis.

##### **<u>Example Code:</u>**

```
udf_recipe <- recipe(target_col ~ ., df) |>    #(1)
  step_scale(all_predictors()) |>              #(2)
  step_center(all_predictors()) |>             #(3)
  prep()                                       #(4)

scaled_df <- bake(recipe, df)                  #(5)
```

1) `recipe()` creates a Recipe for Preprocessing Data. Here we specify the target column/variable, and all other variables are predictors. (udf stands for unscaled dataframe)

2) `step_scale()` scales numeric data. `all_predictors()` applies it to all the predictor variables/columns.

3) `step_center()` centers numeric data.

4) `prep()` function finalizes the recipe by using the data to compute anything necessary to run the recipe (in this case, the column means and standard deviations).

5) `bake()` function applies the recipe to the dataframe?


#### 6.3.5 Balancing

* Rebalance the data by oversampling the rare class.
* We will replicate rare observations multiple times in our data set to give them more voting power in the K-nearest neighbour algorithm.
* In order to do this, we will add an oversampling step to the earlier `udf_recipe` recipe with the `step_upsample()` function.

##### **<u>Example Code:</u>**

```
ups_recipe <- recipe(target_col ~ ., data = df) |>          #(1)
  step_upsample(target_col, over_ratio = n, df) |>          #(2)
  prep()                                                    #(3)

upsampled_df <- bake(ups_recipe, df)
```

1) `recipe()` creates a Recipe for Preprocessing Data

2) `step_upsample()`
    * oversamples data points in minority to match those of majority.
    * 1st argument selects the `target_col`.
    * 2nd argument is a numeric value for the ratio of the majority-to-minority frequencies. (DEFAULT: over_ratio = 1)
    * 3rd argument takes in the dataframe.

3) `prep()` function finalizes the recipe by using the data to compute anything necessary to run the recipe

### 6.4 `workflow()`
___

>* We’re going to use this recipe in a `workflow()` so we don’t need to stress a lot about whether to `prep()` or not. 
>* If you want to explore what the recipe is doing to your data:
    * You can first `prep()` the recipe to estimate the parameters needed for each step
    * Then `bake(new_data = NULL)` to pull out the training data with those steps applied.

##### **<u>Example Code:</u>**
```
# load the unscaled cancer data and make sure the target Class variable is a factor
unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") |> 
  mutate(Class = as_factor(Class))

# create the KNN model
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |> 
  set_engine("kknn") |>
  set_mode("classification")

# create the centering / scaling recipe
uc_recipe <- recipe(Class ~ Area + Smoothness, data = unscaled_cancer) |> 
  step_scale(all_predictors()) |> 
  step_center(all_predictors())

knn_fit <- workflow() |> 
  add_recipe(uc_recipe) |> 
  add_model(knn_spec) |> 
  fit(data = unscaled_cancer)

knn_fit

new_obs <- tibble(Perimeter = 0, Concavity = 3.5)
knnPred <- predict(knn_fit, new_obsv)
```

#### 6.4.1 Advantage of using `workflow()`

- This is a simple way to chain together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps.
- We did not use the select function to extract the relevant variables from the data frame, and instead simply specified the relevant variables to use via the formula `Class ~ Area + Smoothness` (instead of `Class ~ .`) in the recipe.
- You will also notice that we did not call `prep()` on the recipe; this is unnecssary when it is placed in a workflow.
- We do not include a formula in the fit function. This is again because we included the formula in the recipe, so there is no need to respecify it.

## Chapter 7: Classification II (Evaluation and Tuning)

### 7.1 Common functions we may use in this chapter
___

* `bind_cols(col_object, df)`
    * binds the column/vector in argument 1 to dataframe in argument 2

* `rename(df, new_col_name = old_col_name)`
    * renames column name

### 7.2 Measures to assess the classifier
___

1) **Prediction Accuracy**:
$\frac{\text{total number of correct predictions}}{\text{total number of predictions}}$  
2) **Precision**  
3) **Recall**  

### 7.3 Steps to assess the classifier
___

##### **(1) <u>Create the TRAIN SET and TEST SET</u>**
- Training Set should be a 50-100% split of the data  
- Test Set should be the remaining 0-50% of data  
- You want to trade off between:
    - training an accurate model (by using a larger **training** data set)
    - getting an accurate evaluation of its performance (by using a larger **test** data set).

- `initial_split(df, prop = ..., strata = target_column)`
    - 2nd argument is the proportion you want for training (e.g. 0.75)
    - 3rd argument is the column name of the target variable.
    - use `set.seed()` for reproducible results as `initial_split()` randomly samples from df.
    - use `training(split_object)` & `testing(split_object)` to assign the training and test sets to respective reference objects.

##### **(2) <u>Pre-Process the data</u>**
- As we mentioned last chapter, K-NN is sensitive to the scale of the predictors, and so we should perform some preprocessing to standardize them.
- We should create the standardization preprocessor **using only the training data**.  
(This ensures that our test data does not influence any aspect of our model training)
- Once we have created the standardization preprocessor, we can then apply it **separately** to both the **training** and **test** datasets.

##### **(3) <u>Train the Classifier</u>**
- Create the K-nearest neighbour classifier with **only** the **training set**.  
(Here again you see the set.seed function. In the K-nearest neighbour algorithm, if there is a tie for the majority neighbour class, the winner is randomly selected.)

##### **(4) <u>Create the labels in the Test set</u>**
- Predict the class labels for our **test set** using the `predict()` function
- use the `bind_cols()` to add the column of predictions to the original test data creating the predictions dataframe.


##### **(5) <u>Compute the accuracy</u>**
- To assess classifier's accuracy, we use the `metrics()` function.
- `metrics(df, truth = target_col_name, estimate = .pred_class)`
    - 2nd argument takes the name of the the target variable/column
    - 3rd argument takes the name of the column with the predictions

- We can also look at the confusion matrix for the classifier, which shows the table of predicted labels and correct labels, using the `conf_mat()`.
- The *confusion matrix* for the classifier will show us the table of predicted labels and correct labels. 

A confusion matrix is essentially a classification matrix. The columns of the confusion matrix represent the actual class and the rows represent the predicted class (or vice versa). Shown below is an example of a confusion matrix.

|                  |          |  Actual Values |                |
|:----------------:|----------|:--------------:|:--------------:|
|                  |          |    Positive    |    Negative    |
|**Predicted Value**  | Positive |  True Positive | False Positive|
|                  | Negative | False Negative | True Negative  |


- A **true positive** is an outcome where the model correctly predicts the positive class.
- A **true negative** is an outcome where the model correctly predicts the negative class.
- A **false positive** is an outcome where the model incorrectly predicts the positive class.
- A **false negative** is an outcome where the model incorrectly predicts the negative class.

<br>

We can create a confusion matrix by using the `conf_mat` function. Similar to the `metrics` function, you will have to specify the `truth` and `estimate` arguments.

- `conf_mat(df, truth = Class, estimate = .pred_class)`
    - 2nd argument takes the name of the the target variable/column
    - 3rd argument takes the name of the column with the predictions

### 7.4 Tuning the model
___

- Predictive models in statistics and machine learning have parameters that you have to pick.
- For example, in the K-nearest neighbour classification algorithm we have had to pick the number of neighbours K for the class vote.
- Making the most optimal selection is called **Tuning** the model.

#### 7.4.1 Cross Validation Method

- Instead of randomly splitting the data, we want each observation in the data set to be used in a validation set only a single time.
- The name for this strategy is called cross-validation.
- In cross-validation, we split our overall **Training data** into $V$ evenly-sized chunks/folds
- Then iteratively use 1 chunk as the **Validation set** and combine the remaining $V−1$ chunks as the **Training (sub)set**.


$$\text{Cross-validation accuracy} = \frac{\sum{\text{accuracy of n folds}}}{\text{number of folds}}$$

### 7.5 Underfitting and Overfitting
___

<u>**Under-fitting:**</u>  
As we increase the number of neighbours, more and more of the training observations (and those that are farther and farther away from the point)
get a “say” in what the class of a new observation is. This causes a sort of “averaging effect” to take place, making the boundary between where
our classifier would predict a tumour to be malignant versus benign to smooth out and become simpler.

In general, if the model isn’t influenced enough by the training data, it is said to underfit the data.


<u>**Over-fitting:**</u>  
In contrast, when we decrease the number of neighbours, each individual data point has a stronger and stronger vote regarding nearby points. 
Since the data themselves are noisy, this causes a more “jagged” boundary corresponding to a less simple model.
This is just as problematic as the large $K$ case, because the classifier becomes unreliable on new data:
if we had a different training set, the predictions would be completely different.

In general, if the model is influenced too much by the training data, it is said to overfit the data.

<center><img src="data_2/underover.JPG" width=500 height=400 /></center>

### 7.7 Advantages and Disadvantages of K-NN Classification
___

<u>**Advantages:**</u>  
* Simple and easy to understand
* No assumptions about what the data must look like
* Works easily for binary (two-class) and multi-class (> 2 classes) classification problems


<u>**Disadvantages:**</u>  
* As data gets bigger and bigger, K-nearest neighbour gets slower and slower, quite quickly
* Does not perform well with a large number of predictors
* Does not perform well when classes are imbalanced (when many more observations are in one of the classes compared to the others)

## Chapter 8: Regression I (K-NN Regression)

### 8.1 Introduction to K-NN regression
___

* Regression, like classification, is a predictive problem setting where we want to use past information to predict future observations.
* The goal is to predict numerical values instead of class labels.
* To predict a value of $Y$ for a new observation using k-nn regression, we identify the k-nearest neighbours and then assign it the mean of the k-nearest neighbours as the predicted value.


#### 8.1.1 Root Mean Squared Error (RMSE)

* Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. 

$$\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}{\left(\hat{y_i}-{y_i}\right)^2}}$$

#### 8.1.2 RMSE vs RMSPE

* When predicting and evaluating prediction quality on the **training data**, we say $\text{RMSE}$.
* By contrast, when predicting and evaluating prediction quality on the **testing data** or **validation data**, we say $\text{RMSPE}$.
> $\text{RMSE}$ is a measure of goodness of fit.  
$\text{RMSE}$ measures how well the model predicts on data it was trained with.  
$\text{RMSPE}$ is a measure of prediction quality.  
$\text{RMSPE}$ measures how well the model predicts on data it was not trained with.  

### 8.3 Strength and Limitations of K-NN Regression
___

<u>**Strengths:**</u>  
1) Simple and easy to understand  
2) No assumptions about what the data must look like  
3) Works well with non-linear relationships (i.e., if the relationship is not a straight line)  


<u>**Limitations:**</u>  
1) As data gets bigger and bigger, K-NN gets slower and slower, quite quickly 2. Does not perform well with a large number of predictors unless the size of the training set is exponentially larger.  
2) Does not predict well beyond the range of values input in your training data  

### 8.4 Overfitting vs Underfitting
___

<u>**Overfitting:**</u>  
Creates high variance and low bias.  
It has high variance because the flexible blue line follows the training observations very closely, 
and if we were to change any one of the training observation data points we would change the flexible blue line quite a lot. 
This means that the blue line matches the data we happen to have in this training data set, however, 
if we were to collect another training data set from the Sacramento real estate market it likely wouldn’t match those observations as well.


<u>**Underfitting:**</u>  
Creates low variance and high bias as the blue line is extremely smooth, and almost flat.  
This happens because our predicted values for a given x value (here home size), depend on many many (450) neighbouring observations.  
A model like this has low variance and high bias (intuitively, it provides very reliable, but generally very inaccurate predictions).  
It has low variance because the smooth, inflexible blue line does not follow the training observations very closely, and if we were to change any one of the training observation data points it really wouldn’t affect the shape of the smooth blue line at all.  
This means that although the blue line matches does not match the data we happen to have in this particular training data set perfectly, if we were to collect another training data set from the Sacramento real estate market it likely would match those observations equally as well as it matches those in this training data set.

<center><img src="data_2/underover.JPG" width=500 height=400 /></center>

## Chapter 9: Regression II (Linear Regression)

### 9.0 Important Packages for Chapter 9
___

* `tidymodels`
    * We can perform simple linear regression in R using tidymodels in a very similar manner to how we performed KNN regression.


### 9.1 Introduction to Linear Regression
___

- In KNN regression, we look at the K nearest neighbours and average over their values for a prediction. 
- In simple linear regression, we create a straight line of best fit through the training data and then “look up” the prediction using the line.
  Therefore using the data to find the line of best fit is equivalent to finding coefficients $\beta_{0}$ and $\beta_{1}$ that parametrize (correspond to) the line of best fit. 
- Simple linear regression chooses the straight line of best fit by choosing the line that minimizes the average squared vertical distance between itself and each of the observed data points in the training data.
- To assess the predictive accuracy of a simple linear regression model, we use $\text{RMSPE}$, the same measure of predictive performance we used with KNN regression.
- An additional difference that you will notice below is that we do not standardize (i.e., scale and center) our predictors.

<p><center>
  <img src = "https://miro.medium.com/max/611/1*jopCO2kMEI84s6fiGKdXqg.png" width = "500"/>
</center></p>

### 9.3 Comparison of Linear Regression vs KNN Regression
___

<u>**Advantages of Linear Regression over KNN-regression:**</u>  
1) KNN regression does **NOT** predict well beyond the range of the predictors in the training data. Linear regression can be used to address this problem.
2) In KNN regression, the method gets significantly slower as the training dataset grows. Linear regression can be used to address this problem.
3) In linear regression, standardization does not affect the fit (it does affect the coefficients in the equation, though!)
4) A straight line can be defined by two numbers, the vertical intercept and the slope. The intercept tells us what the prediction is when all of the predictors are equal to 0; and the slope tells us what unit increase in the target/response variable we predict given a unit increase in the predictor variable. KNN regression, as simple as it is to implement and understand, has no such interpretability from its wiggly line.

<u>**Disadvantages of Linear Regression when compared to KNN-regression:**</u>  
1) When the relationship between the target and the predictor is not linear, but instead some other shape (e.g. curved or oscillating).  In these cases the prediction model from a simple linear regression will underfit (have high bias), meaning that model/predicted values does not match the actual observed values very well. Such a model would probably have a quite high $\text{RMSE}$ when assessing model goodness of fit on the training data and a quite high $\text{RMSPE}$ when assessing model prediction quality on a test data set.

On such a data set, KNN regression may fare better.

### 9.4 Extrapolation
___

<u>**Extrapolation problems**</u>  
* Predicting outside the range of the observed data is known as extrapolation; KNN and linear models behave quite differently when extrapolating.  
* Depending on the application, the flat or constant slope trend may make more sense.  
* For example, if our housing data were slightly different, the linear model may have actually predicted a negative price for a small houses (if the intercept $\beta_{0}$ was negative), which obviously does not match reality.  
* On the other hand, the trend of increasing house size corresponding to increasing house price probably continues for large houses, so the “flat” extrapolation of KNN likely does not match reality.

### 9.6 Problems of Linear Regression
___

1) **Outliers**  
The problem with outliers is that they can have too much influence on the line of best fit.

2) **Collinearity problem in Multivariate linear regression**  
If collinearity between predictors are very high, then the plane of best fit will have regression coefficients that are very sensitive to the exact values in the data.  
We can design new predictors (by centering them) to tackle this problem.
