# K-NEAREST NEIGHBOURS (KNN) WITH WISCONSIN BREAST CANCER DIAGNOSTIC DATASET

- This exercise is adapted from [Chapter 3 of "Machine Learning with R" by Brett Lantz](https://books.google.com.tr/books?id=ZaJNCgAAQBAJ&printsec=frontcover&hl=tr&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false)

- In this exercise we will utilize the Wisconsin Breast Cancer Diagnostic dataset from the UCI
Machine Learning Repository at http://archive.ics.uci.edu/ml.


- The dataset includes the measurements from digitized images of fine-needle aspirate of a breast mass. The
values represent the characteristics of the cell nuclei present in the digital image.


- The breast cancer data includes 569 examples of cancer biopsies, each with
32 features. One feature is an identification number, another is the cancer diagnosis,
and 30 are numeric-valued laboratory measurements. The diagnosis is coded as
"M" to indicate malignant or "B" to indicate benign. The other 30 numeric measurements comprise the mean, standard error, and worst (that is, largest) value for 10 different characteristics of the digitized cell nuclei.


These include:

- Radius
- Texture
- Perimeter
- Area
- Smoothness
- Compactness
- Concavity
- Concave points
- Symmetry
- Fractal dimension

Based on these names, all the features seem to relate to the shape and size of the cell
nuclei. 

## Preparing data

If you want to continue from a previously saved session state:

In [None]:
sessionfile <- "01_knn_01.RData"

if(file.exists(sessionfile)) load(sessionfile)

First of all, we load the libraries necessary for this exercise and define some useful options

In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
library(BBmisc) # for easy normalization of data
library(class) # for kNN classification algorithm 
library(gmodels) # for model evaluation
library(plotly) # for interactive visualization
options(warn=-1) # for suppressing messages

We load the data into a data.frame/data.table "wbcd":

In [None]:
wbcd <- data.table::fread("../data/csv/01_01_wisc_bc_data.csv")

Let's view the structure of the data:

In [None]:
str(wbcd)

"diagnosis" column is of type character. The rest are numeric.

We don't need the id column, so we can drop it:
Note that we use the in-place modification facility of data.tables.

As a data.table, wbcd offers the ease of handling columnwise operations inside the braces - hence in a more concise and efficient manner:

In [None]:
wbcd[,id:=NULL] # drop 1st column of data.table in-place (without assignment)

# We could also do it as wbcd[,1=NULL] but that wouldn't be "idempotent": 
# Erroneously executing the cell a second time would delete the next column

# wbcd <- wbcd[-1] # this is the usual data.frame way with assignment

See "id" is dropped from column names:

In [None]:
# .SD is a shortcut for all columns.
# So this is, "return the names of all columns"
wbcd[,names(.SD)]

Now let's get a better understanding of the variable names.

Note the use of the "pipe" operator from the "tidyverse" suite of packages.

It redirects the output of the former statement into the first argument of the latter statement:

In [None]:
# Get the names of variables except the first,
# And split the names from the underscore into a list of 30 items:
splitnames <- wbcd[,names(.SD)][-1] %>% strsplit("_")
str(splitnames)

In [None]:
# get only the first index of each list item and reduce into unique values
sapply(splitnames, "[", 1) %>% unique()

# get only the second index of each list item and reduce into unique values
sapply(splitnames, "[", 2) %>% unique()

So we have in fact 10 variables with 3 different measurements for each (mean, se - "for standart error" and worst)

The outcome we try to predict is diagnosis. It would be nice to see the distribution of this categorical variable:

In [None]:
wbcd[,table(diagnosis)]

It is better that we recode diagnosis variable with more informative labels:
Note that splitting long lines is a good coding practice for better readability:

In [None]:
wbcd[,diagnosis:=factor(diagnosis,
                       levels = c("B", "M"),
                       labels = c("Benign", "Malignant"))]

And we can check the new labels:

In [None]:
wbcd[,levels(diagnosis)]

Now let's get the percentages of each category:

In [None]:
wbcd[,round(prop.table(table(diagnosis)) * 100, digits = 1)]

For illustrative purposes, let's get the statistical summary for three selected variables' "mean" measurements:

Note that, to compute the summary on multiple selected columns, we define a placeholder ".SDcols" to hold the names of selected columns. Now ".SD" refers only to those selected columns:

In [None]:
wbcd[,summary(.SD),
     .SDcols = c("radius_mean", "area_mean", "smoothness_mean")]

## Normalizing data

The difference in the scales of variables may distort the "distance" calculation, the step at the heart of the kNN algorithm.

So we must normalize the variables so that they have the same scales.

We will follow the "min-max normalization" approach: The minimum value in all variables will be 0, and the maximum value will be 1.

We can write a custom function as such:

```r
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
```

However, the power of R comes from the ability to reuse functions from the vast corpus of packages. We will utilize the "normalize" function from BBmisc package. This function can handle various normalization methods and options.

In [None]:
# normalize all variables except the first column to 0-1 range and save into new object

wbcd_n <- wbcd[,BBmisc::normalize(.SD, "range"), .SDcols = -1]
wbcd_n

Now let's check whether the variables are really normalized:

In [None]:
# get summary statistics of area_mean
wbcd_n[,summary(area_mean)]

## Split data into train and test sets

First let's determine the number of observations in the dataset.

In data table .N is a placeholder for number of rows:

In [None]:
wbcd_n[,.N]

We split the data into two pieces as "train" and "test" sets.

We holdout last 100 rows to test the predictive accuracy of the model. No need the refer to the end row of the train set explicitly!

In [None]:
# exclude last 100 rows and create train set
wbcd_train <- wbcd_n[1:(.N - 100),]

# confirm the dimensions
dim(wbcd_train)

In [None]:
# assign last 100 rows into test set
wbcd_test <- wbcd_n[.N - 99:0,]

# confirm the dimensions
dim(wbcd_test)

We have to use the labels from the diagnosis variable and split them into train and test also:

In [None]:
wbcd_train_labels <- wbcd[1:(.N - 100)][[1]]
class(wbcd_train_labels)
length(wbcd_train_labels)

In [None]:
wbcd_test_labels <- wbcd[.N - 99:0][[1]]
class(wbcd_test_labels)
length(wbcd_test_labels)

## Training a model

Now we should select an arbitrary "k" value - the count of nearest neighbours to vote for labeling

It should be odd in order to prevent tie vote situations

We go for "21":

Note some style preferences in the below code:

- Although unnecessary when a package is loaded and has a namespace, functions should be explicitly called with their namespaces(packages) as such: "class::knn" for future reference (otherwise it would be hard to track which package a called function comes from)

- operators like "=" should be surrounded by spaces for easy readability

- Splitting function parameters in multiple lines enhances readability

In [None]:
# apply the model on train and test datasets, train label and "k" value
wbcd_test_pred <- class::knn(train = wbcd_train,
                            test = wbcd_test,
                            cl = wbcd_train_labels,
                            k = 21)

In [None]:
str(wbcd_test_pred)
length(wbcd_test_pred)

So the output of the model is a factor vector of diagnosis labels with the length of the test set

## Evaluate model performance

Now we should evaluate the model performance: Did our model perform well in identifying the labels of the test set correctly?

We just compare the "true" labels of the test set with the "predicted" labels:

In [None]:
ct1 <- gmodels::CrossTable(x = wbcd_test_labels,
                   y = wbcd_test_pred,
                   prop.chisq = F)
ct1


Top left and bottom right quadrants indicate test cases which are correctly identified as Benign or Malignant
Top right and bottom left quadrants indicate test cases which are misidentified

Now with a small R hack, we can report the findings of the above table interactively (so in case data or methodology changes, the reporting will be updated automatically):

In [None]:
# Get diagnosis labels
labels <- toupper(wbcd[,levels(diagnosis)])

# false correct vector
cf <- toupper(c("falsely", "correctly"))

# Create a data frame of result reports
df1 <- data.frame(as.vector(outer(1:2, 1:2, Vectorize(function(x,y) sprintf("%s %s test cases are %s predicted as %s",
                                                           ct1$t[x,y],
                                                          labels[x],
                                                            cf[(x == y) + 1],
                                                          labels[y])))))
# Define a name for the report:               
colnames(df1)[1] <- sprintf("Out of %s test cases: ", testn <- wbcd_test[,.N])

df1

## Improve model

### Z-score standardization

Min-max normalization, extreme values are compressed to 0-1 range.

With z-score standardization, outliers are better expressed:

In [None]:
wbcd_z <- wbcd[,BBmisc::normalize(.SD), .SDcols = -1]
wbcd_z

Note that, "scale" function from the base yields the same result, however that will coerce the data.table to a matrix, which we do not want for further analysis.

BBmisc::normalize keeps the data.table class of the object

Let's see whether variables are z-score normalized:

In [None]:
wbcd_z[,summary(area_mean)]

Let's repeat the steps again:

In [None]:
wbcd_train <- wbcd_z[1:(.N - 100),]
wbcd_test <- wbcd_z[.N - 99:0,]

# the labels did not change so we do not need the following steps:
wbcd_train_labels <- wbcd[1:(.N - 100)][[1]]
wbcd_test_labels <- wbcd[.N - 99:0][[1]]


In [None]:
# apply the model on train and test datasets, train label and "k" value
wbcd_test_pred <- class::knn(train = wbcd_train,
                            test = wbcd_test,
                            cl = wbcd_train_labels,
                            k = 21)

In [None]:
ct1 <- gmodels::CrossTable(x = wbcd_test_labels,
                   y = wbcd_test_pred,
                   prop.chisq = F)
ct1

In [None]:
# Create a data frame of result reports
df1 <- data.frame(as.vector(outer(1:2, 1:2, Vectorize(function(x,y) sprintf("%s %s test cases are %s predicted as %s",
                                                           ct1$t[x,y],
                                                          labels[x],
                                                            cf[(x == y) + 1],
                                                          labels[y])))))
# Define a name for the report:               
colnames(df1)[1] <- sprintf("Out of %s test cases: ", testn <- wbcd_test[,.N])

df1

Here, FALSE negative cases increased with z-score standardization

### Testing alternative k values

In order to find the optimum k value, we should run a simulation of the model against a range of k values.

For this, we combine all steps into a function and call it with sapply for multiple k values

In [None]:
k_batch <- function(kval = 21)
{
    # run prediction model
    wbcd_test_pred1 <- class::knn(train = wbcd_train,
                            test = wbcd_test,
                            cl = wbcd_train_labels,
                            k = kval)
    
    # count false negatives using boolean functions and comparing actual and predicted labels
    false_neg <- sum(wbcd_test_labels == "Malignant" & wbcd_test_pred1 == "Benign")

    # count false positives using boolean functions and comparing actual and predicted labels
    false_pos <- sum(wbcd_test_labels == "Benign" & wbcd_test_pred1 == "Malignant")
    
    # report findings
    c(kval, false_neg, false_pos, false_neg + false_pos)

}

# run the model for all k = 1 to 100
report <- t(sapply(1:100, k_batch))

# change column names
colnames(report)  <- c("k value", "False negatives", "False positives", "Total classified incorrectly")

# return the matrix object
report

Now, let's visualize the relationship between k value and model performance

- Plot total incorrect on y axis against k value on x axis
- Show as dashed lines
- The points should be smaller and blue except the point with minimum incorrect value which should be larger and red
- Total number of incorrect labelings should be shown as tooltip when hovered over points with the mouse

In [None]:
# object should a data frame, not a matrix
df1 <- as.data.frame(report)

# create ggplot object with line and point geoms, point color and sizes and tooltip text
# note the vectorized "ifelse" function to create vectors of colors and sizes
gp <- ggplot2::ggplot(df1, aes(x = `k value`, y = `Total classified incorrectly` )) +
geom_line(linetype = "dashed") +
geom_point(color = ifelse(df1[[4]] == min(df1[[4]]), "red", "blue"),
        size = ifelse(df1[[4]] == min(df1[[4]]), 6, 2),
        mapping = aes(text = paste("k value: ", df1[[1]], "\n", "incorrect: ", df1[[4]]))) +
        labs(x = "k value", y = "total incorrect")

# Convert to plotly object for interactive tooltip
plotly::ggplotly(gp, tooltip = c("text"))

Let's interactively interpret the above chart:

In [None]:
sprintf("So, when the k value is %s, count of incorrect is at a minimum of %s",
        which.min(report[,4]),
        min(report[,4]))

In [None]:
save.image(sessionfile)