# K-NEAREST NEIGHBOURS (KNN) WITH WINE DATASET

This exercise is adapted from Chapter 3 of "Machine Learning Made Easy With R" by N.D. Lewis

We will now use Wine Data Set from UCIML:
https://archive.ics.uci.edu/ml/datasets/wine

* These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars
* The analysis determined the quantities of 13 constituents found in each of the three types of wines. 
* The attributes are:
    - Alcohol
    - Malic acid
    - Ash
    - Alcalinity of ash
    - Magnesium
    - Total phenols
    - Flavanoids
    - Nonflavanoid phenols
    - Proanthocyanins
    - Color intensity
    - Hue
    - OD280/OD315 of diluted wines
    - Proline 

We will be trying to identify the cultivars of unlabeled wines

In [None]:
sessionfile <- "01_knn_02.RData"

if(file.exists(sessionfile)) load(sessionfile)

First load necessary libraries for the exercise

In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
library(BBmisc) # for easy normalization of data
library(class) # for kNN classification algorithm 
#library(knnGarden) # knn classification algorithm
library(gmodels) # for model evaluation
library(plotly) # for interactive visualization
library(rebmix) # to load necessary data
library(corrplot) # for correlation plots
library(reshape2) # to melt data for boxplots
options(warn=-1) # for suppressing messages

And now load the data into a data.table:

In [None]:
data("wine", package = "rebmix")
wine <- as.data.table(wine)

In [None]:
str(wine)

Since Cultivar is the categorical variable that we want to label, let's look at the unique values (we can't query the levels since it is not a factor variable yet):

In [None]:
wine[,unique(Cultivar)]

Clearly there are 3 categories.

**EXERCISE 1:** Our first exercise task is to replace the integer Cultivar variable with factor levels "Cultivar 1", "Cultivar 2", "Cultivar 3" (3 minutes).

You can recycle the code from previous WBCD example.

Take advantage of data.table syntax

**SOLUTION 1:**

Replace integer with factors:

In [None]:
wine[,Cultivar:=factor(Cultivar,
                       levels = c(1, 2, 3),
                       labels = c("Cultivar 1", "Cultivar 2", "Cultivar 3"))]

Get factor levels for Cultivar:

In [None]:
wine[,levels(Cultivar)]

## Visualize the Data

Now we want to visualize the breakdown of the count of Cultivar categories, whether data is distributed equally across categories.

We will create a bar plot using plotly library. In the previous example we first created a ggplot object and converted to plotly. Here we will directly create a plotly object.

Advantage of plotly over ggplot is its interactive and 3D features 



In [None]:
plot_ly(wine, 
        x = ~Cultivar,
        type = "histogram")

Hovering over bars, you can view the counts. You can also zoom and pan in plotly charts from the top right menu.

Although not equally distributed, the counts across categories are not too dispersed either

Now we will create a correlation plot of all numeric variables in two steps:
- First get the correlation matrix of all columns except Cultivar
- Pipe into corrplot::corrplot function with ellipses

In [None]:
cor(wine[,!"Cultivar"]) %>%
corrplot::corrplot.mixed(upper = "ellipse",
                         lower = "number",
                         tl.pos = "lt",
                         number.cex = .5,
                         lower.col = "black",
                         tl.cex = 0.7)

- The thinner the ellipses and darker the colour, stronger the relationship (negative or positive). Rounder and lighter colored ellipses denotes correlations closer to zero
- Blue colored and top right oriented ellipses denotes positive relationship. Red colored and top left oriented ellipses denotes negative relationship 

For example:
- Flavonoids and Total.Phenols have a strong positive relationship
- Hue and Malic.Acid have a mildly strong negative relationship
- Ash and Proanthocyanins nearly have zero correlation - no relationship at all

Now, are the scales of variables close to each other or too different?

We can create boxplot of all numeric variables and lay them side by side to see scale differences.

In order to do this, first we have convert the wine data.table from wide to long format or "melt" it.

Original data.table and its dimensions is:

In [None]:
dim(wine)
wine

Now we melt the data into long format:

In [None]:
wine_molten <- data.table::melt(wine,
                      id.vars = "Cultivar",
                      measure.vars = names(wine[,!"Cultivar"]))

And the new format and its dimensions:

In [None]:
dim(wine_molten)
wine_molten

Now we can plot multiple boxplots in one frame with a common scale:

In [None]:
# create ggplot, add boxplot and flip coordinates
ggplot(wine_molten) +
geom_boxplot(aes(x = variable, y = value)) +
coord_flip()

We see that Magnesium, and especially Proline variables has larger ranges.

So we have to transform - rescale, normalize - the variables, so that no variable dominates the distance measure 

## Normalize the Data

**EXERCISE 2:** Exclude Cultivar variable, normalize numeric variables with z-score and save the result into a data.table object named "data_sample" (3 minutes) 

**SOLUTION 2:**

In [None]:
data_sample <- wine[,BBmisc::normalize(.SD), .SDcols = !"Cultivar"]
data_sample

Is our data really normalized now?

Data summary says so, all means are 0, min-max is mostly within -/+ 3.

In [None]:
summary(data_sample)

However it is better that we show the normalization visually with boxplots as we did above

**EXERCISE 3:** Melt the data_sample, create box plots as above. You don't need to save the molten data.table in an interim object, you can just pipe into ggplot!

Note that when you pipe an object into a function, the object becomes the first argument of that function!

And we don't have Cultivar in the data_sample anymore, and we don't need the declare "id.vars" argument

(5 minutes)

**SOLUTION 3:**

In [None]:
# melt the data_sample, pipe into ggplot, add boxplot and flip coordinates
data.table::melt(data_sample,
                      measure.vars = names(data_sample)) %>%

ggplot() +
geom_boxplot(aes(x = variable, y = value)) +
coord_flip()

Now the scales are similar and the data is suitable for distance calculation and kNN

## SPLIT THE DATA INTO TRAIN AND TEST SAMPLES

Now we will split the data in two equal length pieces: train data, and test data to explore whether prediction model is accurate

But we will do this in pure data.table way.

First we will extract "random" row indices half the total row count of data_sample.

For reproducibility we define a "seed" or starting point for randomness so that same numbers are extracted each time the code is run

Remember that inside a data.table .N is a shorthand for total row count or nrow(DT). To extract random numbers from row indices, we need to have a vector of numbers 1:.N. Hopefully, ".I" is a shorthand for that vector!

In [None]:
set.seed(2016)
train <- data_sample[,sample(.I, .N / 2, replace = F)]
train

## RUN AND EVALUATE THE MODEL

**EXERCISE 4:**
- Create "wine_train" and "wine_test" data.tables from data_sample and train vector
- Create "wine_train_labels" and "wine_test_labels" vectors from Cultivar column and train vector
- Run the knn model with k = 2 and save the results into wine_test_pred vector
- You can recycle any code from this or previous example
- Evaluate the model with a CrossTable and save the CrossTable into an object named ct1

(8 minutes)

**SOLUTION 4:**

In [None]:
wine_train <- data_sample[train]
wine_test <- data_sample[-train]
wine_train_labels <- wine[train,Cultivar]
wine_test_labels <- wine[-train,Cultivar]

In [None]:
wine_test_pred <- class::knn(train = wine_train,
                            test = wine_test,
                            cl = wine_train_labels,
                            k = 2)

In [None]:
ct1 <- gmodels::CrossTable(x = wine_test_labels,
                   y = wine_test_pred,
                   prop.chisq = F)
ct1

Now it is time to interpret the results. The most important performance measure of a kNN model is "error rate" or its companion "accuracy rate"

We know that, the diagonal cells on the confusion matrix denote accurate predictions (row and column titles are the same), and off-diagonal cells denote errors.

So if we sum the table proportions of diagonal cells, we have the accuracy rate. And if we subtract it from 1, we get the error rate 

**EXERCISE 5:**
- Using the ct1 "list", calculate the accuracy rate
- You may use "diag" function

(2 minutes)

**SOLUTION 5:**

In [None]:
# using base-r notation. to interpret, read from inside out (much harder)
sum(diag(ct1$prop.tbl))

# using the tidyverse piped notation. to interpret read from left to right (much easier)
ct1$prop.tbl %>% diag() %>% sum()

# we can further split the first column subset step using the functional syntax for operators:
ct1 %>% "[["("prop.tbl") %>% diag() %>% sum()


## SIMULATE THE MODEL AND VISUALIZE

Is k = 2 the optimal level? What if we run the model with different k values?

**EXERCISE 6:**
- Recycling the code from the previous example, get error rates of all models with k = 1:88
- We have two vectors: actual and predicted test labels. Error rate is the count of **"unequal"** pairs divided by total test size. Change the boolean formulation to get a concise solution
- Report a matrix of k-values and error rates (just two columns)
- You have to play with the code copied from the previous example, since the object dimensions may not be the same in two examples

(8 minutes)

**SOLUTION 6:**

In [None]:
k_batch <- function(kval = 2)
{
    # run prediction model
    wine_test_pred1 <- class::knn(train = wine_train,
                            test = wine_test,
                            cl = wine_train_labels,
                            k = kval)
    
    # count unequal pairs and divide by test size
    error_rate <- sum(wine_test_labels != wine_test_pred1) / length(train)

    # report findings
    c(kval, error_rate)

}

# run the model for all k = 1 to 88
report <- t(sapply(1:88, k_batch))

# change column names
colnames(report)  <- c("k value", "Error rate")

# return the matrix object
report

**EXERCISE 7:**
- Recycling code from previous example, create a similar plot of error rates vs. k values, highlighting the first minimum error rate value.
- Automatically report the k value for minimum error rate with sprintf
- You may have to change some parts of the codes since the dimensions and other attributes of the output objects of two examples may differ

(6 minutes)

**SOLUTION 7:**

In [None]:
# object should a data frame, not a matrix
df1 <- as.data.frame(report)

# create ggplot object with line and point geoms, point color and sizes and tooltip text
# note the vectorized "ifelse" function to create vectors of colors and sizes
gp <- ggplot2::ggplot(df1, aes(x = `k value`, y = `Error rate` )) +
geom_line(linetype = "dashed") +
geom_point(color = ifelse(df1[[2]] == min(df1[[2]]), "red", "blue"),
        size = ifelse(df1[[2]] == min(df1[[2]]), 6, 2),
        mapping = aes(text = paste("k value: ", df1[[1]], "\n", "incorrect: ", df1[[2]]))) +
        labs(x = "k value", y = "total incorrect")

# Convert to plotly object for interactive tooltip
plotly::ggplotly(gp, tooltip = c("text"))

In [None]:
sprintf("So, when the k value is %s, count of incorrect is at a minimum of %s",
        which.min(report[,2]),
        min(report[,2]))

In [None]:
save.image(sessionfile)