# HIERARCHICAL CLUSTERING WITH WINE DATA

Adapted from Lesmeister (2015) Chapter 8.

We will now use Wine Data Set from UCIML:
https://archive.ics.uci.edu/ml/datasets/wine

* These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars
* The analysis determined the quantities of 13 constituents found in each of the three types of wines. 
* The attributes are:
    - Alcohol
    - Malic acid
    - Ash
    - Alcalinity of ash
    - Magnesium
    - Total phenols
    - Flavanoids
    - Nonflavanoid phenols
    - Proanthocyanins
    - Color intensity
    - Hue
    - OD280/OD315 of diluted wines
    - Proline 

## Load libraries and dataset

In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
library(plotly) # for interactive visualizations
library(cluster) # for cluster analysis
library(compareGroups) # for building descriptive statistics tables
library(HDclassif) # for the dataset
library(NbClust) # for cluster validity measures
#library(sparcl) # colored dendrograms. Not available for R 3.4.4 - version at binder
library(heatmaply) # visualize clusters with heatmap and dendrograms
library(dendextend) # enhanced dendrograms
library(circlize) # circular visualization
library(factoextra) # visualizing distances, cluster, heatmap
library(fastcluster) # faster hclust implementation
library(microbenchmark) # performance benchmarking
library(caret) # for confusion matrix
library(gmodels) # for confusion matrix

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/data_ad454"

In [None]:
data("wine", package = "HDclassif")

In [None]:
wine_dt <- as.data.table(wine)

In [None]:
setnames(wine_dt,
         c("Class", "Alcohol", "MalicAcid", "Ash", "Alk_ash",
           "magnesium", "T_phenols", "Flavanoids", "Non_flav",
           "Proantho", "C_Intensity", "Hue", "OD280_315", "Proline"))

## Explore and transform

In [None]:
str(wine_dt)

Normalize variables:

In [None]:
wine_dtz <- wine_dt[,BBmisc::normalize(.SD), .SDcols = !"Class"]

See whether normalized:

In [None]:
wine_dtz %>% sapply(quantile, na.rm = T) %>% t()

View the distribution of cultivar classes:

In [None]:
wine_factors <- wine_dt[,.(Class)] %>% # convert into long format for faceting
    ggplot(aes(x = Class)) + # plot value
    geom_bar()

plotly::ggplotly(wine_factors)

## Optimal clusters

Methods for forming clusters is as follows:

- Ward: This minimizes the total within-cluster variance as measured by the sum of squared errors from the cluster points to its centroid
- Complete: Distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster
- Single: Distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster
- Average: Distance between two clusters is the mean distance between an observation in one cluster and an observation in the other cluster
- Centroid: Distance between two clusters is the distance between the cluster centroids

The complete linkage method results in the distance between any two clusters that is the maximum distance between any one observation in a cluster and any one observation in the other cluster. Ward's linkage method seeks to cluster the observations in order to minimize the within-cluster sum of squares.

In [None]:
numComplete <- NbClust::NbClust(wine_dtz,
                       distance="euclidean",
                       min.nc=2,
                       max.nc=6,
                       method="complete",
                       index="all")

Going with the majority rules method, we would select three clusters as the optimal solution, at least for hierarchical clustering. The two plots that are produced contain two graphs each.

As the preceding output states that you are looking for a significant knee in the plot (the graph on the left-hand side) and the peak of the graph on the right-hand side

For the complete method, 23 different metrics are calculated for different cluster k's. The best cluster number proposed by each index is as follows:

In [None]:
numComplete$Best.nc

## Clustering with Complete Linkage

### Distances

Now let's calculate the distance matrix, using the base stats package:

In [None]:
dis <- dist(wine_dtz, method = "euclidian")

Or factoextra package:

In [None]:
dis2 <- factoextra::get_dist(wine_dtz)

And we can visualize the distances:

In [None]:
factoextra::fviz_dist(dis)

The three cultivars are nearyly apparent in the distance matrix

### Hierarchical clustering

We run the cluster algorithm with the complete method:

In [None]:
hc <- hclust(dis, method = "complete")

In [None]:
hc

And visualize as a dendrogram:

In [None]:
plot(hc, hang = -1, labels = F, main = "Complete-Linkage")

Hierarchical clustering does not define specific clusters, but rather defines the dendrogram above.

From the dendrogram we can decipher the distance between any two groups by looking at the height at which the two groups split into two.

(http://genomicsclass.github.io/book/pages/clustering_and_heatmaps.html)

We can also create a colored dendrogram to diffentiate the clusters better:

In [None]:
comp3 <- cutree(hc, 3)

In [None]:
# The version of R at binder deployment is 3.4.4 and sparcl package currently requires a higher version.

#sparcl::ColorDendrogram(hc,
#                       y = comp3,
#                       main = "Complete",
#                       branchlength = 50)

In [None]:
hc %>%
    as.dendrogram %>%
    dendextend::color_branches(k = 3) %>%
    plot

We can also draw a circular dendrogram using dendextend package:

In [None]:
hc %>%
    as.dendrogram %>%
    dendextend::color_branches(k = 3) %>%
    dendextend::circlize_dendrogram()

## Cluster vs classes

Clustering is an unsupervised method: We don't try to predict classes but instead try to detect patterns in data

But in this case we have classes, so we can compare whether the three classes coincide with the clusters:

In [None]:
table(comp3, wine_dt$Class) %>% caret::confusionMatrix()

The clusters coincide with the classes 83.7 % of the time

## Clustering with Ward's Linkage

In [None]:
NbClust::NbClust(wine_dtz,
                    distance = "euclidean",
                    diss = NULL,
                    min.nc = 2,
                    max.nc = 6,
                    method = "ward.D2",
                    index = "all")

This time around also, the majority rules was for a three cluster solution

### Hierarchical clustering

Run the cluster algorithm with Ward's linkage:

In [None]:
hcWard <- stats::hclust(dis, method = "ward.D2")

fastcluster package provides the same functionality however much faster:

In [None]:
hcWard2 <- fastcluster::hclust(dis, method = "ward.D2")

In [None]:
hcWard
hcWard2

Let's compare whether labeling for 3 clusters is identical:

In [None]:
identical(cutree(hcWard2, 3), cutree(hcWard, 3))

And let's compare the performance:

In [None]:
microbenchmark::microbenchmark(stats::hclust(dis, method = "ward.D2"),
                               fastcluster::hclust(dis, method = "ward.D2"), times = 5) %>% summary() %>% t

fastcluster package is at least 3 times faster

Define the cluster cuts:

In [None]:
ward3w <- cutree(hcWard, 3)

And plot the dendrogram:

In [None]:
# The version of R at binder deployment is 3.4.4 and sparcl package currently requires a higher version.

#sparcl::ColorDendrogram(hcWard,
#                       y = ward3w,
#                       main = "Complete",
#                       branchlength = 50)

In [None]:
hcWard %>%
    as.dendrogram %>%
    dendextend::color_branches(k = 3) %>%
    plot

And let's compare classes and clusters:

In [None]:
table(ward3w, wine_dt$Class) %>% caret::confusionMatrix()

Ward matches the actual classes better than the Complete method

We can also compare the two methods:

In [None]:
table(ward3w, comp3)

We can compare the dendrogram from both methods:

In [None]:
wine_dends <- lapply(list(hc, hcWard),
                          function(x) as.dendrogram(x) %>%
                          dendextend::color_branches(k = 3)) %>%
                        dendextend::as.dendlist()

In [None]:
names(wine_dends) <- c("complete", "ward.D2")

In [None]:
wine_dends %>%
    dendextend::dendlist(which = 1:2) %>%
    dendextend::ladderize() %>%
    #set("branches_k_color", k=3) %>%
    #set("rank_branches") %>%
    dendextend::tanglegram(faster = TRUE)
    #tanglegram(common_subtrees_color_branches = TRUE)

We see that ward method created a smaller cluster 1 (69 vs 64) and larger cluster 3 (51 ca 56). cluster 2 is the same in both methods (58):

In [None]:
gmodels::CrossTable(ward3w, comp3, prop.r=F, prop.c=F,
           prop.t=F, prop.chisq=F)

## Explore data across clusters

Now that we have the cluster info, we can add the cluster labels back into the original data to explore the differences across:

In [None]:
wine_dt[, c("ward3w", "comp3") := .(ward3w, comp3)]

In [None]:
wine_dt

In [None]:
wine_dt[,lapply(.SD, mean), by = ward3w, .SDcols = -c("Class", "comp3")]

In [None]:
wine_dt[,lapply(.SD, mean), by = comp3, .SDcols = -c("Class", "ward3w")]

Although they are quite similar, the values for the 1st cluster with Ward method is mostly above those with the complete method 

In [None]:
colnms <- c("cluster", "method", "proline")

p1 <- rbindlist(
list(wine_dt[,.(as.factor(ward3w), "ward", Proline)] %>%
    magrittr::set_colnames(colnms),

    wine_dt[,.(as.factor(comp3), "comp", Proline)] %>%
    magrittr::set_colnames(colnms)
    ),
idcol = NULL
) %>%

ggplot() +
geom_boxplot(aes(x = cluster, y = proline)) +
coord_flip() +
facet_wrap(~ method, scales = "fixed")

plotly::ggplotly(p1)

We see that complete method created five outliers for the second cluster while ward method did not

## Visualization of clusters

We can run the cluster model and at the same time visualize it with a heatmap and a dendrogram at the same time:

In [None]:
heatmaply::heatmaply(wine_dtz, hclust_method = "ward.D2",
                     k_row = 3,
                     Colv = NULL,
                     labRow = NULL)

Or using factoextra we can again run the model and visualize the cluster in polygons:

In [None]:
factoextra::eclust(wine_dtz,
                   FUNcluster = "hclust",
                   k = 3,
                   hc_metric = "euclidean",
                   hc_method = "ward.D2",
                   verbose = interactive()) %>%

factoextra::fviz_cluster()