<font size="6"><b>CLUSTERING PROVINCES ON METEOROLOGICAL DATA</b></font>

In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
library(BBmisc) # for easy normalization of data
library(plotly) # for interactive visualization
library(cluster) # for cluster analysis
library(compareGroups) # for building descriptive statistics tables
library(HDclassif) # for the dataset
library(NbClust) # for cluster validity measures
library(heatmaply) # visualize clusters with heatmap and dendrograms
library(dendextend) # enhanced dendrograms
library(circlize) # circular visualization
library(factoextra) # visualizing distances, cluster, heatmap
library(fastcluster) # faster hclust implementation
library(microbenchmark) # performance benchmarking
library(caret) # for confusion matrix
library(formattable) # for number formatting
library(pheatmap) # heatmap
library(knitr) # pretty tables
library(kableExtra) # pretty tables
library(IRdisplay) # pretty tables
library(NbClust) # cluster metrics
library(vegan) # cluster metrics
library(listviewer) # view list object

options(warn=-1) # for suppressing messages

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/databa"

![xkcd](../imagesba/k_means_clustering.png)

(https://xkcd.com/2731/)

In this session, we will utilize a scraped dataset from the Turkish State Meteorological Service's Website following the link:

https://www.mgm.gov.tr/veridegerlendirme/il-ve-ilceler-istatistik.aspx

Using our general knowledge and common sense, we might think data some meteorological statistics like temperatures or precipitation (rain) levels are similar within geographic regions and vary across those regions.

# Data

The below table for ANKARA is collected for all 81 provinces, merged with province-region correspondence, month-season correspondence and wrangled

<table xmlns:xalan="http://xml.apache.org/xalan">
  <thead>
    <tr>
      <th style="width:22%">ANKARA</th>
      <th style="width:6%">Ocak</th>
      <th style="width:6%">Şubat</th>
      <th style="width:6%">Mart</th>
      <th style="width:6%">Nisan</th>
      <th style="width:6%">Mayıs</th>
      <th style="width:6%">Haziran</th>
      <th style="width:6%">Temmuz</th>
      <th style="width:6%">Ağustos</th>
      <th style="width:6%">Eylül</th>
      <th style="width:6%">Ekim</th>
      <th style="width:6%">Kasım</th>
      <th style="width:6%">Aralık</th>
      <th style="width:6%">Yıllık</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border:none;"> </td>
      <th colspan="13">Ölçüm Periyodu ( 1927 - 2020)</th>
    </tr>
    <tr>
      <th>Ortalama Sıcaklık (°C)</th>
      <td id="d01">0.2</td>
      <td id="d02">1.7</td>
      <td id="d03">5.7</td>
      <td id="d04">11.2</td>
      <td id="d05">16.0</td>
      <td id="d06">20.0</td>
      <td id="d07">23.4</td>
      <td id="d08">23.4</td>
      <td id="d09">18.9</td>
      <td id="d10">13.2</td>
      <td id="d11">7.2</td>
      <td id="d12">2.5</td>
      <td id="d_top">11.9</td>
    </tr>
    <tr>
      <th>Ortalama En Yüksek Sıcaklık (°C)</th>
      <td id="e01">4.2</td>
      <td id="e02">6.5</td>
      <td id="e03">11.5</td>
      <td id="e04">17.4</td>
      <td id="e05">22.4</td>
      <td id="e06">26.7</td>
      <td id="e07">30.3</td>
      <td id="e08">30.4</td>
      <td id="e09">26.1</td>
      <td id="e10">20.0</td>
      <td id="e11">13.0</td>
      <td id="e12">6.5</td>
      <td id="d_top2">17.9</td>
    </tr>
    <tr>
      <th>Ortalama En Düşük Sıcaklık (°C)</th>
      <td id="f01">-3.3</td>
      <td id="f02">-2.3</td>
      <td id="f03">0.7</td>
      <td id="f04">5.3</td>
      <td id="f05">9.7</td>
      <td id="f06">12.9</td>
      <td id="f07">15.8</td>
      <td id="f08">16.0</td>
      <td id="f09">11.8</td>
      <td id="f10">7.2</td>
      <td id="f11">2.5</td>
      <td id="f12">-0.8</td>
      <td id="d_top3">6.3</td>
    </tr>
    <tr>
      <th>Ortalama Güneşlenme Süresi (saat)</th>
      <td id="g01">2.6</td>
      <td id="g02">3.8</td>
      <td id="g03">5.1</td>
      <td id="g04">6.6</td>
      <td id="g05">8.4</td>
      <td id="g06">10.1</td>
      <td id="g07">11.3</td>
      <td id="g08">10.8</td>
      <td id="g09">9.2</td>
      <td id="g10">6.7</td>
      <td id="g11">4.6</td>
      <td id="g12">2.6</td>
      <td id="d_top4">6.8</td>
    </tr>
    <tr>
      <th>Ortalama Yağışlı Gün Sayısı</th>
      <td id="h01">14.7</td>
      <td id="h02">13.2</td>
      <td id="h03">14.3</td>
      <td id="h04">14.5</td>
      <td id="h05">16.1</td>
      <td id="h06">11.4</td>
      <td id="h07">5.6</td>
      <td id="h08">4.5</td>
      <td id="h09">5.6</td>
      <td id="h10">9.0</td>
      <td id="h11">10.6</td>
      <td id="h12">14.5</td>
      <td id="d_top5">134.0</td>
    </tr>
    <tr>
      <th>
                Aylık Toplam Yağış Miktarı Ortalaması<span style="font-size:.8em;">
                  (mm)
                </span></th>
      <td id="i01">40.1</td>
      <td id="i02">35.4</td>
      <td id="i03">39.2</td>
      <td id="i04">42.4</td>
      <td id="i05">52.0</td>
      <td id="i06">35.3</td>
      <td id="i07">14.2</td>
      <td id="i08">12.5</td>
      <td id="i09">18.1</td>
      <td id="i10">27.9</td>
      <td id="i11">31.5</td>
      <td id="i12">44.6</td>
      <td id="d_top6">393.2</td>
    </tr>
    <tr>
      <td style="border:none;"> </td>
      <th colspan="13">
                  Ölçüm Periyodu ( 1927 - 2020)
                </th>
    </tr>
    <tr>
      <th style="color:#dd4747;">En Yüksek Sıcaklık (°C)</th>
      <td id="j01" title="02.01.1995" style="color:#dd4747;">16.6</td>
      <td id="j02" title="18.02.2016" style="color:#dd4747;">21.3</td>
      <td id="j03" title="31.03.1952" style="color:#dd4747;">27.8</td>
      <td id="j04" title="23.04.1928" style="color:#dd4747;">31.6</td>
      <td id="j05" title="31.05.1935" style="color:#dd4747;">34.4</td>
      <td id="j06" title="27.06.1996" style="color:#dd4747;">37.0</td>
      <td id="j07" title="27.07.2012" style="color:#dd4747;">41.0</td>
      <td id="j08" title="07.08.2010" style="color:#dd4747;">40.4</td>
      <td id="j09" title="03.09.2020" style="color:#dd4747;">39.1</td>
      <td id="j10" title="03.10.1952" style="color:#dd4747;">33.3</td>
      <td id="j11" title="01.11.1932" style="color:#dd4747;">24.7</td>
      <td id="j12" title="02.12.1956" style="color:#dd4747;">20.4</td>
      <td style="color:#dd4747;" id="d_top7">41.0</td>
    </tr>
    <tr>
      <th style="color:#437ec1;">En Düşük Sıcaklık (°C)</th>
      <td id="k01" title="05.01.1942" style="color:#437ec1;">-24.9</td>
      <td id="k02" title="07.02.1932" style="color:#437ec1;">-24.2</td>
      <td id="k03" title="02.03.1985" style="color:#437ec1;">-19.2</td>
      <td id="k04" title="10.04.1929" style="color:#437ec1;">-7.2</td>
      <td id="k05" title="01.05.1981" style="color:#437ec1;">-1.6</td>
      <td id="k06" title="09.06.1958" style="color:#437ec1;">3.8</td>
      <td id="k07" title="11.07.1958" style="color:#437ec1;">4.5</td>
      <td id="k08" title="21.08.1949" style="color:#437ec1;">5.5</td>
      <td id="k09" title="29.09.1931" style="color:#437ec1;">-1.5</td>
      <td id="k10" title="30.10.1927" style="color:#437ec1;">-9.8</td>
      <td id="k11" title="29.11.1948" style="color:#437ec1;">-17.5</td>
      <td id="k12" title="31.12.1941" style="color:#437ec1;">-24.2</td>
      <td style="color:#437ec1;" id="d_top8">-24.9</td>
    </tr>
  </tbody>
  <tfoot>
    <tr>
      <td colspan="13">
        <i>En yüksek ve en düşük sıcaklıkların gerçekleşme tarihini görmek için fare imlecini değerlerin üstüne getiriniz.</i>
      </td>
    </tr>
  </tfoot>
</table>

I am skipping the wrangling steps to the final version of the data and load that version to be used:

In [None]:
meteo_data5 <- readRDS(sprintf("%s/rds/meteo_data5.rds", datapath))

In [None]:
meteo_data5 %>% str

In [None]:
head(meteo_data5)

Here, along with province name, geographic coordinates of the province center and respective geographic region name, we have some data wrangled from the original source and summarized across each of the four seasons:

- av_temp: Average daily temperature (mean of three months)
- temp_diff: Difference between average daily maximum and minimum temperatures (mean of three months=
- total_rain: Total precipitation (sum of three months)

And let's normalize the data set so no single feature is dominant in distance calculation and we exclude the non-data identifier columns:

In [None]:
meteo_data6 <- meteo_data5 %>% keep(is.numeric) %>% dplyr::select(-c("lat", "lon")) %>% transmute_all(BBmisc::normalize)

In [None]:
meteo_data6 %>% str

We will use `meteo_data6` for clustering and use `meteo_data5` to interpret the clusters geographically.

# K-means Clustering

## Build and train model

We train the dataset for 3 clusters:

In [None]:
centn <- 3

In [None]:
set.seed(2345)
meteoc <- kmeans(meteo_data6, centers = centn)

In [None]:
meteoc

Sizes of each cluster are:

In [None]:
meteoc$size

Let's combine the data with geo locations, with the cluster numbers: 

In [None]:
meteo_data7 <- cbind(meteo_data5, regionnew = as.factor(meteoc$cluster))

Visualize provinces so that each province is a point, original longitude data on the x axis, original latitude data on the y axis and colors are differentiated by the cluster number from the previous step using ggplot. Interpret the plot based on whether there are any provinces that geographically lie too far away from the remaining provinces in the same cluster:

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)

p1 <- meteo_data7 %>%
ggplot(aes(x = lon, y = lat, color = regionnew, label=province)) +
geom_point() +
geom_text()

In [None]:
p1 %>% ggplotly()

We see that:

- Region 1 provinces are all among provinces by Black Sea coast
- Region 2 provinces mostly surround the coastline of Marmara, Aegean and Mediterranean and some inner provinces to the South
- Region 3 provinces are mostly continental provinces in the Central and Eastern Anatolian Plateaus

The center values of each variable for each cluster are:

In [None]:
centers <- meteoc$centers %>% t %>% round(2)
centers

Let's highlight the values in each row with significalty high or low values with kableExtra:

In [None]:
apply(centers,
      1,
      function(x)
        {
          #zs <- (x - mean(x)) / sd(x);
          zs <- BBmisc::normalize(x);
          cell_spec(x,
                    color = ifelse(abs(zs) > 1, "white", "black"),
                            background = ifelse(zs > 1, "navy", ifelse(zs < -1, "red", "white"))
                   )
        }
    ) %>%
t %>%
magrittr::set_colnames(1:centn) %>% 
as.data.table(keep.rownames = T) %>%
setnames("rn", "feature") %>%
knitr::kable(escape = F, format = "html") %>%
kableExtra::kable_styling() %>%
as.character() %>%
IRdisplay::display_html()

We can also visualize distinctive cluster and variable matchings with a heatmap:

In [None]:
pheatmap::pheatmap(centers, cluster_rows = F, cluster_cols = F)

We can deduce that:

- Region 1 provinces have lower temperature differences and higher rain levels
- Region 2 provinces have higher spring and summer temperatures
- Region 3 provinces have lower autumn and winter temperatures and have higher temperature differences (continental climate)

We can also visualize the clusters' borders across dimensions using factroextra's fviz_cluster

Note that,  when there are more than 2 dimensions, this function automatically conducts a PCA and selects the two components that explain the most of the variance:

In [None]:
factoextra::fviz_cluster(meteoc, data = meteo_data6, labelsize = 0)

## Improve model performance

While conducting k-means analysis, what value should be provided as "k" - the number of clusters?

### Manual simulation

First let's dig into the model output:

In [None]:
meteoc %>% listviewer::jsonedit(mode = "form")

The critical values are:
- totss (total sum of squares)
- tot.withinss (total within groups sum of squares)
- betweenss (between groups sum of squares)

As the "k" goes up withinss should leak into betweenss

In [None]:
withinss <- sapply(1:15,
       function(x) kmeans(meteo_data6, centers = x) %>%
       "["(c("totss", "tot.withinss", "betweenss")) %>% unlist
       ) %>%
t %>%
as.data.table

rownames(withinss) <- 1:15

In [None]:
withinss %>% round

In [None]:
p2 <- withinss %>%
ggplot(aes(x = withinss[,.I], y = tot.withinss)) +
geom_line() +
xlab("Number of clusters") +
ylab("Within group sum of squares")

plotly::ggplotly(p2)

We cannot detect a clear elbow point to cut the number of clusters

### Optimal k with vegan package

Vegan package also does a simulation to determine the optimal k based on Calinski measure:

In [None]:
modelx <- vegan::cascadeKM(meteo_data6, 1, 10, iter = 3)

In [None]:
modelx$results

Calinski is a measure of between-cluster to within-cluster variance.

In [None]:
p3 <- modelx$results %>%
t %>%
as.data.table %>%
ggplot(aes(x = 1:10, y = calinski)) +
geom_line()

plotly::ggplotly(p3)

The k with max calinski value should be selected:

In [None]:
calx <- which.max(modelx$results[2,])
calx

Let's run the model with that:

In [None]:
meteoc2 <- kmeans(meteo_data6, centers = calx)

And see the center values:

In [None]:
centers2 <- meteoc2$centers %>% t %>% round(2)
centers2

And emphasize values over and above average:

In [None]:
apply(centers2,
      1,
      function(x)
        {
          zs <- (x - mean(x)) / sd(x);
          cell_spec(x,
                    color = ifelse(abs(zs) > 0.5, "white", "black"),
                            background = ifelse(zs > 0.5, "navy", ifelse(zs < -0.5, "red", "white"))
                   )
        }
    ) %>%
t %>%
magrittr::set_colnames(1:2) %>%
as.data.table(keep.rownames = T) %>%
setnames("rn", "feature") %>%
knitr::kable(escape = F, format = "html") %>%
kableExtra::kable_styling() %>%
as.character() %>%
IRdisplay::display_html()

Get cluster sizes:

In [None]:
meteoc2$size

In [None]:
meteo_data7b <- cbind(meteo_data5, regionnew = as.factor(meteoc2$cluster))

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)

p4 <- meteo_data7b %>%
ggplot(aes(x = lon, y = lat, color = regionnew, label=province)) +
geom_point() +
geom_text()

In [None]:
p4 %>% ggplotly()

The cluster distinction now becomes coastal vs continental provinces

### Optimal k with NbClust

NbClust package provides 30 indexes for determining the optimal number of clusters in a data set and offers the best clustering scheme from different results to the user.

In [None]:
meteo_nb <- NbClust::NbClust(meteo_data6, min.nc = 2, max.nc = 8, index = "all", method = "kmeans")

The model output:

In [None]:
meteo_nb

The voting of 30 criterion can also be done manually:

In [None]:
meteo_nb$Best.nc[1,] %>% table

The best k for partitions is:

In [None]:
max(meteo_nb$Best.partition)

# Hierarchical clustering

## Optimal clusters

Methods for forming clusters is as follows:

- Ward: This minimizes the total within-cluster variance as measured by the sum of squared errors from the cluster points to its centroid
- Complete: Distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster
- Single: Distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster
- Average: Distance between two clusters is the mean distance between an observation in one cluster and an observation in the other cluster
- Centroid: Distance between two clusters is the distance between the cluster centroids

The complete linkage method results in the distance between any two clusters that is the maximum distance between any one observation in a cluster and any one observation in the other cluster. Ward's linkage method seeks to cluster the observations in order to minimize the within-cluster sum of squares.

In [None]:
numComplete <- NbClust::NbClust(meteo_data6,
                       distance="euclidean",
                       min.nc=2,
                       max.nc=6,
                       method="complete",
                       index="all")

Going with the majority rules method, we would select three clusters as the optimal solution, at least for hierarchical clustering. The two plots that are produced contain two graphs each.

As the preceding output states that you are looking for a significant knee in the plot (the graph on the left-hand side) and the peak of the graph on the right-hand side

For the complete method, 26 different metrics are calculated for different cluster k's. The best cluster number proposed by each index is as follows:

In [None]:
numComplete$Best.nc

In [None]:
bestc <- max(numComplete$Best.partition)
bestc

## Clustering with Complete Linkage

### Distances

Now let's calculate the distance matrix, using the base stats package:

In [None]:
dis <- dist(meteo_data6, method = "euclidian")

Or factoextra package:

In [None]:
dis2 <- factoextra::get_dist(meteo_data6)

And we can visualize the distances:

In [None]:
factoextra::fviz_dist(dis)

### Hierarchical clustering

We run the cluster algorithm with the complete method:

In [None]:
hc <- hclust(dis, method = "complete")

In [None]:
hc

And visualize as a dendrogram:

In [None]:
plot(hc, hang = -1, labels = F, main = "Complete-Linkage")

Hierarchical clustering does not define specific clusters, but rather defines the dendrogram above.

From the dendrogram we can decipher the distance between any two groups by looking at the height at which the two groups split into two.

(http://genomicsclass.github.io/book/pages/clustering_and_heatmaps.html)

We can also create a colored dendrogram to diffentiate the clusters better:

In [None]:
comp3 <- cutree(hc, bestc)

In [None]:
sparcl::ColorDendrogram(hc,
                       y = comp3,
                       main = "Complete",
                       branchlength = 50)

In [None]:
hc %>%
    as.dendrogram %>%
    dendextend::color_branches(k = bestc) %>%
    plot

We can also draw a circular dendrogram using dendextend package:

In [None]:
hc %>%
    as.dendrogram %>%
    dendextend::color_branches(k = bestc) %>%
    dendextend::circlize_dendrogram()

In [None]:
comp3

Combine data with cluster labels:

In [None]:
meteo_data7c <- cbind(meteo_data5, regionnew = as.factor(comp3))

Visualize provinces so that each province is a point, original longitude data on the x axis, original latitude data on the y axis and colors are differentiated by the cluster number from the previous step using ggplot. Interpret the plot based on whether there are any provinces that geographically lie too far away from the remaining provinces in the same cluster:

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)

p5 <- meteo_data7c %>%
ggplot(aes(x = lon, y = lat, color = regionnew, label=province)) +
geom_point() +
geom_text()

In [None]:
p5 %>% ggplotly()

Here the problem is, the third cluster is comprised of only one province.

## Clustering with Ward's Linkage

In [None]:
numward <- NbClust::NbClust(meteo_data6,
                    distance = "euclidean",
                    diss = NULL,
                    min.nc = 2,
                    max.nc = 6,
                    method = "ward.D2",
                    index = "all")

In [None]:
bestc2 <- max(numward$Best.partition)
bestc2

This time around also, the majority rules was for a three cluster solution

### Hierarchical clustering

Run the cluster algorithm with Ward's linkage:

In [None]:
hcWard <- stats::hclust(dis, method = "ward.D2")

fastcluster package provides the same functionality however much faster:

In [None]:
hcWard2 <- fastcluster::hclust(dis, method = "ward.D2")

In [None]:
hcWard
hcWard2

Let's compare whether labeling for 3 clusters is identical:

In [None]:
identical(cutree(hcWard2, bestc2), cutree(hcWard, bestc2))

Define the cluster cuts:

In [None]:
ward3w <- cutree(hcWard, bestc2)

And plot the dendrogram:

In [None]:
sparcl::ColorDendrogram(hcWard,
                       y = ward3w,
                       main = "Complete",
                       branchlength = 50)

In [None]:
hcWard %>%
    as.dendrogram %>%
    dendextend::color_branches(k = bestc2) %>%
    plot

Combine data with new cluster labels:

In [None]:
meteo_data7d <- cbind(meteo_data5, regionnew = as.factor(ward3w))

Visualize provinces so that each province is a point, original longitude data on the x axis, original latitude data on the y axis and colors are differentiated by the cluster number from the previous step using ggplot. Interpret the plot based on whether there are any provinces that geographically lie too far away from the remaining provinces in the same cluster:

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)

p6 <- meteo_data7d %>%
ggplot(aes(x = lon, y = lat, color = regionnew, label=province)) +
geom_point() +
geom_text()

In [None]:
p6 %>% ggplotly()

We can see that:

- Cluster 1 is comprised of southern provinces, either coastal or closer to coasts
- Cluster 2 is comprised of continental provinces, mostly in mid latitudes
- Cluster 3 is comprised of northern coastal provinces

Compare that to the 3 means case from k-means:

In [None]:
p1 %>% ggplotly()

And let's compare classes and clusters:

In [None]:
table(ward3w, comp3) %>% caret::confusionMatrix()

Ward matches the actual classes better than the Complete method

We can compare the dendrogram from both methods:

In [None]:
meteo_dends <- lapply(list(hc, hcWard),
                          function(x) as.dendrogram(x) %>%
                          dendextend::color_branches(k = bestc2)) %>%
                        dendextend::as.dendlist()

In [None]:
names(meteo_dends) <- c("complete", "ward.D2")

In [None]:
meteo_dends %>%
    dendextend::dendlist(which = 1:2) %>%
    dendextend::ladderize() %>%
    dendextend::tanglegram(faster = TRUE)

In order to understand tanglegrams better we can view a simple example conducted on iris dataset:

![tangle](../imagesba/tanglegram.jpeg)

(https://academic.oup.com/view-large/figure/381336054/vbac014f1.jpeg)

(https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac014/6539778)

Observations that appear in the same cluster earlier are positioned close to each other. In each dendogram the order arising from the proximity of observations may change. The lines in the middle are for matching the same observations in two dendograms.

## Explore data across clusters

Now that we have the cluster info, we can add the cluster labels back into the original data to explore the differences across:

In [None]:
meteo_data8 <- copy(meteo_data5)

In [None]:
meteo_data8[, c("ward3w", "comp3") := .(ward3w, comp3)]

In [None]:
head(meteo_data8)

First the map and summary data with Ward's linkage method:

In [None]:
p6 %>% ggplotly()

Size of clusters:

In [None]:
meteo_data8[, .N, by = ward3w]

And summary data:

In [None]:
meteo_data8[,lapply(.SD, mean), by = ward3w, .SDcols = -c("province", "lat", "lon", "region", "comp3")] %>% round(1)

And then the map and data using complete linkage method: 

In [None]:
p5 %>% ggplotly()

Size of clusters:

In [None]:
meteo_data8[, .N, by = comp3]

And summary data:

In [None]:
meteo_data8[,lapply(.SD, mean), by = comp3, .SDcols = -c("province", "lat", "lon", "region", "ward3w")] %>% round(1)

# Object Generating Code

In [None]:
student_id <- 2025000000
library(tidyverse)
library(data.table)
library(BBmisc)
library(NbClust)
library(factoextra)
library(pheatmap)
library(dendextend)
nvar <- 4
set.seed(student_id)
km <- sample(3:6, 1)
sizex <- 100
params <- lapply(1:nvar, function(x) list(means1 = rnorm(km, 0, 3), sds1 = rexp(km, 1.5)))
clstr <- sample(km, sizex, replace = T)
data1 <- lapply(1:nvar, function(x)
{
    datax <- data.table(clstr)
    param <- params[[x]]
    means1 <- param$means1
    sds1 <- param$sds1
    datax[, (c("meanx1", "sdx1")) := .(means1[clstr], sds1[clstr])]
    datax[, val := rnorm(.N, meanx1, sdx1)]
    datax$val
}
       )
colnamesx <- paste(sample(words, nvar), "1", sep = "")
setDT(data1)
setnames(data1, colnamesx)
data1 <- data1 %>% mutate_all(normalize) %>% copy()                 

## K-Means Clustering

Let's first conduct K-means clustering on data1.

First let's get the best number of clusters with NbClust:

In [None]:
km_nb <- NbClust(data1, min.nc = 2, max.nc = 8, index = "all", method = "kmeans")

In [None]:
max(km_nb$Best.partition)

Best number of clusters is 4.

Let's run `kmeans` function with 4 clusters (note that we set an arbitrary seed for full reproducibility):

In [None]:
set.seed(1)
modelkm <- kmeans(data1, centers = 4)

In [None]:
modelkm

Between clusters sum of squares is 75.4% of total sum of squares.

Let's get the cluster centroids:

In [None]:
kmcent <- modelkm$centers %>% round(2)

In [None]:
kmcent

Or visualize them:

In [None]:
pheatmap(modelkm$centers, cluster_rows = F, cluster_cols = F)

We can select an arbitrary cluster, for example Cluster 4 and summarize it:

For Cluster 4, provide1 mean is 0.58, art1 mean is 0.67, product1 mean is 0.34 and trade1 mean is 0.18.

Cluster sizes are:

In [None]:
modelkm$size

So the smallest cluster size is 12 while largest cluster size is 36.

We can get the cluster assignments for all observations to be compared with the assignments from hierarchical clustering later:

In [None]:
kmc <- modelkm$cluster

In [None]:
kmc

We can visualize clusters on two major pricipal components:

In [None]:
fviz_cluster(modelkm, data = data1, labelsize = 0)

We see that clusters 1 and 4 are close to each other and clusters 2 and 3 are close to each other while these two pair of clusters are mostly apart. No clusters are overlapping.

## Hierarchical Clustering

Now let's conduct hierarchical clustering with complete linkage.

Firest get the best number of clusters using NbClust:

In [None]:
hc_nb <- NbClust(data1,
                       distance="euclidean",
                       min.nc = 2,
                       max.nc = 8,
                       method="complete",
                       index="all")

In [None]:
max(hc_nb$Best.partition)

The best number of cluster is 5.

Let's get the distance matrix:

In [None]:
dis <- get_dist(data1)

And visualize it:

In [None]:
fviz_dist(dis)

Run hclust function with complete linkage method (note that we set an arbitrary seed for full reproducibility):

In [None]:
set.seed(2)
hc <- hclust(dis, method = "complete")

Visualize the dendogram, coloring by the best number of clusters, 5:

In [None]:
hc %>%
    as.dendrogram %>%
    color_branches(k = 5) %>%
    plot

Get the cluster assignments using the best number of clusters:

In [None]:
compc <- cutree(hc, 5)

Create a copy of the data:

In [None]:
data2 <- copy(data1)

Add the cluster aassignments to the data:

In [None]:
data2[, cluster := compc]

Now get the means of variables across clusters:

In [None]:
data2[,lapply(.SD, mean), by = cluster] %>% round(2)

If we compare with the centroid from K-means:

In [None]:
modelkm$centers %>% round(2)

We will see that the centroid of Cluster 2 that we get from K-means is the same as the centroid of Cluster 1 that we get from hierarchical clustering.

Let's get a two-way contingency table across cluster assignments of two methods:

In [None]:
table(kmc, compc)

We see that Cluster 5 of hierarchical clustering overlaps partially with Clusters 1 and 4 of K-means. Other clusters from two models completely overlap.