In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
library(BBmisc) # for easy normalization of data
library(class) # for kNN classification algorithm 
library(gmodels) # for model evaluation
library(plotly) # for interactive visualization
options(warn=-1) # for suppressing messages

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/data_ad454"

# CLUSTERING PROVINCES USING K-MEANS

In this session, we will revisit the dataset from the Turkish State Meteorological Service's Website following the link:

https://www.mgm.gov.tr/veridegerlendirme/il-ve-ilceler-istatistik.aspx

The below table for ANKARA is collected for all 81 provinces, merged with province-region correspondence, month-season correspondence and wrangled

<table xmlns:xalan="http://xml.apache.org/xalan">
  <thead>
    <tr>
      <th style="width:22%">ANKARA</th>
      <th style="width:6%">Ocak</th>
      <th style="width:6%">Şubat</th>
      <th style="width:6%">Mart</th>
      <th style="width:6%">Nisan</th>
      <th style="width:6%">Mayıs</th>
      <th style="width:6%">Haziran</th>
      <th style="width:6%">Temmuz</th>
      <th style="width:6%">Ağustos</th>
      <th style="width:6%">Eylül</th>
      <th style="width:6%">Ekim</th>
      <th style="width:6%">Kasım</th>
      <th style="width:6%">Aralık</th>
      <th style="width:6%">Yıllık</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border:none;"> </td>
      <th colspan="13">Ölçüm Periyodu ( 1927 - 2020)</th>
    </tr>
    <tr>
      <th>Ortalama Sıcaklık (°C)</th>
      <td id="d01">0.2</td>
      <td id="d02">1.7</td>
      <td id="d03">5.7</td>
      <td id="d04">11.2</td>
      <td id="d05">16.0</td>
      <td id="d06">20.0</td>
      <td id="d07">23.4</td>
      <td id="d08">23.4</td>
      <td id="d09">18.9</td>
      <td id="d10">13.2</td>
      <td id="d11">7.2</td>
      <td id="d12">2.5</td>
      <td id="d_top">11.9</td>
    </tr>
    <tr>
      <th>Ortalama En Yüksek Sıcaklık (°C)</th>
      <td id="e01">4.2</td>
      <td id="e02">6.5</td>
      <td id="e03">11.5</td>
      <td id="e04">17.4</td>
      <td id="e05">22.4</td>
      <td id="e06">26.7</td>
      <td id="e07">30.3</td>
      <td id="e08">30.4</td>
      <td id="e09">26.1</td>
      <td id="e10">20.0</td>
      <td id="e11">13.0</td>
      <td id="e12">6.5</td>
      <td id="d_top2">17.9</td>
    </tr>
    <tr>
      <th>Ortalama En Düşük Sıcaklık (°C)</th>
      <td id="f01">-3.3</td>
      <td id="f02">-2.3</td>
      <td id="f03">0.7</td>
      <td id="f04">5.3</td>
      <td id="f05">9.7</td>
      <td id="f06">12.9</td>
      <td id="f07">15.8</td>
      <td id="f08">16.0</td>
      <td id="f09">11.8</td>
      <td id="f10">7.2</td>
      <td id="f11">2.5</td>
      <td id="f12">-0.8</td>
      <td id="d_top3">6.3</td>
    </tr>
    <tr>
      <th>Ortalama Güneşlenme Süresi (saat)</th>
      <td id="g01">2.6</td>
      <td id="g02">3.8</td>
      <td id="g03">5.1</td>
      <td id="g04">6.6</td>
      <td id="g05">8.4</td>
      <td id="g06">10.1</td>
      <td id="g07">11.3</td>
      <td id="g08">10.8</td>
      <td id="g09">9.2</td>
      <td id="g10">6.7</td>
      <td id="g11">4.6</td>
      <td id="g12">2.6</td>
      <td id="d_top4">6.8</td>
    </tr>
    <tr>
      <th>Ortalama Yağışlı Gün Sayısı</th>
      <td id="h01">14.7</td>
      <td id="h02">13.2</td>
      <td id="h03">14.3</td>
      <td id="h04">14.5</td>
      <td id="h05">16.1</td>
      <td id="h06">11.4</td>
      <td id="h07">5.6</td>
      <td id="h08">4.5</td>
      <td id="h09">5.6</td>
      <td id="h10">9.0</td>
      <td id="h11">10.6</td>
      <td id="h12">14.5</td>
      <td id="d_top5">134.0</td>
    </tr>
    <tr>
      <th>
                Aylık Toplam Yağış Miktarı Ortalaması<span style="font-size:.8em;">
                  (mm)
                </span></th>
      <td id="i01">40.1</td>
      <td id="i02">35.4</td>
      <td id="i03">39.2</td>
      <td id="i04">42.4</td>
      <td id="i05">52.0</td>
      <td id="i06">35.3</td>
      <td id="i07">14.2</td>
      <td id="i08">12.5</td>
      <td id="i09">18.1</td>
      <td id="i10">27.9</td>
      <td id="i11">31.5</td>
      <td id="i12">44.6</td>
      <td id="d_top6">393.2</td>
    </tr>
    <tr>
      <td style="border:none;"> </td>
      <th colspan="13">
                  Ölçüm Periyodu ( 1927 - 2020)
                </th>
    </tr>
    <tr>
      <th style="color:#dd4747;">En Yüksek Sıcaklık (°C)</th>
      <td id="j01" title="02.01.1995" style="color:#dd4747;">16.6</td>
      <td id="j02" title="18.02.2016" style="color:#dd4747;">21.3</td>
      <td id="j03" title="31.03.1952" style="color:#dd4747;">27.8</td>
      <td id="j04" title="23.04.1928" style="color:#dd4747;">31.6</td>
      <td id="j05" title="31.05.1935" style="color:#dd4747;">34.4</td>
      <td id="j06" title="27.06.1996" style="color:#dd4747;">37.0</td>
      <td id="j07" title="27.07.2012" style="color:#dd4747;">41.0</td>
      <td id="j08" title="07.08.2010" style="color:#dd4747;">40.4</td>
      <td id="j09" title="03.09.2020" style="color:#dd4747;">39.1</td>
      <td id="j10" title="03.10.1952" style="color:#dd4747;">33.3</td>
      <td id="j11" title="01.11.1932" style="color:#dd4747;">24.7</td>
      <td id="j12" title="02.12.1956" style="color:#dd4747;">20.4</td>
      <td style="color:#dd4747;" id="d_top7">41.0</td>
    </tr>
    <tr>
      <th style="color:#437ec1;">En Düşük Sıcaklık (°C)</th>
      <td id="k01" title="05.01.1942" style="color:#437ec1;">-24.9</td>
      <td id="k02" title="07.02.1932" style="color:#437ec1;">-24.2</td>
      <td id="k03" title="02.03.1985" style="color:#437ec1;">-19.2</td>
      <td id="k04" title="10.04.1929" style="color:#437ec1;">-7.2</td>
      <td id="k05" title="01.05.1981" style="color:#437ec1;">-1.6</td>
      <td id="k06" title="09.06.1958" style="color:#437ec1;">3.8</td>
      <td id="k07" title="11.07.1958" style="color:#437ec1;">4.5</td>
      <td id="k08" title="21.08.1949" style="color:#437ec1;">5.5</td>
      <td id="k09" title="29.09.1931" style="color:#437ec1;">-1.5</td>
      <td id="k10" title="30.10.1927" style="color:#437ec1;">-9.8</td>
      <td id="k11" title="29.11.1948" style="color:#437ec1;">-17.5</td>
      <td id="k12" title="31.12.1941" style="color:#437ec1;">-24.2</td>
      <td style="color:#437ec1;" id="d_top8">-24.9</td>
    </tr>
  </tbody>
  <tfoot>
    <tr>
      <td colspan="13">
        <i>En yüksek ve en düşük sıcaklıkların gerçekleşme tarihini görmek için fare imlecini değerlerin üstüne getiriniz.</i>
      </td>
    </tr>
  </tfoot>
</table>

The wrangled data set from knn lab is below:

In [None]:
meteo_data4 <- readRDS(sprintf("%s/rds/11_01_meteo_data4.rds", datapath))

Now we will add a dataset including latitude and longitude information for all province centers:

In [None]:
coordinates <- fread(sprintf("%s/csv/11_02_coordinates.csv", datapath))

In [None]:
setnames(coordinates, c("province", "lat", "lon"))

In [None]:
coordinates[, province := toupper(province)]

In [None]:
coordinates %>% keep(is.numeric) %>% lapply(range)

In [None]:
setdiff(coordinates$province, meteo_data4$province)

In [None]:
setdiff(meteo_data4$province, coordinates$province)

And we combine the previous dataset with the coordinates:

In [None]:
meteo_data5 <- coordinates[meteo_data4, on = "province"]

In [None]:
meteo_data5 %>% str

You task is to:

- Normalize the dataset (keep original lat and long values)
- Conduct k-means clustering analysis to differentiate provinces based on meteorological data (excluding lat lon and region information). Try different methods for finding the optimal k value. Report the metrics for those k values.
- Visualize provinces so that each province is a point, original longitude data on the x axis, original latitude data on the y axis and colors are differentiated by the cluster number from the previous step using ggplot. Interpret the plot based on whether there are any provinces that geographically lie too far away from the remaining provinces in the same cluster.
- Create a two way contingency table showing the counts across regions and clusters (cluster k may be different from 7, the count of regions). Interpret the table
- Conduct another k-means clustering analysis this time including the normalized lat and lon information. Repeat the steps. Is optimal k different than 7? Are there still any geographical outliers?

# Answer

In [None]:
meteo_data6 <- meteo_data5 %>% keep(is.numeric) %>% dplyr::select(-c("lat", "lon")) %>% transmute_all(BBmisc::normalize)

In [None]:
numWard <- NbClust::NbClust(meteo_data6,
                       distance="euclidean",
                       min.nc=2,
                       max.nc=10,
                    #method = "complete",
                       method="ward.D2",
                       index="all")

In [None]:
numWard$Best.nc

In [None]:
dis <- dist(meteo_data6, method = "euclidian")

In [None]:
hc <- hclust(dis, method = "ward.D2")

In [None]:
plot(hc, hang = -1, labels = F, main = "Complete-Linkage")

In [None]:
hc %>%
    as.dendrogram %>%
    dendextend::color_branches(k = 3) %>%
    plot

In [None]:
comp3 <- cutree(hc, 3)

In [None]:
comp3

In [None]:
meteo_data7 <- cbind(meteo_data5, regionnew = as.factor(comp3))

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)

meteo_data7 %>%
ggplot(aes(x = lon, y = lat, color = regionnew, label=province)) +
geom_point() +
geom_text()

In [None]:
fstats <- meteo_data7 %>% keep(is.numeric) %>% lapply(function(x) aov(x ~ comp3, data.table(x, comp3))) %>%
                                                      lapply(summary) %>% lapply(function(x) x[[1]]$`Pr(>F)`[1]) %>% unlist %>% sort

In [None]:
meteo_data7 %>% dplyr::select(c("regionnew", names(fstats[1:6]))) %>% group_by(regionnew) %>% summarise_if(is.numeric, mean)

- Groups are mostly differentiated across latituted although latitude was not a feature used in clustering
- Group 1 is mostly comprised of southern provinces closer to the sea and southeastern provinces. Temperatures and day-night temperature differences are higher and summer rain is lower
- Group 2 is mostly comprised of inner provinces mostly in mid latitudes. Temparatures are lower, temperature difference is high and summer rain is in between
- Group 3 is mostly comprised of higher latitutede provinces closer to sea. Temperatures are slightly higher than Group 2, temperature differences are lower and summer rain is higher