# Evaluating clustering algorithms

The present library contains utilities for evaluating different clustering algorithms
(with or without ground truth labels). On top of the evaluation utilities there are classes for
performing parameters sweeps and model selection. Here we give an overview of the most important functionality


## Before running the notebook

Install the library and its dependencies with, if you haven't done so already
```
pip install -e .
```
from the root directory. You can also execute this command directly in the notebook but will need to reload the
kernel afterwards

In [None]:
# Note - this cell should be executed only once per session
%load_ext autoreload
%autoreload 2

import sys, os

# in order to get the config, it is not part of the library
os.chdir("..")
sys.path.append(os.path.abspath("."))

In [None]:
import numpy as np
import os
from pprint import pprint
from sklearn.cluster import DBSCAN
import seaborn as sns
import geopandas as gp
import matplotlib.pyplot as plt
import logging

from sensai.clustering.coordinate_clustering import SKLearnCoordinateClustering
from sensai.hyperopt import GridSearch
from sensai.evaluation.evaluator_clustering import ClusteringModelSupervisedEvaluator, \
    ClusteringModelUnsupervisedEvaluator
from sensai.evaluation.eval_stats import ClusteringUnsupervisedEvalStats, ClusteringSupervisedEvalStats, \
    AdjustedMutualInfoScore
from sensai.evaluation.clustering_ground_truth import PolygonAnnotatedCoordinates

from config import get_config

logging.basicConfig(level=logging.INFO)

In [None]:
# loading data and config
c  = get_config(reload=True)
sampleFile = c.datafile_path("sample", stage=c.RAW) # this can point to a directory or a shp/geojson file
coordinatesDF = gp.read_file(sampleFile)

## Evaluating a single model

For a single model that was already fitted, evaluation statistics can be extracted with `ClusteringEvalStats`, see the
example below (the eval_stats object can also be used to retrieve evaluation results one by one)


In [None]:
dbscan = SKLearnCoordinateClustering(DBSCAN(eps=150, min_samples=20))
dbscan.fit(coordinatesDF)
evalStats = ClusteringUnsupervisedEvalStats.fromModel(dbscan)
pprint(evalStats.getAll())
plt.hist(evalStats.clusterSizeDistribution)
plt.show()

## Model selection

For model selection we need to compare different (or differently parametrized) models that were
trained on the same dataset. The `ClusteringEvaluator` abstraction was designed with this goal in mind.
The evaluator can be used to obtain evaluation statistics for different models that are guaranteed
to be comparable with each other (always computed by the same object in the same way). Here an example evaluating
a dbscan performance on metrics that don't necessitate ground truth labels

In [None]:
modelEvaluator = ClusteringModelUnsupervisedEvaluator(coordinatesDF)

dbscanEvalStats = modelEvaluator.evalModel(dbscan, fit=False)  # dbscan was already fitted on this data

In [None]:
print("dbscan_performance: \n")
pprint(dbscanEvalStats.getAll())

One of the main purposes of evaluators is to be used within classes that perform a parameter sweep, e.g.
a `GridSearch`. All such objects return a data frame and (optionally but recommended!) persist all evaluation results
in a csv.

In [None]:
parameterOptions = {
    "min_samples": [10, 20],
    "eps": [50, 150]
}

# for running the grid search in multiple processes, all objects need to be picklable.
# Therefore we pass a named function as model factory instead of a lambda
def dbscanFactory(**kwargs):
    return SKLearnCoordinateClustering(DBSCAN(**kwargs))

dbscanGridSearch = GridSearch(dbscanFactory, parameterOptions, csvResultsPath=os.path.join(c.temp, "dbscanGridSearchCsv"))

In [None]:
# the results of the grid-search are saved as csv under the path provided above
resultDf = dbscanGridSearch.run(modelEvaluator, sortColumnName="numClusters", ascending=False)
resultDf.head()

The resulting data frame can be used to visualize the results through standard techniques,
e.g. pivoting and heatmaps


In [None]:
print("calinskiHarabaszScores")
chScoreHeatmap = resultDf.pivot(index="min_samples", columns="eps", values="CalinskiHarabaszScore")
sns.heatmap(chScoreHeatmap, annot=True)
plt.show()

In [None]:
print("daviesBouldinScores")
chScoreHeatmap = resultDf.pivot(index="min_samples", columns="eps", values="DaviesBouldinScore")
sns.heatmap(chScoreHeatmap, cmap=sns.cm.rocket_r, annot=True)
plt.show()

In [None]:
print("numClusters")
numClustersHeatmap = resultDf.pivot(index="min_samples", columns="eps", values="numClusters").astype(int)
sns.heatmap(numClustersHeatmap, annot=True)  # something goes wrong with the datatype here, maybe b/c of zero clusters
plt.show()

## Dealing with ground truth labels


The evaluation classes can take ground truth labels for all coordinates and use them for calculating related metrics.
However, such labels are typically hard to come by, especially if the coordinates cover a large area. Therefore the
library includes utilities for extracting labels from ground truth provided in form of __cluster polygons in a selected
region__. The central class for dealing with this kind of data is `ground_truth.PolygonAnnotatedCoordinates`,
see examples below.

In [None]:
# The polygons can be read directly from a file, see the documentation for more details
groundTruthClusters = PolygonAnnotatedCoordinates(coordinatesDF, c.datafile_path("sample", stage=c.GROUND_TRUTH))

As usual, the object has methods for plotting and exporting to geodata frames.
These can be very useful for inspecting the provided data

In [None]:
groundTruthClusters.plot(markersize=0.2, cmap="plasma")
plt.show()

groundTruthClusters.toGeoDF().head()

We can extract the coordinates and labels for the annotated region and use them in evaluation. In the following
we will train our own adaption of DBSCAN, namely `boundedDBSCAN` on datapoints in the ground truth region and
evaluate the results against the true labels

In [None]:
boundedDbscan = SKLearnCoordinateClustering(DBSCAN(eps=150, min_samples=20), minClusterSize=100)
groundTruthCoordinates, groundTruthLabels = groundTruthClusters.getCoordinatesLabels()
supervisedEvaluator = ClusteringModelSupervisedEvaluator(groundTruthCoordinates, trueLabels=groundTruthLabels)
supervisedEvalStats = supervisedEvaluator.evalModel(boundedDbscan)

print("Supervised evaluation metrics of bounded dbscan:")
pprint(supervisedEvalStats.getAll())

In [None]:
print("Unsupervised evaluation metrics of bounded dbscan:")
pprint(ClusteringUnsupervisedEvalStats(groundTruthCoordinates, groundTruthLabels).getAll())
print("")
print("Unsupervised evaluation metrics of annotated data")
pprint(ClusteringUnsupervisedEvalStats.fromModel(boundedDbscan).getAll())

The bounded dbscan is performing quite OK with the given parameters, although we see that it segregates clusters too
much and has a general tendency towards smaller clusters. These tendencies can be seen visually by comparing the ground
truth and the bounded dbscan cluster plots

In [None]:
groundTruthClusters.plot(markersize=0.2, cmap="plasma", includeNoise=False)

boundedDbscan.plot(markersize=0.2, includeNoise=False)

## Supervised parameter estimation

We can now bring everything together by running a grid search and evaluating against ground truth. Very little code
is needed for that, so we will write it entirely in the cell below

In [None]:
parameterOptions = {
    "min_samples": [19, 20, 21],
    "eps": [140, 150, 160]
}

supervisedGridSearch = GridSearch(dbscanFactory, parameterOptions,
                                           csvResultsPath=os.path.join(c.temp, "bounded_dbscan_grid_search.csv"))

In [None]:
# we will sort the results by mutual information store
supervisedResultDf = supervisedGridSearch.run(supervisedEvaluator, sortColumnName=AdjustedMutualInfoScore.name,
                                              ascending=False)
supervisedResultDf

It seems like we were lucky to already have picked the optimal parameters for the dbscan above.
It is also interesting to notice that the supervised scores are in
stark disagreement with the unsupervised ones