# The Coordinate Clustering Module

On top of support for different clustering algorithms, sensAI provides useful methods specific to
clustering of geospatial data. They include utilities for wrangling geometrical data, spanning trees and for persisting and
visualizing the results. It seamlessly interoperates with geopandas and shapely.
This notebook gives an overview of the coordinate clustering's main functions


## Before running the notebook

Install the library and its dependencies with, if you haven't done so already
```
pip install -e .
```
from the root directory. You can also execute this command directly in the notebook but will need to reload the
kernel afterwards

In [None]:
# Note - this cell should be executed only once per session
%load_ext autoreload
%autoreload 2

import sys, os

# in order to get the config, it is not part of the library
os.chdir("..")
sys.path.append(os.path.abspath("."))

In [None]:
import os
import geopandas as gp
from pprint import pprint
import numpy as np

from sklearn.cluster import DBSCAN

import logging
from sensai.util.graph import CoordinateSpanningTree
from sensai.clustering.coordinate_clustering import SKLearnCoordinateClustering
from sensai.util.geometry import alphaShape
from config import get_config

logging.basicConfig(level=logging.INFO)
c = get_config(reload=True)

## Loading and Fitting

The library contains utils for loading coordinates from files and for wrapping arbitrary scikit-learn compatible
clustering algorithms. Custom clustering algorithms can be implemented easily buy inheriting from the baseclass
`ClusteringModel`


In [None]:
sampleFile = c.datafile_path("sample", stage=c.RAW) # this can point to a directory or a shp/geojson file
sampleGeoDF = gp.read_file(sampleFile)
dbscan = SKLearnCoordinateClustering(DBSCAN(eps=150, min_samples=20))
dbscan.fit(sampleGeoDF)

The resulting `CoordinateClusteringAlgorithm` instance has many useful methods.
You can retrieve clusters individually or via a generator. The noise cluster can be accessed individually

In [None]:
print(f"Clusters found: {dbscan.numClusters}")

clustersMin50 = list(dbscan.clusters(condition=lambda x: len(x) >= 50))

print(f"Clusters with at least 50 members: {len(clustersMin50)}")

## Analysis and Visualization

From the dbscan single clusters which are instances of `CoordinateClusteringAlgorithm.Cluster` 
can be retrieved and visualized. Most objects, including the dbscan itself, have an inbuilt plot method

In [None]:
dbscan.plot(markersize=0.2)

We can condition before plotting as well as pass custom arguments

In [None]:
dbscan.plot(condition=lambda x: len(x) >= 50, cmap='plasma')

### Properties of a single cluster

Single clusters can be plotted too

In [None]:
sampleCluster = dbscan.getCluster(0)

sampleCluster.plot()

Clusters have an identifier and coordinates. It is easy to extract additional information,
e.g. via the summary method

In [None]:
pprint(sampleCluster.summaryDict())

A single cluster is just a wrapper around its coordinates. They can be
retrieved either as a numpy array, a geodataframe or a MultiPoint object.
The latter is useful for geometric operations, e.g. computing hulls

In [None]:
clusterMultipoint = sampleCluster.asMultipoint()
clusterMultipoint.convex_hull

In [None]:
# we also provide a utility for computing alpha shapes for such objects

alphaShape(clusterMultipoint)

sensAI also provides utilities for computing trees, e.g. here for the minimal spanning tree

In [None]:
sampleTree = CoordinateSpanningTree(sampleCluster)
sampleTree.plot()

Most objects provide a way for extracting a summary from them, either as a dict or as a data frame

In [None]:
print("cluster summary:")
pprint(sampleCluster.summaryDict())

In [None]:
dbscan.summaryDF().head()

## Saving and Loading

All of the objects used above can be exported to a GeoDataFrame using the `toGeoDF` method. This geodataframe
can then be persisted as usual.

In addition to that `CoordinateClusteringAlgorithm` has its own save method which persists the object as pickle.
An instance can be loaded using the load classmethod.
This way of persisting the fitted algorithm is _much more efficient and general_ than saving the corresponding gdf

Individual clusters themselves also have saving and loading methods,
with the difference that they are persisted as (and instantiated from) shapefiles.

In [None]:
dbscanGeoDF = dbscan.toGeoDF() # here again a condition for filtering clusters can be passed
clusterGeoDF = sampleCluster.toGeoDF()
treeGeoDF = sampleTree.toGeoDF()
dbscanGeoDF.head()

In [None]:
dbscanSavedPath = os.path.join(c.temp, f"{dbscan}_sample.pickle")
clusterSavedPath = os.path.join(c.temp, f"sampleCluster_{sampleCluster.identifier}")


dbscan.save(dbscanSavedPath)
sampleCluster.save(clusterSavedPath)

In [None]:
loadedDBSCAN = SKLearnCoordinateClustering.load(dbscanSavedPath)
loadedCluster = SKLearnCoordinateClustering.Cluster.load(clusterSavedPath)

In [None]:
# The loaded objects are equal to the ones we persisted

print(loadedCluster.identifier == sampleCluster.identifier)
print(np.array_equal(sampleCluster.datapoints, loadedDBSCAN.getCluster(0).datapoints))

# Cleaning up
import shutil

shutil.rmtree(clusterSavedPath)
os.remove(dbscanSavedPath)