When using CLOSE as stability measure for over-time clusterings, many settings are possible. First, CLOSE has to be initialized (test_data is a 2-dim array containing the data):
import pandas
from ots_eval.stability_evaluation.close import CLOSE
data = pandas.DataFrame(test_data, columns=['object_id', 'time', 'cluster_id'])
rater = CLOSE(data, measure='mae', minPts=2, output=True, jaccard=True, weighting=True, exploitation_term=True)
Explanation of the parameters:
Parameter | Default | Datatype | Description | |
---|---|---|---|---|
data | - | - | pandas.DataFrame | with first column being the objectID, second being the timestamp, third being the clusterID |
measure | optional | 'mae' | string callable | describing the quality measure that should be used a cluster measuring function |
minPts | optional | 2 | int | used for densitiy based quality measure only |
output | optional | False | boolean | indicating if intermediate results should be printed |
jaccard | optional | False | boolean | indicating if jaccard index should be used in CLOSE |
weighting | optional | False | boolean | indicating if more distant past should be weighted lower than nearer past |
exploitation_term | optional | False | boolean | indicating if exploitation term for penalization of outliers should be used |
The names of the columns in the DataFrame are not relevant but the order. The DataFrame may contain further columns but only the first three are considered.
Now, the clustered data set can be evaluated with CLOSE. There are two variants of quality measures:
- quality measures for clusters
- quality measures for clusterings
When using the first type of quality measures, the original formula of CLOSE can be used calling the function
clustering_score = rater.rate_clustering(start_time=None, end_time=None, return_measures=False)
where start_time and end_time indicate the time intervall which should be considered. If start_time and end_time are None, the first and last timestamp are considered as boundary, respectively. return_measures indicates, if the individual components of the CLOSE formula should be returned.
The second type of quality measures can be used by using the modified formula of CLOSE calling
clustering_score = rater.rate_time_clustering(start_time=None, end_time=None)
where start_time and end_time indicate the time intervall which should be considered. If they are None, the first and last timestamp are considered as boundary, respectively.
The exploitation term is originally introduced in order to penalize outliers in CLOSE.
It appends N_co / N_o
to the CLOSE formula, where N_co
defines the number of clustered objects and N_o
represents the number of all objects.
When considering it as penalization term in CLOSE, it is calculated globally for the whole over-time clustering.
But it can also be used as quality measure for example when the clusters are calculated by DBSCAN. In that case, it is computed per timestamp in order to evaluate the individual time clusterings.
You can use the exploitation term as a penalization term by setting exploitation_term=True
when creating the CLOSE object
CLOSE(data, exploitation_term=True)
It is also possible to use the exploitation term as quality measure. Therefore you have to call CLOSE as follows:
CLOSE(data, measure="exploit")
Since the exploitation term has then to be calculated per timestamp the modified formula of CLOSE for quality measures regarding the time clusterings has to be used. Therefore, instead of using the common function rate_clustering()
you have to call
rate_time_clustering()
rater = CLOSE(data, 'exploit')
clustering_score = rater.rate_time_clustering()
CLOSE with DBSCAN, mean average error as quality measure and global exploitation term for outlier penalization:
rater = CLOSE(data, 'mae', exploitation_term=True)
clustering_score = rater.rate_clustering()