Skip to content

Latest commit

 

History

History
86 lines (56 loc) · 4.69 KB

close.md

File metadata and controls

86 lines (56 loc) · 4.69 KB

General Usage

When using CLOSE as stability measure for over-time clusterings, many settings are possible. First, CLOSE has to be initialized (test_data is a 2-dim array containing the data):

import pandas
from ots_eval.stability_evaluation.close import CLOSE

data = pandas.DataFrame(test_data, columns=['object_id', 'time', 'cluster_id'])
rater = CLOSE(data, measure='mae', minPts=2, output=True, jaccard=True, weighting=True, exploitation_term=True)

Explanation of the parameters:

Parameter
DefaultDatatypeDescription
data--pandas.DataFramewith first column being the objectID, second being the timestamp, third being the clusterID
measureoptional'mae'string
callable
describing the quality measure that should be used
a cluster measuring function
minPtsoptional2intused for densitiy based quality measure only
outputoptionalFalsebooleanindicating if intermediate results should be printed
jaccardoptionalFalsebooleanindicating if jaccard index should be used in CLOSE
weightingoptionalFalsebooleanindicating if more distant past should be weighted lower than nearer past
exploitation_termoptionalFalsebooleanindicating if exploitation term for penalization of outliers should be used

The names of the columns in the DataFrame are not relevant but the order. The DataFrame may contain further columns but only the first three are considered.

Now, the clustered data set can be evaluated with CLOSE. There are two variants of quality measures:

  1. quality measures for clusters
  2. quality measures for clusterings

When using the first type of quality measures, the original formula of CLOSE can be used calling the function

clustering_score = rater.rate_clustering(start_time=None, end_time=None, return_measures=False)

where start_time and end_time indicate the time intervall which should be considered. If start_time and end_time are None, the first and last timestamp are considered as boundary, respectively. return_measures indicates, if the individual components of the CLOSE formula should be returned.

The second type of quality measures can be used by using the modified formula of CLOSE calling

clustering_score = rater.rate_time_clustering(start_time=None, end_time=None)

where start_time and end_time indicate the time intervall which should be considered. If they are None, the first and last timestamp are considered as boundary, respectively.

Exploitation Term

The exploitation term is originally introduced in order to penalize outliers in CLOSE. It appends N_co / N_o to the CLOSE formula, where N_co defines the number of clustered objects and N_o represents the number of all objects. When considering it as penalization term in CLOSE, it is calculated globally for the whole over-time clustering.  But it can also be used as quality measure for example when the clusters are calculated by DBSCAN. In that case, it is computed per timestamp in order to evaluate the individual time clusterings.

How to use it?

You can use the exploitation term as a penalization term by setting exploitation_term=True when creating the CLOSE object

CLOSE(data, exploitation_term=True)

It is also possible to use the exploitation term as quality measure. Therefore you have to call CLOSE as follows:

CLOSE(data, measure="exploit")

Since the exploitation term has then to be calculated per timestamp the modified formula of CLOSE for quality measures regarding the time clusterings has to be used. Therefore, instead of using the common function rate_clustering() you have to call

rate_time_clustering()

Examples

CLOSE with DBSCAN with the exploitation term as quality measure:

rater = CLOSE(data, 'exploit')
clustering_score = rater.rate_time_clustering()

CLOSE with DBSCAN, mean average error as quality measure and global exploitation term for outlier penalization:

rater = CLOSE(data, 'mae', exploitation_term=True)
clustering_score = rater.rate_clustering()