In order to use the transition-based outlier detection algorithm DOOTS on your clustered data, first the object has to be initialized:
import pandas
from ots_eval.outlier_detection.doots import DOOTS
data = pandas.DataFrame(data, columns=['object_id', 'time', 'cluster_id'])
detector = DOOTS(data, weighting=False, jaccard=False)
Explanation of the parameter:
Parameter | Default | Datatype | Description | |
---|---|---|---|---|
data | - | - | pandas.DataFrame | with first column being the objectID, second being the timestamp, third being the clusterID |
jaccard | optional | False | boolean | indicating if jaccard index should be used |
weighting | optional | False | boolean | indicating if more distant past should be weighted lower than nearer past |
The names of the columns in the DataFrames are not relevant but the order of them. The DataFrame may contain further columns but only the first three are considered.
The outliers can then be calculated by calling
outlier_result = detector.calc_outlier_degree()
clusters, outlier_result = detector.mark_outliers(tau=0.5)
The function calc_outlier_degree
computes the degree of being an outlier for every subsequence. With mark_outliers
and the threshold parameter tau
all outliers are marked. The function returns the data DataFrame with an additional column 'outlier' indicating, if a data point is an outlier and which type of outlier it is, and the outlier result, which is a pandas.DataFrame with columns 'object_id', 'start_time', 'end_time', 'cluster_end_time', 'rating', 'distance' and 'outlier'. The outlier types are:
-1
: transition-based outlier-2
: intuitive outlier-3
: transition-based as well as intuitive outlier
With
clusters, outlier_result = detector.get_outliers(tau=0.5)
the outliers are calculated immediately.