Significant DBSCAN

(This is the Matlab version. The Python implementation for data with arbitrary dimensions is now available at Significant-DBSCAN-python!)

Code for paper (Best Paper Award @ SSTD 2019):

Xie, Y. and Shekhar, S., 2019, August. Significant DBSCAN towards Statistically Robust Clustering. In Proceedings of the 16th International Symposium on Spatial and Temporal Databases (pp. 31-40). ACM link

@inproceedings{xie2019significant,
  title={Significant DBSCAN towards Statistically Robust Clustering},
  author={Xie, Yiqun and Shekhar, Shashi},
  booktitle={Proceedings of the 16th International Symposium on Spatial and Temporal Databases},
  pages={31--40},
  year={2019}
}

A recent survey on statistically-robust clustering:

Xie, Y., Shekhar, S. and Li, Y., 2022. Statistically-robust clustering techniques for mapping spatial hotspots: A survey. ACM Computing Surveys (CSUR), 55(2), pp.1-38. (Published version; arXiv preprint)

Description

This work aims to address a major limitation of traditional density-based clustering approach -- the lack of statistical rigor. This makes approaches such as DBSCAN tend to return many spurious clusters. For example, according to the HDBSCAN paper:

"small clusters of objects that may be highly similar to each other just by chance, that is, as a consequence of the natural randomness associated with the use of a finite data sample"

The code implements significant DBSCAN to automatically remove spurious clusters through point-process based statistical tests.

Significant DBSCAN works particularly well when data has high volume of noise.

At the moment, this MATLAB demo code (Python or Java version will come later) is primarily for 2D spatial data but can be easily edited to work for arbitrary dimensions (e.g., removing the grid index). It also works for data with clusters of different densities.

Please note that we just quickly edited and shared the code here due to requests, and will need to do a bit more testing soon when I get time in about a week or two (late August). Feel free to try for now if interested.

Example

The following figure shows an example of results comparing DBSCAN and significant DBSCAN.

Usage

Main function: significantDBSCAN()

[cluster_idx] = sigDBSCAN(data, dim1, dim2, m, siglvl, mode, eps_div)

Inputs: (as mentioned before, this demo code currently is primarily for 2D spatial data, but can be easily edited to work for arbitrary dimensions if needed; contact xiexx347 at umn dot edu if you have any questions)

data: N by 2 array (e.g., spatial coordinates of events)

dim1, dim2: scalars describing the size of the study area (data need to be pre-processed to set min to 0 for both dimensions)

m: number of Monte Carlo simulation trials to use for estimating test statistic distribution

siglvl: significance level (e.g., 0.01)

mode: Optional. Defines algorithm to use (e.g., with acceleration or not). Default is 3 (dual-convergence algorithm).

eps_div: Optional. Controls cell size to use for discrete scan (an acceleration method). Default is 0.25 (1/4 of current eps).

Output:

cluster_idx: N by 3 array with data points and corresponding cluster IDs (0 for noise). The order of points is likely different from original data input.

Result visualization:

2D visualization code is in visualization folder.

Synthetic Data

Synthetic data generator and base image files are in synthetic_data folder.

Example synthetic data with clusters (24 sets) and synthetic data with no cluster (4 sets) have been added to synthetic_data. Visualizations of data and Significant DBSCAN results will be added soon.

Synthetic data with clusters (24 sets)

Data have four clusters with background noises. The spatial dimensions are 100 x 100. Datasets come with different number of points and effect sizes (exact definition see paper). The naming format is:

ShapeName_EffectSize1_EffectSize2_#Points

ShapeName: "shape" or "test". Determines the shape of clusters (figures included in synthetic_data folder)

EffectSize1 and EffectSize2: Determines how many times inside probability density is as high as outside (e.g., an effect size of two means the probability density of getting a point at each location inside the corresponding cluster is twice that of the outside non-clustered region). EffectSize1 is used for two clusters and EffectSize2 for the rest. When effect size is one, that means the inside probability density is the same as outside. As a result, that means the inside region is not a true cluster (no underlying process to generate high density of points; if high density is observed, that is due to natural randomness).

#Points: Number of points in a dataset (e.g., 2,000, 10,000).

Synthetic data with no cluster (4 sets)

The datasets are generated by homogeneous point processes (effect size = 1 across the study area), and there is no true cluster in the data. Any clusters detected from this type of data are spurious detections. Robust clustering algorithms should tell there is no cluster in the data.

Running Significant DBSCAN on synthetic data

Same set of default parameters (m=100, siglvl=0.01, default optional parameters) can be used across all example datasets with different number of clusters, number of points and effect sizes.

Example comparison with other clustering techniques

In the following, the first figure shows results on complete random data where there is no true cluster, and any cluster detected is spurious (happen only due to natural randomness). In contrast, the second figure shows detections on truly clustered data (governed by clustered point process; please see paper for more details).

Complete random data (no true cluster exists)

Clustered data with noise in background (four shapes are true clusters; background are noises generated with homogeneous probability density)

License

The MIT License: AS IS (please see license file)

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
example_data_and_results		example_data_and_results
sigDBSCAN		sigDBSCAN
synthetic_data		synthetic_data
visualization		visualization
README.md		README.md
example_results.png		example_results.png
license.txt		license.txt
python_link.txt		python_link.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example_data_and_results

example_data_and_results

sigDBSCAN

sigDBSCAN

synthetic_data

synthetic_data

visualization

visualization

README.md

README.md

example_results.png

example_results.png

license.txt

license.txt

python_link.txt

python_link.txt

Repository files navigation

Significant DBSCAN

Description

Example

Usage

Synthetic Data

Example comparison with other clustering techniques

License

About

Releases

Packages

Languages

License

yqthanks/Significant-DBSCAN-matlab

Folders and files

Latest commit

History

Repository files navigation

Significant DBSCAN

Description

Example

Usage

Synthetic Data

Example comparison with other clustering techniques

License

About

Resources

License

Stars

Watchers

Forks

Languages