Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 42 additions & 3 deletions .github/workflows/build_wheels.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# https://github.com/pypa/cibuildwheel/blob/main/examples/github-deploy.yml
# except no Windows
name: Build and upload to PyPI
name: build

# Build on every branch push, tag push, and pull request change:
on: [push, pull_request]
Expand All @@ -13,18 +13,54 @@ on: [push, pull_request]
# - published

jobs:
build-and-test:
name: Build executable and run test
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Build
run: |
mkdir build
cd build
cmake ..
make -j
- name: Test
run: |
cd build
ctest --no-tests=error --output-on-failure

build_wheels:
name: Build wheels on ${{ matrix.os }}
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-20.04, macos-11]
os: [ubuntu-latest, macos-latest, windows-latest]

steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0

- name: Set up QEMU
if: runner.os == 'Linux' && (github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/tags/v'))
uses: docker/setup-qemu-action@v2
with:
platforms: all

- name: Build wheels
- name: Build wheels (development)
if: github.ref != 'refs/heads/master' && !startsWith(github.ref, 'refs/tags/v')
uses: pypa/cibuildwheel@v2.11.2
env:
CIBW_ARCHS_MACOS: "arm64"

- name: Build wheels (production)
if: github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/tags/v')
uses: pypa/cibuildwheel@v2.11.2
env:
CIBW_ARCHS_MACOS: "x86_64 arm64"
CIBW_ARCHS_LINUX: "auto aarch64"

- uses: actions/upload-artifact@v3
with:
Expand All @@ -33,8 +69,11 @@ jobs:
build_sdist:
name: Build source distribution
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/tags/v')
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0

- name: Build sdist
run: pipx run build --sdist
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,4 @@ dbscan/build/
.DS_Store
dbscan.egg-info/
__pycache__
pythonmodule/_version.py
2 changes: 2 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
recursive-include include *
global-exclude *.py[co] .DS_Store
exclude src/dbscan
114 changes: 59 additions & 55 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
# Overview
# Theoretically-Efficient and Practical Parallel DBSCAN

[![arXiv](https://img.shields.io/badge/arXiv-1912.06255-b31b1b.svg)](https://arxiv.org/abs/1912.06255)
[![build](https://github.com/wangyiqiu/dbscan-python/actions/workflows/build_wheels.yml/badge.svg)](https://github.com/wangyiqiu/dbscan-python/actions/workflows/build_wheels.yml)

## Overview

This repository hosts fast parallel DBSCAN clustering code for low dimensional Euclidean space. The code automatically uses the available threads on a parallel shared-memory machine to speedup DBSCAN clustering. It stems from a paper presented in SIGMOD'20: [Theoretically Efficient and Practical Parallel DBSCAN](https://dl.acm.org/doi/10.1145/3318464.3380582).

Expand All @@ -11,9 +16,9 @@ Data sets with dimensionality 2 - 20 are supported by default, which can be modi
<img src="https://github.com/wangyiqiu/dbscan-python/blob/master/example.png" alt="example" width="300"/>
</p>

# Tutorial
## Tutorial

## Option 1: Use the binary executable
### Option 1: Use the binary executable

Compile and run the program:

Expand All @@ -28,75 +33,30 @@ make -j # this will take a while

The `<data-file>` can be any CSV-like point data file, where each line contains a data point -- see an example [here](https://github.com/wangyiqiu/hdbscan/blob/main/example-data.csv). The data file can be either with or without header. The cluster output `clusters.txt` will contain a cluster ID on each line (other than the first-line header), giving a cluster assignment in the same ordering as the input file. A noise point will have a cluster ID of `-1`.

## Option 2: Use the Python binding
### Option 2: Use the Python binding

There are two ways to install it:

* Compile it yourself: First install dependencies ``pip3 install -r src/requirements.txt`` and ``sudo apt install libpython3-dev``. Run ``python3 setup.py build --inplace``, The compilation will take a few minutes, and generate a ``.so`` library containing the ``DBSCAN`` module.
* ***OR*** Install it using PyPI: ``pip3 install --user dbscan`` (the latest version is 0.0.9)

An example for using the Python module is provided in ``src/example.py``. If the dependencies above are installed, simply run ``python3 example.py`` from ``src/`` to reproduce the plots above.

* Install it using PyPI: ``pip3 install --user dbscan`` (you can find the wheels [here](https://pypi.org/project/dbscan/#files))
* (harder and not recommended) Compile it yourself: First install dependencies ``pip3 install -r src/requirements.txt`` and ``sudo apt install libpython3-dev``. Run ``python3 setup.py build --inplace``, The compilation will take a few minutes, and generate a ``.so`` library containing the ``DBSCAN`` module.
To create a wheel that is supported universally across many Python versions for your given OS, run ``python setup.py bdist_wheel`` in an environment containing the oldest numpy version available for the version of Python that you are compiling for. For example, for Python 3.8, use numpy 1.17 to compile the wheel. Then, the wheel will work on all Python and numpy versions that are newer that that for your given OS. This is done automatically when installing via pip.

## Option 3: Include directly in your own C++ program

Create your own caller header and source file by instantiating the DBSCAN template function in "dbscan/algo.h".
An example for using the Python module is provided in ``example.py``. If the dependencies above are installed, simply run ``python3 example.py`` from the root directory to reproduce the plots above.

dbscan.h:
```c++
template<int dim>
int DBSCAN(int n, double* PF, double epsilon, int minPts, bool* coreFlagOut, int* coreFlag, int* cluster);

// equivalent to
// int DBSCAN(intT n, floatT PF[n][dim], double epsilon, intT minPts, bool coreFlagOut[n], intT coreFlag[n], intT cluster[n])
// if C++ syntax was a little more flexible

template<>
int DBSCAN<3>(int n, double* PF, double epsilon, int minPts, bool* coreFlagOut, int* coreFlag, int* cluster);
```

dbscan.cpp:
```c++
#include "dbscan/algo.h"
#include "dbscan.h"
```

Calling the instantiated function:
```c++
int n = ...; // number of data points
double data[n][3] = ...; // data points
int labels[n]; // label ids get saved here
bool core_samples[n]; // a flag determining whether or not the sample is a core sample is saved here
{
int ignore[n];
DBSCAN<3>(n, (void*)data, 70, 100, core_samples, ignore, labels);
}
```

Doing this will only compile the function for the number of dimensions that you want, which saves on compilation time.

You can also include the "dbscan/capi.h" and define your own ``DBSCAN_MIN_DIMS`` and ``DBSCAN_MAX_DIMS`` macros the same way the Python extension uses it. The function exported has the following signature.
```c++
extern "C" int DBSCAN(int dim, int n, double* PF, double epsilon, int minPts, bool* coreFlag, int* cluster);
```

Right now, the only two files that are guaranteed to remain in the C/C++ API are "dbscan/algo.h" and "dbscan/capi.h" and the functions named DBSCAN within.

### Python API
#### Python API

```
from dbscan import DBSCAN
labels, core_samples_mask = DBSCAN(X, eps=0.3, min_samples=10)
```

##### Input
#### Input

* ``X``: A 2-D Numpy array (``dtype=np.float64``) containing the input data points. The first dimension of ``X`` is the number of data points ``n``, and the second dimension is the data set dimensionality (the maximum supported dimensionality is 20).
* ``eps``: The epsilon parameter (default 0.5).
* ``min_samples``: The minPts parameter (default 5).

##### Output
#### Output

* ``labels``: A length ``n`` Numpy array (``dtype=np.int32``) containing cluster IDs of the data points, in the same ordering as the input data. Noise points are given a pseudo-ID of ``-1``.
* ``core_samples_mask``: A length ``n`` Numpy array (``dtype=np.bool``) masking the core points, in the same ordering as the input data.
Expand Down Expand Up @@ -146,6 +106,50 @@ plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
```

### Option 3: Include directly in your own C++ program

Create your own caller header and source file by instantiating the DBSCAN template function in "dbscan/algo.h".

dbscan.h:
```c++
template<int dim>
int DBSCAN(int n, double* PF, double epsilon, int minPts, bool* coreFlagOut, int* coreFlag, int* cluster);

// equivalent to
// int DBSCAN(intT n, floatT PF[n][dim], double epsilon, intT minPts, bool coreFlagOut[n], intT coreFlag[n], intT cluster[n])
// if C++ syntax was a little more flexible

template<>
int DBSCAN<3>(int n, double* PF, double epsilon, int minPts, bool* coreFlagOut, int* coreFlag, int* cluster);
```

dbscan.cpp:
```c++
#include "dbscan/algo.h"
#include "dbscan.h"
```

Calling the instantiated function:
```c++
int n = ...; // number of data points
double data[n][3] = ...; // data points
int labels[n]; // label ids get saved here
bool core_samples[n]; // a flag determining whether or not the sample is a core sample is saved here
{
int ignore[n];
DBSCAN<3>(n, (void*)data, 70, 100, core_samples, ignore, labels);
}
```

Doing this will only compile the function for the number of dimensions that you want, which saves on compilation time.

You can also include the "dbscan/capi.h" and define your own ``DBSCAN_MIN_DIMS`` and ``DBSCAN_MAX_DIMS`` macros the same way the Python extension uses it. The function exported has the following signature.
```c++
extern "C" int DBSCAN(int dim, int n, double* PF, double epsilon, int minPts, bool* coreFlag, int* cluster);
```

Right now, the only two files that are guaranteed to remain in the C/C++ API are "dbscan/algo.h" and "dbscan/capi.h" and the functions named DBSCAN within.

## Citation

If you use our work in a publication, we would appreciate citations:
Expand Down
1 change: 0 additions & 1 deletion executable/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
#include "dbscan/point.h"
#include "dbscan/geometryIO.h"
#include "dbscan/pbbs/parallel.h"
#include "dbscan/pbbs/gettime.h"
#include "dbscan/pbbs/parseCommandLine.h"


Expand Down
3 changes: 1 addition & 2 deletions include/dbscan/algo.h
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
#include "dbscan/shared.h"
#include "dbscan/grid.h"
#include "dbscan/coreBccp.h"
#include "dbscan/pbbs/gettime.h"
// #include "dbscan/pbbs/gettime.h"
#include "dbscan/pbbs/parallel.h"
#include "dbscan/pbbs/sampleSort.h"
#include "dbscan/pbbs/unionFind.h"
Expand Down Expand Up @@ -96,7 +96,6 @@ int DBSCAN(intT n, floatT* PF, double epsilon, intT minPts, bool* coreFlagOut, i

auto uf = unionFind(G->numCell());

timing t1;
parallel_for(0, G->numCell(), [&](intT i) {
if (ccFlag[i]) {
auto procTj = [&](cellT* cj) {
Expand Down
3 changes: 2 additions & 1 deletion include/dbscan/kdNode.h
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,8 @@ class kdNode {
}}

inline void boundingBoxParallel() {
intT P = getWorkers()*8;
// intT P = getWorkers()*8;
static const intT P = 36 * 8;
intT blockSize = (n+P-1)/P;
pointT localMin[P];
pointT localMax[P];
Expand Down
3 changes: 3 additions & 0 deletions include/dbscan/pbbs/gettime.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#ifndef GETTIME_H
#define GETTIME_H

/*
#include <stdlib.h>
#include <sys/time.h>
#include <iomanip>
Expand Down Expand Up @@ -92,4 +93,6 @@ struct timing {
// #define nextTime(_string) _tm.reportNext(_string);
// #define nextTimeN() _tm.reportT(_tm.next());

*/

#endif
4 changes: 2 additions & 2 deletions include/dbscan/pbbs/sequence.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@
#include "utils.h"

// For fast popcount
#include <immintrin.h>
#include <x86intrin.h>
// #include <immintrin.h>
// #include <x86intrin.h>

using namespace std;

Expand Down
6 changes: 4 additions & 2 deletions include/dbscan/pbbs/unionFind.h
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,8 @@ struct unionFind {
v = find(v);
if(u == v) break;
if(u > v) swap(u,v);
if(hooks[u] == intMax() && __sync_bool_compare_and_swap(&hooks[u], intMax(), u)){
// if(hooks[u] == intMax() && __sync_bool_compare_and_swap(&hooks[u], intMax(), u)){
if(hooks[u] == intMax() && utils::myCAS(&hooks[u], intMax(), u)){
parents[u]=v;
break;
}}
Expand Down Expand Up @@ -79,7 +80,8 @@ edgeUnionFind(intT nn): n(nn) {
v = find(v);
if(u == v) break;
if(u > v) swap(u,v);
if(hooks[u].first == intMax() && __sync_bool_compare_and_swap(&hooks[u].first, intMax(), c_from)){
// if(hooks[u].first == intMax() && __sync_bool_compare_and_swap(&hooks[u].first, intMax(), c_from)){
if(hooks[u].first == intMax() && utils::myCAS(&hooks[u].first, intMax(), c_from)){
parents[u]=v;
hooks[u].second=c_to;
break;
Expand Down
3 changes: 2 additions & 1 deletion include/dbscan/pbbs/utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
#include <algorithm>
#include "parallel.h"


/*
#if defined(__APPLE__)
#define PTCMPXCH " cmpxchgl %2,%1\n"
#else
Expand All @@ -39,6 +39,7 @@
static int __ii = mallopt(M_MMAP_MAX,0);
static int __jj = mallopt(M_TRIM_THRESHOLD,-1);
#endif
*/

#define newA(__E,__n) (__E*) malloc((__n)*sizeof(__E))

Expand Down
3 changes: 2 additions & 1 deletion include/dbscan/shared.h
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,8 @@ point<dim> pMinSerial(point<dim>* items, intT n) {
template<int dim>
point<dim> pMinParallel(point<dim>* items, intT n) {
point<dim> pMin = point<dim>(items[0].x);
intT P = getWorkers()*8;
// intT P = getWorkers()*8;
static const intT P = 36 * 8;
intT blockSize = (n+P-1)/P;
point<dim> localMin[P];
for (intT i=0; i<P; ++i) {
Expand Down
Loading