WIP: Initial commit of DBScan #12

xd009642 · 2019-12-24T15:00:05Z

Looked to the k-means implementation to keep the design consistent. Comments need to be filled in and tests. Also distance is currently just euclidean distance but it's probably a good idea for there to some sort of distance trait or enum so people can pass in what distance function they want for this or other algorithms.

I'll carry on filling in the rest just figured early feedback was better than later 😄

xd009642 · 2019-12-24T15:05:58Z

Also, the dbscan implementation in sklearn isn't one that you can fit then predict, fitting is part of the prediction stage so I've replicated that. Incremental DBScan is an extension over the original algorithm

codecov · 2019-12-24T15:16:56Z

Codecov Report

Merging #12 into master will decrease coverage by 2.22%.
The diff coverage is 94.44%.

@@            Coverage Diff             @@
##           master      #12      +/-   ##
==========================================
- Coverage   96.68%   94.46%   -2.23%     
==========================================
  Files           7       10       +3     
  Lines         181      271      +90     
==========================================
+ Hits          175      256      +81     
- Misses          6       15       +9

Impacted Files	Coverage Δ
linfa-clustering/examples/kmeans.rs	`100% <ø> (ø)`
linfa-clustering/examples/dbscan.rs	`100% <100%> (ø)`
linfa-clustering/src/dbscan/hyperparameters.rs	`86.66% <86.66%> (ø)`
linfa-clustering/src/dbscan/algorithm.rs	`95.38% <95.38%> (ø)`
linfa-clustering/src/k_means/algorithm.rs	`90.9% <0%> (-4.05%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ec5d491...d7f8236. Read the comment docs.

LukeMathWalker · 2019-12-24T15:37:18Z

Nice! I'll have a look straight away 😁

There are other interesting implementation in Python-land that we should keep in mind - in particular, HDBSCAN and pyclustering.
HDBSCAN also supports assigning a cluster to a new point after the initial fitting round - see here.

linfa-clustering/src/dbscan/algorithm.rs

* Remove Sync trait bounds * Use ndarray_stats l2_norm * Actually use observation in search queue to get neighbours * Add two tests for noise points and nested dense clusters

linfa-clustering/src/dbscan/algorithm.rs

LukeMathWalker · 2019-12-24T16:13:49Z

It would be interesting to add a benchmark for DBSCAN as well - I believe we can do some optimisations in a couple of places, but I'd avoid starting with them before we can measure if the gain is real 👍

xd009642 · 2019-12-24T16:26:49Z

So I realised when writing a quick benchmark that predict was a bad function name as a free function because of the public reexports, so I've currently renamed it to dbscan benchmark coming soon (basically copying the k_means one)

LukeMathWalker · 2019-12-24T16:28:58Z

Yeah, we can work on the naming - I would probably suggest to wrap it in a struct anyway, for ease of saving/loading as well as future extensibility. But we can figure this out once we nailed down the algorithm implementation (I think we are close 😁)

xd009642 · 2019-12-24T16:34:57Z

Right benchmark added which should look very familiar

linfa-clustering/src/dbscan/algorithm.rs

xd009642 · 2019-12-24T17:05:43Z

Here's a picture of the change for 10,000 points

Change to return reference to the neighbour data as well to avoid lookups

linfa-clustering/src/dbscan/algorithm.rs

LukeMathWalker · 2019-12-26T10:38:08Z

I think we are there from an algorithmic point of view 😀

Next steps to get this ready to be merged:

an example in the examples category;
wrapping it in a struct for consistency/serialisation/future extensibility.

Then we are good to go 🚀

xd009642 · 2019-12-26T12:04:41Z

I think we are there from an algorithmic point of view

I realised a mistake in my implementation of the algorithm! Only minor but the number of neighbours needs to be taken into account for each element in the search queue. I've added that and an example. Currently writing the doc comments then I'll push something

Rename DBScan to Dbscan because the whole thing is an acronym

linfa-clustering/src/dbscan/hyperparameters.rs

LukeMathWalker · 2019-12-27T08:58:57Z

Merged - thanks for all your work here @xd009642! 🙏

Initial commit of DBScan

e432b56

xd009642 mentioned this pull request Dec 24, 2019

Roadmap #7

Open

24 tasks