Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Initial commit of DBScan #12

Merged
merged 11 commits into from
Dec 27, 2019
Merged

Conversation

xd009642
Copy link
Member

Looked to the k-means implementation to keep the design consistent. Comments need to be filled in and tests. Also distance is currently just euclidean distance but it's probably a good idea for there to some sort of distance trait or enum so people can pass in what distance function they want for this or other algorithms.

I'll carry on filling in the rest just figured early feedback was better than later 馃槃

@xd009642 xd009642 mentioned this pull request Dec 24, 2019
24 tasks
@xd009642
Copy link
Member Author

Also, the dbscan implementation in sklearn isn't one that you can fit then predict, fitting is part of the prediction stage so I've replicated that. Incremental DBScan is an extension over the original algorithm

@codecov
Copy link

codecov bot commented Dec 24, 2019

Codecov Report

Merging #12 into master will decrease coverage by 2.22%.
The diff coverage is 94.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #12      +/-   ##
==========================================
- Coverage   96.68%   94.46%   -2.23%     
==========================================
  Files           7       10       +3     
  Lines         181      271      +90     
==========================================
+ Hits          175      256      +81     
- Misses          6       15       +9
Impacted Files Coverage 螖
linfa-clustering/examples/kmeans.rs 100% <酶> (酶)
linfa-clustering/examples/dbscan.rs 100% <100%> (酶)
linfa-clustering/src/dbscan/hyperparameters.rs 86.66% <86.66%> (酶)
linfa-clustering/src/dbscan/algorithm.rs 95.38% <95.38%> (酶)
linfa-clustering/src/k_means/algorithm.rs 90.9% <0%> (-4.05%) 猬囷笍

Continue to review full report at Codecov.

Legend - Click here to learn more
螖 = absolute <relative> (impact), 酶 = not affected, ? = missing data
Powered by Codecov. Last update ec5d491...d7f8236. Read the comment docs.

@LukeMathWalker
Copy link
Contributor

LukeMathWalker commented Dec 24, 2019

Nice! I'll have a look straight away 馃榿

There are other interesting implementation in Python-land that we should keep in mind - in particular, HDBSCAN and pyclustering.
HDBSCAN also supports assigning a cluster to a new point after the initial fitting round - see here.

* Remove Sync trait bounds
* Use ndarray_stats l2_norm
* Actually use observation in search queue to get neighbours
* Add two tests for noise points and nested dense clusters
@LukeMathWalker
Copy link
Contributor

It would be interesting to add a benchmark for DBSCAN as well - I believe we can do some optimisations in a couple of places, but I'd avoid starting with them before we can measure if the gain is real 馃憤

@xd009642
Copy link
Member Author

So I realised when writing a quick benchmark that predict was a bad function name as a free function because of the public reexports, so I've currently renamed it to dbscan benchmark coming soon (basically copying the k_means one)

@LukeMathWalker
Copy link
Contributor

Yeah, we can work on the naming - I would probably suggest to wrap it in a struct anyway, for ease of saving/loading as well as future extensibility. But we can figure this out once we nailed down the algorithm implementation (I think we are close 馃榿)

@xd009642
Copy link
Member Author

Right benchmark added which should look very familiar

@xd009642
Copy link
Member Author

Here's a picture of the change for 10,000 points

Screenshot from 2019-12-24 17-04-37

Change to return reference to the neighbour data as well to avoid lookups
@LukeMathWalker
Copy link
Contributor

I think we are there from an algorithmic point of view 馃榾

Next steps to get this ready to be merged:

  • an example in the examples category;
  • wrapping it in a struct for consistency/serialisation/future extensibility.

Then we are good to go 馃殌

@xd009642
Copy link
Member Author

I think we are there from an algorithmic point of view

I realised a mistake in my implementation of the algorithm! Only minor but the number of neighbours needs to be taken into account for each element in the search queue. I've added that and an example. Currently writing the doc comments then I'll push something

Rename DBScan to Dbscan because the whole thing is an acronym
@LukeMathWalker LukeMathWalker merged commit 6fc1f69 into rust-ml:master Dec 27, 2019
@LukeMathWalker
Copy link
Contributor

Merged - thanks for all your work here @xd009642! 馃檹

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants