-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Add hierarchical clustering #99
[FEATURE] Add hierarchical clustering #99
Conversation
57d2fde
to
ec10922
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look really great. 👍
There are two hard errors:
std::string const & = "blub"
creates a reference onto a temporary reference...which makes any acces invalid (I believe, not sure why it didn't make any trouble).std::numeric_limits<int>::infinity()
should be replaced by::max()
.- Cool c++ kids don't use new/delete 😂
The rest of my comments can definitely be ignored if you don't like 😃
@file Ending: As far as I understand every line of a text file has to be finished by an '\n', otherwise it is technically not a text file and some programs will fail (I believe it is not even a valid c++ program, but no compiler complains). This actually should be cared for by your text editor automatically. Not sure what went wrong.
32ad3f4
to
6b28ac7
Compare
Hi @SGSSGene, thanks for the very helpful review. I now incorporated most of your suggestions - with the exception of the "uncool" new/delete 😄 The hierarchical clustering functions that I use (as a dependency) expect arrays/pointers as parameters. From The size of the arrays is computed during runtime based on the number of elements in each partition. Currently, the space for the arrays is dynamically allocated using |
Ah yes, I see. Checkout You can use it as follows:
At the end |
6b28ac7
to
8bae8d6
Compare
Great, I replaced new/delete with nice and safe std::vectors now. Is the |
The |
Could you rebase this branch (or merge the origin/master into it)? there seems to some conflict in src/variant_detection/variant_detection.cpp |
8bae8d6
to
7f39c05
Compare
Hi @SGSSGene, I rebased the branch so it's ready for another review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this work!
I have started to look at some of the details of the changes. I will do a full review after the approved first review.
But thank you very much, it looks great! 🎉
CMakeLists.txt
Outdated
# Dependency: hclust. | ||
add_subdirectory (lib) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you said you weren't sure how to do this, I looked in the code of our other seqan tools and found this (seqan/mars):
https://github.com/seqan/mars/blob/master/CMakeLists.txt
Do you think it makes sense to extend our code in this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's much better to get rid of the git submodule because we just want to use the library instead of accessing the source code. I just pushed a commit that replaces the submodule with CMake's FetchContent
to get the hierarchical clustering dependency. For FetchContent
I needed to increase the CMake version to 3.11.
test/api/CMakeLists.txt
Outdated
@@ -5,3 +5,5 @@ target_use_datasources (junction_detection_test FILES simulated.minimap2.hg19.co | |||
|
|||
add_api_test (junction_detection_methods_test.cpp) | |||
target_use_datasources (junction_detection_methods_test FILES simulated.minimap2.hg19.coordsorted_cutoff.sam) | |||
|
|||
add_api_test (clustering_methods_test.cpp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still open?
7f39c05
to
4c91a3d
Compare
4c91a3d
to
32f98e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your call against std::tie
is a good call. It would harm performance.
I see that there are still a few for
loops that could use more tidy up, but I think
it is not so important, it doesn't harm readability and too much premature optimizations might lead to waste of development time. So from my sight it gets a green light
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that was a lot to read and process. 😮
First of all, thank you! Thanks also for the detailed tests on the new method! ❤️
I have a few documentation suggestions and otherwise really only style issues. 💅
Content wise it looks great! I hope I understood everything 😄
CMakeLists.txt
Outdated
# Dependency: hclust. | ||
add_subdirectory (lib) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no?
a1027ef
to
fbcde88
Compare
- remove git submodule for hclust - fetch hclust with FetchContent instead - increase cmake_minimum_required to VERSION 3.11 (for FetchContent) - use CMake 3.11.4 for Github CI
fbcde88
to
9f3c9b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! LGFM now, just two misspellings.
I'll commit These last changes by myself or do you want to cleanup the history anyway?
[DOC] Fix misspelling
This PR adds a second clustering method - agglomerative hierarchical clustering - and closes #54.
The idea
Agglomerative hierarchical clustering uses a "bottom-up" approach: Each junction starts in its own cluster. In every iteration, the two clusters with the lowest distance are merged. This process yields a dendogram (see the example here) that can be cut at a desired distance.
To break the problem of clustering thousands of junctions into multiple smaller problems, we first generate smaller partitions of junctions in close proximity (see
partition_junctions()
insrc/modules/clustering/hierarchical_clustering_method.cpp
).Then, we cluster each partition separately (see
hierarchical_clustering_method()
insrc/modules/clustering/hierarchical_clustering_method.cpp
).Implementation
We reuse an existing implementation of the hierarchical clustering algorithm from this repo which we add to the project as a submodule.
Review
It is probably a good idea to review this PR on a commit-by-commit basis because this makes it easier to follow the changes.
EDIT (irallia, 13.04.2021): As this is the last PR of the epic, it resolves #32 aswell.