Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add hierarchical clustering #99

Merged
merged 11 commits into from
Apr 26, 2021

Conversation

eldariont
Copy link
Collaborator

@eldariont eldariont commented Apr 8, 2021

This PR adds a second clustering method - agglomerative hierarchical clustering - and closes #54.

The idea

Agglomerative hierarchical clustering uses a "bottom-up" approach: Each junction starts in its own cluster. In every iteration, the two clusters with the lowest distance are merged. This process yields a dendogram (see the example here) that can be cut at a desired distance.

To break the problem of clustering thousands of junctions into multiple smaller problems, we first generate smaller partitions of junctions in close proximity (see partition_junctions() in src/modules/clustering/hierarchical_clustering_method.cpp).

Then, we cluster each partition separately (see hierarchical_clustering_method() in src/modules/clustering/hierarchical_clustering_method.cpp).

Implementation

We reuse an existing implementation of the hierarchical clustering algorithm from this repo which we add to the project as a submodule.

Review

It is probably a good idea to review this PR on a commit-by-commit basis because this makes it easier to follow the changes.

EDIT (irallia, 13.04.2021): As this is the last PR of the epic, it resolves #32 aswell.

@eldariont eldariont force-pushed the FEATURE/hierarchical_clustering branch from 57d2fde to ec10922 Compare April 8, 2021 14:49
@eldariont eldariont requested a review from SGSSGene April 8, 2021 14:56
Copy link

@SGSSGene SGSSGene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look really great. 👍

There are two hard errors:

  1. std::string const & = "blub" creates a reference onto a temporary reference...which makes any acces invalid (I believe, not sure why it didn't make any trouble).
  2. std::numeric_limits<int>::infinity() should be replaced by ::max().
  3. Cool c++ kids don't use new/delete 😂
    The rest of my comments can definitely be ignored if you don't like 😃

@file Ending: As far as I understand every line of a text file has to be finished by an '\n', otherwise it is technically not a text file and some programs will fail (I believe it is not even a valid c++ program, but no compiler complains). This actually should be cared for by your text editor automatically. Not sure what went wrong.

include/structures/cluster.hpp Outdated Show resolved Hide resolved
include/structures/cluster.hpp Outdated Show resolved Hide resolved
lib/CMakeLists.txt Outdated Show resolved Hide resolved
src/modules/clustering/hierarchical_clustering_method.cpp Outdated Show resolved Hide resolved
src/modules/clustering/hierarchical_clustering_method.cpp Outdated Show resolved Hide resolved
test/api/clustering_methods_test.cpp Outdated Show resolved Hide resolved
test/api/clustering_methods_test.cpp Outdated Show resolved Hide resolved
test/api/clustering_methods_test.cpp Outdated Show resolved Hide resolved
src/structures/cluster.cpp Outdated Show resolved Hide resolved
@eldariont eldariont force-pushed the FEATURE/hierarchical_clustering branch 2 times, most recently from 32ad3f4 to 6b28ac7 Compare April 9, 2021 13:38
@eldariont
Copy link
Collaborator Author

Hi @SGSSGene, thanks for the very helpful review. I now incorporated most of your suggestions - with the exception of the "uncool" new/delete 😄

The hierarchical clustering functions that I use (as a dependency) expect arrays/pointers as parameters.

From lib/hclust/fastcluster.h:
int hclust_fast(int n, double* distmat, int method, int* merge, double* height);

The size of the arrays is computed during runtime based on the number of elements in each partition. Currently, the space for the arrays is dynamically allocated using new. Do you see an option that would get rid of the new/delete? The std::array constructor expects a constexpr as size and std::vector does not translate into a pointer AFAIK. So which other options are there?

@eldariont eldariont requested a review from SGSSGene April 9, 2021 13:49
@SGSSGene
Copy link

Hi @SGSSGene, thanks for the very helpful review. I now incorporated most of your suggestions - with the exception of the "uncool" new/delete smile

The hierarchical clustering functions that I use (as a dependency) expect arrays/pointers as parameters.

From lib/hclust/fastcluster.h:
int hclust_fast(int n, double* distmat, int method, int* merge, double* height);

The size of the arrays is computed during runtime based on the number of elements in each partition. Currently, the space for the arrays is dynamically allocated using new. Do you see an option that would get rid of the new/delete? The std::array constructor expects a constexpr as size and std::vector does not translate into a pointer AFAIK. So which other options are there?

Ah yes, I see. Checkout std::vector::data() https://en.cppreference.com/w/cpp/container/vector/data

You can use it as follows:

std::vector<int> myVector(10); // initializes the vector size 10
assert(myVector.size() == 10);
int* ptr = myVector.data();

At the end std::vector of course also calls new/delete, but it is much safer, since it handles many cases for exceptions and early returns. The general concept is called RAII and can be extended from memory to many other resources https://en.cppreference.com/w/cpp/language/raii

@eldariont eldariont force-pushed the FEATURE/hierarchical_clustering branch from 6b28ac7 to 8bae8d6 Compare April 12, 2021 09:55
@eldariont
Copy link
Collaborator Author

Ah yes, I see. Checkout std::vector::data() https://en.cppreference.com/w/cpp/container/vector/data

You can use it as follows:

std::vector<int> myVector(10); // initializes the vector size 10
assert(myVector.size() == 10);
int* ptr = myVector.data();

At the end std::vector of course also calls new/delete, but it is much safer, since it handles many cases for exceptions and early returns. The general concept is called RAII and can be extended from memory to many other resources https://en.cppreference.com/w/cpp/language/raii

Great, I replaced new/delete with nice and safe std::vectors now. Is the assert(v.size() == n) required after I default-initialized the vector with std::vector<T> v(n)?

@SGSSGene
Copy link

Ah yes, I see. Checkout std::vector::data() https://en.cppreference.com/w/cpp/container/vector/data
You can use it as follows:

std::vector<int> myVector(10); // initializes the vector size 10
assert(myVector.size() == 10);
int* ptr = myVector.data();

At the end std::vector of course also calls new/delete, but it is much safer, since it handles many cases for exceptions and early returns. The general concept is called RAII and can be extended from memory to many other resources https://en.cppreference.com/w/cpp/language/raii

Great, I replaced new/delete with nice and safe std::vectors now. Is the assert(v.size() == n) required after I default-initialized the vector with std::vector<T> v(n)?

The assert(v.size() == n) is not required. I just put it there to "prove" that it is guaranteed that v.size() is always equal to n.

@SGSSGene
Copy link

Could you rebase this branch (or merge the origin/master into it)? there seems to some conflict in src/variant_detection/variant_detection.cpp

@eldariont eldariont force-pushed the FEATURE/hierarchical_clustering branch from 8bae8d6 to 7f39c05 Compare April 14, 2021 09:43
@eldariont
Copy link
Collaborator Author

Could you rebase this branch (or merge the origin/master into it)? there seems to some conflict in src/variant_detection/variant_detection.cpp

Hi @SGSSGene, I rebased the branch so it's ready for another review :)

Copy link
Collaborator

@Irallia Irallia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this work!
I have started to look at some of the details of the changes. I will do a full review after the approved first review.
But thank you very much, it looks great! 🎉

CMakeLists.txt Outdated
Comment on lines 33 to 35
# Dependency: hclust.
add_subdirectory (lib)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you said you weren't sure how to do this, I looked in the code of our other seqan tools and found this (seqan/mars):
https://github.com/seqan/mars/blob/master/CMakeLists.txt
Do you think it makes sense to extend our code in this way?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's much better to get rid of the git submodule because we just want to use the library instead of accessing the source code. I just pushed a commit that replaces the submodule with CMake's FetchContent to get the hierarchical clustering dependency. For FetchContent I needed to increase the CMake version to 3.11.

src/CMakeLists.txt Outdated Show resolved Hide resolved
src/structures/breakend.cpp Show resolved Hide resolved
src/structures/cluster.cpp Show resolved Hide resolved
src/structures/junction.cpp Show resolved Hide resolved
@@ -5,3 +5,5 @@ target_use_datasources (junction_detection_test FILES simulated.minimap2.hg19.co

add_api_test (junction_detection_methods_test.cpp)
target_use_datasources (junction_detection_methods_test FILES simulated.minimap2.hg19.coordsorted_cutoff.sam)

add_api_test (clustering_methods_test.cpp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still open?

@eldariont eldariont force-pushed the FEATURE/hierarchical_clustering branch from 7f39c05 to 4c91a3d Compare April 19, 2021 12:01
@eldariont eldariont force-pushed the FEATURE/hierarchical_clustering branch from 4c91a3d to 32f98e5 Compare April 19, 2021 12:11
@eldariont
Copy link
Collaborator Author

Hi @SGSSGene, I rebased the PR another time. As soon as you are satisfied with the changes, @Irallia can do the second review.

Copy link

@SGSSGene SGSSGene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your call against std::tie is a good call. It would harm performance.
I see that there are still a few for loops that could use more tidy up, but I think
it is not so important, it doesn't harm readability and too much premature optimizations might lead to waste of development time. So from my sight it gets a green light

@SGSSGene SGSSGene requested a review from Irallia April 19, 2021 20:30
Copy link
Collaborator

@Irallia Irallia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that was a lot to read and process. 😮
First of all, thank you! Thanks also for the detailed tests on the new method! ❤️
I have a few documentation suggestions and otherwise really only style issues. 💅
Content wise it looks great! I hope I understood everything 😄

src/modules/clustering/hierarchical_clustering_method.cpp Outdated Show resolved Hide resolved
src/modules/clustering/hierarchical_clustering_method.cpp Outdated Show resolved Hide resolved
src/modules/clustering/hierarchical_clustering_method.cpp Outdated Show resolved Hide resolved
test/api/clustering_test.cpp Outdated Show resolved Hide resolved
test/api/clustering_test.cpp Outdated Show resolved Hide resolved
test/api/clustering_test.cpp Outdated Show resolved Hide resolved
test/api/clustering_test.cpp Outdated Show resolved Hide resolved
CMakeLists.txt Outdated
Comment on lines 33 to 35
# Dependency: hclust.
add_subdirectory (lib)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no?

@eldariont eldariont force-pushed the FEATURE/hierarchical_clustering branch from a1027ef to fbcde88 Compare April 23, 2021 15:58
- remove git submodule for hclust
- fetch hclust with FetchContent instead
- increase cmake_minimum_required to VERSION 3.11 (for FetchContent)
- use CMake 3.11.4 for Github CI
@eldariont eldariont force-pushed the FEATURE/hierarchical_clustering branch from fbcde88 to 9f3c9b7 Compare April 23, 2021 16:21
@eldariont eldariont requested a review from Irallia April 23, 2021 16:34
Copy link
Collaborator

@Irallia Irallia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! LGFM now, just two misspellings.
I'll commit These last changes by myself or do you want to cleanup the history anyway?

@Irallia Irallia merged commit f109d08 into seqan:master Apr 26, 2021
@eldariont eldariont deleted the FEATURE/hierarchical_clustering branch April 27, 2021 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Cluster junctions by hierarchical clustering iGenVar - Call Deletions from long reads
3 participants