Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Constrained Hierarchical Agglomerative Clustering
Constrained HAC is hierarchical agglomerative clustering in which each observation is associated to a position, and the clustering is constrained so as only adjacent clusters are merged. These methods are useful in various application fields, including ecology (Quaternary data) and bioinformatics (e.g. in Genome-Wide Association Studies (GWAS), see https://doi.org/10.1186/s12859-015-0556-6). However, the complexity of constrained HAC is intrinsically quadratic in the number
p of items to cluster. In common situations where
p is of the order of 10^5-10^6, this algorithm is not applicable.
In several applications including GWAS, the similarity between two items is a decreasing function of their physical distance, and it is reasonable to assume that the similarity between two items is close to 0 when these items are distant of more than a constant
h. Under this assumption, it is possible to propose an algorithm whose complexity in
p is quasilinear in time (
O(p(log(p)+h)) and linear in space (
O(ph)). This algorithm is described in this PhD thesis (chapter 5.3) for the Ward criterion. It takes as an input a band similarity matrix of size
ph, and combines pre-calculation of certain partial sums of similarities with an efficient storage of the successive merges in a binary heap. A beta package implementing this algorithm is already available: https://github.com/pneuvial/adjclust, with Ward criterion to produce a dendrogram associated to constrained HAC (the position of the rows in the similarity matrix are the positions used for the adjacency constrains).
This project aims at making this beta package a package which is submittable to CRAN.
The R package rioja implements a similar algorithm (it takes dist objects as an input and uses either Ward criterion or single linkage criterion). This algorithm does not scale well with
p, as it is quadratic in both time and space.
Besides good expertise in R and C++ programming, this project requires strong knowledge about hierarchical clustering and mathematical background on similarity / dissimilarity and Euclidean distances
Month 1: clean package with minimal (but sufficient) features
The goal of the first month of the project is to make a package with minimal features that can be submitted to CRAN
- all user-level functions should be documented. The documentation should follow the CRAN policies and be created using the roxygen2 package. Each documented function should contain an informative example
- in the current version, only a part of the matrix, located in the neighborhood of the diagonal, is passed to the function; however, this format is not standard and an automatic function to create this data is needed. This function should be able to take similarity matrices given as standard matrices or as dist objects. Currently, the function ‘adjClustBand_heap’ (silently) assumes that the similarity of an element with itself, is equal to 1: this problem should be taken care of in the proposed solutions
- the constrained clustering function should return an object of type ‘hclust’.
- small tests should be written (using the testthat package) in order to increase the code coverage of the package. A specific test should be written in order to make sure that the package is able to reproduce the results obtained by rioja for a maximum band size (h=p).
- write a vignette describing the usage of the main functions of the package.
Month 2: I/O improvement: sparse matrices and applications
The goal of the second month of the project is to improve the user interface of the package in specific applications where p is large: Genome-Wide Association studies (GWAS) and Hi-C
2.1 Sparse matrices
The implementation of constrained HAC in the adjclust package is able to handle situations where p is large (i.e. p>10^6), but the similarity matrix is possibly sparse and its signal mostly located near the diagonal. In such situations, the input data is typically a sparse matrix.
- implement a wrapper function that performs constrained HAC from objects of class Matrix::dgCMatrix and Matrix::dsCMatrix
2.2. Genome-Wide Association studies
In GWAS, the objects to be clustered are p single Nucleotide Polymorphisms (SNPs): see the GWAS vignette. The typical input is either a p*p matrix of linkage disequilibrium (LD, see e.g. the matrix ‘ld.ceph’ in the vignette), or a n*p matrix giving the genotypes of p SNPs for n individuals (see e.g. the matrix ‘ceph.1mb’). Tasks:
- implement a wrapper function that performs constrained HAC from a LD matrix of class Matrix::dgCMatrix
- implement a wrapper function that performs constrained HAC from a genotype matrix of class base::matrix of snpStats::SnpMatrix
- document these functions and add an example on how they may be used on the data from the GWAS vignette.
In Hi-C data analyis, the objects to be clustered are p genomic regions or loci, and the input data is in the form of a contact map. A contact map is a p x p similarity matrix, where the similarity between any pair of loci quantifies the intensity of the physical interaction between these loci. As most pairs of loci have no interaction, the contact map is usually very sparse. The typical input for the clustering function is a p x p Matrix::dsCMatrix (see e.g. the Bioconductor package HiCDataHumanIMR90), or a text file with one line per pair of loci for which an interaction has been observed (in the format: locus1 | locus2 | signal). Tasks:
- implement a wrapper function that performs constrained HAC from a Hi-C contact map of class Matrix::dsCMatrix or given as a text file
- document this function and add an example on how it may be used on the data from the HiCDataHumanIMR90 package.
Month 3: Performance and new features
3.1 Performance of ‘adjClustBand_heap’
The current implementation of the ‘adjClustBand_heap’ function uses two internal functions ‘.toMatLeft’ and ‘.toMatRight’ that create p * h matrices which are then used to pre-calculate cumulative sums of similarities which are in turn required by the constrained HAC algorithm. These functions are currently a computational bottleneck for the ‘adjClustBand_heap’ function. Task:
- propose a more efficient implementation of the pre-calculation of the necessary cumulative sums
3.2. Additional merging criteria
The current version of the package only implements the Ward criterion for grouping two adjacent clusters. Tasks:
- implement the single linkage criterion
- implement the complete linkage criterion
The issue addressed by this method is a very standard problem, especially in bioinformatics. In this field, the size of data is usually very large (many thousands) and an efficient constrained HAC package with all linkage options would have many applications (GWAS, Hi-C analysis, …).
Several tests that potential students can do to demonstrate their capabilities for this particular project:
- Easy: use the R package rioja to obtain constrained HAC (with the function chclust) using both implemented “method”s and this dataset (which is a dissimilarity matrix). Provide script, output and explain possible problems (the dissimilarity matrix is not Euclidean).
- Medium: fork our github repository and compile the package. The package includes a vignette describing how to use the low level functions ‘HeapHop’ and ‘adjClustBand_heap’ in the package. The input is a vector which contains a band (width:
h) centered around a similarity matrix. NB: as demonstrated in the vignette, the input is slightly different for the two functions.
- using this dataset (an Euclidean dissimilarity matrix), create the similarity matrix using the last equation in page 6 of this article. Then extract the band centered on the diagonal with width 5 and use the function ‘HeapHop’ or ‘adjClustBand_heap’ to obtain the clustering of this dataset. Provide your entire script (from the similarity construction to the use of the clustering function) and your outputs (RMarkdown or similar preferred)
- compare the results obtained by the ‘adjclust’ package to the ones obtained by ‘rioja’ for this dissimilarity matrix by varying the width in ‘adjclust’.
- implement a basic R function that takes a similarity matrix and a value for
hand call for ‘HeapHop’ or ‘adjClustBand_heap’ to obtain a ‘hclust’ object
- Hard: the package uses C scripts for the implementation of ‘adjClustBand_heap’, and C++ scripts for the implementation of ‘HeapHop’. For this test you may focus on one of the two implementations at your choice: ‘adjClustBand_heap’ or ‘HeapHop’.
- Have a look at the src directory and describe what are the different functions used for and the relations between them (textual description, a single line by function should be enough plus a dependency graph for instance)
- Modify the C or C++ scripts in the src directory to implement merging of clusters by single linkage criterion (Ward criterion is the one already implemented). Post a script with the functions that are modified (replace the current functions by the ones implementing the single linkage criterion. Do not try to pass an option for choosing between the two merging options)
Solutions of tests
Students, please post a link to your test results here.
Yasheel Vyas (https://github.com/yasheel-vyas/HAC-Clustering-GSoC-2017)
Parismita Das (https://github.com/parismita/gsoc-2017)
Shubham Kumar (https://github.com/shubhamkmr47/R-GSoC2017)