####Note
This is a forked version of EBC python implementation. For the final release please refer to this repo.
A python implementation of the Ensemble Biclustering for Classification (EBC) algorithm. Although having "biclustering" in its name, EBC is a co-clustering algorithm that allows you to perform co-clustering on very large sparse N-dimensional matrix. For details and examples of using EBC please reference this paper.
##Files
ebc.py: an implementation of EBC algorithm using sparse matrix.ebc2d.py: a vectorized implementation of EBC algorithm using (2D) numpy dense array.matrix.py: a N-dimensional sparse matrix implementation using python dict that allows basic get/set/sum operations.resources/: this directory contains many example data that could be used to run EBC on.
##Usage
####General Usage
Using ebc is easy. First you need a sparse matrix constructed as a SparseMatrix instance defined in matrix.py. You can use matrix.py's built in read_data() method to construct the sparse matrix.
Once you have the sparse matrix and import the ebc module, you can simply do the following:
ebc = EBC(sparse_matrix, n_clusters=[30, 125], max_iterations=10)
cXY, objective, iter = ebc.run()
The returned cXY contains the clustering assignments along each axis in a list of lists, objective contains the final objective function value, iter contains the iterations that it ran for to converge.
####Efficiency Considerations
In short, EBC is built for highly sparse multi-dimensional matrix, while EBC2D is a vectorized version of EBC, and is built for less sparse 2D matrix. The running time of EBC increases linearly with the number of non-zero elements, while the running time of EBC2D will only increase when matrix size increases.
- If your input matrix is highly sparse or you need support for N-dimensional matrix with N > 2, you should use
EBCclass inebc.py. - If your input matrix is not very sparse (e.g. 5% density), and can fit into memory with
numpy, you can choose to useEBC2Dclass inebc2d.py. Note thatEBC2Donly supports 2D matrix, since large multi-dimensional dense matrix can hardly fit into memory.
To give you more sense of the efficiency, the EBC implementation runs for ~50 seconds on our 14052 x 7272 example matrix (which is >99% sparse) on a MacBook, and in this case it is faster than EBC2D. However, in our testing, for a 5000 x 5000 matrix of 95% sparsity, EBC2D can get a x5 speedup over EBC. And this speedup grows as the sparsity decreases.
####Dependency
To run this implementation of EBC algorithm, you need to have numpy installed.
##References
- Percha, Bethany, and Russ B. Altman. "Learning the Structure of Biomedical Relationships from Unstructured Text." PLoS computational biology 11.7 (2015).
- Dhillon, Inderjit S., Subramanyam Mallela, and Dharmendra S. Modha. "Information-theoretic co-clustering." Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003.