# [WIP] Implementation of the Cheng and Church algorithm#2172

Open
wants to merge 50 commits into from

### 3 participants

This pull request is ready for review.

TODO:

• indicate which rows were inverted
• examples
• incremental update in Cython
• documentation
• full test coverage
• also do square residue for node addition in Cython (turned out not to be worth it)
• incremental update of MSR (tried it in pure python, but was not much faster.)
• data sample generator
• get base implementation working.
scikit-learn member
commented Jul 25, 2013

Could you rebase this on top of master and remove duplicated commits.

Done rebasing. Time to start writing more tests.

Okay, I just finished the incremental update of the MSR in Cython. I tested it on arrays of three different sizes. Here are the times and speedups (cython vs cython + incremental update):

• 500x500: 2.05 seconds vs 1.45 seconds. 0.4x speedup.
• 1000x1000: 16.4 seconds vs 10.5 seconds. 0.5x speedup.
• 1500x1500: 53.1 seconds vs 34 seconds. 0.5x speedup.

The full row and column MSR calculations for node addition() are still in pure Python. However, since it spends 97% of its runtime in singe node deletion and only 2% in node addition, it does not seem worth the extra complexity of doing that in Cython too.

I have made some further improvements to the cython code. As requested, here are some numbers for naive pure python vs cython with incremental MSR updating:

• 500x500 matrix: 15.2 seconds vs 1.11 seconds. 13.7x speedup
• 1000x1000 matrix: 124 seconds vs 7.93 seconds. 15.6x speedup
• 1500x1500 matrix: 419 seconds vs 24.7 seconds. 17x speedup
scikit-learn member
commented Aug 1, 2013
commented on the diff Aug 15, 2013
doc/modules/biclustering.rst
 +.. math:: + \frac{1}{mn} \sum \left (a_{ij} - a_{iJ} - a_{Ij} + a_{I J} \right)^2 + +The mean squared residue achieves its minimum of 0 when all the rows +and all the columns of a bicluster are shifted versions of each other. +Constant biclusters, biclusters with constant rows or columns, or +biclusters with identical rows or columns are special cases of this +condition, all of which have an MSR of 0. As a corollary, additive +scaling of the entire bicluster does not affect the MSR score of any +of its biclusters: :math:A and :math:A+c, where :math:c is any +constant, have the same MSR. + +For this reason, Cheng and Church may be useful to find shift-pattern +biclusters. It may instead find scale-pattern biclusters after +applying a log transformation to the data, which converts +multiplicative scaling into additive scaling. In other words, if
 scikit-learn member vene added a note Aug 15, 2013 This is a great point in my opinion. Do you know a certain application or data (maybe artificial) that could show this off? kemaleren added a note Aug 29, 2013 It's a pattern that shows up in gene expression data. I could modify the microarray example to show the difference in the kinds of biclusters found when the data is first log-transformed. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 15, 2013
doc/modules/biclustering.rst
 +of its biclusters: :math:A and :math:A+c, where :math:c is any +constant, have the same MSR. + +For this reason, Cheng and Church may be useful to find shift-pattern +biclusters. It may instead find scale-pattern biclusters after +applying a log transformation to the data, which converts +multiplicative scaling into additive scaling. In other words, if +:math:u and :math:v are any column vectors, the MSR of :math:u +v^\top may be large, but the MSR of :math:\log(u v^\top) is 0. + +Cheng and Church finds biclusters that are as large as possible, with +the constraint that a bicluster's MSR must be less than the threshold +:math:\delta. The algorithm proceeds in an iterative greedy fashion. +It starts with the whole dataset, greedily removes rows and columns +until :math:\text{MSR} < \delta, then greedily adds rows and columns +while maintaining the bicluster's score. Once a bicluster has been
 scikit-learn member vene added a note Aug 15, 2013 You mean maintaining it exactly, or within a tolerance? kemaleren added a note Aug 17, 2013 Exactly. I should clarify this section, or maybe add something to the code, however. If inverse_rows is True, the MSR of the bicluster may be greater than delta, if the fact that some rows need to be inverted is not taken into account. to join this conversation on GitHub. Already have an account? Sign in to comment
and 1 other commented on an outdated diff Aug 15, 2013
examples/bicluster/cheng_church_microarray.py
 +import numpy as np + +from sklearn.cluster.bicluster import ChengChurch + +# get data +url = "http://arep.med.harvard.edu/biclustering/lymphoma.matrix" +lines = urllib.urlopen(url).read().strip().split('\n') +# insert a space before all negative signs +lines = list(' -'.join(line.split('-')).split(' ') for line in lines) +lines = list(list(int(i) for i in line if i) for line in lines) +data = np.array(lines) + +# replace missing values, just as in the paper +generator = np.random.RandomState(0) +idx = np.where(data == 999) +data[idx] = generator.randint(-800, 801, len(idx[0]))
 scikit-learn member vene added a note Aug 15, 2013 maybe use the newly merged missing data imputation, if it can do the same thing? ^_^ kemaleren added a note Aug 17, 2013 It would be nice to use the existing data imputation functionality. Right now it doesn't support generating random values, however. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on an outdated diff Aug 15, 2013
examples/bicluster/cheng_church_microarray.py
 +data[idx] = generator.randint(-800, 801, len(idx[0])) + +# cluster with same parameters as original paper +model = ChengChurch(n_clusters=100, max_msr=1200, + deletion_threshold=1.2, inverse_rows=True, + random_state=0) +print("Biclustering...") +start_time = time() +model.fit(data) +print("Done in {:.2f}s.".format(time() - start_time)) + +# find smallest msr +msr = lambda a: (np.power(a - a.mean(axis=1, keepdims=True) - + a.mean(axis=0) + a.mean(), 2).mean()) +min_msr = min(msr(model.get_submatrix(i, data)) for i in range(100)) +print ("MSR of best bicluster: {:.2f}".format(min_msr))
 scikit-learn member vene added a note Aug 15, 2013 remove space before paren I think... does this work in py3k? to join this conversation on GitHub. Already have an account? Sign in to comment
commented on an outdated diff Aug 15, 2013
sklearn/datasets/samples_generator.py
 @@ -1493,3 +1493,120 @@ def make_checkerboard(shape, n_clusters, noise=0.0, minval=10, for label in range(n_col_clusters)) return result, rows, cols + + +def make_msr(shape, n_clusters, noise=0.0, constant=False,
 scikit-learn member vene added a note Aug 15, 2013 The name of this function isn't clear enough, one might guess it has to do with regression. Maybe add the word biclusters in the name? to join this conversation on GitHub. Already have an account? Sign in to comment
commented on an outdated diff Aug 15, 2013
...ics/cluster/bicluster/tests/test_bicluster_metrics.py
 @@ -34,3 +35,10 @@ def test_consensus_score(): assert_equal(consensus_score((a, a), (a, b)), 0) assert_equal(consensus_score((b, b), (a, b)), 0) assert_equal(consensus_score((b, b), (b, a)), 0) + + # ensure single biclusters get reshaped correctly + rows = [True, False] + cols = [True, False] + assert_equal(consensus_score((rows, cols), + (array2d(rows), array2d(cols))),
 scikit-learn member vene added a note Aug 15, 2013 I think the idiom is rows[:, np.newaxis] unless you're trying to accomplish something else that I didn't get. array2d might have a bit of overhead. to join this conversation on GitHub. Already have an account? Sign in to comment
and 1 other commented on an outdated diff Aug 15, 2013
sklearn/cluster/bicluster/cheng_church.py
 + +from sklearn.base import BaseEstimator, BiclusterMixin +from sklearn.externals import six + +from sklearn.utils.validation import check_arrays +from sklearn.utils.validation import check_random_state + +from .utils import check_array_ndim +from ._squared_residue import compute_msr + + +class EmptyBiclusterException(Exception): + pass + + +class IncrementalMSR(object):
 scikit-learn member vene added a note Aug 15, 2013 Should this class be public-facing? kemaleren added a note Aug 17, 2013 No. I assume that private classes also use the '_' prefix? to join this conversation on GitHub. Already have an account? Sign in to comment
and 1 other commented on an outdated diff Aug 26, 2013
examples/bicluster/cheng_church_microarray.py
 +data[idx] = generator.randint(-800, 801, len(idx[0])) + +# cluster with same parameters as original paper +model = ChengChurch(n_clusters=100, max_msr=1200, + deletion_threshold=1.2, inverse_rows=True, + random_state=0) +print("Biclustering...") +start_time = time() +model.fit(data) +print("Done in {:.2f}s.".format(time() - start_time)) + +# find smallest msr +msr = lambda a: (np.power(a - a.mean(axis=1, keepdims=True) - + a.mean(axis=0) + a.mean(), 2).mean()) +min_msr = min(msr(model.get_submatrix(i, data)) for i in range(100)) +print("MSR of best bicluster: {:.2f}".format(min_msr))
 scikit-learn member vene added a note Aug 26, 2013 The MSR doesn't mean much, at least to me, without a bit of context. Is it possible to either give some context (maybe by also running some baseline method) or to plot something? kemaleren added a note Aug 29, 2013 I'd like to plot the parallel coordinates of the bicluster's rows, as I did in my blog post. I used pandas' plotting functionality for that. Since pandas is not available here, maybe I could get the same effect with a line plot. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on an outdated diff Aug 26, 2013
examples/bicluster/cheng_church_microarray.py
 @@ -0,0 +1,56 @@ +""" +================================================== +Biclustering microarray data with Cheng and Church +================================================== + +This example is a replication of an experiment from the original Cheng +and Church paper. The gene microarray data is downloaded from the
 scikit-learn member vene added a note Aug 26, 2013 I guess it would be nice to also write a bit about what the microarray data represents and what is the application that biclustering solves on such data. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on an outdated diff Aug 26, 2013
examples/bicluster/plot_cheng_church.py
 + +from matplotlib import pyplot as plt + +from sklearn.datasets import make_msr_biclusters +from sklearn.datasets import samples_generator as sg +from sklearn.cluster.bicluster import ChengChurch +from sklearn.metrics import consensus_score + +data, rows, columns = make_msr_biclusters(shape=(100, 100), + n_clusters=3, noise=10, + shuffle=False, + random_state=0) + +plt.matshow(data, cmap=plt.cm.Blues) +plt.title("Original dataset") +plt.show()
 scikit-learn member vene added a note Aug 26, 2013 plt blocks on this call, which makes the example wait for you to close the window when running it manually. I just thought it was taking a long time, until I realized. I think both calls should be removed with a single call to plt.show() at the very end of the script. to join this conversation on GitHub. Already have an account? Sign in to comment
and 1 other commented on an outdated diff Aug 26, 2013
sklearn/cluster/bicluster/tests/test_cheng_church.py
 + + inc.remove_row(0) + inc.remove_col(0) + + arr = data[new_rows][:, new_cols] + sr = arr - arr.mean(axis=1, keepdims=True) - arr.mean(axis=0) + arr.mean() + sr = np.power(sr, 2) + + assert_almost_equal(inc.msr, sr.mean()) + assert_array_almost_equal(inc.row_msr, sr.mean(axis=1)) + assert_array_almost_equal(inc.col_msr, sr.mean(axis=0)) + + +def test_cheng_church(): + """Test Cheng and Church algorithm on a simple problem.""" + for shape in ((150, 150), (50, 50)):
 scikit-learn member vene added a note Aug 26, 2013 Would the shape change anything here? Does it need to be square? kemaleren added a note Aug 29, 2013 It need not be square. The reason I tried two different shapes was becase the default cutoff for multiple node deletion is 150. But I could just change the cutoff, instead. to join this conversation on GitHub. Already have an account? Sign in to comment
and 1 other commented on an outdated diff Aug 26, 2013
sklearn/cluster/bicluster/tests/test_cheng_church.py
 + assert_array_almost_equal(inc.col_msr, sr.mean(axis=0)) + + +def test_cheng_church(): + """Test Cheng and Church algorithm on a simple problem.""" + for shape in ((150, 150), (50, 50)): + for noise in (0, 1): + for deletion_threshold in (1.5, 2): + data, rows, cols = make_msr_biclusters(shape, 3, + noise=noise, + random_state=0) + model = ChengChurch(n_clusters=3, max_msr=10, + deletion_threshold=deletion_threshold, + random_state=0) + model.fit(data) + assert(consensus_score((rows, cols), model.biclusters_) > 0.7)
 scikit-learn member vene added a note Aug 26, 2013 You should use assert_greater, but also, I think that this test is a bit weak. Maybe also test that consensus_score is better in absence of noise? kemaleren added a note Aug 29, 2013 Yes, it is weak. I like the idea of checking that the consensus score improves. I could also make a custom threshold for each set of parameters, which would catch if something causes one of the higher scores to fall. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Aug 26, 2013
sklearn/cluster/bicluster/tests/test_cheng_church.py
 @@ -0,0 +1,235 @@ +"""Testing for Spectral Biclustering methods"""
 scikit-learn member vene added a note Aug 26, 2013 I think many of the tests in this file can be made faster by making the input arrays smaller. Of course, in cases where it wouldn't change anything. WDYT? kemaleren added a note Aug 29, 2013 It's the spectral biclustering test that takes most of the time. I made the twice as small, but that only sped it up by a few tenths of a second. The slowdown is caused by iterating over a parameter grid. I cut the time in half by using only one paramter for n_svd_vecs. to join this conversation on GitHub. Already have an account? Sign in to comment
and 1 other commented on an outdated diff Aug 26, 2013
sklearn/cluster/bicluster/tests/test_cheng_church.py
 + # check that all the new rows are inverted rows + expected_inv_rows = np.zeros(15, dtype=np.bool) + expected_inv_rows[10:15] = True + new_rows = np.logical_and(model.rows_[0], + np.logical_not(old_rows)) + assert not np.any(np.logical_or(expected_inv_rows, new_rows)[:10]) + assert np.any(new_rows[10:]) + + +def test_empty_biclusters(): + """Cheng and Church should always find at least one bicluster. + + The MSR of a bicluster with one row or one column is zero. + + """ + for i in range(10):
 scikit-learn member vene added a note Aug 26, 2013 Rather than try arbitrary random seeds, how about building an input by hand that would pose problems? The way it's done here, couldn't it be the case that we just get lucky for i in 0...9? kemaleren added a note Aug 29, 2013 I forgot about this, so I'm glad you caught it. I think it would make more sense to not report biclusters with only one row or column. Any arbitrary vector has a perfect mean squared residue, so it is a meaningless result. to join this conversation on GitHub. Already have an account? Sign in to comment
scikit-learn member
commented Aug 26, 2013

Looks good but it diverged from master, could you rebase please?

Sure. I just rebased, and also addressed your other comments.

commented on the diff Sep 12, 2013
doc/modules/biclustering.rst
 @@ -84,6 +87,77 @@ diagonal and checkerboard bicluster structures. .. currentmodule:: sklearn.cluster.bicluster +.. _cheng_church: + +Cheng and Church +================ + +:class:ChengChurch tries to find biclusters with a low mean squared +residue (MSR). For a matrix :math:A with shape :math:m \times n,
 scikit-learn member GaelVaroquaux added a note Sep 12, 2013 Am I right to think that in this sens it minimizes the same kind of criterion than KMeans? kemaleren added a note Sep 13, 2013 It's a similar criterion in that is a 'mean squared ______. However, in this case it takes more than just the overall mean of the cluster into account. I don't think they are very similar. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Sep 12, 2013
examples/bicluster/plot_cheng_church.py
 @@ -0,0 +1,67 @@ +"""
 scikit-learn member GaelVaroquaux added a note Sep 12, 2013 Very nice example! to join this conversation on GitHub. Already have an account? Sign in to comment
and 1 other commented on an outdated diff Sep 12, 2013
examples/bicluster/plot_cheng_church_microarray.py
 +residue threshold is lowered to make the bicluster visually simpler. + +""" +from __future__ import print_function + +print(__doc__) + +from time import time +import urllib + +import numpy as np +from matplotlib import pyplot as plt + +from sklearn.cluster.bicluster import ChengChurch + +# get data
 scikit-learn member GaelVaroquaux added a note Sep 12, 2013 Should we be writing a dataset fetch for this example, with a caching, as the other datasets? I am fortunate enough to have Wifi right now, but that's not always the case? kemaleren added a note Sep 13, 2013 Sure, I can do that. Sounds like a good idea. Let me take a look at how the others do it. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Sep 12, 2013
examples/bicluster/plot_cheng_church_microarray.py
 +gene, and each column represents a tissue sample from a patient with +lymphoma. The larger the value of data[i, j], the more active gene +i in sample j. Biclustering this data with Cheng and Church +finds subsets of samples with similar expression profiles in a subset +of genes. The goal of this kind of analysis is often to find sets of +genes that may be somehow related. For instance, lymphoma may cause +some genes that are otherwise unrelated to become highly expressed or +supressed. + +The gene microarray data is downloaded from the paper's supplementary +information webpage, parsed into a NumPy array, and clustered with +Cheng and Church. The bicluster is then visualized by a parallel +coordinate plot of its rows. Biclustering is performed with almost the +same parameters as in the original experiment, except the mean squared +residue threshold is lowered to make the bicluster visually simpler. +
 scikit-learn member GaelVaroquaux added a note Sep 12, 2013 That's a nice example, but is there any chance that you can find us a more visual plot? to join this conversation on GitHub. Already have an account? Sign in to comment
and 1 other commented on an outdated diff Sep 12, 2013
sklearn/cluster/bicluster/cheng_church.py
 + self._row_idxs = None + self._col_idxs = None + + subarr = arr[self.row_idxs[:, np.newaxis], self.col_idxs] + self._sum = subarr.sum() + self._row_sum = subarr.sum(axis=1) + self._col_sum = subarr.sum(axis=0) + + self._reset() + + def _reset(self): + self._msr = None + self._row_msr = None + self._col_msr = None + + @property
 scikit-learn member GaelVaroquaux added a note Sep 12, 2013 From a style point of view, I prefer explicit getters (as in in a 'get_row_idx' function) rather than properties. It makes it more explicit in the code that there is computation going on when reading the code. The same remark holds for the properties below. IMHO, the real usecase of properties is impedance matching to adapt to an interface that expects attributes. kemaleren added a note Sep 13, 2013 Good point. I can change this easily. I think I just used properties here because row_idxs is shorter to write than get_row_idxs. to join this conversation on GitHub. Already have an account? Sign in to comment
scikit-learn member

General cosmetic remark: you used the term MSR 'Mean Square Residue' a lot in your code. I am more used to 'MSE', as in mean square error. I wonder if I am the only one. If not it might be worth changing.

and 1 other commented on an outdated diff Sep 12, 2013
sklearn/cluster/bicluster/cheng_church.py
 + old_rows = rows.copy() # save for row inverse + msr = self._msr(rows, cols, X) + row_msr = self._row_msr(rows, cols, X) + rows = np.logical_or(rows, row_msr < msr) + + if self.inverse_rows: + row_msr = self._row_msr(old_rows, cols, X, + inverse=True) + to_add = row_msr < msr + new_inverse_rows = np.logical_and(to_add, np.logical_not(rows)) + inverse_rows = np.logical_or(inverse_rows, + new_inverse_rows) + rows = np.logical_or(rows, to_add) + + if (n_rows == np.count_nonzero(rows)) and \ + (n_cols == np.count_nonzero(cols)):
 scikit-learn member GaelVaroquaux added a note Sep 12, 2013 I have in mind that 'np.count_nonzero' is somewhat of a recent addition to numpy. You may want to check if it is in all the versions of numpy that we support. kemaleren added a note Sep 13, 2013 It is indeed not available until NumPy 1.6. I will rewrite this. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on an outdated diff Sep 12, 2013
sklearn/cluster/bicluster/cheng_church.py
 + """Mask a bicluster in the data with random values.""" + shape = np.count_nonzero(rows), np.count_nonzero(cols) + mask_vals = generator.uniform(minval, maxval, shape) + r = rows.nonzero()[0][:, np.newaxis] + c = cols.nonzero()[0] + X[r, c] = mask_vals + + def fit(self, X): + """Creates a biclustering for X. + + Parameters + ---------- + X : array-like, shape (n_samples, n_features) + + """ + X = X.copy() # need to modify it in-place
 scikit-learn member GaelVaroquaux added a note Sep 12, 2013 You should rather use 'copy=True', in check_arrays, 2 lines below. Indeed, if the dtype is not float64, you will copy the data twice. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on an outdated diff Sep 12, 2013
sklearn/cluster/bicluster/_squared_residue.pyx
 + +cimport numpy as np +cimport cython + +np.import_array() + +ctypedef np.float64_t DOUBLE +ctypedef np.int64_t LONG + + +def compute_msr(long[:] rows, + long[:] cols, + double[:] row_mean, + double[:] col_mean, + double arr_mean, + double[:, :] X):
 scikit-learn member GaelVaroquaux added a note Sep 12, 2013 I think that this function needs a docstring, even if is a very short one. Also, I believe that it should be moved to utils, as it can be of general interest and will not be found where it currently is. to join this conversation on GitHub. Already have an account? Sign in to comment

@GaelVaroquaux The mean squared residue criterion in this context is different from the mean squared error. The MSR is well known in biclustering, so I believe it would be more, rather than less, confusing to refer to it as MSE instead.

scikit-learn member
added some commits Jul 25, 2013
 kemaleren re-committing Cheng and Church code as a feature branch - implemented the algorithm - added some simple tests - wrote data generator 3bf832f kemaleren do single node deletion in main loop 0974438 kemaleren wrote some tests 0027e0c kemaleren fixed precision in cython e29bb6a kemaleren some more tests - test when deletion unnecessary - check that exception gets raised f6e7077 kemaleren also do other residue calculations in cython 9ed95f0 kemaleren updated docstring, renamed _sr to _square_residue f265a1b kemaleren do not calculate same mean twice 48ef505 kemaleren added ChengChurch to documentation 4f6c916 kemaleren corrected calculation of msr for adding rows and columns 8e247bc kemaleren cython version precision was corrected, so update test 2cb6fa3 kemaleren renamed square to squared b310d79 kemaleren added msr note to documentation 1d5cf70 kemaleren compute all msr at once c9425fe kemaleren documentation updates f29535f kemaleren added comment explaining row shape 52ddc45 kemaleren corrected reference 5e1d164 kemaleren incremental update 2e3efc8 kemaleren added docstring for IncrementalMSR f919d6d kemaleren use keepdims in docstring 9c7ee7a kemaleren renamed compute() to compute_msr() ec162b9 kemaleren cython speedups - use typed memoryviews - use nogil 6db2ff4 kemaleren ensure metrics work for single biclusters daec7fd kemaleren wrote tests for inverse rows and columns 95ee24c kemaleren wrote docstrings cdcf444 kemaleren make_msr() does not just generate constant biclusters a2173b8 kemaleren fixed bug in IncrementalMSR cd40436 kemaleren updated documentation to improve MSR explanation 75087e3 kemaleren renamed variable to be more descriptive b57ed5c kemaleren updated description of msr 8f8eabe kemaleren updated documentation 1203733 kemaleren no inverse columns. use incremental class everywhere possible e72d40b kemaleren updated make_msr() 32baa46 kemaleren removed test d57ee69 kemaleren updated test 4e18080 kemaleren added cheng and church examples de956c5 kemaleren removed unused code 9b53bd5 kemaleren address code review suggestions 62ff076 kemaleren added the inverted_rows_ attribute c49742a kemaleren updated cheng church examples d40b921 kemaleren do not return trivial biclusters b7e3dfc kemaleren tests: ensure cheng and church does better when noise=0 a5ad29d kemaleren sped up spectral biclustering tests 24bfdd6 kemaleren wrote caching data fetcher for microarray data 6735b3b kemaleren added plot of bicluster in example dd14582 kemaleren replaced properties with getters ec41954 kemaleren made count_nonzero work for older versions of numpy 3f8aee6 kemaleren replaced unnecessary copy 6b4e1df kemaleren moved mean squared residue to extmath 2a82e8d kemaleren added whats new f47f47b
scikit-learn member

I had a closer look at the microarray example, and I tried to do a plot of the resulting clustering to get a gut feeling. I did the following:

row_order = np.argsort(model.rows_[0], kind='mergesort')
column_order = np.argsort(model.columns_[0], kind='mergesort')

plt.matshow(data[row_order].T[column_order].T[:300])


It seems to me that this would sort out the clusters.

The resulting image does not really highlight a structure. Am I being dense, or is it just that there is very little structure in this data?

There are two issues here.

First, you would need to reverse the row and column order to put the bicluster in the top left corner. As written, it would be in the lower right corner.

row_order = row_order[::-1]
column_order = column_order[::-1]
`

Second, the bicluster's range is (-65, 65), whereas the portion you plotted has range (-799, 800). The variation within the bicluster is just not visible. That is why the example heatmap only shows the bicluster itself.

scikit-learn member

Do you think that you could do this in the example? I think that it would make things more explicit.

Also, have you seen the PR that I sent you a little while ago? kemaleren#1

to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.