Skip to content

[WIP] Implementation of the Cheng and Church algorithm #2172

Open
wants to merge 50 commits into from

3 participants

@kemaleren

This pull request is ready for review.

TODO:

  • indicate which rows were inverted
  • examples
  • incremental update in Cython
  • documentation
  • full test coverage
  • also do square residue for node addition in Cython (turned out not to be worth it)
  • incremental update of MSR (tried it in pure python, but was not much faster.)
  • data sample generator
  • get base implementation working.
@vene
scikit-learn member
vene commented Jul 25, 2013

Could you rebase this on top of master and remove duplicated commits.

@kemaleren

Done rebasing. Time to start writing more tests.

@kemaleren

Okay, I just finished the incremental update of the MSR in Cython. I tested it on arrays of three different sizes. Here are the times and speedups (cython vs cython + incremental update):

  • 500x500: 2.05 seconds vs 1.45 seconds. 0.4x speedup.
  • 1000x1000: 16.4 seconds vs 10.5 seconds. 0.5x speedup.
  • 1500x1500: 53.1 seconds vs 34 seconds. 0.5x speedup.

The full row and column MSR calculations for node addition() are still in pure Python. However, since it spends 97% of its runtime in singe node deletion and only 2% in node addition, it does not seem worth the extra complexity of doing that in Cython too.

@kemaleren

I have made some further improvements to the cython code. As requested, here are some numbers for naive pure python vs cython with incremental MSR updating:

  • 500x500 matrix: 15.2 seconds vs 1.11 seconds. 13.7x speedup
  • 1000x1000 matrix: 124 seconds vs 7.93 seconds. 15.6x speedup
  • 1500x1500 matrix: 419 seconds vs 24.7 seconds. 17x speedup
@vene
scikit-learn member
vene commented Aug 1, 2013
@vene vene commented on the diff Aug 15, 2013
doc/modules/biclustering.rst
+.. math::
+ \frac{1}{mn} \sum \left (a_{ij} - a_{iJ} - a_{Ij} + a_{I J} \right)^2
+
+The mean squared residue achieves its minimum of 0 when all the rows
+and all the columns of a bicluster are shifted versions of each other.
+Constant biclusters, biclusters with constant rows or columns, or
+biclusters with identical rows or columns are special cases of this
+condition, all of which have an MSR of 0. As a corollary, additive
+scaling of the entire bicluster does not affect the MSR score of any
+of its biclusters: :math:`A` and :math:`A+c`, where :math:`c` is any
+constant, have the same MSR.
+
+For this reason, Cheng and Church may be useful to find shift-pattern
+biclusters. It may instead find scale-pattern biclusters after
+applying a log transformation to the data, which converts
+multiplicative scaling into additive scaling. In other words, if
@vene
scikit-learn member
vene added a note Aug 15, 2013

This is a great point in my opinion. Do you know a certain application or data (maybe artificial) that could show this off?

@kemaleren
kemaleren added a note Aug 29, 2013

It's a pattern that shows up in gene expression data. I could modify the microarray example to show the difference in the kinds of biclusters found when the data is first log-transformed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene vene commented on the diff Aug 15, 2013
doc/modules/biclustering.rst
+of its biclusters: :math:`A` and :math:`A+c`, where :math:`c` is any
+constant, have the same MSR.
+
+For this reason, Cheng and Church may be useful to find shift-pattern
+biclusters. It may instead find scale-pattern biclusters after
+applying a log transformation to the data, which converts
+multiplicative scaling into additive scaling. In other words, if
+:math:`u` and :math:`v` are any column vectors, the MSR of :math:`u
+v^\top` may be large, but the MSR of :math:`\log(u v^\top)` is 0.
+
+Cheng and Church finds biclusters that are as large as possible, with
+the constraint that a bicluster's MSR must be less than the threshold
+:math:`\delta`. The algorithm proceeds in an iterative greedy fashion.
+It starts with the whole dataset, greedily removes rows and columns
+until :math:`\text{MSR} < \delta`, then greedily adds rows and columns
+while maintaining the bicluster's score. Once a bicluster has been
@vene
scikit-learn member
vene added a note Aug 15, 2013

You mean maintaining it exactly, or within a tolerance?

@kemaleren
kemaleren added a note Aug 17, 2013

Exactly. I should clarify this section, or maybe add something to the code, however. If inverse_rows is True, the MSR of the bicluster may be greater than delta, if the fact that some rows need to be inverted is not taken into account.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene vene and 1 other commented on an outdated diff Aug 15, 2013
examples/bicluster/cheng_church_microarray.py
+import numpy as np
+
+from sklearn.cluster.bicluster import ChengChurch
+
+# get data
+url = "http://arep.med.harvard.edu/biclustering/lymphoma.matrix"
+lines = urllib.urlopen(url).read().strip().split('\n')
+# insert a space before all negative signs
+lines = list(' -'.join(line.split('-')).split(' ') for line in lines)
+lines = list(list(int(i) for i in line if i) for line in lines)
+data = np.array(lines)
+
+# replace missing values, just as in the paper
+generator = np.random.RandomState(0)
+idx = np.where(data == 999)
+data[idx] = generator.randint(-800, 801, len(idx[0]))
@vene
scikit-learn member
vene added a note Aug 15, 2013

maybe use the newly merged missing data imputation, if it can do the same thing? ^_^

@kemaleren
kemaleren added a note Aug 17, 2013

It would be nice to use the existing data imputation functionality. Right now it doesn't support generating random values, however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene vene commented on an outdated diff Aug 15, 2013
examples/bicluster/cheng_church_microarray.py
+data[idx] = generator.randint(-800, 801, len(idx[0]))
+
+# cluster with same parameters as original paper
+model = ChengChurch(n_clusters=100, max_msr=1200,
+ deletion_threshold=1.2, inverse_rows=True,
+ random_state=0)
+print("Biclustering...")
+start_time = time()
+model.fit(data)
+print("Done in {:.2f}s.".format(time() - start_time))
+
+# find smallest msr
+msr = lambda a: (np.power(a - a.mean(axis=1, keepdims=True) -
+ a.mean(axis=0) + a.mean(), 2).mean())
+min_msr = min(msr(model.get_submatrix(i, data)) for i in range(100))
+print ("MSR of best bicluster: {:.2f}".format(min_msr))
@vene
scikit-learn member
vene added a note Aug 15, 2013

remove space before paren I think... does this work in py3k?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene vene commented on an outdated diff Aug 15, 2013
sklearn/datasets/samples_generator.py
@@ -1493,3 +1493,120 @@ def make_checkerboard(shape, n_clusters, noise=0.0, minval=10,
for label in range(n_col_clusters))
return result, rows, cols
+
+
+def make_msr(shape, n_clusters, noise=0.0, constant=False,
@vene
scikit-learn member
vene added a note Aug 15, 2013

The name of this function isn't clear enough, one might guess it has to do with regression. Maybe add the word biclusters in the name?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene vene commented on an outdated diff Aug 15, 2013
...ics/cluster/bicluster/tests/test_bicluster_metrics.py
@@ -34,3 +35,10 @@ def test_consensus_score():
assert_equal(consensus_score((a, a), (a, b)), 0)
assert_equal(consensus_score((b, b), (a, b)), 0)
assert_equal(consensus_score((b, b), (b, a)), 0)
+
+ # ensure single biclusters get reshaped correctly
+ rows = [True, False]
+ cols = [True, False]
+ assert_equal(consensus_score((rows, cols),
+ (array2d(rows), array2d(cols))),
@vene
scikit-learn member
vene added a note Aug 15, 2013

I think the idiom is rows[:, np.newaxis] unless you're trying to accomplish something else that I didn't get. array2d might have a bit of overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene vene and 1 other commented on an outdated diff Aug 15, 2013
sklearn/cluster/bicluster/cheng_church.py
+
+from sklearn.base import BaseEstimator, BiclusterMixin
+from sklearn.externals import six
+
+from sklearn.utils.validation import check_arrays
+from sklearn.utils.validation import check_random_state
+
+from .utils import check_array_ndim
+from ._squared_residue import compute_msr
+
+
+class EmptyBiclusterException(Exception):
+ pass
+
+
+class IncrementalMSR(object):
@vene
scikit-learn member
vene added a note Aug 15, 2013

Should this class be public-facing?

@kemaleren
kemaleren added a note Aug 17, 2013

No. I assume that private classes also use the '_' prefix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene vene and 1 other commented on an outdated diff Aug 26, 2013
examples/bicluster/cheng_church_microarray.py
+data[idx] = generator.randint(-800, 801, len(idx[0]))
+
+# cluster with same parameters as original paper
+model = ChengChurch(n_clusters=100, max_msr=1200,
+ deletion_threshold=1.2, inverse_rows=True,
+ random_state=0)
+print("Biclustering...")
+start_time = time()
+model.fit(data)
+print("Done in {:.2f}s.".format(time() - start_time))
+
+# find smallest msr
+msr = lambda a: (np.power(a - a.mean(axis=1, keepdims=True) -
+ a.mean(axis=0) + a.mean(), 2).mean())
+min_msr = min(msr(model.get_submatrix(i, data)) for i in range(100))
+print("MSR of best bicluster: {:.2f}".format(min_msr))
@vene
scikit-learn member
vene added a note Aug 26, 2013

The MSR doesn't mean much, at least to me, without a bit of context. Is it possible to either give some context (maybe by also running some baseline method) or to plot something?

@kemaleren
kemaleren added a note Aug 29, 2013

I'd like to plot the parallel coordinates of the bicluster's rows, as I did in my blog post. I used pandas' plotting functionality for that. Since pandas is not available here, maybe I could get the same effect with a line plot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene vene commented on an outdated diff Aug 26, 2013
examples/bicluster/cheng_church_microarray.py
@@ -0,0 +1,56 @@
+"""
+==================================================
+Biclustering microarray data with Cheng and Church
+==================================================
+
+This example is a replication of an experiment from the original Cheng
+and Church paper. The gene microarray data is downloaded from the
@vene
scikit-learn member
vene added a note Aug 26, 2013

I guess it would be nice to also write a bit about what the microarray data represents and what is the application that biclustering solves on such data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene vene commented on an outdated diff Aug 26, 2013
examples/bicluster/plot_cheng_church.py
+
+from matplotlib import pyplot as plt
+
+from sklearn.datasets import make_msr_biclusters
+from sklearn.datasets import samples_generator as sg
+from sklearn.cluster.bicluster import ChengChurch
+from sklearn.metrics import consensus_score
+
+data, rows, columns = make_msr_biclusters(shape=(100, 100),
+ n_clusters=3, noise=10,
+ shuffle=False,
+ random_state=0)
+
+plt.matshow(data, cmap=plt.cm.Blues)
+plt.title("Original dataset")
+plt.show()
@vene
scikit-learn member
vene added a note Aug 26, 2013

plt blocks on this call, which makes the example wait for you to close the window when running it manually. I just thought it was taking a long time, until I realized. I think both calls should be removed with a single call to plt.show() at the very end of the script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene vene and 1 other commented on an outdated diff Aug 26, 2013
sklearn/cluster/bicluster/tests/test_cheng_church.py
+
+ inc.remove_row(0)
+ inc.remove_col(0)
+
+ arr = data[new_rows][:, new_cols]
+ sr = arr - arr.mean(axis=1, keepdims=True) - arr.mean(axis=0) + arr.mean()
+ sr = np.power(sr, 2)
+
+ assert_almost_equal(inc.msr, sr.mean())
+ assert_array_almost_equal(inc.row_msr, sr.mean(axis=1))
+ assert_array_almost_equal(inc.col_msr, sr.mean(axis=0))
+
+
+def test_cheng_church():
+ """Test Cheng and Church algorithm on a simple problem."""
+ for shape in ((150, 150), (50, 50)):
@vene
scikit-learn member
vene added a note Aug 26, 2013

Would the shape change anything here? Does it need to be square?

@kemaleren
kemaleren added a note Aug 29, 2013

It need not be square. The reason I tried two different shapes was becase the default cutoff for multiple node deletion is 150. But I could just change the cutoff, instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene vene and 1 other commented on an outdated diff Aug 26, 2013
sklearn/cluster/bicluster/tests/test_cheng_church.py
+ assert_array_almost_equal(inc.col_msr, sr.mean(axis=0))
+
+
+def test_cheng_church():
+ """Test Cheng and Church algorithm on a simple problem."""
+ for shape in ((150, 150), (50, 50)):
+ for noise in (0, 1):
+ for deletion_threshold in (1.5, 2):
+ data, rows, cols = make_msr_biclusters(shape, 3,
+ noise=noise,
+ random_state=0)
+ model = ChengChurch(n_clusters=3, max_msr=10,
+ deletion_threshold=deletion_threshold,
+ random_state=0)
+ model.fit(data)
+ assert(consensus_score((rows, cols), model.biclusters_) > 0.7)
@vene
scikit-learn member
vene added a note Aug 26, 2013

You should use assert_greater, but also, I think that this test is a bit weak. Maybe also test that consensus_score is better in absence of noise?

@kemaleren
kemaleren added a note Aug 29, 2013

Yes, it is weak. I like the idea of checking that the consensus score improves. I could also make a custom threshold for each set of parameters, which would catch if something causes one of the higher scores to fall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene vene commented on the diff Aug 26, 2013
sklearn/cluster/bicluster/tests/test_cheng_church.py
@@ -0,0 +1,235 @@
+"""Testing for Spectral Biclustering methods"""
@vene
scikit-learn member
vene added a note Aug 26, 2013

I think many of the tests in this file can be made faster by making the input arrays smaller. Of course, in cases where it wouldn't change anything. WDYT?

@kemaleren
kemaleren added a note Aug 29, 2013

It's the spectral biclustering test that takes most of the time. I made the twice as small, but that only sped it up by a few tenths of a second. The slowdown is caused by iterating over a parameter grid. I cut the time in half by using only one paramter for n_svd_vecs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene vene and 1 other commented on an outdated diff Aug 26, 2013
sklearn/cluster/bicluster/tests/test_cheng_church.py
+ # check that all the new rows are inverted rows
+ expected_inv_rows = np.zeros(15, dtype=np.bool)
+ expected_inv_rows[10:15] = True
+ new_rows = np.logical_and(model.rows_[0],
+ np.logical_not(old_rows))
+ assert not np.any(np.logical_or(expected_inv_rows, new_rows)[:10])
+ assert np.any(new_rows[10:])
+
+
+def test_empty_biclusters():
+ """Cheng and Church should always find at least one bicluster.
+
+ The MSR of a bicluster with one row or one column is zero.
+
+ """
+ for i in range(10):
@vene
scikit-learn member
vene added a note Aug 26, 2013

Rather than try arbitrary random seeds, how about building an input by hand that would pose problems? The way it's done here, couldn't it be the case that we just get lucky for i in 0...9?

@kemaleren
kemaleren added a note Aug 29, 2013

I forgot about this, so I'm glad you caught it. I think it would make more sense to not report biclusters with only one row or column. Any arbitrary vector has a perfect mean squared residue, so it is a meaningless result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@vene
scikit-learn member
vene commented Aug 26, 2013

Looks good but it diverged from master, could you rebase please?

@kemaleren

Sure. I just rebased, and also addressed your other comments.

@GaelVaroquaux GaelVaroquaux commented on the diff Sep 12, 2013
doc/modules/biclustering.rst
@@ -84,6 +87,77 @@ diagonal and checkerboard bicluster structures.
.. currentmodule:: sklearn.cluster.bicluster
+.. _cheng_church:
+
+Cheng and Church
+================
+
+:class:`ChengChurch` tries to find biclusters with a low mean squared
+residue (MSR). For a matrix :math:`A` with shape :math:`m \times n`,
@GaelVaroquaux
scikit-learn member

Am I right to think that in this sens it minimizes the same kind of criterion than KMeans?

@kemaleren
kemaleren added a note Sep 13, 2013

It's a similar criterion in that is a 'mean squared ______`. However, in this case it takes more than just the overall mean of the cluster into account. I don't think they are very similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on the diff Sep 12, 2013
examples/bicluster/plot_cheng_church.py
@@ -0,0 +1,67 @@
+"""
@GaelVaroquaux
scikit-learn member

Very nice example!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Sep 12, 2013
examples/bicluster/plot_cheng_church_microarray.py
+residue threshold is lowered to make the bicluster visually simpler.
+
+"""
+from __future__ import print_function
+
+print(__doc__)
+
+from time import time
+import urllib
+
+import numpy as np
+from matplotlib import pyplot as plt
+
+from sklearn.cluster.bicluster import ChengChurch
+
+# get data
@GaelVaroquaux
scikit-learn member

Should we be writing a dataset fetch for this example, with a caching, as the other datasets? I am fortunate enough to have Wifi right now, but that's not always the case?

@kemaleren
kemaleren added a note Sep 13, 2013

Sure, I can do that. Sounds like a good idea. Let me take a look at how the others do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on the diff Sep 12, 2013
examples/bicluster/plot_cheng_church_microarray.py
+gene, and each column represents a tissue sample from a patient with
+lymphoma. The larger the value of ``data[i, j]``, the more active gene
+``i`` in sample ``j``. Biclustering this data with Cheng and Church
+finds subsets of samples with similar expression profiles in a subset
+of genes. The goal of this kind of analysis is often to find sets of
+genes that may be somehow related. For instance, lymphoma may cause
+some genes that are otherwise unrelated to become highly expressed or
+supressed.
+
+The gene microarray data is downloaded from the paper's supplementary
+information webpage, parsed into a NumPy array, and clustered with
+Cheng and Church. The bicluster is then visualized by a parallel
+coordinate plot of its rows. Biclustering is performed with almost the
+same parameters as in the original experiment, except the mean squared
+residue threshold is lowered to make the bicluster visually simpler.
+
@GaelVaroquaux
scikit-learn member

That's a nice example, but is there any chance that you can find us a more visual plot?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Sep 12, 2013
sklearn/cluster/bicluster/cheng_church.py
+ self._row_idxs = None
+ self._col_idxs = None
+
+ subarr = arr[self.row_idxs[:, np.newaxis], self.col_idxs]
+ self._sum = subarr.sum()
+ self._row_sum = subarr.sum(axis=1)
+ self._col_sum = subarr.sum(axis=0)
+
+ self._reset()
+
+ def _reset(self):
+ self._msr = None
+ self._row_msr = None
+ self._col_msr = None
+
+ @property
@GaelVaroquaux
scikit-learn member

From a style point of view, I prefer explicit getters (as in in a 'get_row_idx' function) rather than properties. It makes it more explicit in the code that there is computation going on when reading the code.

The same remark holds for the properties below.

IMHO, the real usecase of properties is impedance matching to adapt to an interface that expects attributes.

@kemaleren
kemaleren added a note Sep 13, 2013

Good point. I can change this easily. I think I just used properties here because row_idxs is shorter to write than get_row_idxs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux
scikit-learn member

General cosmetic remark: you used the term MSR 'Mean Square Residue' a lot in your code. I am more used to 'MSE', as in mean square error. I wonder if I am the only one. If not it might be worth changing.

@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Sep 12, 2013
sklearn/cluster/bicluster/cheng_church.py
+ old_rows = rows.copy() # save for row inverse
+ msr = self._msr(rows, cols, X)
+ row_msr = self._row_msr(rows, cols, X)
+ rows = np.logical_or(rows, row_msr < msr)
+
+ if self.inverse_rows:
+ row_msr = self._row_msr(old_rows, cols, X,
+ inverse=True)
+ to_add = row_msr < msr
+ new_inverse_rows = np.logical_and(to_add, np.logical_not(rows))
+ inverse_rows = np.logical_or(inverse_rows,
+ new_inverse_rows)
+ rows = np.logical_or(rows, to_add)
+
+ if (n_rows == np.count_nonzero(rows)) and \
+ (n_cols == np.count_nonzero(cols)):
@GaelVaroquaux
scikit-learn member

I have in mind that 'np.count_nonzero' is somewhat of a recent addition to numpy. You may want to check if it is in all the versions of numpy that we support.

@kemaleren
kemaleren added a note Sep 13, 2013

It is indeed not available until NumPy 1.6. I will rewrite this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on an outdated diff Sep 12, 2013
sklearn/cluster/bicluster/cheng_church.py
+ """Mask a bicluster in the data with random values."""
+ shape = np.count_nonzero(rows), np.count_nonzero(cols)
+ mask_vals = generator.uniform(minval, maxval, shape)
+ r = rows.nonzero()[0][:, np.newaxis]
+ c = cols.nonzero()[0]
+ X[r, c] = mask_vals
+
+ def fit(self, X):
+ """Creates a biclustering for X.
+
+ Parameters
+ ----------
+ X : array-like, shape (n_samples, n_features)
+
+ """
+ X = X.copy() # need to modify it in-place
@GaelVaroquaux
scikit-learn member

You should rather use 'copy=True', in check_arrays, 2 lines below. Indeed, if the dtype is not float64, you will copy the data twice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on an outdated diff Sep 12, 2013
sklearn/cluster/bicluster/_squared_residue.pyx
+
+cimport numpy as np
+cimport cython
+
+np.import_array()
+
+ctypedef np.float64_t DOUBLE
+ctypedef np.int64_t LONG
+
+
+def compute_msr(long[:] rows,
+ long[:] cols,
+ double[:] row_mean,
+ double[:] col_mean,
+ double arr_mean,
+ double[:, :] X):
@GaelVaroquaux
scikit-learn member

I think that this function needs a docstring, even if is a very short one.

Also, I believe that it should be moved to utils, as it can be of general interest and will not be found where it currently is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@kemaleren

@GaelVaroquaux The mean squared residue criterion in this context is different from the mean squared error. The MSR is well known in biclustering, so I believe it would be more, rather than less, confusing to refer to it as MSE instead.

@GaelVaroquaux
scikit-learn member
kemaleren added some commits Jul 25, 2013
@kemaleren kemaleren re-committing Cheng and Church code as a feature branch
- implemented the algorithm
- added some simple tests
- wrote data generator
3bf832f
@kemaleren kemaleren do single node deletion in main loop 0974438
@kemaleren kemaleren wrote some tests 0027e0c
@kemaleren kemaleren fixed precision in cython e29bb6a
@kemaleren kemaleren some more tests
- test when deletion unnecessary
- check that exception gets raised
f6e7077
@kemaleren kemaleren also do other residue calculations in cython 9ed95f0
@kemaleren kemaleren updated docstring, renamed _sr to _square_residue f265a1b
@kemaleren kemaleren do not calculate same mean twice 48ef505
@kemaleren kemaleren added ChengChurch to documentation 4f6c916
@kemaleren kemaleren corrected calculation of msr for adding rows and columns 8e247bc
@kemaleren kemaleren cython version precision was corrected, so update test 2cb6fa3
@kemaleren kemaleren renamed square to squared b310d79
@kemaleren kemaleren added msr note to documentation 1d5cf70
@kemaleren kemaleren compute all msr at once c9425fe
@kemaleren kemaleren documentation updates f29535f
@kemaleren kemaleren added comment explaining row shape 52ddc45
@kemaleren kemaleren corrected reference 5e1d164
@kemaleren kemaleren incremental update 2e3efc8
@kemaleren kemaleren added docstring for IncrementalMSR f919d6d
@kemaleren kemaleren use keepdims in docstring 9c7ee7a
@kemaleren kemaleren renamed compute() to compute_msr() ec162b9
@kemaleren kemaleren cython speedups
- use typed memoryviews
- use nogil
6db2ff4
@kemaleren kemaleren ensure metrics work for single biclusters daec7fd
@kemaleren kemaleren wrote tests for inverse rows and columns 95ee24c
@kemaleren kemaleren wrote docstrings cdcf444
@kemaleren kemaleren make_msr() does not just generate constant biclusters a2173b8
@kemaleren kemaleren fixed bug in IncrementalMSR cd40436
@kemaleren kemaleren updated documentation to improve MSR explanation 75087e3
@kemaleren kemaleren renamed variable to be more descriptive b57ed5c
@kemaleren kemaleren updated description of msr 8f8eabe
@kemaleren kemaleren updated documentation 1203733
@kemaleren kemaleren no inverse columns. use incremental class everywhere possible e72d40b
@kemaleren kemaleren updated make_msr() 32baa46
@kemaleren kemaleren removed test d57ee69
@kemaleren kemaleren updated test 4e18080
@kemaleren kemaleren added cheng and church examples de956c5
@kemaleren kemaleren removed unused code 9b53bd5
@kemaleren kemaleren address code review suggestions 62ff076
@kemaleren kemaleren added the inverted_rows_ attribute c49742a
@kemaleren kemaleren updated cheng church examples d40b921
@kemaleren kemaleren do not return trivial biclusters b7e3dfc
@kemaleren kemaleren tests: ensure cheng and church does better when noise=0 a5ad29d
@kemaleren kemaleren sped up spectral biclustering tests 24bfdd6
@kemaleren kemaleren wrote caching data fetcher for microarray data 6735b3b
@kemaleren kemaleren added plot of bicluster in example dd14582
@kemaleren kemaleren replaced properties with getters ec41954
@kemaleren kemaleren made count_nonzero work for older versions of numpy 3f8aee6
@kemaleren kemaleren replaced unnecessary copy 6b4e1df
@kemaleren kemaleren moved mean squared residue to extmath 2a82e8d
@kemaleren kemaleren added whats new f47f47b
@GaelVaroquaux
scikit-learn member

I had a closer look at the microarray example, and I tried to do a plot of the resulting clustering to get a gut feeling. I did the following:

row_order = np.argsort(model.rows_[0], kind='mergesort')
column_order = np.argsort(model.columns_[0], kind='mergesort')

plt.matshow(data[row_order].T[column_order].T[:300])

It seems to me that this would sort out the clusters.

The resulting image does not really highlight a structure. Am I being dense, or is it just that there is very little structure in this data?

@kemaleren

There are two issues here.

First, you would need to reverse the row and column order to put the bicluster in the top left corner. As written, it would be in the lower right corner.

row_order = row_order[::-1]
column_order = column_order[::-1]

Second, the bicluster's range is (-65, 65), whereas the portion you plotted has range (-799, 800). The variation within the bicluster is just not visible. That is why the example heatmap only shows the bicluster itself.

@GaelVaroquaux
scikit-learn member

Do you think that you could do this in the example? I think that it would make things more explicit.

Also, have you seen the PR that I sent you a little while ago? kemaleren#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.