# [MRG] Bicluster metrics #2452

Open
wants to merge 11 commits into
from

## Conversation

Projects
None yet
6 participants
Contributor

### kemaleren commented Sep 17, 2013

 Improvements and additions to the bicluster metrics module. TODO: refactor hungarian matching code (Let's lave this for later -- Vlad) tests: full code coverage implement gene match score update documentation to explain new functionality implement size bias correction implement other similarity metrics (Dice and goodness measure)

### kemaleren added some commits Sep 16, 2013

 kemaleren  wrote dice and goodness measures   12107a1  kemaleren  added correction   be43818  kemaleren  test corrections   d9c7d7d  kemaleren  Fix a couple issues with metrics.  - ensure that consensus score is correct if number of biclusters differ. - ensure that empty biclusters have similarity of 0 with everything.  f8c0f16  kemaleren  use thomas's approach; easier to read   27beedb  kemaleren  implemented match score   ba10fd0  kemaleren  updated metrics documentation   20ddf3e  kemaleren  test for values of similarity argument   0197644  kemaleren  updated what's new   8b3efbc 

### vene commented on the diff Sep 23, 2013

doc/modules/biclustering.rst
 .. math:: - J(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} + pre = \frac{|a \cap b|}{|b|}

#### vene Sep 23, 2013

Owner

I'm not sure about calling them "pre" and "rec", should we be consistent here with sklearn.metrics?

#### kemaleren Oct 2, 2013

Contributor

I thought it made the equations look nicer when they are abbreviated. But consistency is probably more important. I'll change them to the full names.

### vene commented on the diff Sep 23, 2013

doc/modules/biclustering.rst
 + rec = \frac{|a \cap b|}{|a|} + +The similarity measures implemented in scikit-learn combine +the precision and recall in various ways. + +The Jaccard measure : + +.. math:: + s(a, b) = \frac{|a \cap b|}{|a| + |b| - |a \cap b|} = \frac{pre \cdot rec}{pre + rec - pre \cdot rec} + +The Dice measure, which corresponds to the standard F-measure: + +.. math:: + s(a, b) = \frac{2 |a \cap b|}{|a| + |b|} = \frac{2 \cdot pre \cdot rec}{pre + rec} + +The Goodness measure, which is the mean of precision and recall:

#### vene Sep 23, 2013

Owner

Just curious, why is this appropriate here, and what would happen if one would take the harmonic mean instead?

#### kemaleren Oct 2, 2013

Contributor

No idea. I've never seen it, and I don't think it's correct to take the mean anyways. But the paper I referenced claims it's been used before, so I included it for completeness. Perhaps we should just remove it.

Closed

### vene commented on an outdated diff Sep 25, 2013

sklearn/metrics/cluster/bicluster/bicluster_metrics.py
 @@ -6,28 +6,91 @@ from sklearn.utils.validation import check_arrays +def _replace_nan(f):

Owner

### vene commented on an outdated diff Sep 25, 2013

sklearn/metrics/cluster/bicluster/bicluster_metrics.py
 @@ -59,10 +150,16 @@ def consensus_score(a, b, similarity="jaccard"): Another set of biclusters like a. similarity : string or function, optional, default: "jaccard" - May be the string "jaccard" to use the Jaccard coefficient, or + May be the strings "jaccard", "dice", or "goodness", or

Owner

*one of

### vene and 1 other commented on an outdated diff Sep 25, 2013

sklearn/metrics/cluster/bicluster/bicluster_metrics.py
 + Parameters + ---------- + expected : (rows, columns) + Tuple of row and column indicators for a set of biclusters. + + found : (rows, columns) + Another set of biclusters like a. + + similarity : string or function, optional, default: "jaccard" + May be the strings "jaccard", "dice", or "goodness", or + any function that takes four arguments, each of which is a 1d + indicator vector: (a_rows, a_columns, b_rows, b_columns). + + correction : int or None, optional, default: None + If provided, this should be data.size. Used to correct for + bicluster size bias, as described in Hanczar, et. al (2013).

#### vene Sep 25, 2013

Owner

Technically the dot should be after al, not after et, as the unabbreviated latin phrase is "et alii" :P

#### kemaleren Sep 26, 2013

Contributor

Don't know why I keep making this mistake. I even studied Latin!

### vene commented on an outdated diff Sep 25, 2013

sklearn/metrics/cluster/bicluster/bicluster_metrics.py
 + Another set of biclusters like a. + + similarity : string or function, optional, default: "jaccard" + May be the strings "jaccard", "dice", or "goodness", or + any function that takes four arguments, each of which is a 1d + indicator vector: (a_rows, a_columns, b_rows, b_columns). + + correction : int or None, optional, default: None + If provided, this should be data.size. Used to correct for + bicluster size bias, as described in Hanczar, et. al (2013). + If this is used, bicluster similarities may be less than 0, to + indicate that they are worse than random chance. + + Returns + ------- + (recovery, relevance): tuple

#### vene Sep 25, 2013

Owner

Returning a tuple is the standard Python way of returning multiple values. Therefore we conventionally just list each return value, e.g.:

recovery: float,
... description

relevance: float,
... description


### kemaleren added some commits Sep 26, 2013

 kemaleren  handle division by zero like in sklearn.metrics   5b4bde6  kemaleren  corrected docstrings   a229c47 
Owner

### arjoly commented Sep 26, 2013

 Why not making bicluster metrics a sub-module of metrics? Is there a need that metric.cluster.bicluster.bicluster_metrics get its own folder?

### GaelVaroquaux commented on the diff Sep 26, 2013

sklearn/metrics/cluster/bicluster/bicluster_metrics.py
 + denom = dsize - bsize + return _divide(num, denom, 'corrected recall') + + +def _jaccard(precision, recall): + """Jaccard coefficient""" + num = precision * recall + denom = precision + recall - precision * recall + return _divide(num, denom, 'jaccard coefficient') + + +def _dice(precision, recall): + """Dice measure. Same as the traditional balanced F-score.""" + num = 2 * precision * recall + denom = precision + recall + return _divide(num, denom, 'jaccard coefficient')

#### GaelVaroquaux Sep 26, 2013

Owner

I believe that jaccard and dice are implemented in scipy: scipy.spatial.distance.pdist. Is there a reason to reimplement them?

One thing to be careful about is that the scipy implementation are the 'distance' version, ie 1 - the index.

#### kemaleren Oct 2, 2013

Contributor

The SciPy version computes them directly, so we would still need to implement the corrected versions ourselves. By computing these measures from the precision and recall arguments, it is easy to compute either the uncorrected or corrected measures, by using the uncorrected or corrected precision and recall.

Contributor

### kemaleren commented Oct 2, 2013

 @arjoly I did it to mirror the directory layout for the biclustering algorithms. But you are right that it is not necessary here, since everything is contained in one file.
Owner

### mblondel commented Oct 8, 2013

 I did it to mirror the directory layout for the biclustering algorithms. But you are right that it is not necessary here, since everything is contained in one file. Actually, I think the directory layout for the biclustering module is way too nested. But I guess I'm late to the party.
Owner

### GaelVaroquaux commented Oct 10, 2013

 Actually, I think the directory layout for the biclustering module is way too nested. But I guess I'm late to the party. I think that I agree. Too bad that we realized like that...