Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

[MRG] Bicluster metrics #2452

Open
wants to merge 11 commits into
from

Conversation

Projects
None yet
6 participants
Contributor

kemaleren commented Sep 17, 2013

Improvements and additions to the bicluster metrics module.

TODO:

  • refactor hungarian matching code (Let's lave this for later -- Vlad)
  • tests: full code coverage
  • implement gene match score
  • update documentation to explain new functionality
  • implement size bias correction
  • implement other similarity metrics (Dice and goodness measure)

@vene vene commented on the diff Sep 23, 2013

doc/modules/biclustering.rst
.. math::
- J(A, B) = \frac{|A \cap B|}{|A| + |B| - |A \cap B|}
+ pre = \frac{|a \cap b|}{|b|}
@vene

vene Sep 23, 2013

Owner

I'm not sure about calling them "pre" and "rec", should we be consistent here with sklearn.metrics?

@kemaleren

kemaleren Oct 2, 2013

Contributor

I thought it made the equations look nicer when they are abbreviated. But consistency is probably more important. I'll change them to the full names.

@vene vene commented on the diff Sep 23, 2013

doc/modules/biclustering.rst
+ rec = \frac{|a \cap b|}{|a|}
+
+The similarity measures implemented in scikit-learn combine
+the precision and recall in various ways.
+
+The Jaccard measure :
+
+.. math::
+ s(a, b) = \frac{|a \cap b|}{|a| + |b| - |a \cap b|} = \frac{pre \cdot rec}{pre + rec - pre \cdot rec}
+
+The Dice measure, which corresponds to the standard F-measure:
+
+.. math::
+ s(a, b) = \frac{2 |a \cap b|}{|a| + |b|} = \frac{2 \cdot pre \cdot rec}{pre + rec}
+
+The Goodness measure, which is the mean of precision and recall:
@vene

vene Sep 23, 2013

Owner

Just curious, why is this appropriate here, and what would happen if one would take the harmonic mean instead?

@kemaleren

kemaleren Oct 2, 2013

Contributor

No idea. I've never seen it, and I don't think it's correct to take the mean anyways. But the paper I referenced claims it's been used before, so I included it for completeness. Perhaps we should just remove it.

@vene vene commented on an outdated diff Sep 25, 2013

sklearn/metrics/cluster/bicluster/bicluster_metrics.py
@@ -6,28 +6,91 @@
from sklearn.utils.validation import check_arrays
+def _replace_nan(f):

@vene vene commented on an outdated diff Sep 25, 2013

sklearn/metrics/cluster/bicluster/bicluster_metrics.py
@@ -59,10 +150,16 @@ def consensus_score(a, b, similarity="jaccard"):
Another set of biclusters like ``a``.
similarity : string or function, optional, default: "jaccard"
- May be the string "jaccard" to use the Jaccard coefficient, or
+ May be the strings "jaccard", "dice", or "goodness", or
@vene

vene Sep 25, 2013

Owner

*one of

@vene vene and 1 other commented on an outdated diff Sep 25, 2013

sklearn/metrics/cluster/bicluster/bicluster_metrics.py
+ Parameters
+ ----------
+ expected : (rows, columns)
+ Tuple of row and column indicators for a set of biclusters.
+
+ found : (rows, columns)
+ Another set of biclusters like ``a``.
+
+ similarity : string or function, optional, default: "jaccard"
+ May be the strings "jaccard", "dice", or "goodness", or
+ any function that takes four arguments, each of which is a 1d
+ indicator vector: (a_rows, a_columns, b_rows, b_columns).
+
+ correction : int or None, optional, default: None
+ If provided, this should be data.size. Used to correct for
+ bicluster size bias, as described in Hanczar, et. al (2013).
@vene

vene Sep 25, 2013

Owner

Technically the dot should be after al, not after et, as the unabbreviated latin phrase is "et alii" :P

@kemaleren

kemaleren Sep 26, 2013

Contributor

Don't know why I keep making this mistake. I even studied Latin!

@vene vene commented on an outdated diff Sep 25, 2013

sklearn/metrics/cluster/bicluster/bicluster_metrics.py
+ Another set of biclusters like ``a``.
+
+ similarity : string or function, optional, default: "jaccard"
+ May be the strings "jaccard", "dice", or "goodness", or
+ any function that takes four arguments, each of which is a 1d
+ indicator vector: (a_rows, a_columns, b_rows, b_columns).
+
+ correction : int or None, optional, default: None
+ If provided, this should be data.size. Used to correct for
+ bicluster size bias, as described in Hanczar, et. al (2013).
+ If this is used, bicluster similarities may be less than 0, to
+ indicate that they are worse than random chance.
+
+ Returns
+ -------
+ (recovery, relevance): tuple
@vene

vene Sep 25, 2013

Owner

Returning a tuple is the standard Python way of returning multiple values. Therefore we conventionally just list each return value, e.g.:

recovery: float,
    ... description

relevance: float,
    ... description
Owner

arjoly commented Sep 26, 2013

Why not making bicluster metrics a sub-module of metrics?
Is there a need that metric.cluster.bicluster.bicluster_metrics get its own folder?

@GaelVaroquaux GaelVaroquaux commented on the diff Sep 26, 2013

sklearn/metrics/cluster/bicluster/bicluster_metrics.py
+ denom = dsize - bsize
+ return _divide(num, denom, 'corrected recall')
+
+
+def _jaccard(precision, recall):
+ """Jaccard coefficient"""
+ num = precision * recall
+ denom = precision + recall - precision * recall
+ return _divide(num, denom, 'jaccard coefficient')
+
+
+def _dice(precision, recall):
+ """Dice measure. Same as the traditional balanced F-score."""
+ num = 2 * precision * recall
+ denom = precision + recall
+ return _divide(num, denom, 'jaccard coefficient')
@GaelVaroquaux

GaelVaroquaux Sep 26, 2013

Owner

I believe that jaccard and dice are implemented in scipy: scipy.spatial.distance.pdist. Is there a reason to reimplement them?

One thing to be careful about is that the scipy implementation are the 'distance' version, ie 1 - the index.

@kemaleren

kemaleren Oct 2, 2013

Contributor

The SciPy version computes them directly, so we would still need to implement the corrected versions ourselves. By computing these measures from the precision and recall arguments, it is easy to compute either the uncorrected or corrected measures, by using the uncorrected or corrected precision and recall.

Contributor

kemaleren commented Oct 2, 2013

@arjoly I did it to mirror the directory layout for the biclustering algorithms. But you are right that it is not necessary here, since everything is contained in one file.

Owner

mblondel commented Oct 8, 2013

I did it to mirror the directory layout for the biclustering algorithms. But you are right that it is not necessary here, since everything is contained in one file.

Actually, I think the directory layout for the biclustering module is way too nested. But I guess I'm late to the party.

Owner

GaelVaroquaux commented Oct 10, 2013

Actually, I think the directory layout for the biclustering module is way too
nested. But I guess I'm late to the party.

I think that I agree. Too bad that we realized like that...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment