# FIX for issue #884 and improve contingency matrix built efficiency #1147

Closed
wants to merge 20 commits into
from

None yet

### 5 participants

scikit-learn member

I think(I am not so sure) that the log operation breaks for `log(a/b)`, as the division/multiplication will have possible loss of precision, and `log(a/b)` should be written as `log(a) - log(b)`. This causes the issue #884

For example in issue #776 @ogrisel:
1.`normalized_mutual_info_score([1, 1, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6])` Will be zero
2. ```for i in np.logspace(1, 4, 4): print normalized_mutual_info_score(np.ones(i, dtype=np.int), np.arange(i, dtype=np.int))```
Will be all zero

Another is to use `coo_matrix` to accelerate the built of the contingency matrix, A quick test is shown below.

Using coo_matrix
coo_matrix is similar to matlab's `accumarray`, and I think it is more efficient to calculate :-).
```%timeit contingency = contingency_matrix(random_integers(1,10000,(1,1000000))[0], random_integers(1,10000,(1,1000000))[0]) 1 loops, best of 3: 604 ms per loop```
Using original python dict
```%timeit contingency = contingency_matrix(random_integers(1,10000,(1,1000000))[0], random_integers(1,10000,(1,1000000))[0]) 1 loops, best of 3: 1.84 s per loop```

Further issue is that when using dict, `contingency = contingency_matrix(random_integers(1,10,(1,100)),random_integers(1,10,(1,100)))` such code will fail as the ndarray not hashable, a check for the input string should be made.

One thing remaining is that, the contigency_matrix is usually (if not all the time) sparse, so use sparse operation should further accelearte expecially calculate the `x * log(x)` with dealing the `0log0` case as we can get out the non-zero items with no cost than dense matrix. But currently I put a `todense()` for the contingency matrix. I will further fix it if decided to use coo_matrix :-)

added some commits Sep 14, 2012
 weilinear `ENH: using coo matrix construction to accelerate calculation of the c…` `…ontingency matrix` `e2788df` weilinear `FIX: numerial issues in NMI` `f4ab723` weilinear `COMIT pep8` `6a3479c` weilinear `ENH add refs to issue #884` `979fc3a`
scikit-learn member

as the contigency matrix is likely to be dense, I don't think `coo_matrix` is the best solution.
The easy way to speed this up is to use `np.histogram2d` as in `metrics.confusion_matrix`.
The a bit harder way (not that hard tough) is to do a cython function `bincount2d` analogue to `np.bincount` and `np.histogram2d`.

I guess you are right about the loss of precision in the division, I'll try to have a look later.
Do I understand it correctly that in the two examples you give, the output of your version is different form the current version? (and hopefully corrected #884)

scikit-learn member

btw it would also be possible to just call `metrics.confusion_matrix` to compute the contigency matrix.

scikit-learn member

Sorry, I didn't read your comment fully before. Why do you think the contigency matrix is usually sparse?
I am not so experienced with these methods, but in my applications it was dense. What is your application?
For evaluating clusterings with ground truth, the number of labels in the ground truth is usually "small" in the data sets I used.

scikit-learn member

Thanks for looking into this by the way, I really didn't have the time :-/

and 1 other commented on an outdated diff Sep 14, 2012
sklearn/metrics/cluster/supervised.py
 outer = np.outer(pi, pj) nnz = contingency != 0.0 - mi = contingency[nnz] * np.log(contingency[nnz] / outer[nnz]) + mi = ((contingency[nnz] / contingency_sum) *
 amueller Sep 14, 2012 Do you think it is possible to split this into several lines to make it a bit more readable? Also, maybe you should try computing `contigentcy[nnz]` only once. Atm it is computed 3 times. amueller Sep 14, 2012 It looks as if it could be written as `(contingency[nnz] / contingency_sum) * (np.log(contigency[nnz]) - log(contigency_sum) - np.log(outer[nnz]) + log(pi.sum()) + log(pj.sum()))` maybe split up in three parts ```# normalized contigency contigency_nm = (contingency[nnz] / contingency_sum) # log of normalized contigency log_contigency_nm = (np.log(contigency[nnz]) - log(contigency_sum) # negative log of marginal distributions (better name?) log_outer = - np.log(outer[nnz]) + log(pi.sum()) + log(pj.sum())) return np.sum(...)``` weilinear Sep 14, 2012 For `contingency[nnz]`, it is a good idea to pre-calculate it. For the second part, spliting it will make them to be more clear. I will change it after we have made decision on the numeric and coo_matrix issues :)
scikit-learn member

@amueller For the numerical issue, in the current version(0.12) on my computer is, the result as follows:

```normalized_mutual_info_score([1, 1, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6])
-2.2204460492503135e-06```

And

```for i in np.logspace(1, 4, 4):
print normalized_mutual_info_score(np.ones(i, dtype=np.int), np.arange(i, dtype=np.int))

-1.11022302463e-06
6.66133814775e-06
6.66133814775e-06
-0.000938138455808```

All of them should be exactly zero I think. Could @ogrisel can have a look at it, it is the case in issue #776.

scikit-learn member

For the coo_matrix issue, I think my considering is as follows:
1. The contingency matrix is not guranteed to be sparse actually, but in my experience, when the class is very large(like >100), one class is confused mainly on several other classes but not all. I do not know whether others' experience is similar :-).
2. I think the contigency matrix cannot be too large, as the matrix is at most `n_class x n_label`, so the sparse matrix representation will not lose much of the speed compared to dense matrix, but the operation is very important.
3. Coo_matrix is a basic sparse matrix type, I believe that scipy should get enough optimization for the code. The testing provided all have a full matrix and the speed incrementation is almost `3x`. Intiutively, constructing a coo_matrix does not cost us much overhead.
4. When dealing with the contingency[nnz] problem, we do not have the indexing problem as we can just take out the data from the sparse matrix to calculat the `xlogx` as we should always make sure that `0log0` is `0` rather than `NaN`

The above is how I consider the problem actually. Actually it is to construct an array by accumulation like `AccumarrayLike` in scipy. Test the coo_matrix with histogram2d:

A quick test is like follows:

```%timeit p= histogram2d(random_integers(1,10000,(1,1000000))[0],random_integers(1,10000,(1,1000000))[0],[10000,10000])
1 loops, best of 3: 7.29 s per loop```
```%timeit p= coo_matrix((np.ones(1000000),(random_integers(1,10000,(1,1000000))[0],random_integers(1,10000,(1,1000000))[0])),shape=[1000000,1000000],dtype=np.int)
10 loops, best of 3: 44.2 ms per loop```

By changing the contingency to be calculated as (this can pass all the test)

` contingency = np.asarray(np.histogram2d(class_idx, cluster_idx, [n_classes, n_clusters])[0],dtype='int')`
```%timeit contingency=contingency_matrix(random_integers(1,10000,(1,1000000))[0],random_integers(1,10000,(1,1000000))[0])
1 loops, best of 3: 7.86 s per loop```

Here is just to do the histogram without taking the unique, the time taken is 7.29s vs 44.2ms, for contingency matrix calculation, the result is 7.86 vs 604ms(in first post). It not tested with small scale yet. Not sure whether I have done the comparision in the right manner, whether I have used the `histogram2d` right, and hope you can review it :)

scikit-learn member

Thanks for the testing. I couldn't to it myself as my CPU is quite busy atm ;)
I feel this is quite surprising. Probably due to the fact that the correct operation would be a bincount, not a histogram, which tries to sort into bins first. Could you please try again with given the bins np.arange(n_classes), np.arange(n_clusters) ? That could make a big difference.

Otherwise, as your solution is so much faster, let's use that. Maybe a little comment on why this is a good thing to do ;)

If would be great if you could change the `metrics.confusion_matrix` to use the same construction!

scikit-learn member

@amueller Takes your time :). Maybe let us open it a few days, just wait others' comments? I am not so sure about the performance thing actually. I will further work with it and finish this whole file(including docstrings) where other places may take advantage of the sparse representation if it turns out to be a good idea eventually. Another is I plan to put @ogrisel 's test cases into the test file.

I will further look into the confusion_matrix. Maybe I can use this trick to accelerate confusion matrix and just use the confusion matrix to calculate the confusion_matrix here.

scikit-learn member

Yes, I think sharing the code would be a good idea. If you have the time, just do the histogram2d tests with the bins given by arange, not by the number of bins. If that is still slower than your version, I think no-one will argue against it ;)
You would just be making the confusion matrix faster :)

scikit-learn member

Btw, did your fix affect the problems in `v_measure`?
As i said in #776, `v_measure` should be `MI(X,Y)/(H(X) + H(Y))`, which was not the case :-/

scikit-learn member

Seems I have missed the v_measure :(. I will look into that today.

NB

scikit-learn member

Don't worry about the v-measure, we can do this later. This PR is definitely an improvement.

added some commits Sep 17, 2012
 weilinear `FIX: ADD test cases for exact 0 case, and nmi equal to v_measure case` `444faf9` weilinear `FIX: accelerate v_measure calculation based on mutual information` `bc40127` weilinear `COSMIT add doc to clearify how nmi is normalized and pep8 fix` `082a9ed` weilinear `COSMIT pep8 fix for test_supervised` `5182bdf` weilinear `FIX: fixes error caused by break line` `6f1e09f` weilinear `Using coo_matrix to accelerate confusion_matrix calculation` `09c214e`
scikit-learn member

I have changed the `confusion_matrix` as well. I think it is not quite a good idea to use confusion matrix in contingency matrix as in confusion matrix we have an additional label vectors and we need may dictionary operation to rearrange the labels_true and labels_predict. which will double the time for calculation.

For `confusion_matrix` here is a simple test

`%timeit confusion_matrix(random_integers(1,10000,(1,1000000))[0],random_integers(1,10000,(1,1000000))[0])`

original : 1 loops, best of 3: 2.05 s per loop
new : 1 loops, best of 3: 897 ms per loop

`%timeit a = confusion_matrix(np.arange(20000),np.arange(20000))`

original: 1 loops, best of 3: 461 ms per loop
new : 1 loops, best of 3: 438 ms per loop

`%timeit p= coo_matrix((np.ones(1000000),(np.arange(1000000),np.arange(1000000))),shape=[1000000,1000000])`

100 loops, best of 3: 7.89 ms per loop

`histogram2d(np.arange(1000000),np.arange(1000000))`

10 loops, best of 3: 180 ms per loop

This will yield similar performance for using `arange`. Seems that when the data is aligned, the original method using `dict` is quite effective as the resulted matrix is quite sparse and the dictionary does not need to create too much new keys. However, seems histogram2d are doing too much jobs which makes it less effective on our simple usage here.

scikit-learn member

For the v_measure thing, I think it is not equal to our version of NMI. It equals to NMI that is normalized by (H(x) + H(y))/2 and our version is use sqrt(H(x) * H(y)) to normalize. I put one string in nmi's doc to tell our normalization methods.

I further simplify the calculation of homogenity score and compleness score, actually
`1 - H(C|E) / H(E) = (H(E) - H(C|E)) / H(E) = MI(E,C) / H(E)`
`1 - H(E|C) / H(C) = (H(C) - H(E|C)) / H(E) = MI(E,C) / H(E)`

So, it is easy to see that v_measure is equal to
`2 * MI(E,C) / (H(C) + H(E))`
which is different from ours

 weilinear `COSMIT` `f5f52a7`
scikit-learn member

Again for the histogram2d: sorry I couldn't do the experiments myself, my box is busy non-stop with experiments atm.
You should really pass the histogram2d the bins explicitly like so:
``` %timeit p= histogram2d(random_integers(1,10000,(1,1000000))[0],random_integers(1,10000,(1,1000000))[0],[np.arange(10000),np.arange(10000)]) ```
Sorry if I haven't been clear before.

scikit-learn member

Could you also please separately test the confusion matrix for the two cases that are there, i.e. with few labels and with many?
Sorry I'm a bit busy atm, hope I'll be more helpful from next week on.

scikit-learn member
`%timeit p= histogram2d(random_integers(1,10000,(1,1000000))[0],random_integers(1,10000,(1,1000000))[0],[np.arange(10000),np.arange(10000)])`

1 loops, best of 3: 7.3 s per loop

confusion_matrix with few labels:

`%timeit confusion_matrix(random_integers(1,20,(1,1000000))[0],random_integers(1,20,(1,1000000))[0])1 loops, best of 3: 1.72 s per loop`

original(using `dict`): 1.72s
new(using `coo_matrix`): 717ms

just take your time :). I will be busy these few days as well :(

scikit-learn member

Thanks for the benchmarks!
Looks good :)

For the v-measure: yes, it is a different kind of normalized mutual information. The question was: if we take our entropy implementation and mutual information, and evaulate
`2 * MI(E,C) / (H(C) + H(E))`, do we get the v-measure out.
That was not the case before.

added some commits Sep 17, 2012
 weilinear `ENH add test for testing v_measure is a variant of nmi` `0fe80ac` weilinear `COSMIT typos in doc strings` `feeb599`
scikit-learn member

Add a test function to test script and seems v_measure and are equal with
`2 * MI(E,C) / (H(C) + H(E))`

scikit-learn member

Thanks, that is great :)
Awesome! +10 for merge then :)
This bug was bugging me ;)

scikit-learn member

@ogrisel Would you mind having a look?

added some commits Sep 17, 2012
 weilinear `FIX let test use random_state(seed)` `83a79e6` weilinear `PEP8..` `8305a7a`
scikit-learn member

Seems I always miss to pep8 before commit, too many unnecessary commit :(

scikit-learn member

Don't worry to much about it ;)
If you want, you can use rebase -i - remember you'll have to force-push after a rebase.
Some might think differently about it but I don't think it is necessary. And no-one has complained about my endless list of tiny commits.

and 1 other commented on an outdated diff Sep 18, 2012
sklearn/metrics/cluster/supervised.py
 n_classes = classes.shape[0] n_clusters = clusters.shape[0] - contingency = np.zeros((n_classes, n_clusters), dtype=np.int) - for c, k in zip(labels_true, labels_pred): - contingency[class_idx[c], cluster_idx[k]] += 1 + # using coo_matrix to accelerate calculation of contingency matrix + # it can accelerate 2d-histogram like construction
 vene Sep 18, 2012 It is not clear to me what this comment line means, would you mind explaining? weilinear Sep 19, 2012 @vene Actually I mean that those simple histogram-like calculation can be done using coo_matrix. Currently its performance is better than histogram2d. I will change this line :)
commented on an outdated diff Sep 18, 2012
sklearn/metrics/cluster/tests/test_supervised.py
 @@ -167,3 +167,25 @@ def test_adjusted_mutual_info_score(): def test_entropy(): ent = entropy([0, 0, 42.]) assert_almost_equal(ent, 0.6365141, 5) + + +def test_exactly_zero_info_score(): + """Check numerical stabability when information is exactly zero""" + for i in np.logspace(1, 4, 4): + labels_a, labels_b = np.ones(i, dtype=np.int),\ + np.arange(i, dtype=np.int) + assert_equal(normalized_mutual_info_score(labels_a, labels_b), 0.0) + assert_equal(v_measure_score(labels_a, labels_b), 0.0) + assert_equal(adjusted_mutual_info_score(labels_a, labels_b), 0.0) + assert_equal(normalized_mutual_info_score(labels_a, labels_b), 0.0) + + +def test_v_measure_and_mi(seed=36): + """Check relation between v_measure, entropy and and mi"""
 vene Sep 18, 2012 typo "and and", also try to fit "mutual information" instead of "mi" if you can.
and 2 others commented on an outdated diff Sep 18, 2012
sklearn/metrics/cluster/tests/test_supervised.py
 @@ -167,3 +167,25 @@ def test_adjusted_mutual_info_score(): def test_entropy(): ent = entropy([0, 0, 42.]) assert_almost_equal(ent, 0.6365141, 5) + + +def test_exactly_zero_info_score(): + """Check numerical stabability when information is exactly zero""" + for i in np.logspace(1, 4, 4): + labels_a, labels_b = np.ones(i, dtype=np.int),\ + np.arange(i, dtype=np.int) + assert_equal(normalized_mutual_info_score(labels_a, labels_b), 0.0) + assert_equal(v_measure_score(labels_a, labels_b), 0.0) + assert_equal(adjusted_mutual_info_score(labels_a, labels_b), 0.0) + assert_equal(normalized_mutual_info_score(labels_a, labels_b), 0.0) + + +def test_v_measure_and_mi(seed=36):
 vene Sep 18, 2012 does the seed need to be fixed? GaelVaroquaux Sep 18, 2012 does the seed need to be fixed? Always, IMHO vene Sep 18, 2012 Then what would be the point of "Seeding RNG with seed 4389128" before running the tests? Something is eluding me... weilinear Sep 19, 2012 @vene @GaelVaroquaux I am a little confused as well for "Seeding RNG ...". Another thing is have I done it right here? GaelVaroquaux Sep 19, 2012 Then what would be the point of "Seeding RNG with seed 4389128" before running the tests? Something is eluding me... In case people forget to seed everything (it has happened before).
 weilinear `FIX typos and vague comments` `3415717`
and 1 other commented on an outdated diff Sep 19, 2012
sklearn/metrics/cluster/supervised.py
 label_idx = unique(labels, return_inverse=True)[1] pi = np.bincount(label_idx).astype(np.float) pi = pi[pi > 0] - pi /= np.sum(pi) - return -np.sum(pi * np.log(pi)) + pi_sum = np.sum(pi) + return -np.sum((pi / pi_sum) * (np.log(pi) - log(pi_sum)))
 GaelVaroquaux Sep 19, 2012 Out of curiosity, what is the motivation for this change? weilinear Sep 19, 2012 @GaelVaroquaux This line causes some exactly 0/1 case to fail, make nmi gives values a little bit larger than 1 or very smal negative values. I think `log(a) - log(b)` may be more accurate than `log(a/b)` as division may lose precision and a quick test by others is here: ref: http://www.jonkagstrom.com/floating-point-precision-and-sum-of-logs However, I think I need more time to have a deeper look at those issues :) GaelVaroquaux Sep 19, 2012 On Wed, Sep 19, 2012 at 08:22:48AM -0700, Wei LI wrote: > + return -np.sum((pi / pi_sum) * (np.log(pi) - log(pi_sum))) This line causes some exactly 0/1 case to fail, make nmi gives values a little bit larger than 1 or very smal negative values. OK. I am a bit surprised that the '0' case works better:
In [1]: 0*np.log(0) Warning: divide by zero encountered in log Out[1]: nan
I would have suspected that it was necessary to remove the occurences of pi==0 manually. I guess that this is all covered by the tests, right? Could you please add a comment line above that line explaining the reasons. Thanks! weilinear Sep 19, 2012 @GaelVaroquaux The test cases will cover 0log0 cases, as I am testing cases where nmi,v_measure is exactly zero in which the 0log0 cases can be covered as one nan will make the final answer become nan. Comments are added.
scikit-learn member

Looks good to me. 👍 for merge. Thanks a lot @kuantkid

added some commits Sep 19, 2012
 weilinear `DOC add comments for log(a) - log(b) precision` `b6d511b` weilinear `COSMIT fails to see the function name use mi rather than mutual infor…` `…mation` `bcf2676`
scikit-learn member

There are some doctest issues, which I'll fix and then merge.

scikit-learn member

Hm I forgot how to deal with this:

``````Expected:
0.0..
Got:
3.2034265038149171e-16
``````

Use a print? Change numpy representation precision?

scikit-learn member

Or should we rather that this as a sign that some computation somewhere is less stable now?

scikit-learn member

Hm I should really be doing something else right now, @kuantkid could you maybe investigate? run `nosetests sklearn/metrics/cluster/supervised.py` to reproduce (probably).

scikit-learn member

@amueller I take care of that :)

BTW, how to tell whether the doctest is right or not?

scikit-learn member

OK, I see the doctest then. I originally take the doctest as something that tests the doc...

scikit-learn member
scikit-learn member

@amueller Actually this is caused by a very interesting phenomena

suppose that we have a vector v = [0.5, 0.5]. The the following code

```v = [float(0.5), float(0.5)]
entropy_C = 0
for i in range(2):
entropy_C -= log(v[i]) * v[i]```

and the following code

`(np.log(v) * v).sum()`

will gives out different answers. And it seems the original code's precision is not the same as np.float. I think the original doc test can be passed because there is a `MutualInformation/Entropy` operation there and it just do rounding to skip the troubling part that will cause the precision problem described as follows:

Currently the test fail maybe because the current sum() function does not support something like math.fsum() which means if we sum over a same vector but in different order, we may gives slightly different answer with respect to machine precision: http://en.wikipedia.org/wiki/Kahan_summation_algorithm

Here is a ticket for numpy: http://projects.scipy.org/numpy/ticket/1855

Here is an example that is quite possible in calculating entropy:

```numpy.sum([log(2), log(3), -log(2), -log(3)]), numpy.sum([log(2),-log(2),log(3),-log(3)])
fsum([log(2), log(3), -log(2), -log(3)]), fsum([log(2),-log(2),log(3),-log(3)])```

Output is

``````(-2.2204460492503131e-16, 0.0)
(0.0, 0.0)
``````

The answer is different even if the input array is the the same. I tend to think this problem cannot be directly solved in our package, so I may suggest we tolerate the precision to machine precision :)

scikit-learn member

Thanks for investigating. Could you explain where the rounding operation was in the previous implemenation?

scikit-learn member

Here is n_C is a list of float and it causes the rounding effect. But it seems it is not done on purpose

```    n_C = [float(np.sum(labels_true == c)) for c in classes]
n_K = [float(np.sum(labels_pred == k)) for k in clusters]

for i in xrange(len(classes)):
entropy_C -= n_C[i] / n_samples * log(n_C[i] / n_samples)

for j in xrange(len(clusters)):
entropy_K -= n_K[j] / n_samples * log(n_K[j] / n_samples)```
scikit-learn member

@amueller When I run the nosetest, I get a

``````raise self.failureException(self.format_failure(<StringIO.StringIO instance at 0x1a06830>.getvalue()))
``````

How can I check on which line is this failure?

scikit-learn member
scikit-learn member

Wow thanks, @GaelVaroquaux, I didn't know about that. Ususallly I use ipython and import the test.

scikit-learn member
scikit-learn member

I usually use `nosetests test_file.py -m function_name`...

scikit-learn member

@kuantkid Sorry, I'm a bit slow. I still don't see where the rounding is taking place.

scikit-learn member

@amueller My fault not to make myself clear.

Suppose labels_true = [1 2]. Then the entropy calculated using current (this PR) is

```v = np.array([0.5,0.5], dtype=np.float)
en1 = -np.sum(v * np.log(v))```

`0.69314718055994529`

And in the original implementation, the entropy is calculated as

```v = [float(0.5), float(0.5)]
en2 = 0.0-( v[0] * math.log(v[0]) + v[1] * math.log(v[1]))```

`0.6931471805599453`

Seems `float` is different from `np.float`, and possible loss of precision. This is what I mean by rounding in previous comments.

scikit-learn member

@amueller @GaelVaroquaux Actually my problem is I run nosetest2 sklearn/metrics/cluster/supervised.py and the test fail with

``````raise self.failureException(self.format_failure(<StringIO.StringIO instance at 0x1e2a8c0>.getvalue()))
``````

I think maybe it is a failure about parsing the docstrings in supervised.py .

scikit-learn member

Hi @kuantkid , when reporting test failures, you need to report a bit more than the exception raised, but also the error message that goes with it. In your case:

```======================================================================
FAIL: Doctest: sklearn.metrics.cluster.supervised.completeness_score
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python2.7/doctest.py", line 2201, in runTest
raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for sklearn.metrics.cluster.supervised.completeness_score
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 329, in completeness_score

----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 381, in sklearn.metrics.cluster.supervised.completeness_score
Failed example:
completeness_score([0, 1, 2, 3], [0, 0, 1, 1])
Expected:
1.0
Got:
0.99999999999999989

>>  raise self.failureException(self.format_failure(.getvalue()))

======================================================================
FAIL: Doctest: sklearn.metrics.cluster.supervised.homogeneity_score
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python2.7/doctest.py", line 2201, in runTest
raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for sklearn.metrics.cluster.supervised.homogeneity_score
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 262, in homogeneity_score

----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 312, in sklearn.metrics.cluster.supervised.homogeneity_score
Failed example:
homogeneity_score([0, 0, 1, 1], [0, 0, 1, 2])
Expected:
1.0
Got:
0.99999999999999989
----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 314, in sklearn.metrics.cluster.supervised.homogeneity_score
Failed example:
homogeneity_score([0, 0, 1, 1], [0, 1, 2, 3])
Expected:
1.0
Got:
0.99999999999999989
----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 322, in sklearn.metrics.cluster.supervised.homogeneity_score
Failed example:
homogeneity_score([0, 0, 1, 1], [0, 0, 0, 0])
Expected:
0.0
Got:
1.6017132519074588e-16

>>  raise self.failureException(self.format_failure(.getvalue()))

======================================================================
FAIL: Doctest: sklearn.metrics.cluster.supervised.v_measure_score
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python2.7/doctest.py", line 2201, in runTest
raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for sklearn.metrics.cluster.supervised.v_measure_score
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 396, in v_measure_score

----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 453, in sklearn.metrics.cluster.supervised.v_measure_score
Failed example:
v_measure_score([0, 0, 1, 2], [0, 0, 1, 1])     # doctest: +ELLIPSIS
Expected:
0.8...
Got:
0.79999999999999993
----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 462, in sklearn.metrics.cluster.supervised.v_measure_score
Failed example:
v_measure_score([0, 0, 1, 1], [0, 0, 1, 2])     # doctest: +ELLIPSIS
Expected:
0.8...
Got:
0.79999999999999993
----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 476, in sklearn.metrics.cluster.supervised.v_measure_score
Failed example:
v_measure_score([0, 0, 1, 1], [0, 0, 0, 0])
Expected:
0.0
Got:
3.2034265038149171e-16

>>  raise self.failureException(self.format_failure(.getvalue()))
```

Quite clearly, the errors that you are getting are due to loss of numerical precision. Now it remains to be seen whether it is important.

scikit-learn member

@GaelVaroquaux Thanks! I fail to see that there is one `self.format_failure` at each of the failed test-cases. I just think that at the end of the test, there is one doctest written in the wrong style which causes the `format_failure`. My fault.

I tried to analyze the precision loss in previous few comments :)

scikit-learn member

@GaelVaroquaux @amueller How can I change the doctest string format so that it can tolerate machine precision? Currently I can use `>>>print xxx` as `print` will round the result so that `0.79999999999999993` will become `0.8` but that does not look well :)

scikit-learn member

You can easily change that by using `print("%.2f" % x)`, but I think it would be interesting to investigate the loss of precision.

scikit-learn member

@amueller @GaelVaroquaux Could you have a look at previous comments about the loss of precision? Mainly the loss is due to the summation operation. If you think that is reasonable loss of precision then I will change the doctest :)

PS: I do not know how to quote previous comments...

scikit-learn member

Sorry for not getting back. I have been way busy this week. I'd like to have a closer look at the code again before merging. Hopefully on this weekend. I'd really like to get this code in soon as it is a great contribution.

scikit-learn member

@amueller Just take your time :)

scikit-learn member

Ok, I was having a closer look again.
I observed:

``````ipdb> log(8. / (4 * 4))
-0.6931471805599453
ipdb> log(8.) - log(4 * 4)
-0.6931471805599454
``````

the correct result is ln(2) which is actually the top one.
Separating the logs here causes a log of precision.

Maybe a bit easier to understand with `exp`:

``````ipdb> exp(log(8.) - log(4 * 4))
0.49999999999999994
ipdb> exp(log(8. / (4 * 4)))
0.5
``````

Intuitively I would have assumed that you where right in first calculating the log, but it seems that is not always good.

scikit-learn member

Ok I have no idea how that should be handled. @GaelVaroquaux @ogrisel any opinions / ideas?

scikit-learn member
scikit-learn member

I can not answer either of the questions, but I would hate to see this great PR sit around much longer.

scikit-learn member

@amueller I think you are right when we do not lose precision in division/multiplication. Cases when calculating the inner side first by multiplication/division is that there maybe loss of precision. Here 8/4/4=0.50. What if we are dealing with `log(8/(3*3))`?

But if we break them apart, problem is when we face loss of precision for every log calculation.

However, log is itself and summation will lose precision which is what we cannot overtake.

I originally think it would be easier to fix though finally find I am short of knowledge :)

scikit-learn member

As this fixes a bug, I guess we should just merge it after fixing the doctests. @kuantkid could you do that please?

@GaelVaroquaux I think if possible this should be included in 12.1.

scikit-learn member

@amueller I can fix the doctest asap. I will keep an eye on it no matter it is merged or not.

scikit-learn member

thanks :)

 weilinear `FIX doctest to check up to 6 digits precision` `5d533f1`
scikit-learn member

@amueller I fix the doctests by checking the precision up to 6 digits.

scikit-learn member

I was always getting
E125 continuation line does not distinguish itself from next logical line
and tried autopep8 but it cannot fix it though :(

scikit-learn member

Don't use the `\` if possible and put the ellipsis command in the next line (I think you need to put `..` in front then).
Also: You can usually avoid `\` line-continuations by putting parentheses around the expression like

```foo = (bar +
zork)```
scikit-learn member

I am a little slow to catch you :(

For the line-continuations, how can I change this format to meet pep E125? There seems already a pair of parentheses.

```    if (classes.shape[0] == clusters.shape[0] == 1
or classes.shape[0] == clusters.shape[0] == 0):
return 1.0```

And for the doctest, how to change the following without the `\`? If I omit the `\`, there will be one `SyntaxError: unexpected EOF while parsing`

```      >>> print ("%.6f" % homogeneity_score([0, 0, 1, 1],\
[0, 0, 1, 2])) # doctest: +ELLIPSIS
1.0...```

Thanks

scikit-learn member

I think this should be fine:

``````>>> print ("%.6f" % homogeneity_score([0, 0, 1, 1],
..                                       [0, 0, 1, 2])) # doctest: +ELLIPSIS
1.0...
``````

or better imho:

``````>>> print ("%.6f" % homogeneity_score([0, 0, 1, 1], [0, 0, 1, 2]))
..                                                                 # doctest: +ELLIPSIS
1.0...
``````

For the other one, I have to think.

scikit-learn member
scikit-learn member

For

```    if (classes.shape[0] == clusters.shape[0] == 1
or classes.shape[0] == clusters.shape[0] == 0):
return 1.0```

I don't get any pep8 errors....
which tool are you using?
Maybe another pair of parentheses?

```    if ((classes.shape[0] == clusters.shape[0] == 1)
or (classes.shape[0] == clusters.shape[0] == 0)):
return 1.0```
scikit-learn member
scikit-learn member

Same here, and I think I just omit it this time as I have not touched the lines contain this complaints... (pep8 1.3.3)

scikit-learn member

Ok. +1 for merge if the tests pass now. Seems I have to upgrade my pep8....

scikit-learn member

Just upgraded to pep8 1.3.3 and still don't see any errors....

 weilinear `FIX: eliminate \ for continuation from doctests` `dc6212b`
scikit-learn member

@amueller Seems the line contains #doctest should begin with three dots if starts in the next line. I have fixed doctest by placing the #doctest thing in the next line. By google I find this, seems an issue with pep8 1.3.3 PyCQA/pycodestyle#126

scikit-learn member

I'll merge this by rebase if no-one is opposed...

scikit-learn member

Merged to master. Thanks @kuantkid :)

closed this Oct 4, 2012
scikit-learn member

@amueller Thanks for reviewing this pr multiple times! Just out of curiosity, should I also update the whats_new file each time sending a PR :)

scikit-learn member

Well thanks for your contribution. Yes, usually it is a good idea to update the whatsnew. I guess I should add your fix that I just merged.

scikit-learn member

I just think of how complex for the release manager to remember to update all those things :) Next time I may remember to update it no matter it would be a "new" or would not.

This was referenced Oct 11, 2012
Closed

Closed