FIX for issue #884 and improve contingency matrix built efficiency #1147

Closed
wants to merge 20 commits into
from

Projects

None yet

5 participants

@weilinear
scikit-learn member

I think(I am not so sure) that the log operation breaks for log(a/b), as the division/multiplication will have possible loss of precision, and log(a/b) should be written as log(a) - log(b). This causes the issue #884

For example in issue #776 @ogrisel:
1.normalized_mutual_info_score([1, 1, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6]) Will be zero
2. for i in np.logspace(1, 4, 4):
print normalized_mutual_info_score(np.ones(i, dtype=np.int), np.arange(i, dtype=np.int))

Will be all zero

Another is to use coo_matrix to accelerate the built of the contingency matrix, A quick test is shown below.

Using coo_matrix
coo_matrix is similar to matlab's accumarray, and I think it is more efficient to calculate :-).
%timeit contingency = contingency_matrix(random_integers(1,10000,(1,1000000))[0],
random_integers(1,10000,(1,1000000))[0])
1 loops, best of 3: 604 ms per loop

Using original python dict
%timeit contingency = contingency_matrix(random_integers(1,10000,(1,1000000))[0],
random_integers(1,10000,(1,1000000))[0])
1 loops, best of 3: 1.84 s per loop

Further issue is that when using dict, contingency = contingency_matrix(random_integers(1,10,(1,100)),random_integers(1,10,(1,100))) such code will fail as the ndarray not hashable, a check for the input string should be made.

One thing remaining is that, the contigency_matrix is usually (if not all the time) sparse, so use sparse operation should further accelearte expecially calculate the x * log(x) with dealing the 0log0 case as we can get out the non-zero items with no cost than dense matrix. But currently I put a todense() for the contingency matrix. I will further fix it if decided to use coo_matrix :-)

@amueller
scikit-learn member

as the contigency matrix is likely to be dense, I don't think coo_matrix is the best solution.
The easy way to speed this up is to use np.histogram2d as in metrics.confusion_matrix.
The a bit harder way (not that hard tough) is to do a cython function bincount2d analogue to np.bincount and np.histogram2d.

I guess you are right about the loss of precision in the division, I'll try to have a look later.
Do I understand it correctly that in the two examples you give, the output of your version is different form the current version? (and hopefully corrected #884)

@amueller
scikit-learn member

btw it would also be possible to just call metrics.confusion_matrix to compute the contigency matrix.

@amueller
scikit-learn member

Sorry, I didn't read your comment fully before. Why do you think the contigency matrix is usually sparse?
I am not so experienced with these methods, but in my applications it was dense. What is your application?
For evaluating clusterings with ground truth, the number of labels in the ground truth is usually "small" in the data sets I used.

@amueller
scikit-learn member

Thanks for looking into this by the way, I really didn't have the time :-/

@amueller amueller and 1 other commented on an outdated diff Sep 14, 2012
sklearn/metrics/cluster/supervised.py
outer = np.outer(pi, pj)
nnz = contingency != 0.0
- mi = contingency[nnz] * np.log(contingency[nnz] / outer[nnz])
+ mi = ((contingency[nnz] / contingency_sum) *
@amueller
amueller Sep 14, 2012

Do you think it is possible to split this into several lines to make it a bit more readable?
Also, maybe you should try computing contigentcy[nnz] only once. Atm it is computed 3 times.

@amueller
amueller Sep 14, 2012

It looks as if it could be written as
(contingency[nnz] / contingency_sum) * (np.log(contigency[nnz]) - log(contigency_sum) - np.log(outer[nnz]) + log(pi.sum()) + log(pj.sum()))
maybe split up in three parts

# normalized contigency
contigency_nm = (contingency[nnz] / contingency_sum) 
# log of normalized contigency
log_contigency_nm = (np.log(contigency[nnz]) - log(contigency_sum)
# negative log of marginal distributions (better name?)
log_outer = - np.log(outer[nnz]) + log(pi.sum()) + log(pj.sum()))
return np.sum(...)
@weilinear
weilinear Sep 14, 2012

For contingency[nnz], it is a good idea to pre-calculate it. For the second part, spliting it will make them to be more clear. I will change it after we have made decision on the numeric and coo_matrix issues :)

@weilinear
scikit-learn member

@amueller For the numerical issue, in the current version(0.12) on my computer is, the result as follows:

normalized_mutual_info_score([1, 1, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6])
-2.2204460492503135e-06

And

for i in np.logspace(1, 4, 4):
    print normalized_mutual_info_score(np.ones(i, dtype=np.int), np.arange(i, dtype=np.int))

-1.11022302463e-06
6.66133814775e-06
6.66133814775e-06
-0.000938138455808

All of them should be exactly zero I think. Could @ogrisel can have a look at it, it is the case in issue #776.

@weilinear
scikit-learn member

For the coo_matrix issue, I think my considering is as follows:
1. The contingency matrix is not guranteed to be sparse actually, but in my experience, when the class is very large(like >100), one class is confused mainly on several other classes but not all. I do not know whether others' experience is similar :-).
2. I think the contigency matrix cannot be too large, as the matrix is at most n_class x n_label, so the sparse matrix representation will not lose much of the speed compared to dense matrix, but the operation is very important.
3. Coo_matrix is a basic sparse matrix type, I believe that scipy should get enough optimization for the code. The testing provided all have a full matrix and the speed incrementation is almost 3x. Intiutively, constructing a coo_matrix does not cost us much overhead.
4. When dealing with the contingency[nnz] problem, we do not have the indexing problem as we can just take out the data from the sparse matrix to calculat the xlogx as we should always make sure that 0log0 is 0 rather than NaN

The above is how I consider the problem actually. Actually it is to construct an array by accumulation like AccumarrayLike in scipy. Test the coo_matrix with histogram2d:

A quick test is like follows:

%timeit p= histogram2d(random_integers(1,10000,(1,1000000))[0],random_integers(1,10000,(1,1000000))[0],[10000,10000])
1 loops, best of 3: 7.29 s per loop
%timeit p= coo_matrix((np.ones(1000000),(random_integers(1,10000,(1,1000000))[0],random_integers(1,10000,(1,1000000))[0])),shape=[1000000,1000000],dtype=np.int)
10 loops, best of 3: 44.2 ms per loop

By changing the contingency to be calculated as (this can pass all the test)

 contingency = np.asarray(np.histogram2d(class_idx, cluster_idx, [n_classes, n_clusters])[0],dtype='int')
%timeit contingency=contingency_matrix(random_integers(1,10000,(1,1000000))[0],random_integers(1,10000,(1,1000000))[0])
1 loops, best of 3: 7.86 s per loop

Here is just to do the histogram without taking the unique, the time taken is 7.29s vs 44.2ms, for contingency matrix calculation, the result is 7.86 vs 604ms(in first post). It not tested with small scale yet. Not sure whether I have done the comparision in the right manner, whether I have used the histogram2d right, and hope you can review it :)

@amueller
scikit-learn member

Thanks for the testing. I couldn't to it myself as my CPU is quite busy atm ;)
I feel this is quite surprising. Probably due to the fact that the correct operation would be a bincount, not a histogram, which tries to sort into bins first. Could you please try again with given the bins np.arange(n_classes), np.arange(n_clusters) ? That could make a big difference.

Otherwise, as your solution is so much faster, let's use that. Maybe a little comment on why this is a good thing to do ;)

If would be great if you could change the metrics.confusion_matrix to use the same construction!

@weilinear
scikit-learn member

@amueller Takes your time :). Maybe let us open it a few days, just wait others' comments? I am not so sure about the performance thing actually. I will further work with it and finish this whole file(including docstrings) where other places may take advantage of the sparse representation if it turns out to be a good idea eventually. Another is I plan to put @ogrisel 's test cases into the test file.

I will further look into the confusion_matrix. Maybe I can use this trick to accelerate confusion matrix and just use the confusion matrix to calculate the confusion_matrix here.

@amueller
scikit-learn member

Yes, I think sharing the code would be a good idea. If you have the time, just do the histogram2d tests with the bins given by arange, not by the number of bins. If that is still slower than your version, I think no-one will argue against it ;)
You would just be making the confusion matrix faster :)

@amueller
scikit-learn member

Btw, did your fix affect the problems in v_measure?
As i said in #776, v_measure should be MI(X,Y)/(H(X) + H(Y)), which was not the case :-/

@weilinear
scikit-learn member

Seems I have missed the v_measure :(. I will look into that today.

@sth4nth

NB

@amueller
scikit-learn member

Don't worry about the v-measure, we can do this later. This PR is definitely an improvement.

@weilinear
scikit-learn member

I have changed the confusion_matrix as well. I think it is not quite a good idea to use confusion matrix in contingency matrix as in confusion matrix we have an additional label vectors and we need may dictionary operation to rearrange the labels_true and labels_predict. which will double the time for calculation.

For confusion_matrix here is a simple test

%timeit confusion_matrix(random_integers(1,10000,(1,1000000))[0],random_integers(1,10000,(1,1000000))[0])

original : 1 loops, best of 3: 2.05 s per loop
new : 1 loops, best of 3: 897 ms per loop

%timeit a = confusion_matrix(np.arange(20000),np.arange(20000))

original: 1 loops, best of 3: 461 ms per loop
new : 1 loops, best of 3: 438 ms per loop

%timeit p= coo_matrix((np.ones(1000000),(np.arange(1000000),np.arange(1000000))),shape=[1000000,1000000])

100 loops, best of 3: 7.89 ms per loop

histogram2d(np.arange(1000000),np.arange(1000000))

10 loops, best of 3: 180 ms per loop

This will yield similar performance for using arange. Seems that when the data is aligned, the original method using dict is quite effective as the resulted matrix is quite sparse and the dictionary does not need to create too much new keys. However, seems histogram2d are doing too much jobs which makes it less effective on our simple usage here.

@weilinear
scikit-learn member

For the v_measure thing, I think it is not equal to our version of NMI. It equals to NMI that is normalized by (H(x) + H(y))/2 and our version is use sqrt(H(x) * H(y)) to normalize. I put one string in nmi's doc to tell our normalization methods.

I further simplify the calculation of homogenity score and compleness score, actually
1 - H(C|E) / H(E) = (H(E) - H(C|E)) / H(E) = MI(E,C) / H(E)
1 - H(E|C) / H(C) = (H(C) - H(E|C)) / H(E) = MI(E,C) / H(E)

So, it is easy to see that v_measure is equal to
2 * MI(E,C) / (H(C) + H(E))
which is different from ours

@amueller
scikit-learn member

Again for the histogram2d: sorry I couldn't do the experiments myself, my box is busy non-stop with experiments atm.
You should really pass the histogram2d the bins explicitly like so:

%timeit p= histogram2d(random_integers(1,10000,(1,1000000))[0],random_integers(1,10000,(1,1000000))[0],[np.arange(10000),np.arange(10000)])

Sorry if I haven't been clear before.

@amueller
scikit-learn member

Could you also please separately test the confusion matrix for the two cases that are there, i.e. with few labels and with many?
Sorry I'm a bit busy atm, hope I'll be more helpful from next week on.

@weilinear
scikit-learn member
%timeit p= histogram2d(random_integers(1,10000,(1,1000000))[0],random_integers(1,10000,(1,1000000))[0],[np.arange(10000),np.arange(10000)])

1 loops, best of 3: 7.3 s per loop

confusion_matrix with few labels:

%timeit confusion_matrix(random_integers(1,20,(1,1000000))[0],random_integers(1,20,(1,1000000))[0])1 loops, best of 3: 1.72 s per loop

original(using dict): 1.72s
new(using coo_matrix): 717ms

just take your time :). I will be busy these few days as well :(

@amueller
scikit-learn member

Thanks for the benchmarks!
Looks good :)

For the v-measure: yes, it is a different kind of normalized mutual information. The question was: if we take our entropy implementation and mutual information, and evaulate
2 * MI(E,C) / (H(C) + H(E)), do we get the v-measure out.
That was not the case before.

@weilinear
scikit-learn member

Add a test function to test script and seems v_measure and are equal with
2 * MI(E,C) / (H(C) + H(E))

@amueller
scikit-learn member

Thanks, that is great :)
Awesome! +10 for merge then :)
This bug was bugging me ;)

@amueller
scikit-learn member

@ogrisel Would you mind having a look?

@weilinear
scikit-learn member

Seems I always miss to pep8 before commit, too many unnecessary commit :(

@amueller
scikit-learn member

Don't worry to much about it ;)
If you want, you can use rebase -i - remember you'll have to force-push after a rebase.
Some might think differently about it but I don't think it is necessary. And no-one has complained about my endless list of tiny commits.

@vene vene and 1 other commented on an outdated diff Sep 18, 2012
sklearn/metrics/cluster/supervised.py
n_classes = classes.shape[0]
n_clusters = clusters.shape[0]
- contingency = np.zeros((n_classes, n_clusters), dtype=np.int)
- for c, k in zip(labels_true, labels_pred):
- contingency[class_idx[c], cluster_idx[k]] += 1
+ # using coo_matrix to accelerate calculation of contingency matrix
+ # it can accelerate 2d-histogram like construction
@vene
vene Sep 18, 2012

It is not clear to me what this comment line means, would you mind explaining?

@weilinear
weilinear Sep 19, 2012

@vene Actually I mean that those simple histogram-like calculation can be done using coo_matrix. Currently its performance is better than histogram2d. I will change this line :)

@vene vene commented on an outdated diff Sep 18, 2012
sklearn/metrics/cluster/tests/test_supervised.py
@@ -167,3 +167,25 @@ def test_adjusted_mutual_info_score():
def test_entropy():
ent = entropy([0, 0, 42.])
assert_almost_equal(ent, 0.6365141, 5)
+
+
+def test_exactly_zero_info_score():
+ """Check numerical stabability when information is exactly zero"""
+ for i in np.logspace(1, 4, 4):
+ labels_a, labels_b = np.ones(i, dtype=np.int),\
+ np.arange(i, dtype=np.int)
+ assert_equal(normalized_mutual_info_score(labels_a, labels_b), 0.0)
+ assert_equal(v_measure_score(labels_a, labels_b), 0.0)
+ assert_equal(adjusted_mutual_info_score(labels_a, labels_b), 0.0)
+ assert_equal(normalized_mutual_info_score(labels_a, labels_b), 0.0)
+
+
+def test_v_measure_and_mi(seed=36):
+ """Check relation between v_measure, entropy and and mi"""
@vene
vene Sep 18, 2012

typo "and and", also try to fit "mutual information" instead of "mi" if you can.

@vene vene and 2 others commented on an outdated diff Sep 18, 2012
sklearn/metrics/cluster/tests/test_supervised.py
@@ -167,3 +167,25 @@ def test_adjusted_mutual_info_score():
def test_entropy():
ent = entropy([0, 0, 42.])
assert_almost_equal(ent, 0.6365141, 5)
+
+
+def test_exactly_zero_info_score():
+ """Check numerical stabability when information is exactly zero"""
+ for i in np.logspace(1, 4, 4):
+ labels_a, labels_b = np.ones(i, dtype=np.int),\
+ np.arange(i, dtype=np.int)
+ assert_equal(normalized_mutual_info_score(labels_a, labels_b), 0.0)
+ assert_equal(v_measure_score(labels_a, labels_b), 0.0)
+ assert_equal(adjusted_mutual_info_score(labels_a, labels_b), 0.0)
+ assert_equal(normalized_mutual_info_score(labels_a, labels_b), 0.0)
+
+
+def test_v_measure_and_mi(seed=36):
@vene
vene Sep 18, 2012

does the seed need to be fixed?

@GaelVaroquaux
GaelVaroquaux Sep 18, 2012
@vene
vene Sep 18, 2012

Then what would be the point of "Seeding RNG with seed 4389128" before running the tests? Something is eluding me...

@weilinear
weilinear Sep 19, 2012

@vene @GaelVaroquaux I am a little confused as well for "Seeding RNG ...".

Another thing is have I done it right here?

@GaelVaroquaux
GaelVaroquaux Sep 19, 2012
@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Sep 19, 2012
sklearn/metrics/cluster/supervised.py
label_idx = unique(labels, return_inverse=True)[1]
pi = np.bincount(label_idx).astype(np.float)
pi = pi[pi > 0]
- pi /= np.sum(pi)
- return -np.sum(pi * np.log(pi))
+ pi_sum = np.sum(pi)
+ return -np.sum((pi / pi_sum) * (np.log(pi) - log(pi_sum)))
@GaelVaroquaux
GaelVaroquaux Sep 19, 2012

Out of curiosity, what is the motivation for this change?

@weilinear
weilinear Sep 19, 2012

@GaelVaroquaux This line causes some exactly 0/1 case to fail, make nmi gives values a little bit larger than 1 or very smal negative values. I think log(a) - log(b) may be more accurate than log(a/b) as division may lose precision and a quick test by others is here:
ref: http://www.jonkagstrom.com/floating-point-precision-and-sum-of-logs

However, I think I need more time to have a deeper look at those issues :)

@GaelVaroquaux
GaelVaroquaux Sep 19, 2012
@weilinear
weilinear Sep 19, 2012

@GaelVaroquaux The test cases will cover 0log0 cases, as I am testing cases where nmi,v_measure is exactly zero in which the 0log0 cases can be covered as one nan will make the final answer become nan.

Comments are added.

@GaelVaroquaux
scikit-learn member

Looks good to me. 👍 for merge. Thanks a lot @kuantkid

@amueller
scikit-learn member

There are some doctest issues, which I'll fix and then merge.

@amueller
scikit-learn member

Hm I forgot how to deal with this:

Expected:
    0.0..
Got:
    3.2034265038149171e-16

Use a print? Change numpy representation precision?

@amueller
scikit-learn member

Or should we rather that this as a sign that some computation somewhere is less stable now?

@amueller
scikit-learn member

Hm I should really be doing something else right now, @kuantkid could you maybe investigate? run nosetests sklearn/metrics/cluster/supervised.py to reproduce (probably).

@weilinear
scikit-learn member

@amueller I take care of that :)

BTW, how to tell whether the doctest is right or not?

@weilinear
scikit-learn member

OK, I see the doctest then. I originally take the doctest as something that tests the doc...

@GaelVaroquaux
scikit-learn member
@weilinear
scikit-learn member

@amueller Actually this is caused by a very interesting phenomena

suppose that we have a vector v = [0.5, 0.5]. The the following code

v = [float(0.5), float(0.5)]
entropy_C = 0
for i in range(2):
    entropy_C -= log(v[i]) * v[i]

and the following code

(np.log(v) * v).sum()

will gives out different answers. And it seems the original code's precision is not the same as np.float. I think the original doc test can be passed because there is a MutualInformation/Entropy operation there and it just do rounding to skip the troubling part that will cause the precision problem described as follows:

Currently the test fail maybe because the current sum() function does not support something like math.fsum() which means if we sum over a same vector but in different order, we may gives slightly different answer with respect to machine precision: http://en.wikipedia.org/wiki/Kahan_summation_algorithm

Here is a ticket for numpy: http://projects.scipy.org/numpy/ticket/1855

Here is an example that is quite possible in calculating entropy:

numpy.sum([log(2), log(3), -log(2), -log(3)]), numpy.sum([log(2),-log(2),log(3),-log(3)])
fsum([log(2), log(3), -log(2), -log(3)]), fsum([log(2),-log(2),log(3),-log(3)])

Output is

(-2.2204460492503131e-16, 0.0)
(0.0, 0.0)

The answer is different even if the input array is the the same. I tend to think this problem cannot be directly solved in our package, so I may suggest we tolerate the precision to machine precision :)

@amueller
scikit-learn member

Thanks for investigating. Could you explain where the rounding operation was in the previous implemenation?

@weilinear
scikit-learn member

Here is n_C is a list of float and it causes the rounding effect. But it seems it is not done on purpose

    n_C = [float(np.sum(labels_true == c)) for c in classes]
    n_K = [float(np.sum(labels_pred == k)) for k in clusters]

    for i in xrange(len(classes)):
        entropy_C -= n_C[i] / n_samples * log(n_C[i] / n_samples)

    for j in xrange(len(clusters)):
        entropy_K -= n_K[j] / n_samples * log(n_K[j] / n_samples)
@weilinear
scikit-learn member

@amueller When I run the nosetest, I get a

raise self.failureException(self.format_failure(<StringIO.StringIO instance at 0x1a06830>.getvalue()))

How can I check on which line is this failure?

@GaelVaroquaux
scikit-learn member
@amueller
scikit-learn member

Wow thanks, @GaelVaroquaux, I didn't know about that. Ususallly I use ipython and import the test.

@GaelVaroquaux
scikit-learn member
@amueller
scikit-learn member

I usually use nosetests test_file.py -m function_name...

@amueller
scikit-learn member

@kuantkid Sorry, I'm a bit slow. I still don't see where the rounding is taking place.

@weilinear
scikit-learn member

@amueller My fault not to make myself clear.

Suppose labels_true = [1 2]. Then the entropy calculated using current (this PR) is

v = np.array([0.5,0.5], dtype=np.float)
en1 = -np.sum(v * np.log(v))

0.69314718055994529

And in the original implementation, the entropy is calculated as

v = [float(0.5), float(0.5)]
en2 = 0.0-( v[0] * math.log(v[0]) + v[1] * math.log(v[1]))

0.6931471805599453

Seems float is different from np.float, and possible loss of precision. This is what I mean by rounding in previous comments.

@weilinear
scikit-learn member

@amueller @GaelVaroquaux Actually my problem is I run nosetest2 sklearn/metrics/cluster/supervised.py and the test fail with

raise self.failureException(self.format_failure(<StringIO.StringIO instance at 0x1e2a8c0>.getvalue()))

I think maybe it is a failure about parsing the docstrings in supervised.py .

@GaelVaroquaux
scikit-learn member

Hi @kuantkid , when reporting test failures, you need to report a bit more than the exception raised, but also the error message that goes with it. In your case:

======================================================================
FAIL: Doctest: sklearn.metrics.cluster.supervised.completeness_score
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/doctest.py", line 2201, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for sklearn.metrics.cluster.supervised.completeness_score
  File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 329, in completeness_score

----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 381, in sklearn.metrics.cluster.supervised.completeness_score
Failed example:
    completeness_score([0, 1, 2, 3], [0, 0, 1, 1])
Expected:
    1.0
Got:
    0.99999999999999989

>>  raise self.failureException(self.format_failure(.getvalue()))
    

======================================================================
FAIL: Doctest: sklearn.metrics.cluster.supervised.homogeneity_score
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/doctest.py", line 2201, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for sklearn.metrics.cluster.supervised.homogeneity_score
  File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 262, in homogeneity_score

----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 312, in sklearn.metrics.cluster.supervised.homogeneity_score
Failed example:
    homogeneity_score([0, 0, 1, 1], [0, 0, 1, 2])
Expected:
    1.0
Got:
    0.99999999999999989
----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 314, in sklearn.metrics.cluster.supervised.homogeneity_score
Failed example:
    homogeneity_score([0, 0, 1, 1], [0, 1, 2, 3])
Expected:
    1.0
Got:
    0.99999999999999989
----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 322, in sklearn.metrics.cluster.supervised.homogeneity_score
Failed example:
    homogeneity_score([0, 0, 1, 1], [0, 0, 0, 0])
Expected:
    0.0
Got:
    1.6017132519074588e-16

>>  raise self.failureException(self.format_failure(.getvalue()))
    

======================================================================
FAIL: Doctest: sklearn.metrics.cluster.supervised.v_measure_score
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/doctest.py", line 2201, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for sklearn.metrics.cluster.supervised.v_measure_score
  File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 396, in v_measure_score

----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 453, in sklearn.metrics.cluster.supervised.v_measure_score
Failed example:
    v_measure_score([0, 0, 1, 2], [0, 0, 1, 1])     # doctest: +ELLIPSIS
Expected:
    0.8...
Got:
    0.79999999999999993
----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 462, in sklearn.metrics.cluster.supervised.v_measure_score
Failed example:
    v_measure_score([0, 0, 1, 1], [0, 0, 1, 2])     # doctest: +ELLIPSIS
Expected:
    0.8...
Got:
    0.79999999999999993
----------------------------------------------------------------------
File "/home/varoquau/dev/scikit-learn/sklearn/metrics/cluster/supervised.py", line 476, in sklearn.metrics.cluster.supervised.v_measure_score
Failed example:
    v_measure_score([0, 0, 1, 1], [0, 0, 0, 0])
Expected:
    0.0
Got:
    3.2034265038149171e-16

>>  raise self.failureException(self.format_failure(.getvalue()))

Quite clearly, the errors that you are getting are due to loss of numerical precision. Now it remains to be seen whether it is important.

@weilinear
scikit-learn member

@GaelVaroquaux Thanks! I fail to see that there is one self.format_failure at each of the failed test-cases. I just think that at the end of the test, there is one doctest written in the wrong style which causes the format_failure. My fault.

I tried to analyze the precision loss in previous few comments :)

@weilinear
scikit-learn member

@GaelVaroquaux @amueller How can I change the doctest string format so that it can tolerate machine precision? Currently I can use >>>print xxx as print will round the result so that 0.79999999999999993 will become 0.8 but that does not look well :)

@amueller
scikit-learn member

You can easily change that by using print("%.2f" % x), but I think it would be interesting to investigate the loss of precision.

@weilinear
scikit-learn member

@amueller @GaelVaroquaux Could you have a look at previous comments about the loss of precision? Mainly the loss is due to the summation operation. If you think that is reasonable loss of precision then I will change the doctest :)

PS: I do not know how to quote previous comments...

@amueller
scikit-learn member

Sorry for not getting back. I have been way busy this week. I'd like to have a closer look at the code again before merging. Hopefully on this weekend. I'd really like to get this code in soon as it is a great contribution.

@weilinear
scikit-learn member

@amueller Just take your time :)

@amueller
scikit-learn member

Ok, I was having a closer look again.
I observed:

ipdb> log(8. / (4 * 4))
-0.6931471805599453
ipdb> log(8.) - log(4 * 4)
-0.6931471805599454

the correct result is ln(2) which is actually the top one.
Separating the logs here causes a log of precision.

Maybe a bit easier to understand with exp:

ipdb> exp(log(8.) - log(4 * 4))
0.49999999999999994
ipdb> exp(log(8. / (4 * 4)))
0.5

Intuitively I would have assumed that you where right in first calculating the log, but it seems that is not always good.

@amueller
scikit-learn member

Ok I have no idea how that should be handled. @GaelVaroquaux @ogrisel any opinions / ideas?

@GaelVaroquaux
scikit-learn member
@amueller
scikit-learn member

I can not answer either of the questions, but I would hate to see this great PR sit around much longer.

@weilinear
scikit-learn member

@amueller I think you are right when we do not lose precision in division/multiplication. Cases when calculating the inner side first by multiplication/division is that there maybe loss of precision. Here 8/4/4=0.50. What if we are dealing with log(8/(3*3))?

But if we break them apart, problem is when we face loss of precision for every log calculation.

However, log is itself and summation will lose precision which is what we cannot overtake.

I originally think it would be easier to fix though finally find I am short of knowledge :)

@amueller
scikit-learn member

As this fixes a bug, I guess we should just merge it after fixing the doctests. @kuantkid could you do that please?

@GaelVaroquaux I think if possible this should be included in 12.1.

@weilinear
scikit-learn member

@amueller I can fix the doctest asap. I will keep an eye on it no matter it is merged or not.

@amueller
scikit-learn member

thanks :)

@weilinear
scikit-learn member

@amueller I fix the doctests by checking the precision up to 6 digits.

@weilinear
scikit-learn member

I was always getting
E125 continuation line does not distinguish itself from next logical line
and tried autopep8 but it cannot fix it though :(

@amueller
scikit-learn member

Don't use the \ if possible and put the ellipsis command in the next line (I think you need to put .. in front then).
Also: You can usually avoid \ line-continuations by putting parentheses around the expression like

foo = (bar +
           zork)
@weilinear
scikit-learn member

I am a little slow to catch you :(

For the line-continuations, how can I change this format to meet pep E125? There seems already a pair of parentheses.

    if (classes.shape[0] == clusters.shape[0] == 1
        or classes.shape[0] == clusters.shape[0] == 0):
        return 1.0

And for the doctest, how to change the following without the \? If I omit the \, there will be one SyntaxError: unexpected EOF while parsing

      >>> print ("%.6f" % homogeneity_score([0, 0, 1, 1],\
                                            [0, 0, 1, 2])) # doctest: +ELLIPSIS
      1.0...

Thanks

@amueller
scikit-learn member

I think this should be fine:

>>> print ("%.6f" % homogeneity_score([0, 0, 1, 1],
      ..                                       [0, 0, 1, 2])) # doctest: +ELLIPSIS
      1.0...

or better imho:

>>> print ("%.6f" % homogeneity_score([0, 0, 1, 1], [0, 0, 1, 2]))
..                                                                 # doctest: +ELLIPSIS
      1.0...

For the other one, I have to think.

@GaelVaroquaux
scikit-learn member
@amueller
scikit-learn member

For

    if (classes.shape[0] == clusters.shape[0] == 1
        or classes.shape[0] == clusters.shape[0] == 0):
        return 1.0

I don't get any pep8 errors....
which tool are you using?
Maybe another pair of parentheses?

    if ((classes.shape[0] == clusters.shape[0] == 1)
        or (classes.shape[0] == clusters.shape[0] == 0)):
        return 1.0
@vene
scikit-learn member
@weilinear
scikit-learn member

Same here, and I think I just omit it this time as I have not touched the lines contain this complaints... (pep8 1.3.3)

@amueller
scikit-learn member

Ok. +1 for merge if the tests pass now. Seems I have to upgrade my pep8....

@amueller
scikit-learn member

Just upgraded to pep8 1.3.3 and still don't see any errors....

@weilinear
scikit-learn member

@amueller Seems the line contains #doctest should begin with three dots if starts in the next line. I have fixed doctest by placing the #doctest thing in the next line. By google I find this, seems an issue with pep8 1.3.3 PyCQA/pycodestyle#126

@amueller
scikit-learn member

I'll merge this by rebase if no-one is opposed...

@amueller
scikit-learn member

Merged to master. Thanks @kuantkid :)

@amueller amueller closed this Oct 4, 2012
@weilinear
scikit-learn member

@amueller Thanks for reviewing this pr multiple times! Just out of curiosity, should I also update the whats_new file each time sending a PR :)

@amueller
scikit-learn member

Well thanks for your contribution. Yes, usually it is a good idea to update the whatsnew. I guess I should add your fix that I just merged.

@weilinear
scikit-learn member

I just think of how complex for the release manager to remember to update all those things :) Next time I may remember to update it no matter it would be a "new" or would not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment