Proof Ckmeans #124

tmcw · 2015-08-13T13:32:04Z

Get a instrumented version of the R library working
Add this result as a testcase Begin implementation of Ckmeans 1 dimensional clustering #118 (comment) and figure out what's going on with it

llimllib · 2015-08-15T02:22:33Z

Here's a much smaller test case:

$ tail -n3 test/ckmeans.test.js
    t.deepEqual(cK([0, 3, 4], 2), [[0], [3,4]]);
    t.end();
});

$ node test/ckmeans.test.js
TAP version 13
# C k-means
ok 1 exports fn
not ok 2 should be equivalent
  ---
    operator: deepEqual
    expected: [ [ 0 ], [ 3, 4 ] ]
    actual:   [ [ 0, 3 ], [ 4 ] ]
    at: Test.<anonymous> (/Users/llimllib/code/simple-statistics/test/ckmeans.test.js:34:7)
  ...

It's clear that [(0), (3,4)] has a much smaller cluster distance than [(0,3), (4)], so something is wrong with our algorithm.

To find that test case, I wrote some property-based tests with @DRMacIver's excellent hypothesis library. Although the testing code needs work[1], this seems to be a good counterexample that fails on both our algorithms in the same way.

[1]: right now if you run it and it finds a counterexample, the next time you run it it may yell at you about flaky tests, because of how I'm generating the data. Working on fixing that right now.

llimllib · 2015-08-15T03:23:58Z

OK, I improved example generation and updated the gist. Another small example is: ckmeans([-1, 0, 0], 2) = [[-1, 0], [0]], which should be [[-1], [0,0]].

llimllib · 2015-08-15T03:44:28Z

I also just noticed that the tests in the test_ckmeans.js file are indeed wrong, as the comment in there suggests.

ckmeans([1,2,2,3], 3) should be [[1], [2, 2], [3]], for example

llimllib · 2015-08-15T20:38:53Z

I think we're failing to consider the first element of the array, because this code:

        for (var sortedIdx = Math.max(cluster, 1);
             sortedIdx < sorted.length;
             sortedIdx++) {

never considers sortedIdx 0. Does that seem plausible to you?

llimllib · 2015-08-15T21:06:55Z

(edit: no, not plausible)

DRMacIver · 2015-08-15T21:14:59Z

It may be useful to note that you can ask hypothesis to give you a Random instance, which will remove any flakiness you get from your use of random.sample

llimllib · 2015-08-16T18:09:19Z

I think I fixed it! After this commit hypothesis still finds some errors, but they appear to be related to MAX_INT and/or floating point checking.

The two errors fixed in that commit are:

the withinss was being calculated incorrectly; ((data_idx - 1) / data_idx) * squared_difference ignores the first data point (0/1) instead of multiplying it by 1/2
the mean was being calculated incorrectly. On the second data point, where data_idx is 1, first_cluster_mean was being set to the sum of the first two numbers instead of half that sum, and so on for each index. subtracting one from data_idx does the trick.

tmcw · 2016-03-19T15:32:31Z

Okay: ckmeans is polished up! Running out a release.

tmcw mentioned this issue Aug 13, 2015

Begin implementation of Ckmeans 1 dimensional clustering #118

Merged

3 tasks

llimllib mentioned this issue Aug 16, 2015

Fix ckmeans #125

Merged

stevage mentioned this issue Nov 15, 2015

New NPM release needed #129

Closed

tmcw closed this as completed Mar 19, 2016

schnerd mentioned this issue Sep 22, 2016

Consider switching to faster Ckmeans implementation #157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proof Ckmeans #124

Proof Ckmeans #124

tmcw commented Aug 13, 2015

llimllib commented Aug 15, 2015

llimllib commented Aug 15, 2015

llimllib commented Aug 15, 2015

llimllib commented Aug 15, 2015

llimllib commented Aug 15, 2015

DRMacIver commented Aug 15, 2015

llimllib commented Aug 16, 2015

tmcw commented Mar 19, 2016

Proof Ckmeans #124

Proof Ckmeans #124

Comments

tmcw commented Aug 13, 2015

llimllib commented Aug 15, 2015

llimllib commented Aug 15, 2015

llimllib commented Aug 15, 2015

llimllib commented Aug 15, 2015

llimllib commented Aug 15, 2015

DRMacIver commented Aug 15, 2015

llimllib commented Aug 16, 2015

tmcw commented Mar 19, 2016