Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof Ckmeans #124

Closed
tmcw opened this issue Aug 13, 2015 · 8 comments
Closed

Proof Ckmeans #124

tmcw opened this issue Aug 13, 2015 · 8 comments

Comments

@tmcw
Copy link
Member

tmcw commented Aug 13, 2015

@llimllib
Copy link
Contributor

Here's a much smaller test case:

$ tail -n3 test/ckmeans.test.js
    t.deepEqual(cK([0, 3, 4], 2), [[0], [3,4]]);
    t.end();
});

$ node test/ckmeans.test.js
TAP version 13
# C k-means
ok 1 exports fn
not ok 2 should be equivalent
  ---
    operator: deepEqual
    expected: [ [ 0 ], [ 3, 4 ] ]
    actual:   [ [ 0, 3 ], [ 4 ] ]
    at: Test.<anonymous> (/Users/llimllib/code/simple-statistics/test/ckmeans.test.js:34:7)
  ...

It's clear that [(0), (3,4)] has a much smaller cluster distance than [(0,3), (4)], so something is wrong with our algorithm.

To find that test case, I wrote some property-based tests with @DRMacIver's excellent hypothesis library. Although the testing code needs work[1], this seems to be a good counterexample that fails on both our algorithms in the same way.

[1]: right now if you run it and it finds a counterexample, the next time you run it it may yell at you about flaky tests, because of how I'm generating the data. Working on fixing that right now.

@llimllib
Copy link
Contributor

OK, I improved example generation and updated the gist. Another small example is: ckmeans([-1, 0, 0], 2) = [[-1, 0], [0]], which should be [[-1], [0,0]].

@llimllib
Copy link
Contributor

I also just noticed that the tests in the test_ckmeans.js file are indeed wrong, as the comment in there suggests.

ckmeans([1,2,2,3], 3) should be [[1], [2, 2], [3]], for example

@llimllib
Copy link
Contributor

I think we're failing to consider the first element of the array, because this code:

        for (var sortedIdx = Math.max(cluster, 1);
             sortedIdx < sorted.length;
             sortedIdx++) {

never considers sortedIdx 0. Does that seem plausible to you?

@llimllib
Copy link
Contributor

(edit: no, not plausible)

@DRMacIver
Copy link

It may be useful to note that you can ask hypothesis to give you a Random instance, which will remove any flakiness you get from your use of random.sample

@llimllib
Copy link
Contributor

I think I fixed it! After this commit hypothesis still finds some errors, but they appear to be related to MAX_INT and/or floating point checking.

The two errors fixed in that commit are:

  • the withinss was being calculated incorrectly; ((data_idx - 1) / data_idx) * squared_difference ignores the first data point (0/1) instead of multiplying it by 1/2
  • the mean was being calculated incorrectly. On the second data point, where data_idx is 1, first_cluster_mean was being set to the sum of the first two numbers instead of half that sum, and so on for each index. subtracting one from data_idx does the trick.

@tmcw
Copy link
Member Author

tmcw commented Mar 19, 2016

Okay: ckmeans is polished up! Running out a release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants