Upgrade to faster Ckmeans implementation (Ckmeans 3.4.6) #163

schnerd · 2016-10-16T02:06:44Z

See #157 for more details.

Ran some perf tests and the results are impressive:

While the code is functionally correct, I'm not 100% satisfied with the clarity of certain comments and variable/function names in this PR (in particular ssq, sjlowi, etc). The code from the original C++ implementation is unfortunately a bit difficult to follow, and the authors have not yet published a detailed explanation of the new algorithm, so it's difficult to design better names without a deeper understanding.

I'm hoping to get a few more eyeballs on this code that can provide feedback or recommend changes if necessary.

schnerd · 2016-10-16T02:11:28Z

src/ckmeans.js

+    } else {
+        sji = sumsOfSquares[i] - sums[i] * sums[i] / (i + 1);
+    }
+    return Math.max(0, sji);


I originally had this as

if (sji < 0) { return 0; } return sji;

however the test coverage checker complained that the conditional return was not covered by any tests. I unfortunately was unable to craft a test case that hit this branch of the code, so I cheated by turning it into a Math.max. A deeper understanding of the algorithm would be required to figure out when this conditional is needed or if it's even needed at all.

I don't have a deeper understanding of the algorithm, but I do have an automated test case generator, so here's an example that hits this branch:

ckmeans([64.64249127327881, 64.64249127328245, 57.79216426169771], 2) == [[57.79216426169771], [64.64249127327881, 64.64249127328245]]

(If I had to guess I'd suspect it's to do with float math imprecisions?)

@llimllib that's fantastic, can confirm that it works. will add it in with a comment.

schnerd · 2016-10-16T02:12:38Z

src/ckmeans.js

    var matrix = makeMatrix(nClusters, sorted.length),
-        // named 'B' originally
+        // named 'J' originally


These names were changed in the 3.4.6 implementation

schnerd · 2016-10-16T02:15:23Z

src/ckmeans.js

-            }
-        }
-    }
+    fillMatrices(sorted, matrix, backtrackMatrix);


The body of fillMatrices could technically be inlined here. I kept it in a separate function since it's more true to the original implementation, but this can be easily changed if preferred.

Whether this method is inline or not I'm not picky about - invocation overhead here is pretty minimal and it is nice to have shorter method bodies. The thing that doesn't fit that well with simple-statistics or JS is that these methods - fillMatrices and fillMatrixColumn - don't return values but change their input by reference. But reviewing the code, I don't see a good way to change this without reducing performance, so I think we're good with the implementation as-is.

mourner · 2016-10-18T16:28:34Z

This looks wonderful, thanks for taking the time to port this!

mourner · 2016-10-18T16:29:07Z

Can you rebase this branch so that it merges cleanly on the latest master?

tmcw · 2016-10-18T15:25:10Z

src/ckmeans.js

+    var sumsOfSquares = [];
+
+    // Initialize first column in matrix & backtrackMatrix
+    for (var i = 0; i < nValues; ++i) {


Since we never reference unshifted values or the shift variable after this section, maybe it would simplify the code to either generate a shifted values array:

var shiftedData = data.map(function (value) { return value - shift; });

Or in this for loop immediately doing a

var shiftedValue = data[i] - shift;

Since that'd save us the 6 times that the for loop body refers to data[i] - shift (including the data[0] cases)

Good point, definitely makes the code less redundant and easier to follow. will change.

schnerd · 2016-10-20T06:13:04Z

@mourner suggestions taken into account and branch rebased on master

tmcw · 2016-10-20T15:02:13Z

Awesome, thanks @schnerd - merging!

schnerd commented Oct 16, 2016

View reviewed changes

schnerd mentioned this pull request Oct 16, 2016

Consider switching to faster Ckmeans implementation #157

Closed

mourner approved these changes Oct 18, 2016

View reviewed changes

tmcw reviewed Oct 18, 2016

View reviewed changes

schnerd force-pushed the ckmeans-upgrade branch from 405a736 to 3263f7b Compare October 20, 2016 05:52

Upgrade ckmeans algorithm to 3.4.6 for speed improvements

abca80d

schnerd force-pushed the ckmeans-upgrade branch from 3263f7b to abca80d Compare October 20, 2016 06:09

tmcw merged commit 5eb7722 into simple-statistics:master Oct 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to faster Ckmeans implementation (Ckmeans 3.4.6) #163

Upgrade to faster Ckmeans implementation (Ckmeans 3.4.6) #163

schnerd commented Oct 16, 2016

schnerd Oct 16, 2016 •

edited

Loading

llimllib Oct 19, 2016 •

edited

Loading

llimllib Oct 19, 2016

schnerd Oct 20, 2016

schnerd Oct 16, 2016

schnerd Oct 16, 2016

tmcw Oct 18, 2016

mourner commented Oct 18, 2016

mourner commented Oct 18, 2016

tmcw Oct 18, 2016

schnerd Oct 20, 2016

schnerd commented Oct 20, 2016

tmcw commented Oct 20, 2016

Upgrade to faster Ckmeans implementation (Ckmeans 3.4.6) #163

Upgrade to faster Ckmeans implementation (Ckmeans 3.4.6) #163

Conversation

schnerd commented Oct 16, 2016

schnerd Oct 16, 2016 • edited Loading

Choose a reason for hiding this comment

llimllib Oct 19, 2016 • edited Loading

Choose a reason for hiding this comment

llimllib Oct 19, 2016

Choose a reason for hiding this comment

schnerd Oct 20, 2016

Choose a reason for hiding this comment

schnerd Oct 16, 2016

Choose a reason for hiding this comment

schnerd Oct 16, 2016

Choose a reason for hiding this comment

tmcw Oct 18, 2016

Choose a reason for hiding this comment

mourner commented Oct 18, 2016

mourner commented Oct 18, 2016

tmcw Oct 18, 2016

Choose a reason for hiding this comment

schnerd Oct 20, 2016

Choose a reason for hiding this comment

schnerd commented Oct 20, 2016

tmcw commented Oct 20, 2016

schnerd Oct 16, 2016 •

edited

Loading

llimllib Oct 19, 2016 •

edited

Loading