Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to faster Ckmeans implementation (Ckmeans 3.4.6) #163

Merged
merged 1 commit into from
Oct 20, 2016

Conversation

schnerd
Copy link
Member

@schnerd schnerd commented Oct 16, 2016

See #157 for more details.

Ran some perf tests and the results are impressive:
screen shot 2016-10-13 at 10 12 35 am

While the code is functionally correct, I'm not 100% satisfied with the clarity of certain comments and variable/function names in this PR (in particular ssq, sjlowi, etc). The code from the original C++ implementation is unfortunately a bit difficult to follow, and the authors have not yet published a detailed explanation of the new algorithm, so it's difficult to design better names without a deeper understanding.

I'm hoping to get a few more eyeballs on this code that can provide feedback or recommend changes if necessary.

} else {
sji = sumsOfSquares[i] - sums[i] * sums[i] / (i + 1);
}
return Math.max(0, sji);
Copy link
Member Author

@schnerd schnerd Oct 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had this as

if (sji < 0) {
    return 0;
}
return sji;

however the test coverage checker complained that the conditional return was not covered by any tests. I unfortunately was unable to craft a test case that hit this branch of the code, so I cheated by turning it into a Math.max. A deeper understanding of the algorithm would be required to figure out when this conditional is needed or if it's even needed at all.

Copy link
Contributor

@llimllib llimllib Oct 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a deeper understanding of the algorithm, but I do have an automated test case generator, so here's an example that hits this branch:

ckmeans([64.64249127327881, 64.64249127328245, 57.79216426169771], 2) == [[57.79216426169771], [64.64249127327881, 64.64249127328245]]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(If I had to guess I'd suspect it's to do with float math imprecisions?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@llimllib that's fantastic, can confirm that it works. will add it in with a comment.

var matrix = makeMatrix(nClusters, sorted.length),
// named 'B' originally
// named 'J' originally
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names were changed in the 3.4.6 implementation

}
}
}
fillMatrices(sorted, matrix, backtrackMatrix);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The body of fillMatrices could technically be inlined here. I kept it in a separate function since it's more true to the original implementation, but this can be easily changed if preferred.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether this method is inline or not I'm not picky about - invocation overhead here is pretty minimal and it is nice to have shorter method bodies. The thing that doesn't fit that well with simple-statistics or JS is that these methods - fillMatrices and fillMatrixColumn - don't return values but change their input by reference. But reviewing the code, I don't see a good way to change this without reducing performance, so I think we're good with the implementation as-is.

@mourner
Copy link
Contributor

mourner commented Oct 18, 2016

This looks wonderful, thanks for taking the time to port this!

@mourner
Copy link
Contributor

mourner commented Oct 18, 2016

Can you rebase this branch so that it merges cleanly on the latest master?

var sumsOfSquares = [];

// Initialize first column in matrix & backtrackMatrix
for (var i = 0; i < nValues; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we never reference unshifted values or the shift variable after this section, maybe it would simplify the code to either generate a shifted values array:

var shiftedData = data.map(function (value) {
  return value - shift;
});

Or in this for loop immediately doing a

var shiftedValue = data[i] - shift;

Since that'd save us the 6 times that the for loop body refers to data[i] - shift (including the data[0] cases)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, definitely makes the code less redundant and easier to follow. will change.

@schnerd
Copy link
Member Author

schnerd commented Oct 20, 2016

@mourner suggestions taken into account and branch rebased on master

@tmcw
Copy link
Member

tmcw commented Oct 20, 2016

Awesome, thanks @schnerd - merging!

@tmcw tmcw merged commit 5eb7722 into simple-statistics:master Oct 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants