-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to faster Ckmeans implementation (Ckmeans 3.4.6) #163
Conversation
} else { | ||
sji = sumsOfSquares[i] - sums[i] * sums[i] / (i + 1); | ||
} | ||
return Math.max(0, sji); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally had this as
if (sji < 0) {
return 0;
}
return sji;
however the test coverage checker complained that the conditional return was not covered by any tests. I unfortunately was unable to craft a test case that hit this branch of the code, so I cheated by turning it into a Math.max. A deeper understanding of the algorithm would be required to figure out when this conditional is needed or if it's even needed at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a deeper understanding of the algorithm, but I do have an automated test case generator, so here's an example that hits this branch:
ckmeans([64.64249127327881, 64.64249127328245, 57.79216426169771], 2) == [[57.79216426169771], [64.64249127327881, 64.64249127328245]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(If I had to guess I'd suspect it's to do with float math imprecisions?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@llimllib that's fantastic, can confirm that it works. will add it in with a comment.
var matrix = makeMatrix(nClusters, sorted.length), | ||
// named 'B' originally | ||
// named 'J' originally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These names were changed in the 3.4.6 implementation
} | ||
} | ||
} | ||
fillMatrices(sorted, matrix, backtrackMatrix); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The body of fillMatrices
could technically be inlined here. I kept it in a separate function since it's more true to the original implementation, but this can be easily changed if preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether this method is inline or not I'm not picky about - invocation overhead here is pretty minimal and it is nice to have shorter method bodies. The thing that doesn't fit that well with simple-statistics or JS is that these methods - fillMatrices and fillMatrixColumn - don't return values but change their input by reference. But reviewing the code, I don't see a good way to change this without reducing performance, so I think we're good with the implementation as-is.
This looks wonderful, thanks for taking the time to port this! |
Can you rebase this branch so that it merges cleanly on the latest master? |
var sumsOfSquares = []; | ||
|
||
// Initialize first column in matrix & backtrackMatrix | ||
for (var i = 0; i < nValues; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we never reference unshifted values or the shift
variable after this section, maybe it would simplify the code to either generate a shifted values array:
var shiftedData = data.map(function (value) {
return value - shift;
});
Or in this for loop immediately doing a
var shiftedValue = data[i] - shift;
Since that'd save us the 6 times that the for loop body refers to data[i] - shift
(including the data[0]
cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, definitely makes the code less redundant and easier to follow. will change.
405a736
to
3263f7b
Compare
3263f7b
to
abca80d
Compare
@mourner suggestions taken into account and branch rebased on master |
Awesome, thanks @schnerd - merging! |
See #157 for more details.
Ran some perf tests and the results are impressive:
![screen shot 2016-10-13 at 10 12 35 am](https://cloud.githubusercontent.com/assets/875591/19414690/0c033a68-930a-11e6-95d7-218e6daf2bba.png)
While the code is functionally correct, I'm not 100% satisfied with the clarity of certain comments and variable/function names in this PR (in particular
ssq
,sjlowi
, etc). The code from the original C++ implementation is unfortunately a bit difficult to follow, and the authors have not yet published a detailed explanation of the new algorithm, so it's difficult to design better names without a deeper understanding.I'm hoping to get a few more eyeballs on this code that can provide feedback or recommend changes if necessary.