-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding AVLTreeDigest function to set a specific random seed #183
Conversation
2cc6ae3
to
8dd0036
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will need more review than this one comment.
Overall, I can see the value of being able to nail down randomness, but I don't like the idea of arbitrary types being deserialized.
Have you created a corresponding issue for this pull request?
ByteArrayInputStream bais = new ByteArrayInputStream(bytes); | ||
try { | ||
ObjectInputStream ois = new ObjectInputStream(bais); | ||
Random r = (Random)ois.readObject(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the object here constrained? I don't like creating objects of unknown type based on user data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bytes
here isn't constrained - it was meant to cleanup the code in fromBytes ()
a bit, but in retrospect, you're right - it doesn't really make sense to expose this function to the rest of the class.
I'll make a note to move this into fromBytes()
if serializing the Random
object ends up being the path forward after some additional discussion
Is there a reason you are using AVLTreeDigest instead of the MergingDigest?
…On Wed, Dec 22, 2021 at 8:31 AM Cedric Hansen ***@***.***> wrote:
The random element in TDigest can cause some unpredictability in certain
use cases.
This commit adds a second constructor to AVLTreeDigest, which allows a
specific random seed to be used.
Tests have been added to verify that this option does not change the
behaviour of the standard AVLTreeDigest constructor
------------------------------
You can view, comment on, or merge this pull request online at:
#183
Commit Summary
- 1569dbe
<1569dbe>
Adding AVLTreeDigest option with a specific seed
File Changes
(8 files <https://github.com/tdunning/t-digest/pull/183/files>)
- *M* benchmark/src/main/java/com/tdunning/Benchmark.java
<https://github.com/tdunning/t-digest/pull/183/files#diff-bd3f3f286a70e5d4221519ea5e93fe6213fa59f1b423d3bb29d30419598b6a83>
(8)
- *M* benchmark/src/main/java/com/tdunning/TDigestBench.java
<https://github.com/tdunning/t-digest/pull/183/files#diff-1e9c46111118e072dc8e1f29e57beafd28935d8093895f6b8a596da3a8742a43>
(13)
- *M* core/src/main/java/com/tdunning/math/stats/AVLTreeDigest.java
<https://github.com/tdunning/t-digest/pull/183/files#diff-d4348809c2cd9b37c07f0997787deb9ea144de479aec02bb8ff83e5c47beef29>
(70)
- *M* core/src/main/java/com/tdunning/math/stats/TDigest.java
<https://github.com/tdunning/t-digest/pull/183/files#diff-c279eac9bfb3616cb9e46d8e5787ad0b014779f66c5901c0fd0ae2596f946121>
(33)
- *M*
core/src/test/java/com/tdunning/math/stats/AlternativeMergeTest.java
<https://github.com/tdunning/t-digest/pull/183/files#diff-2320765f1c49a23f2850c8efa83a6f6d6e8f4c803ed45c0f03b858d848f72698>
(9)
- *M*
core/src/test/java/com/tdunning/math/stats/TDigestSerializationTest.java
<https://github.com/tdunning/t-digest/pull/183/files#diff-1a84e2c9b248d6c88d5c25bc242fb8226102ab1c400ccbcd4131c9a79dcee4b7>
(22)
- *M* core/src/test/java/com/tdunning/math/stats/TDigestTest.java
<https://github.com/tdunning/t-digest/pull/183/files#diff-4621057ea1091de6ac57671a17129a8842f23ecde326296b0e101a6c05d9b589>
(2)
- *M* quality/src/test/java/com/tdunning/tdigest/quality/Util.java
<https://github.com/tdunning/t-digest/pull/183/files#diff-2c38c2449e7c11ad8ef5dc442dad5bf25ed6626329be280d004daeb64b6404b2>
(9)
Patch Links:
- https://github.com/tdunning/t-digest/pull/183.patch
- https://github.com/tdunning/t-digest/pull/183.diff
—
Reply to this email directly, view it on GitHub
<#183>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5E6VYFSIQ4SSEDNBV7MTUSH4OLANCNFSM5KS5V7YA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thanks for getting a first look on this PR so quickly @tdunning !
Fair enough. I'll try to think of some other solutions myself. For a bit of extra context, what my team and I are aiming to get, is a way to consistently produce the exact same data structure, given the same points, and potentially new data points. At the moment, when we recreate an AVLTreeDigest, the random element in the implementation results in a slightly different data structure, which is an undesirable result for us. Using a specific random object with a specific seed, allows us to recreate the exact same data structure when we create a new tree with old data, at which point, the random object is restored to the same state.
I have not. If you'd like me to outline in greater detail the rationale as to why my team and I are looking to add this functionality, then I'd be happy to create an issue.
My team and I ran some benchmarks with the various TDigest implementations, and chose AVLTreeDigest with a small compression factor (5), since it runs significantly faster than other implementations, estimates the median value within our acceptable margin of error, and takes up a fairly small memory footprint. Speed is arguably the most important performance metric for us, and AVLTreeDigest was the fastest according to our testing - is this consistent with your findings/knowledge? If there is a way to speed up other implementations without sacrificing the accuracy |
Yeah... an issue would be a good thing.
There is a pending bug in AVLTreeDigest you should be aware of. See issue
#169
The speed is interesting. Very much worth a discussion!
…On Wed, Dec 22, 2021 at 6:43 PM Cedric Hansen ***@***.***> wrote:
Thanks for getting a first look on this PR so quickly @tdunning
<https://github.com/tdunning> !
I will need more review than this one comment.
Overall, I can see the value of being able to nail down randomness, but I
don't like the idea of arbitrary types being deserialized.
Fair enough. I'll try to think of some other solutions myself. For a bit
of extra context, what my team and I are aiming to get, is a way to
consistently produce the exact same data structure, given the same points,
and potentially new data points. At the moment, when we recreate an
AVLTreeDigest, the random element in the implementation results in a
slightly different data structure, which is an undesirable result for us.
Using a specific random object with a specific seed, allows us to recreate
the exact same data structure when we create a new tree with old data, at
which point, the random object is restored to the same state.
Have you created a corresponding issue for this pull request?
I have not. If you'd like me to outline in greater detail the rationale as
to why my team and I are looking to add this functionality, then I'd be
happy to create an issue.
Is there a reason you are using AVLTreeDigest instead of the MergingDigest?
My team and I ran some benchmarks with the various TDigest
implementations, and chose AVLTreeDigest with a small compression factor
(5), since it runs significantly faster than other implementations,
estimates the median value within our acceptable margin of error, and takes
up a fairly small memory footprint. Speed is arguably the most important
performance metric for us, and AVLTreeDigest was the fastest according to
our testing - is this consistent with your findings/knowledge? If there is
a way to speed up other implementations without sacrificing the accuracy
tDigest.quantile(0.5), then we would certainly be open to testing that
out to see if it better suits our particular use case
—
Reply to this email directly, view it on GitHub
<#183 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5E6WQBOUSKRUR4RTRFI3USKEF5ANCNFSM5KS5V7YA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks for sharing that issue, I ran into it a few times during my testing, good to know that's a transient failure I just re-ran some experiments comparing I'll talk to my team after the holidays to see if the accuracy of using |
For median estimation, you might try using the K_0 scaling function. That
will avoid wasting information on the tail accuracy.
Also, measure the actual space used. MergingDigest may be more aggressive
about staying within memory bounds which could mean that a higher value of
compression would be warranted (possibly up to 10 or so).
…On Thu, Dec 23, 2021 at 12:24 PM Cedric Hansen ***@***.***> wrote:
Yeah... an issue would be a good thing.
Sounds good, I'll start working on one - likely won't submit until after
the holidays.
Thanks for sharing that issue, I ran into it a few times during my
testing, good to know that's a transient failure
I just re-ran some experiments comparing AVLTreeDigest and MergingDigest,
and they were surprisingly roughly the same speed, however AVLTreeDigest
was significantly more accurate (~10x closer to the median than
MergingDigest). My original experiments were with TDigest3.1, where
MergingDigest was noticeably slower than AVLTreeDigest. In 3.1 the speed
seems to be about the same in both implementations.
I'll talk to my team after the holidays to see if the accuracy of using
MergingDigest(5) is within our margin of error. If so, then we might try
that approach instead of AVLTreeDigest with some of the work/ideas
outlined in this PR
—
Reply to this email directly, view it on GitHub
<#183 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5E6QU2PZAMZZY47ER74DUSOAOZANCNFSM5KS5V7YA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The random element in TDigest can cause some unpredictability in certain use cases. This commit adds a second constructor to `AVLTreeDigest`, which allows a specific random obj to be used. If this constructor is used, then the Random object will be persisted, such that the random number generation is consistent. Tests have been added to verify that this option does not change the behaviour of the standard `AVLTreeDigest` constructor
17a71f8
to
2b8891f
Compare
This reverts commit 2b8891f.
This work is related to #185 |
928372b
to
454b379
Compare
I've had a chance to re-think the overall approach to tackle issue #185 , and have made the appropriate code changes. If someone wants to change the seed once, they can just call This approach is far simpler, makes trees deterministic (if thats the desirable outcome), and doesn't change any serialization/deserialization (which maintains some backwards compatibility). A few open questions:
Happy to hear your thoughts on this simpler approach @tdunning |
I like the simpler approach. To your questions:
I concur. A setter is enough.
No strong opinion. I prefer simpler, but it is a two-liner so it isn't a big difference. |
Have you considered the option of having just the constructor parameter and not have the setter? That would assure some hypotetical machine verifier that the initialized object will necessarily give the expected result, since changing the seed after construction will be impossible. Not sure if this an important property, but, if the original RNG is a final attribute, maybe it is. |
Sounds good, works for me. I guess with that being said, I'll mark this PR as being ready for final review |
Cool!
…On Tue, Jan 4, 2022 at 11:50 AM Cedric Hansen ***@***.***> wrote:
I like the simpler approach.
To your questions:
Should another add() method be added, which supports a random seed as a
parameter? My gut feeling is no
I concur. A setter is enough.
Thoughts on a constructor taking in a seed, and then calling
setRandomSeed() from there?
No strong opinion. I prefer simpler, but it is a two-liner so it isn't a
big difference.
Sounds good, works for me. I guess with that being said, I'll mark this PR
as being ready for final review
—
Reply to this email directly, view it on GitHub
<#183 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5E6RQMHOT4JA3YHTEMC3UUNFQTANCNFSM5KS5V7YA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
@pedrolamarao Good question, I'll defer that to @tdunning . As for a case where setting the seed each time might be useful, consider the following: |
@tdunning Is there a timeline for this PR to be included in a new release, perhaps |
There is no timeline.
Do you think that there is a viable way for me to delegate release
management for 3.4 to you?
…On Tue, Jan 4, 2022 at 12:04 PM Cedric Hansen ***@***.***> wrote:
@tdunning <https://github.com/tdunning> Is there a timeline for this PR
to be included in a new release, perhapsv3.4?
I see on the readme there are a handful of items listed for the v4.0
release, so not sure if that is the next anticipated release
—
Reply to this email directly, view it on GitHub
<#183 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5E6XMMNHHJNX6Y6CCYQ3UUNHGXANCNFSM5KS5V7YA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I'll check with my company's open source guidelines to see if we have a standard procedure for releasing a third party library, and I'll get back to you. As a side note: I am just noticing I never squashed my commits in this PR.
|
IF you manage the release, it would need to be via github processes, not a
company process.
…On Wed, Jan 5, 2022 at 6:58 AM Cedric Hansen ***@***.***> wrote:
I'll check with my company's open source guidelines to see if we have a
standard procedure for releasing a third party library, and I'll get back
to you.
As a side note: I am just noticing I never squashed my commits in this PR.
caa728d and
addee2b commits can be removed entirely
(one commit simply reverts the other - this is essentially the content of
the *initial* PR where I was messing with serialization). I don't have
write access to main, so I can't remove them myself. Just thought I'd give
you the heads up (and apologize) that the commit history is a bit messy in
main right now.
—
Reply to this email directly, view it on GitHub
<#183 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5E6X3J7C44YLKKQOWIODUURMCPANCNFSM5KS5V7YA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I hate to do history rewrites on the main branch, so let's let this ride.
…On Wed, Jan 5, 2022 at 6:58 AM Cedric Hansen ***@***.***> wrote:
I'll check with my company's open source guidelines to see if we have a
standard procedure for releasing a third party library, and I'll get back
to you.
As a side note: I am just noticing I never squashed my commits in this PR.
caa728d and
addee2b commits can be removed entirely
(one commit simply reverts the other - this is essentially the content of
the *initial* PR where I was messing with serialization). I don't have
write access to main, so I can't remove them myself. Just thought I'd give
you the heads up (and apologize) that the commit history is a bit messy in
main right now.
—
Reply to this email directly, view it on GitHub
<#183 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5E6X3J7C44YLKKQOWIODUURMCPANCNFSM5KS5V7YA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Sounds good. I have not hear anything back yet about my company's policy surrounding managing the release of an open source project that we don't "own". That being said, to err on the side of caution, my team can wait for whatever the next release will be. We've opted to use some (sigh) reflection as a temporary workaround. If I hear anything back and get the green light to manage the release, I'll reach back out in this issue! |
Cool.
…On Mon, Jan 10, 2022 at 7:35 AM Cedric Hansen ***@***.***> wrote:
IF you manage the release, it would need to be via github processes, not a
company process.
Sounds good. I have not hear anything back yet about my company's policy
surrounding managing the release of an open source project that we don't
"own". That being said, to err on the side of caution, my team can wait for
whatever the next release will be. We've opted to use some (sigh)
reflection as a temporary workaround.
If I hear anything back and get the green light to manage the release,
I'll reach back out in this issue!
—
Reply to this email directly, view it on GitHub
<#183 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5E6UISYPNVVNIDVVP7QLUVL4FFANCNFSM5KS5V7YA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The random element in TDigest can cause some unpredictability in certain use cases.
This commit adds a second constructor to
AVLTreeDigest
, which allows a specific random seed to be used.Tests have been added to verify that this option does not change the behaviour of the standard
AVLTreeDigest
constructor