Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
satellite/audit: fix reservoir sampling bias
While researching logs from a large set of audits, I noticed that nearly all of them had streamIDs starting with 0 or 1. This seemed very odd, because streamIDs are supposed to be pretty much entirely random, and every hex digit from 0-f should have been represented with roughly equal frequency. It turned out that our A-Chao implementation of reservoir sampling is flawed. As far as we can tell, so is the Wikipedia implementation. No one has yet reviewed the original 1982 paper by Dr. Chao in enough detail to know where the error originated, but we do know that we have been auditing segments near the beginning of the segment loop (low streamIDs) far more often than segments near the end of the segment loop (high streamIDs). This change uses an algorithm Wikipedia calls "A-Res" instead, and adds a test to check for that sort of bias creeping back in somehow. A-Res will be slightly slower than A-Chao, because of a few extra steps that need to be done, but it does appear to be selecting items uniformly. Change-Id: I45eba4c522bafc729cebe2aab6f3fe65cd6336be
- Loading branch information
Showing
2 changed files
with
78 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters