-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for "sparse data", i.e. omitted 0-counts in datasets. #64
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A bug was found whereby the chi2 and logl objective functions were not computed correctly when there were "blanks" in a DataSet being analyzed, i.e. "sparse data". This was because chi2 and logl terms corresponding to blank-outcomes were not computed (assumed to be zero) and this is incorrect. This is because the chi2 term is still N(p-f)^2/p and the logl term is N*f*log(p) - N*p ( != 0 when f == 0). Only the poissonPicture=False logl term ( N*f*log(p) ) is zero when f is. We *have* to use the poissonPicture=True in least-squares optimizations because this ensures the positive terms needed for the algorithm, so just using poissonPicture=False isn't really an option: we either need to store the zeros in the data set (this is what hotfix-0.9.7.4 does) or somehow account for the blanks in a data set. This commit implements the latter - it reverts recordZeroCnt back to `False` in DataSet.add_count_dict and adds a bunch of logic to the core.py objective and jacobian functions so that blanks, or "omitted data" is accounted for. It does this in a way slightly different than the way it woud be done if we included all the zero counts - we lump all the omitted probability for a circuit, i.e. the quantity 1.0 - sum_of_present_probabilities, together and treat it as though it was a *single* probability term, adding it's contribution to the *first* present-outcome-probability for a given circuit. This is the same as the separate accounting when all the individual probabilities corresponding to zero-frequences are not "clipped": both the logl term and the chi2 term is just `N*p` when f==0 (linear!). When there *is* clipping (p not in [minp,1-minp] for chi2 and p < radius for logl), however, the penalty function is not linear and so there will be a difference between the "sparse-data" and "non-sparse-data" cases (this can happen more than you might think, b/c zero counts often correspond to small probabilities). This approach still seems best, as there's nothing particularly "right" about either approach (both "patch" the objective fn), and we don't want to separately account for all the zero-frequency outcomes in the sparse-data case: there could be a ton of them (e.g. a 20Q dataset with 100 clicks per sequence). We can always make the minProbClipForWeighting and radius parameters really small. Things we still need to do surrounding this: - we can remove the poissonPicture=False case for our objective functions - these are actually useless b/c they're incompatible with the least squares solver. - the functions in chi2fns.py and likelihoodfns.py need to be updated as the ones in core.py have -- they're still incorrect for sparse data. - it seems useful to have some "sparse" arguments to functions like generate_fake_data and load_dataset -- so we have access to the recordZeroCnts argument of add_count_dict. Maybe even a way to convert a DataSet from sparse <-> dense representation? - check the speed of the added core.py functions and optimize some of the omitted-probs code (maybe add some Cython fns, and at least short-circuit when there are no omitted probs). Finally, this commit leaves a bunch of debugging code in core.py - search for TODO REMOVE to get rid of this later.
Transfers calculation of omitted-probability contributions from core.py functions to those in chi2fns.py and likelihoodfns.py so functions like logl(...) now work with sparse data. These haven't been tested against and should eventually be consolidated with the functions in core.py. Additionally, this commit implements the logl hessian computation which wasn't needed in core.py - this needs to be tested for correctness.
A long time ago we moved to a least-squares solver and, by default, the "poisson picture" of the log-likelihood. We left in the option to use the non-poisson-picture, but shouldn't have, as this doesn't give the strictly positive terms needed by the leastsq solvers. This commit removes (comments out and adds a NotImplementedError) the poissonPicture=False case of do_mlgst. We'll leave the plumbing in for now, in case in the future we allow for use of a different optimization algorithm. The line where we would destructively truncate negative terms in the non-Poisson-picture case is marked with a comment to this effect.
Plumbs a connection to DataSet.add_count_dict's "recordZeroCnts" argument to the dataset generating/filtering functions in datasetconstruction.py and the dataset/multidataset loading functions in loaders.py. The default value is *always* set to True for now, as this seems the safest option (and now it's easy to access the alternate "sparse data" functionality).
Adds "if firsts is not None:" conditionals around computational logic in core.py, chi2fns.py and likelihoodfns.py so that the extra steps needed to account for omitted probabilities are not run when there are no such probabilities. This also helps to identify which code is used for dealing with omitted probabilities.
This includes updating reference files used by testCalcMethods1Q.py, as correctly accounting for omitted probabilities alters the results slightly (~1e-3 in model "frobenius distance").
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Updates the logL and chi2 functions so that they properly handle the case when probabilities are omitted from the supplied data. Also adds
recordZeroCnts
arguments to data set loading and creation functions such asload_dataset
andgenerate_fake_data
to allow easy toggling of whether 0-counts are stored in created data sets. Default behavior is to include zero counts, i.e.recordZeroCnts=True
.