Support for "sparse data", i.e. omitted 0-counts in datasets. #64

enielse · 2019-04-30T16:20:38Z

Updates the logL and chi2 functions so that they properly handle the case when probabilities are omitted from the supplied data. Also adds recordZeroCnts arguments to data set loading and creation functions such as load_dataset and generate_fake_data to allow easy toggling of whether 0-counts are stored in created data sets. Default behavior is to include zero counts, i.e. recordZeroCnts=True.

A bug was found whereby the chi2 and logl objective functions were not computed correctly when there were "blanks" in a DataSet being analyzed, i.e. "sparse data". This was because chi2 and logl terms corresponding to blank-outcomes were not computed (assumed to be zero) and this is incorrect. This is because the chi2 term is still N(p-f)^2/p and the logl term is N*f*log(p) - N*p ( != 0 when f == 0). Only the poissonPicture=False logl term ( N*f*log(p) ) is zero when f is. We *have* to use the poissonPicture=True in least-squares optimizations because this ensures the positive terms needed for the algorithm, so just using poissonPicture=False isn't really an option: we either need to store the zeros in the data set (this is what hotfix-0.9.7.4 does) or somehow account for the blanks in a data set. This commit implements the latter - it reverts recordZeroCnt back to `False` in DataSet.add_count_dict and adds a bunch of logic to the core.py objective and jacobian functions so that blanks, or "omitted data" is accounted for. It does this in a way slightly different than the way it woud be done if we included all the zero counts - we lump all the omitted probability for a circuit, i.e. the quantity 1.0 - sum_of_present_probabilities, together and treat it as though it was a *single* probability term, adding it's contribution to the *first* present-outcome-probability for a given circuit. This is the same as the separate accounting when all the individual probabilities corresponding to zero-frequences are not "clipped": both the logl term and the chi2 term is just `N*p` when f==0 (linear!). When there *is* clipping (p not in [minp,1-minp] for chi2 and p < radius for logl), however, the penalty function is not linear and so there will be a difference between the "sparse-data" and "non-sparse-data" cases (this can happen more than you might think, b/c zero counts often correspond to small probabilities). This approach still seems best, as there's nothing particularly "right" about either approach (both "patch" the objective fn), and we don't want to separately account for all the zero-frequency outcomes in the sparse-data case: there could be a ton of them (e.g. a 20Q dataset with 100 clicks per sequence). We can always make the minProbClipForWeighting and radius parameters really small. Things we still need to do surrounding this: - we can remove the poissonPicture=False case for our objective functions - these are actually useless b/c they're incompatible with the least squares solver. - the functions in chi2fns.py and likelihoodfns.py need to be updated as the ones in core.py have -- they're still incorrect for sparse data. - it seems useful to have some "sparse" arguments to functions like generate_fake_data and load_dataset -- so we have access to the recordZeroCnts argument of add_count_dict. Maybe even a way to convert a DataSet from sparse <-> dense representation? - check the speed of the added core.py functions and optimize some of the omitted-probs code (maybe add some Cython fns, and at least short-circuit when there are no omitted probs). Finally, this commit leaves a bunch of debugging code in core.py - search for TODO REMOVE to get rid of this later.

Transfers calculation of omitted-probability contributions from core.py functions to those in chi2fns.py and likelihoodfns.py so functions like logl(...) now work with sparse data. These haven't been tested against and should eventually be consolidated with the functions in core.py. Additionally, this commit implements the logl hessian computation which wasn't needed in core.py - this needs to be tested for correctness.

A long time ago we moved to a least-squares solver and, by default, the "poisson picture" of the log-likelihood. We left in the option to use the non-poisson-picture, but shouldn't have, as this doesn't give the strictly positive terms needed by the leastsq solvers. This commit removes (comments out and adds a NotImplementedError) the poissonPicture=False case of do_mlgst. We'll leave the plumbing in for now, in case in the future we allow for use of a different optimization algorithm. The line where we would destructively truncate negative terms in the non-Poisson-picture case is marked with a comment to this effect.

Plumbs a connection to DataSet.add_count_dict's "recordZeroCnts" argument to the dataset generating/filtering functions in datasetconstruction.py and the dataset/multidataset loading functions in loaders.py. The default value is *always* set to True for now, as this seems the safest option (and now it's easy to access the alternate "sparse data" functionality).

Adds "if firsts is not None:" conditionals around computational logic in core.py, chi2fns.py and likelihoodfns.py so that the extra steps needed to account for omitted probabilities are not run when there are no such probabilities. This also helps to identify which code is used for dealing with omitted probabilities.

This includes updating reference files used by testCalcMethods1Q.py, as correctly accounting for omitted probabilities alters the results slightly (~1e-3 in model "frobenius distance").

Erik Nielsen added 7 commits April 26, 2019 23:19

Docstring update (trivial).

de6c4d7

Updates unit tests after recent sparse-data fixes.

e2a300a

This includes updating reference files used by testCalcMethods1Q.py, as correctly accounting for omitted probabilities alters the results slightly (~1e-3 in model "frobenius distance").

enielse self-assigned this Apr 30, 2019

enielse mentioned this pull request Apr 30, 2019

pyGSTi underestimates log-likelihood when data contains 0-counts #65

Closed

enielse changed the base branch from master to develop April 30, 2019 16:36

Fixes flycheck errors (long-line comments & removes whitespace).

a6eb0a0

enielse merged commit 4064007 into develop Apr 30, 2019

enielse deleted the feature-sparsedata branch April 30, 2019 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for "sparse data", i.e. omitted 0-counts in datasets. #64

Support for "sparse data", i.e. omitted 0-counts in datasets. #64

enielse commented Apr 30, 2019

Support for "sparse data", i.e. omitted 0-counts in datasets. #64

Support for "sparse data", i.e. omitted 0-counts in datasets. #64

Conversation

enielse commented Apr 30, 2019