Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for "sparse data", i.e. omitted 0-counts in datasets. #64

Merged
merged 8 commits into from
Apr 30, 2019

Conversation

enielse
Copy link
Collaborator

@enielse enielse commented Apr 30, 2019

Updates the logL and chi2 functions so that they properly handle the case when probabilities are omitted from the supplied data. Also adds recordZeroCnts arguments to data set loading and creation functions such as load_dataset and generate_fake_data to allow easy toggling of whether 0-counts are stored in created data sets. Default behavior is to include zero counts, i.e. recordZeroCnts=True.

Erik Nielsen added 7 commits April 26, 2019 23:19
A bug was found whereby the chi2 and logl objective functions were
not computed correctly when there were "blanks" in a DataSet being
analyzed, i.e. "sparse data".  This was because chi2 and logl terms
corresponding to blank-outcomes were not computed (assumed to be
zero) and this is incorrect.  This is because the chi2 term is
still N(p-f)^2/p and the logl term is N*f*log(p) - N*p ( != 0
when f == 0).  Only the poissonPicture=False logl term ( N*f*log(p) )
is zero when f is.  We *have* to use the poissonPicture=True in
least-squares optimizations because this ensures the positive terms
needed for the algorithm, so just using poissonPicture=False isn't
really an option: we either need to store the zeros in the data set
(this is what hotfix-0.9.7.4 does) or somehow account for the blanks
in a data set.

This commit implements the latter - it reverts recordZeroCnt back to
`False` in DataSet.add_count_dict and adds a bunch of logic to the
core.py objective and jacobian functions so that blanks, or "omitted
data" is accounted for.  It does this in a way slightly different than
the way it woud be done if we included all the zero counts - we lump
all the omitted probability for a circuit, i.e. the quantity
1.0 - sum_of_present_probabilities, together and treat it as though
it was a *single* probability term, adding it's contribution to the
*first* present-outcome-probability for a given circuit.  This is
the same as the separate accounting when all the individual
probabilities corresponding to zero-frequences are not "clipped":
both the logl term and the chi2 term is just `N*p` when f==0 (linear!).
When there *is* clipping (p not in [minp,1-minp] for chi2 and p < radius
for logl), however, the penalty function is not linear and so there
will be a difference between the "sparse-data" and "non-sparse-data"
cases (this can happen more than you might think, b/c zero counts
often correspond to small probabilities).  This approach still seems
best, as there's nothing particularly "right" about either approach
(both "patch" the objective fn), and we don't want to separately
account for all the zero-frequency outcomes in the sparse-data
case: there could be a ton of them (e.g. a 20Q dataset with 100
clicks per sequence).  We can always make the
minProbClipForWeighting and radius parameters really small.

Things we still need to do surrounding this:
- we can remove the poissonPicture=False case for our objective
 functions - these are actually useless b/c they're incompatible
 with the least squares solver.
- the functions in chi2fns.py and likelihoodfns.py need to be
 updated as the ones in core.py have -- they're still incorrect
 for sparse data.
- it seems useful to have some "sparse" arguments to functions
 like generate_fake_data and load_dataset -- so we have access to
 the recordZeroCnts argument of add_count_dict.  Maybe even a way
 to convert a DataSet from sparse <-> dense representation?
- check the speed of the added core.py functions and optimize
 some of the omitted-probs code (maybe add some Cython fns,
 and at least short-circuit when there are no omitted probs).

Finally, this commit leaves a bunch of debugging code in core.py
- search for TODO REMOVE to get rid of this later.
Transfers calculation of omitted-probability contributions from
core.py functions to those in chi2fns.py and likelihoodfns.py so
functions like logl(...) now work with sparse data.  These haven't
been tested against and should eventually be consolidated with the
functions in core.py.  Additionally, this commit implements the
logl hessian computation which wasn't needed in core.py - this
needs to be tested for correctness.
A long time ago we moved to a least-squares solver and, by default,
the "poisson picture" of the log-likelihood.  We left in the option
to use the non-poisson-picture, but shouldn't have, as this doesn't
give the strictly positive terms needed by the leastsq solvers.

This commit removes (comments out and adds a NotImplementedError)
the poissonPicture=False case of do_mlgst.  We'll leave the
plumbing in for now, in case in the future we allow for use of a
different optimization algorithm.  The line where we would
destructively truncate negative terms in the non-Poisson-picture
case is marked with a comment to this effect.
Plumbs a connection to DataSet.add_count_dict's "recordZeroCnts"
argument to the dataset generating/filtering functions in
datasetconstruction.py and the dataset/multidataset loading functions
in loaders.py.  The default value is *always* set to True for now,
as this seems the safest option (and now it's easy to access the
alternate "sparse data" functionality).
Adds "if firsts is not None:" conditionals around computational
logic in core.py, chi2fns.py and likelihoodfns.py so that the extra
steps needed to account for omitted probabilities are not run when
there are no such probabilities.  This also helps to identify which
code is used for dealing with omitted probabilities.
This includes updating reference files used by testCalcMethods1Q.py,
as correctly accounting for omitted probabilities alters the
results slightly (~1e-3 in model "frobenius distance").
@enielse enielse self-assigned this Apr 30, 2019
@enielse enielse changed the base branch from master to develop April 30, 2019 16:36
@enielse enielse merged commit 4064007 into develop Apr 30, 2019
@enielse enielse deleted the feature-sparsedata branch April 30, 2019 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant