Fix cit.chisq big memory bug #37
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Updated functions:
chisq_or_gsq_testinutils/cit.py.What is improved:
Why this update is needed:
Consider a chi-squared CITest over discrete variables
XandYgiven conditioning variables setS. We'll need to count the joint probability table over each occurred configuration ofS. In pull request #6, we parallelize this counting cells process and gain huge speedup. However, that implementation is based on an assumption that the cardinalities of variables are usually small (e.g. <5), and the size ofSis usually small (e.g. <5).Yet sometimes the conditioning set may contain many variables and each variable's cardinality is large. Consider a case where
Scontains 7 variables and each's cardinality=20, thencardS = np.prod(cardSXY[:-2])would be 1280000000, i.e., there are 1280000000 different possible configuration ofS, so theSxyJointCounts(to the line) array would be of size1280000000 * cardX * cardY * np.int64, i.e., ~3.73TB memory! (supposecardX,cardYare also 20).However, the sample size is usually in 1k-100k scale, which is far less than
cardS. Not all (and actually only a very small portion of configurations ofSappeared in data), i.e.,SMarginalCountsNonZero(to the line) is a very sparse array.Hence, when
cardSXYis large, we first re-indexS(skip the absent configurations) and then count the joint XY table for each configuration. Specifically, two functions_Fill3DCountTable_by_bincountand_Fill3DCountTable_by_uniqueare used under different scale ofcardSXY.Testing:
Empirical threshold: how to choose between
np.bincountandnp.unique?Refer here: