-
Notifications
You must be signed in to change notification settings - Fork 196
Open
Labels
Description
Some inconsistency in zscore when all values are the same:
@show zscore(zeros(5) + log(1e-5))[1] # NaN
@show zscore(zeros(8) + log(1e-5))[1] # -0.9354143466934853
@show zscore(zeros(10) + log(1e-5))[1] # -0.9486832980505138I think 0 would have been a better answer in all of those cases, but NaN is certainly better than -0.935
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
andreasnoack commentedon Jul 26, 2016
I don't really see how we could do better here. Scaling a degenerate distribution to unit variance is not really possible. In numerical terms that just means you get maximum errors in the result.
ararslan commentedon Jul 26, 2016
@andreasnoack We could try something like
allunique(x) ? Inf : zscore!(...), whereInfcould also beNaNor whatever. Dunno if that's the best course of action, just an idea.maximsch2 commentedon Jul 26, 2016
Even non-unique items can lead to issues. I've run into the same issue before (on the same dataset :() in scikit-learn which lead to: scikit-learn/scikit-learn#4436, scikit-learn/scikit-learn#3725.
I think scikit-learn tries to fix std from 0 to 1, and also does a bit more to ensure stability.
andreasnoack commentedon Jul 26, 2016
That's a pretty significant discontinuity to introduce at zero and the result definitely doesn't have unit variance. @maximsch2 can you explain what kind of computations you make after the
zscorethat would be erroneous for the present version ofzscore?The
alluniquecheck should be very cheap so that might be a good way to capture a constant vector. It seems to me that the constant case is the most likely problematic case.simonbyrne commentedon Jul 27, 2016
We've hit this issue with
meanandvarbefore (I think there's an issue somewhere, though I can't find it now). One solution would be to compute the mean using a superaccumulator algorithm.maximsch2 commentedon Jul 27, 2016
@andreasnoack I'm processing microarray gene expression data and values for one probe happen to be all equal across sample. Standardizing the data (zero mean, variance one) is a standard pre-processing step to do linear modeling afterwards and, from that point of view, replacing a vector that has all values the same with
zerosis OK. I guess you can think of it as just ignoring univariance transform if the variance is exactly zero.I think regardless of your opinion on magically converting std=0 to std=1 (which I agree is potentially a troublesome thing in the edge cases), I think zscore should return the same outputs for all arrays from my original post, maybe NaN is a correct value (although I would've preferred 0 for practical purposes), but -0.93 certainly doesn't make much sense.
ararslan commentedon Jul 27, 2016
From a purely mathematical standpoint, a z-score of 0 wouldn't make sense for a constant vector since you're dividing by a standard deviation of 0.
Btw for what it's worth R gives
NaNin all cases listed here, both by manually calculating the z-scores and by usingscale.simonbyrne commentedon Jul 27, 2016
I think NaN is the most appropriate result.
ararslan commentedon Jul 27, 2016
In terms of implementation would it make more sense to catch a constant vector using
alluniqueor something likeisapprox(std(x), 0)?simonbyrne commentedon Jul 27, 2016
I don't think we want
allunique(asallunique([1,1,2]) == false), but we could just check if all entries are the same as the first.Computing the mean via a superaccumulator should also fix the problem (as this will give
std(x) == 0)ararslan commentedon Jul 27, 2016
Seems like adjusting the
meancomputation would be a good candidate for an enhancement in Base rather than having a competingmeanimplementation here.andreasnoack commentedon Jul 27, 2016
@simonbyrne Wouldn't a fancy mean be much slower? Checking that all the values in a vector are identical (which indeed is not what
alluniquedoes) should be really cheap.ararslan commentedon Jul 27, 2016
Regarding speed, I was looking at this paper and it indeed seems that precision-preserving summation is much slower.
simonbyrne commentedon Jul 30, 2016
In that case, why don't we add a pass for
meanwhich checks if all entries are the same, and if so returns that value (we could also do it in the same pass as the summation if that's faster)? This should fix all the downstream problems as well.3 remaining items
simonbyrne commentedon Jun 25, 2019
probably not
wsshin commentedon Mar 27, 2022
I don't think checking the sameness of the entries inside
Statistics.meanis realistic, because it will slow down the performancemean(x)significantly when all entries ofxare identical:As an alternative, I propose subtracting the mean of
xbefore calculating the z-score. Because the z-score is shift-invariant, subtracting a constant value (the mean ofxin this case) does not change the result in exact arithmetic. Plus, the extra calculation of the mean and its subtraction fromxis much faster than checking the sameness of the entries ofx, which was demonstrated above 9X slower thanmean(x)itself:Importantly, the proposal avoids the problem of
zscorereported in this Issue in most cases. For example, in the above REPL code,zscore(y)returns a vector correctly filled withNaN, whereaszscore(x)doesn't.The reason the proposal works is as follows. As pointed out in the earlier comments, the problem of
zscorereported in this issue occurs due to the inaccuracy ofmean(x)whenxhas all-identical entries. The inaccuracy comes from the finite precision of floating-point arithmetic: whensum(x)is calculated insideμ = mean(x), rounding error occurs due to the finite precision, and whensum(x)is divided bylength(x), the result is no longer the same as the all-identical entries ofx. When suchμis used in calculating the Z-score(x .- μ) ./ σ, the numerator is no longer a zero vector and therefore the result is not aNaN-filled vector; note thatNaNis generated only when both the numerator and denominator are0.So, even if
xhas all-identical entries, the problem should not occur if the entries have a very small number of significant digits such thatsum(x)does not introduce the rounding error. Note that forxwith all-identical entries,mean(x)becomes different fromx[1]only by the rounding errors and therefore the difference is in the order of the machine epsiloneps():Therefore, for
xwith all-identical entries, we can sumx .- mean(x)without rounding errors unlesslength(x)is very long, in the order of1 / eps() ≈ 5e15. Except for such an extreme case, the proposal should produce the correctNaN-filled vector forxwith all-identical entries, with only minor performance degradation.wsshin commentedon Mar 28, 2022
The implementation of the above proposal should look something like this:
Here are some test results. First, verify if
zscore2fixes the present issue whenxhas all-identical entries:Check if
zscore2produces nearly the same result aszscorefor usualx:Finally, speed comparison:
zscore2uses the same number of allocations aszscore, which is great. However,zscore2is nearly 2X slower thanzscore. As mentioned in my previous comment, I expectedzscore2to take only about 30% longer thanzscore, so this result is disappointing.Tried to speed up
zscore2, but no success. Any suggestions? Or maybe the 2X slowdown is acceptable, compared to the 9X slowdown expected from a more accurate implementation ofmeanas reported in my previous comment.nalimilan commentedon Apr 19, 2022
Have you tried calling
std(Z, mean=0)instead ofmean_and_std(Z)? Though I'm not sure to what extent assuming that the mean is exactly zero can be a problem.Note that there are more efficient ways to do this. First,
x[1]shouldn't be called for each entry. Second,allshort-circuits, so its cost is negligible when all entries are not equal. The overhead when all entries are equal may not be less of a problem given that a slow but correct result is better than an incorrect result. Though another solution would be to usesumorcount, which are faster as they don't short-circuit so they use SIMD. Their downside is that you have to pay the cost even when all entries are not equal. One intermediate solution could be to check whether the two first entries are equal: if that's not the case, we can skip the check; if that's the case, callsumto check that other entries are also equal.wsshin commentedon Apr 21, 2022
Is this what you are suggesting?
function zscore3(X) μ = mean(X) Z = similar(X) Z .= X .- μ σ = std(Z, mean=0) # replace mean_and_std() of zscore2() zscore!(Z, μ, σ) endUnfortunately this does not solve the issue raised by OP:
Remember that
NaNis generated only by0/0. (I have updated #196 (comment) to emphasize this.) My proposal does not work if we don't recalculate the accurate mean for theXshifted by the incorrectμ.wsshin commentedon Apr 21, 2022
I like the "intermediate solution" you mentioned in the above quote, but checking the sameness of the first two entries, or of the first
nentries for sufficiently smallnmay not be ideal, because then we would end up applying the more costly algorithm even ifxhas identical entries only in the first part.How about checking the sameness of a few randomly selected entries of
x? Any suggestions as to how many entries we should check?nalimilan commentedon Apr 21, 2022
I guess we could check e.g. the two first entries and then the last two entries, just in case the data is sorted. Picking values at random positions probably isn't worth it as it will be relatively slow due to cache misses. If we really want to limit the chances of running the full check, better check more consecutive values at the beginning and at the end.
(Ideally one day
allwill be be smart enough to use SIMD by blocks and short-circuit at the end of each block. We could implement this manually but it may not be worth it here.)wsshin commentedon Apr 27, 2022
I'm trying to create a PR implementing the suggested solution, but there is a problem in checking the last two entries.
xcan be an iterator in general (e.g., generated byskipmissing(x)), for which we cannot access the last two entries directly without going through all the earlier entries.As a workaround, I think we should just check first few entries. What do you think?
wsshin commentedon Apr 29, 2022
See the PR #787. I noticed that Z-score is calculated by not only
zscore(), but also byZScoreTransform. So, instead of fixingzscore(), I have fixedmean_and_std()such that it calculates the mean more accurately using the technique discussed earlier in this thread.nalimilan commentedon May 2, 2022
Thanks, but I don't think
mean_and_stdshould give different results from calling juststdorm = mean(x); std(x, mean=m). Also the PR is very complex. As suggested by others above, it would be better to changemeanin Statistics to check whether all first entries are equal, and if so check whether all entries are equal, in which case the first entry can be returned (with adequate promotion, which may be a bit tricky for heterogeneous collections).wsshin commentedon May 4, 2022
The PR has gotten complex in order to handle the cases with multi-dimensional arrays. These cases pass
dimsas a keyword argument, and you have to determine whether or not to calculate the more accurate mean for each vector along thedimsdirection. The basic functionality for vectors is very simple.I thought the proposal to check whether all entries are equal was rejected because it was too slow. Such significant performance regression will never be accepted in
Base. But we can open an issue at least and see how the community is willing to approach this problem. I will open an issue.meanJuliaLang/julia#45186wsshin commentedon May 6, 2022
A comment in the above Julia issue, JuliaLang/julia#45186 (comment), suggested using
AccurateArithmetic.jlfor an accurate sum. For the users ofzscore()who are concerned about the inaccuracy discussed here, it might be just enough to update the docstrings to recommend calculating the mean usingAccurateArithmetic.jloutsidezscore().Currently the two accurate sums implemented in
AccurateArithmetic.jllack the capability to calculate the sum along a certain dimension, which we need forzscore()taking an array. Moreover, the package does not havemean()utilizing the accurate sums. For the convenience of the users, maybe we should introduce these features inAccurateArithmetic.jl.Does this sound reasonable?
nalimilan commentedon Jun 10, 2022
Yes we could point users to other more accurate methods. But given the low performance cost it could still be interesting to handle correctly the case where all values are equal (which is the one mentioned in the OP).