zscore accuracy #196

New issue

Open

#787

Open

zscore accuracy#196

Labels

opened

Some inconsistency in zscore when all values are the same:

@show zscore(zeros(5) + log(1e-5))[1] # NaN
@show zscore(zeros(8) + log(1e-5))[1] # -0.9354143466934853
@show zscore(zeros(10) + log(1e-5))[1] # -0.9486832980505138

I think 0 would have been a better answer in all of those cases, but NaN is certainly better than -0.935

added

Member

I don't really see how we could do better here. Scaling a degenerate distribution to unit variance is not really possible. In numerical terms that just means you get maximum errors in the result.

ararslan

Member

@andreasnoack We could try something like allunique(x) ? Inf : zscore!(...), where Inf could also be NaN or whatever. Dunno if that's the best course of action, just an idea.

maximsch2

Author

Even non-unique items can lead to issues. I've run into the same issue before (on the same dataset :() in scikit-learn which lead to: scikit-learn/scikit-learn#4436, scikit-learn/scikit-learn#3725.
I think scikit-learn tries to fix std from 0 to 1, and also does a bit more to ensure stability.

andreasnoack

Member

I think scikit-learn tries to fix std from 0 to 1

That's a pretty significant discontinuity to introduce at zero and the result definitely doesn't have unit variance. @maximsch2 can you explain what kind of computations you make after the zscore that would be erroneous for the present version of zscore?

The allunique check should be very cheap so that might be a good way to capture a constant vector. It seems to me that the constant case is the most likely problematic case.

simonbyrne

Member

We've hit this issue with mean and var before (I think there's an issue somewhere, though I can't find it now). One solution would be to compute the mean using a superaccumulator algorithm.

maximsch2

Author

@andreasnoack I'm processing microarray gene expression data and values for one probe happen to be all equal across sample. Standardizing the data (zero mean, variance one) is a standard pre-processing step to do linear modeling afterwards and, from that point of view, replacing a vector that has all values the same with zeros is OK. I guess you can think of it as just ignoring univariance transform if the variance is exactly zero.

I think regardless of your opinion on magically converting std=0 to std=1 (which I agree is potentially a troublesome thing in the edge cases), I think zscore should return the same outputs for all arrays from my original post, maybe NaN is a correct value (although I would've preferred 0 for practical purposes), but -0.93 certainly doesn't make much sense.

ararslan

Member

From a purely mathematical standpoint, a z-score of 0 wouldn't make sense for a constant vector since you're dividing by a standard deviation of 0.

Btw for what it's worth R gives NaN in all cases listed here, both by manually calculating the z-scores and by using scale.

simonbyrne

Member

I think NaN is the most appropriate result.

ararslan

Member

In terms of implementation would it make more sense to catch a constant vector using allunique or something like isapprox(std(x), 0)?

simonbyrne

Member

I don't think we want allunique (as allunique([1,1,2]) == false), but we could just check if all entries are the same as the first.

Computing the mean via a superaccumulator should also fix the problem (as this will give std(x) == 0)

ararslan

Member

Seems like adjusting the mean computation would be a good candidate for an enhancement in Base rather than having a competing mean implementation here.

andreasnoack

Member

@simonbyrne Wouldn't a fancy mean be much slower? Checking that all the values in a vector are identical (which indeed is not what allunique does) should be really cheap.

ararslan

Member

Regarding speed, I was looking at this paper and it indeed seems that precision-preserving summation is much slower.

simonbyrne

Member

In that case, why don't we add a pass for mean which checks if all entries are the same, and if so returns that value (we could also do it in the same pass as the summation if that's faster)? This should fix all the downstream problems as well.

3 remaining items

simonbyrne

Member

probably not

wsshin

I don't think checking the sameness of the entries inside Statistics.mean is realistic, because it will slow down the performance mean(x) significantly when all entries of x are identical:

julia> using Statistics

julia> x = fill(rand(), 1000);  # all-identical entries

julia> @btime mean($x);
  94.509 ns (0 allocations: 0 bytes)

julia> @btime all(y->(y==$x[1]), $x);  # checking sameness is 9X slower than mean(x) itself
  816.565 ns (0 allocations: 0 bytes)

As an alternative, I propose subtracting the mean of x before calculating the z-score. Because the z-score is shift-invariant, subtracting a constant value (the mean of x in this case) does not change the result in exact arithmetic. Plus, the extra calculation of the mean and its subtraction from x is much faster than checking the sameness of the entries of x, which was demonstrated above 9X slower than mean(x) itself:

julia> @btime zscore($x);
  846.237 ns (1 allocation: 7.94 KiB)

julia> y = similar(x);  # preallocated storage for x .- mean(x)

julia> @btime ($y .= $x .- mean($x); zscore($y));  # only 30% slower than zscore(x)
  1.109 μs (1 allocation: 7.94 KiB)

Importantly, the proposal avoids the problem of zscore reported in this Issue in most cases. For example, in the above REPL code, zscore(y) returns a vector correctly filled with NaN, whereas zscore(x) doesn't.

The reason the proposal works is as follows. As pointed out in the earlier comments, the problem of zscore reported in this issue occurs due to the inaccuracy of mean(x) when x has all-identical entries. The inaccuracy comes from the finite precision of floating-point arithmetic: when sum(x) is calculated inside μ = mean(x), rounding error occurs due to the finite precision, and when sum(x) is divided by length(x), the result is no longer the same as the all-identical entries of x. When such μ is used in calculating the Z-score (x .- μ) ./ σ, the numerator is no longer a zero vector and therefore the result is not a NaN-filled vector; note that NaN is generated only when both the numerator and denominator are 0.

So, even if x has all-identical entries, the problem should not occur if the entries have a very small number of significant digits such that sum(x) does not introduce the rounding error. Note that for x with all-identical entries, mean(x) becomes different from x[1] only by the rounding errors and therefore the difference is in the order of the machine epsilon eps():

julia> (x .- mean(x))[1]
3.3306690738754696e-16

julia> eps()
2.220446049250313e-16

Therefore, for x with all-identical entries, we can sum x .- mean(x) without rounding errors unless length(x) is very long, in the order of 1 / eps() ≈ 5e15. Except for such an extreme case, the proposal should produce the correct NaN-filled vector for x with all-identical entries, with only minor performance degradation.

wsshin

The implementation of the above proposal should look something like this:

function zscore2(X)
    μ = mean(X)
    Z = similar(X)
    Z .= X .- μ
    μ, σ = mean_and_std(Z)
    zscore!(Z, μ, σ)
end

Here are some test results. First, verify if zscore2 fixes the present issue when x has all-identical entries:

julia> x = fill(rand(), 7)
7-element Vector{Float64}:
 0.8278053032920696
 0.8278053032920696
 0.8278053032920696
 0.8278053032920696
 0.8278053032920696
 0.8278053032920696
 0.8278053032920696

julia> zscore(x)  # returns wrong result
7-element Vector{Float64}:
 0.9258200997725514
 0.9258200997725514
 0.9258200997725514
 0.9258200997725514
 0.9258200997725514
 0.9258200997725514
 0.9258200997725514

julia> zscore2(x)  # returns correct result
7-element Vector{Float64}:
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN

Check if zscore2 produces nearly the same result as zscore for usual x:

julia> x = rand(1000);

julia> zscore2(x) ≈ zscore(x)
true

Finally, speed comparison:

julia> x = rand(1000);

julia> @btime zscore($x);
  839.520 ns (1 allocation: 7.94 KiB)

julia> @btime zscore2($x);
  1.704 μs (1 allocation: 7.94 KiB)

zscore2 uses the same number of allocations as zscore, which is great. However, zscore2 is nearly 2X slower than zscore. As mentioned in my previous comment, I expected zscore2 to take only about 30% longer than zscore, so this result is disappointing.

Tried to speed up zscore2, but no success. Any suggestions? Or maybe the 2X slowdown is acceptable, compared to the 9X slowdown expected from a more accurate implementation of mean as reported in my previous comment.

nalimilan

Member

Have you tried calling std(Z, mean=0) instead of mean_and_std(Z)? Though I'm not sure to what extent assuming that the mean is exactly zero can be a problem.

I don't think checking the sameness of the entries inside Statistics.mean is realistic, because it will slow down the performance mean(x) significantly when all entries of x are identical:

Note that there are more efficient ways to do this. First, x[1] shouldn't be called for each entry. Second, all short-circuits, so its cost is negligible when all entries are not equal. The overhead when all entries are equal may not be less of a problem given that a slow but correct result is better than an incorrect result. Though another solution would be to use sum or count, which are faster as they don't short-circuit so they use SIMD. Their downside is that you have to pay the cost even when all entries are not equal. One intermediate solution could be to check whether the two first entries are equal: if that's not the case, we can skip the check; if that's the case, call sum to check that other entries are also equal.

julia> using Statistics

julia> x = fill(rand(), 1000);  # all-identical entries

julia> @btime mean($x);
  71.341 ns (0 allocations: 0 bytes)

julia> @btime all(y->(y==$x[1]), $x);
  636.744 ns (0 allocations: 0 bytes)

julia> @btime all(y->(y==$(x[1])), $x); 
  499.474 ns (0 allocations: 0 bytes)

julia> @btime sum(y->(y==$(x[1])), $x);
  73.719 ns (0 allocations: 0 bytes)

wsshin

Have you tried calling std(Z, mean=0) instead of mean_and_std(Z)? Though I'm not sure to what extent assuming that the mean is exactly zero can be a problem.

Is this what you are suggesting?

function zscore3(X)
    μ = mean(X)
    Z = similar(X)
    Z .= X .- μ
    σ = std(Z, mean=0)  # replace mean_and_std() of zscore2()
    zscore!(Z, μ, σ)
end

Unfortunately this does not solve the issue raised by OP:

julia> x = zeros(8) .+ log(1e-5);

julia> zscore(x)  # wrong result
8-element Vector{Float64}:
 -0.9354143466934853
 -0.9354143466934853
 -0.9354143466934853
 -0.9354143466934853
 -0.9354143466934853
 -0.9354143466934853
 -0.9354143466934853
 -0.9354143466934853

julia> zscore2(x)  # correct result
8-element Vector{Float64}:
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN

julia> zscore3(x)  # wrong result
8-element Vector{Float64}:
 6.062608262865675e15
 6.062608262865675e15
 6.062608262865675e15
 6.062608262865675e15
 6.062608262865675e15
 6.062608262865675e15
 6.062608262865675e15
 6.062608262865675e15

Remember that NaN is generated only by 0/0. (I have updated #196 (comment) to emphasize this.) My proposal does not work if we don't recalculate the accurate mean for the X shifted by the incorrect μ.

wsshin

Note that there are more efficient ways to do this. First, x[1] shouldn't be called for each entry. Second, all short-circuits, so its cost is negligible when all entries are not equal. The overhead when all entries are equal may not be less of a problem given that a slow but correct result is better than an incorrect result. Though another solution would be to use sum or count, which are faster as they don't short-circuit so they use SIMD. Their downside is that you have to pay the cost even when all entries are not equal. One intermediate solution could be to check whether the two first entries are equal: if that's not the case, we can skip the check; if that's the case, call sum to check that other entries are also equal.

I like the "intermediate solution" you mentioned in the above quote, but checking the sameness of the first two entries, or of the first n entries for sufficiently small n may not be ideal, because then we would end up applying the more costly algorithm even if x has identical entries only in the first part.

How about checking the sameness of a few randomly selected entries of x? Any suggestions as to how many entries we should check?

nalimilan

Member

I guess we could check e.g. the two first entries and then the last two entries, just in case the data is sorted. Picking values at random positions probably isn't worth it as it will be relatively slow due to cache misses. If we really want to limit the chances of running the full check, better check more consecutive values at the beginning and at the end.

(Ideally one day all will be be smart enough to use SIMD by blocks and short-circuit at the end of each block. We could implement this manually but it may not be worth it here.)

wsshin

I guess we could check e.g. the two first entries and then the last two entries, just in case the data is sorted.

I'm trying to create a PR implementing the suggested solution, but there is a problem in checking the last two entries. x can be an iterator in general (e.g., generated by skipmissing(x)), for which we cannot access the last two entries directly without going through all the earlier entries.

As a workaround, I think we should just check first few entries. What do you think?

wsshin

linked a pull request that will close this issue

on Apr 29, 2022

Improve Z-score accuracy #787

wsshin

See the PR #787. I noticed that Z-score is calculated by not only zscore(), but also by ZScoreTransform. So, instead of fixing zscore(), I have fixed mean_and_std() such that it calculates the mean more accurately using the technique discussed earlier in this thread.

nalimilan

Member

Thanks, but I don't think mean_and_std should give different results from calling just std or m = mean(x); std(x, mean=m). Also the PR is very complex. As suggested by others above, it would be better to change mean in Statistics to check whether all first entries are equal, and if so check whether all entries are equal, in which case the first entry can be returned (with adequate promotion, which may be a bit tricky for heterogeneous collections).

wsshin

Also the PR is very complex.

The PR has gotten complex in order to handle the cases with multi-dimensional arrays. These cases pass dims as a keyword argument, and you have to determine whether or not to calculate the more accurate mean for each vector along the dims direction. The basic functionality for vectors is very simple.

As suggested by others above, it would be better to change mean in Statistics to check whether all first entries are equal, and if so check whether all entries are equal, in which case the first entry can be returned (with adequate promotion, which may be a bit tricky for heterogeneous collections).

I thought the proposal to check whether all entries are equal was rejected because it was too slow. Such significant performance regression will never be accepted in Base. But we can open an issue at least and see how the community is willing to approach this problem. I will open an issue.

wsshin

mentioned this

on May 4, 2022

More accurate mean JuliaLang/julia#45186

wsshin

A comment in the above Julia issue, JuliaLang/julia#45186 (comment), suggested using AccurateArithmetic.jl for an accurate sum. For the users of zscore() who are concerned about the inaccuracy discussed here, it might be just enough to update the docstrings to recommend calculating the mean using AccurateArithmetic.jl outside zscore().

Currently the two accurate sums implemented in AccurateArithmetic.jl lack the capability to calculate the sum along a certain dimension, which we need for zscore() taking an array. Moreover, the package does not have mean() utilizing the accurate sums. For the convenience of the users, maybe we should introduce these features in AccurateArithmetic.jl.

Does this sound reasonable?

nalimilan

Member

Yes we could point users to other more accurate methods. But given the low performance cost it could still be interesting to handle correctly the case where all values are equal (which is the one mentioned in the OP).

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

Improve Z-score accuracyJuliaStats/StatsBase.jl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

zscore accuracy #196

3 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

zscore accuracy #196

Description

Activity

andreasnoack commented on Jul 26, 2016

ararslan commented on Jul 26, 2016

maximsch2 commented on Jul 26, 2016

andreasnoack commented on Jul 26, 2016

simonbyrne commented on Jul 27, 2016

maximsch2 commented on Jul 27, 2016

ararslan commented on Jul 27, 2016

simonbyrne commented on Jul 27, 2016

ararslan commented on Jul 27, 2016

simonbyrne commented on Jul 27, 2016

ararslan commented on Jul 27, 2016

andreasnoack commented on Jul 27, 2016

ararslan commented on Jul 27, 2016

simonbyrne commented on Jul 30, 2016

3 remaining items

simonbyrne commented on Jun 25, 2019

wsshin commented on Mar 27, 2022

wsshin commented on Mar 28, 2022

nalimilan commented on Apr 19, 2022

wsshin commented on Apr 21, 2022

wsshin commented on Apr 21, 2022

nalimilan commented on Apr 21, 2022

wsshin commented on Apr 27, 2022

wsshin commented on Apr 29, 2022

nalimilan commented on May 2, 2022

wsshin commented on May 4, 2022

wsshin commented on May 6, 2022

nalimilan commented on Jun 10, 2022

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions