Fix dream wfm tests #280

nvaytet · 2025-11-20T09:30:36Z

With the latest release of tof, the nightly tests started failing.
It turns out that when computing the wavelengths of neutrons, there was a chance of less than 1 in 10,000 to have some outliers where the error is greater than 2%.

Here is a figure of the error of each neutron in the test, using 100K neutrons. We see 5 outliers.

I checked and even with the previous version of tof, if we use more neutrons than the 10K used in the test, we also start to see the outliers. So it is just a statistical fluctuation and we were simply lucky not to see it before.

To fix, we relax the percentile threshold by a very small amount (100 -> 99.9) to make the test more robust to future changes.

jokasimr · 2025-11-20T13:16:11Z

I'm not sure if I'm misunderstanding what you are saying, but it sounds to me like you're saying the failures were flukes and they don't have anything to do with the recent release of tof.

The nightlies run three times every night (once on main, and once on release, and once with the oldest allowed dependencies), and this particular failure has not happened before.

But last night it happened twice, on both main and the latest release. Meanwhile it didn't happen in the nightly that uses the old version of tof.

That strongly indicates there's some change in the last tof release that made this much more likely than before.
Do you know what that might be?

Edit: The outliers do look very harmless though. I'm not saying they're necessary a problem. Just that it'd be good to know why they happen more frequently now.

nvaytet · 2025-11-20T14:39:18Z

That strongly indicates there's some change in the last tof release that made this much more likely than before.

I don't think so.
It didn't happen before because we were using 10K neutrons in the tests and we were always using the same seed.
If you use the old version of tof and either change the seed or increase the number of neutrons, outliers can appear.

Something changed in tof for sure: how the pulse is generated.

Old: to select neutrons, do a 1d random.choice in a distribution of birth times, and another random.choice in a distribution of wavelengths.

New: we have a 2d probability distribution, that depends on birth time and wavelength. We flatten the distribution, do a single random.choice on the flattened array, this gives us an index from which we find a birth time and a wavelength.
The 2d distribution was created by broadcasting the old 1d distributions together.

Overall, it should give quantitatively the same result, but statistical fluctuations will exist.

nvaytet · 2025-11-20T14:43:45Z

You can try it yourself: changing the seed to from 88 to 345 causes the test to fail (using the old tof)

jokasimr · 2025-11-20T14:52:59Z

Do we use the same seed every run? In that case that explains it a bit better. I thought we used different seeds every run.

Although still doesn't explain why it didn't fail in the nightly test with old dependencies. But like you say that could just be a fluke.

nvaytet · 2025-11-20T15:02:11Z

Do we use the same seed every run? In that case that explains it a bit better. I thought we used different seeds every run.

We have 3 cases: dream, dream_with_time_overlap, v20.
Each one of these cases uses a different seed. However, each case is parametrized into 24 different tests (but all 24 tests use the same seed). So while we are running 72 tests, only 3 different seeds are being used, giving quite a high chance that the outliers were missed.

I still don't know where the outliers come from, and it would probably be good to find out...
Maybe we can trace it down using the ids of the neutrons?

Although still doesn't explain why it didn't fail in the nightly test with old dependencies.

Why does it not explain that?

jokasimr · 2025-11-20T15:27:23Z

Okay, I saw now that in the failing nighly only one of the three cases you mentioned (test_dream_wfm_with_subframe_time_overlap) that is failing.
So it could very well just be noise like you say.

Although still doesn't explain why it didn't fail in the nightly test with old dependencies.

Why does it not explain that?

I misunderstood, I thought the seed had changed, but of course the only thing that changed is the package.

If you're confident there's no issues with this then I'm fine with that, adjusting the threshold seems like a reasonable fix.
If you want to investigate it more, one quick thing we could test is rerunning the failing tests with different seeds a couple of times and see if there's a significant difference in the rate of failure in the old and the new version.

nvaytet · 2025-11-20T15:54:31Z

I think I understand where the outliers come from.

It's not a coincidence that they only appear in the test case where we have a time overlap in the 2 WFM frames.
In this case, when we try to compute the mean wavelength for a given time of arrival, in the central region where the overlap occurs, we get a large variance/uncertainty. Remember the figure in the docs: https://scipp.github.io/essreduce/user-guide/tof/dream.html#Handling-time-overlap-between-subframes

That region get subsequently masked out, according to how high the variance is.

The outliers are all located around that central region between the frames, where uncertainty is highest (I plotted the outliers in black on the lookup table)

In the present table, we have masked out the region with large uncertainty, but that was just based on an arbitrary threshold we chose. So it's basically that close to the regions, uncertainties get larger and exceed the limit we used in the test.

If we had either

masked out a larger region

or

asserted that the max difference is smaller that the error threshold that was applied when creating the LUT

then the test would probably have passed?

nvaytet · 2025-11-21T09:32:05Z

Maybe the name LookupTableRelativeErrorThreshold wasn't the best, as it's more to do with variances than errors...

jokasimr · 2025-11-21T09:53:03Z

I'm not really sure I understand the distinction in this context.

LookupTableRelativeErrorThreshold is

$$ e: = \frac{\sqrt{Var[t_{of}]} }{E[t_{of}]} $$

right?

Personally I prefer that threshold quantity, for two reasons:

I think people typically care more about standard deviations than about variances. Variance is useful in calculations, but it's not really how people think about uncertainty.
It's often useful to think about a deviation/an error in relative terms, I think scientists are more used to talking about wavelength resolution as "1% error in wavelength" rather than absolute "0.05 angstrom error in wavelength". But depending on the chopper setup maybe both may occur. We could consider adding an absolute tolerance term, to allow for both.

jokasimr · 2025-11-21T09:59:01Z

We could consider adding an absolute tolerance term, to allow for both.

For example, we could make the condition for masking the lookup table be:

$$ a + e E[t_{of}] \le \sqrt{Var[t_{of}]} $$

where a is the absolute tolerance, and e the relative tolerance (LookupTableRelativeErrorThreshold).

nvaytet · 2025-11-21T12:13:35Z

I think I meant to say that the term "error" can be interpreted in different ways.
'Standard deviation' is more clear to me. And I agree that using standard deviation is better than variances, as it makes more intuitive sense.
I think 'error' suggests that the result is wrong in those regions, while standard deviation communicates that the result is uncertain.
But maybe that's just me.

jokasimr · 2025-11-21T12:20:56Z

Makes sense 👍 I'm fine with calling it something like uncertainty threshold or standard deviation threshold too.

relax percentile limit for dream wfm tests

bc451db

nvaytet requested a review from jokasimr November 20, 2025 09:30

jokasimr approved these changes Nov 20, 2025

View reviewed changes

nvaytet merged commit eb39675 into main Nov 20, 2025
4 checks passed

nvaytet deleted the fix-wfm-tests branch November 20, 2025 22:17

Fix dream wfm tests #280

Fix dream wfm tests #280

Conversation

nvaytet commented Nov 20, 2025

Uh oh!

jokasimr commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvaytet commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvaytet commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jokasimr commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvaytet commented Nov 20, 2025

Uh oh!

jokasimr commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvaytet commented Nov 20, 2025

Uh oh!

Uh oh!

nvaytet commented Nov 21, 2025

Uh oh!

jokasimr commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jokasimr commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvaytet commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jokasimr commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jokasimr commented Nov 20, 2025 •

edited

Loading

nvaytet commented Nov 20, 2025 •

edited

Loading

nvaytet commented Nov 20, 2025 •

edited

Loading

jokasimr commented Nov 20, 2025 •

edited

Loading

jokasimr commented Nov 20, 2025 •

edited

Loading

jokasimr commented Nov 21, 2025 •

edited

Loading

jokasimr commented Nov 21, 2025 •

edited

Loading

nvaytet commented Nov 21, 2025 •

edited

Loading

jokasimr commented Nov 21, 2025 •

edited

Loading