Skip to content

Conversation

@nvaytet
Copy link
Member

@nvaytet nvaytet commented Nov 20, 2025

With the latest release of tof, the nightly tests started failing.
It turns out that when computing the wavelengths of neutrons, there was a chance of less than 1 in 10,000 to have some outliers where the error is greater than 2%.

Here is a figure of the error of each neutron in the test, using 100K neutrons. We see 5 outliers.
Figure 1 (64)

I checked and even with the previous version of tof, if we use more neutrons than the 10K used in the test, we also start to see the outliers. So it is just a statistical fluctuation and we were simply lucky not to see it before.

To fix, we relax the percentile threshold by a very small amount (100 -> 99.9) to make the test more robust to future changes.

@nvaytet nvaytet requested a review from jokasimr November 20, 2025 09:30
@jokasimr
Copy link
Contributor

jokasimr commented Nov 20, 2025

I'm not sure if I'm misunderstanding what you are saying, but it sounds to me like you're saying the failures were flukes and they don't have anything to do with the recent release of tof.

The nightlies run three times every night (once on main, and once on release, and once with the oldest allowed dependencies), and this particular failure has not happened before.

But last night it happened twice, on both main and the latest release. Meanwhile it didn't happen in the nightly that uses the old version of tof.

That strongly indicates there's some change in the last tof release that made this much more likely than before.
Do you know what that might be?

Edit: The outliers do look very harmless though. I'm not saying they're necessary a problem. Just that it'd be good to know why they happen more frequently now.

@nvaytet
Copy link
Member Author

nvaytet commented Nov 20, 2025

That strongly indicates there's some change in the last tof release that made this much more likely than before.

I don't think so.
It didn't happen before because we were using 10K neutrons in the tests and we were always using the same seed.
If you use the old version of tof and either change the seed or increase the number of neutrons, outliers can appear.

Something changed in tof for sure: how the pulse is generated.

Old: to select neutrons, do a 1d random.choice in a distribution of birth times, and another random.choice in a distribution of wavelengths.

New: we have a 2d probability distribution, that depends on birth time and wavelength. We flatten the distribution, do a single random.choice on the flattened array, this gives us an index from which we find a birth time and a wavelength.
The 2d distribution was created by broadcasting the old 1d distributions together.

Overall, it should give quantitatively the same result, but statistical fluctuations will exist.

@nvaytet
Copy link
Member Author

nvaytet commented Nov 20, 2025

You can try it yourself: changing the seed to from 88 to 345 causes the test to fail (using the old tof)

@jokasimr
Copy link
Contributor

jokasimr commented Nov 20, 2025

Do we use the same seed every run? In that case that explains it a bit better. I thought we used different seeds every run.

Although still doesn't explain why it didn't fail in the nightly test with old dependencies. But like you say that could just be a fluke.

@nvaytet
Copy link
Member Author

nvaytet commented Nov 20, 2025

Do we use the same seed every run? In that case that explains it a bit better. I thought we used different seeds every run.

We have 3 cases: dream, dream_with_time_overlap, v20.
Each one of these cases uses a different seed. However, each case is parametrized into 24 different tests (but all 24 tests use the same seed). So while we are running 72 tests, only 3 different seeds are being used, giving quite a high chance that the outliers were missed.

I still don't know where the outliers come from, and it would probably be good to find out...
Maybe we can trace it down using the ids of the neutrons?

Although still doesn't explain why it didn't fail in the nightly test with old dependencies.

Why does it not explain that?

@jokasimr
Copy link
Contributor

jokasimr commented Nov 20, 2025

Okay, I saw now that in the failing nighly only one of the three cases you mentioned (test_dream_wfm_with_subframe_time_overlap) that is failing.
So it could very well just be noise like you say.

Although still doesn't explain why it didn't fail in the nightly test with old dependencies.

Why does it not explain that?

I misunderstood, I thought the seed had changed, but of course the only thing that changed is the package.

If you're confident there's no issues with this then I'm fine with that, adjusting the threshold seems like a reasonable fix.
If you want to investigate it more, one quick thing we could test is rerunning the failing tests with different seeds a couple of times and see if there's a significant difference in the rate of failure in the old and the new version.

@nvaytet
Copy link
Member Author

nvaytet commented Nov 20, 2025

I think I understand where the outliers come from.

It's not a coincidence that they only appear in the test case where we have a time overlap in the 2 WFM frames.
In this case, when we try to compute the mean wavelength for a given time of arrival, in the central region where the overlap occurs, we get a large variance/uncertainty. Remember the figure in the docs: https://scipp.github.io/essreduce/user-guide/tof/dream.html#Handling-time-overlap-between-subframes
Screenshot_20251120_164658

That region get subsequently masked out, according to how high the variance is.

The outliers are all located around that central region between the frames, where uncertainty is highest (I plotted the outliers in black on the lookup table)
Figure 1 (65)

In the present table, we have masked out the region with large uncertainty, but that was just based on an arbitrary threshold we chose. So it's basically that close to the regions, uncertainties get larger and exceed the limit we used in the test.

If we had either

  1. masked out a larger region

or

  1. asserted that the max difference is smaller that the error threshold that was applied when creating the LUT

then the test would probably have passed?

@nvaytet nvaytet merged commit eb39675 into main Nov 20, 2025
4 checks passed
@nvaytet nvaytet deleted the fix-wfm-tests branch November 20, 2025 22:17
@nvaytet
Copy link
Member Author

nvaytet commented Nov 21, 2025

Maybe the name LookupTableRelativeErrorThreshold wasn't the best, as it's more to do with variances than errors...

@jokasimr
Copy link
Contributor

jokasimr commented Nov 21, 2025

I'm not really sure I understand the distinction in this context.

LookupTableRelativeErrorThreshold is

$$ e: = \frac{\sqrt{Var[t_{of}]} }{E[t_{of}]} $$

right?

Personally I prefer that threshold quantity, for two reasons:

  1. I think people typically care more about standard deviations than about variances. Variance is useful in calculations, but it's not really how people think about uncertainty.
  2. It's often useful to think about a deviation/an error in relative terms, I think scientists are more used to talking about wavelength resolution as "1% error in wavelength" rather than absolute "0.05 angstrom error in wavelength". But depending on the chopper setup maybe both may occur. We could consider adding an absolute tolerance term, to allow for both.

@jokasimr
Copy link
Contributor

jokasimr commented Nov 21, 2025

We could consider adding an absolute tolerance term, to allow for both.

For example, we could make the condition for masking the lookup table be:

$$ a + e E[t_{of}] \le \sqrt{Var[t_{of}]} $$

where a is the absolute tolerance, and e the relative tolerance (LookupTableRelativeErrorThreshold).

@nvaytet
Copy link
Member Author

nvaytet commented Nov 21, 2025

I think I meant to say that the term "error" can be interpreted in different ways.
'Standard deviation' is more clear to me. And I agree that using standard deviation is better than variances, as it makes more intuitive sense.
I think 'error' suggests that the result is wrong in those regions, while standard deviation communicates that the result is uncertain.
But maybe that's just me.

@jokasimr
Copy link
Contributor

jokasimr commented Nov 21, 2025

Makes sense 👍 I'm fine with calling it something like uncertainty threshold or standard deviation threshold too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants