Skip to content

[ENH] Improve test framework for v1 models #1908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

phoeenniixx
Copy link
Contributor

@phoeenniixx phoeenniixx commented Jul 1, 2025

Reference Issues/PRs

Fixes #1906

What does this implement/fix? Explain your changes.

The PR makes the changes as suggested in #1906:

  • Add tag "info:compatible_loss" to model_pkg classes to keep a track of losses a model is compatible with.
  • Add "loss classes" :
    • "distr" - Distribution Losses like NormalDistributionLoss
    • "quantile" - QuantileLoss
    • "point" - Point Prediction Losses like MAE
  • Tests that loop over models and each loss that model is compatible with and test against common vignettes
  • Tests for MultiLoss containing only losses compatible with a given model.
  • Tests for output format contracts
  • Adds the remaining Models to the test framework

@phoeenniixx
Copy link
Contributor Author

For now only DeepAR tests are added (not the MultiLoss tests). Will add rest of the models in subsequent commits


all_train_kwargs.append(final_params)
train_kwargs_names.append(f"base_params_{i}_{loss_name}")
else:
Copy link
Contributor Author

@phoeenniixx phoeenniixx Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if-else block is just added so that the model_pkg classes that still not have the tag info:compatible_loss can also pass the tests, I will remove it once all the model_pkg classes have this tag

Copy link

codecov bot commented Jul 1, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (main@18f5f95). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1908   +/-   ##
=======================================
  Coverage        ?   86.61%           
=======================================
  Files           ?       97           
  Lines           ?     7865           
  Branches        ?        0           
=======================================
  Hits            ?     6812           
  Misses          ?     1053           
  Partials        ?        0           
Flag Coverage Δ
cpu 86.61% <100.00%> (?)
pytest 86.61% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

final_params = deepcopy(base_params)
for key, value in loss_params.items():
if (
isinstance(value, dict)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 or more tabs is a code smell - can we avoid this?

Copy link
Contributor Author

@phoeenniixx phoeenniixx Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I moved the code to different functions

@phoeenniixx
Copy link
Contributor Author

This is strange, the tests are passing for macOS and ubuntu and failing for windows?
the error:

self = LogNormal(), value = tensor([[1., 1., 1.],
        [1., 1., 0.]])

    def _validate_sample(self, value: Tensor) -> None:
        """
        Argument validation for distribution methods such as `log_prob`,
        `cdf` and `icdf`. The rightmost dimensions of a value to be
        scored via these methods must agree with the distribution's batch
        and event shapes.
    
        Args:
            value (Tensor): the tensor whose log probability is to be
                computed by the `log_prob` method.
        Raises
            ValueError: when the rightmost dimensions of `value` do not match the
                distribution's batch and event shapes.
        """
        if not isinstance(value, torch.Tensor):
            raise ValueError("The value argument to log_prob must be a Tensor")
    
        event_dim_start = len(value.size()) - len(self._event_shape)
        if value.size()[event_dim_start:] != self._event_shape:
            raise ValueError(
                f"The right-most size of value must match event_shape: {value.size()} vs {self._event_shape}."
            )
    
        actual_shape = value.size()
        expected_shape = self._batch_shape + self._event_shape
        for i, j in zip(reversed(actual_shape), reversed(expected_shape)):
            if i != 1 and j != 1 and i != j:
                raise ValueError(
                    f"Value is not broadcastable with batch_shape+event_shape: {actual_shape} vs {expected_shape}."
                )
        try:
            support = self.support
        except NotImplementedError:
            warnings.warn(
                f"{self.__class__} does not define `support` to enable "
                + "sample validation. Please initialize the distribution with "
                + "`validate_args=False` to turn off validation."
            )
            return
        assert support is not None
        valid = support.check(value)
        if not torch._is_all_true(valid):
>           raise ValueError(
                "Expected value argument "
                f"({type(value).__name__} of shape {tuple(value.shape)}) "
                f"to be within the support ({repr(support)}) "
                f"of the distribution {repr(self)}, "
                f"but found invalid values:\n{value}"
            )
E           ValueError: Expected value argument (Tensor of shape (2, 3)) to be within the support (GreaterThan(lower_bound=0.0)) of the distribution LogNormal(), but found invalid values:
E           tensor([[1., 1., 1.],
E                   [1., 1., 0.]])

C:\hostedtoolcache\windows\Python\3.12.10\x64\Lib\site-packages\torch\distributions\distribution.py:317: ValueError

When I run locally this doesnot happen. Although I am on ubuntu so I cannot say much, maybe someone with windows test what the issue is?

@fkiraly
Copy link
Collaborator

fkiraly commented Jul 2, 2025

Is it always the same test, test_integration[DecoderMLP-base_params-1-LogNormalDistributionLoss]?

I am suspecting there is something with numerical accuracy happening, e.g., values that are close to zero but not quite zero.

@phoeenniixx
Copy link
Contributor Author

Is it always the same test, test_integration[DecoderMLP-base_params-1-LogNormalDistributionLoss]?

Yes

I am suspecting there is something with numerical accuracy happening, e.g., values that are close to zero but not quite zero.

I'll then shift to windows and will see locally what exactly is happening

@fkiraly
Copy link
Collaborator

fkiraly commented Jul 3, 2025

one trick is also to introduce prints or artificial raises in the code that trigger only when you expect the error - although in this specific case I fear that is not going to be easy, so trying to reproduce locally might be your best bet.

@phoeenniixx
Copy link
Contributor Author

phoeenniixx commented Jul 3, 2025

From what I could see in dwc["target"] , there are values like 0.0010 that might be causing the issue?
See here the whole dwc["target"]
Whole dwc here

@fkiraly
Copy link
Collaborator

fkiraly commented Jul 3, 2025

Strange.

  • can you explain which object at which stage of the test this corresponds to?
  • do you have ea hypothesis how the 0.010 values cause an issue?
  • do the 0.010 values occur only on windows? If yes, why?
  • if no, why are they a problem only on windows?

@phoeenniixx
Copy link
Contributor Author

phoeenniixx commented Jul 6, 2025

Hi, so dwc is dataframe output of data_with_covariates() called inside _get_test_dataloaders_from of DecoderMLP_pkg. At first, I thought the issue is dwc["target"] has some values like 0.0010 and this is being taken as 0 somewhere. But I didn't find any concrete evidence regarding this.

Then maybe the issue is different float types as you said on discord, but as we know ( and I checked) dwc["target"] is numpy.float64 in both ubuntu and windows. So this was also a dead end.

Then I found that in LogNormal of torch.distributions, scale has the bound to be positive and its the same bound that is failing. So, I thought maybe this could be the issue? but the rescale_parameters of LogNormalDistributionLoss [source] already adds F.softplus, but still to be sure, I added F.softplus in map_x_to_distribution (see here) to the scale:

def map_x_to_distribution(self, x: torch.Tensor) -> distributions.LogNormal:
        scale=F.softplus(x[..., 1])
        return self.distribution_class(loc=x[..., 0], scale=scale)

And obviously this also didn't work.

After sometime, I found that somehow y_actual (from here) is getting value 0.0 in tensors. This is because y in training_step is getting this value 0 from the batch. So what my current hypothesis is that somehow the transformation to tensors of y is the actual issue. I still need some time to diagnose.

From this exprience I learned one thing atleast, that debugging should be "targetted" and not just random guesses, I think my random gusses also have made this process so complex.

Edit: just to make it clear and add more context how the y is having 0.
this is a value I copied from debugger for y from training_step:

y = (tensor([[1., 1., 1.],
        [1., 0., 0.]]), None)

and in some cases (when I would rerun the debugger), the value is something like:

y = (tensor([[1., 1., 1.],
        [1., 1., 0.]]), None)

or

y = (tensor([[1., 1., 1.],
        [1., 1., 1.]]), None)

In the second case, (when there is no 0), in some other batch y would get 0, so eventually it will fail again.
The one thing I noticed though is 0 appears only in the second list, and first list of tensors will always be 1.

@phoeenniixx
Copy link
Contributor Author

Then maybe the issue is different float types as you said on discord, but as we know ( and I checked) dwc["target"] is numpy.float64 in both ubuntu and windows

To see if the float types Changes after transformation to tensors, I first need to find the place where this transformation is being done and there I can add assert and see the float types. But I would need some time as where exactly this transformation happens and how the data moves across TimeSeriesDataset is still a mystery to me in many ways :)

@fkiraly
Copy link
Collaborator

fkiraly commented Jul 7, 2025

to quickly summarize our discussion in the tech meet today: the LogNormal issue looks like a cursed failure that might be hard to diagnose (and possibly even a problem in torch itself)

It is also proving that the test framework works well! Since it seems to be a genuine issue, rather than something to do with the framework - and the framework is intended to increase coverage so these combination cases get picked up.

The approach I would recommend is to park detected failures that are not immediately clear how to fix, and move them to issues - after all, the failures were there before, just not detected. Add some skipping mechanism to skip precisely the failure, open an issue with a description, and then we can deal with it later - and turn the test back on in the same PR as the fix (this pointer should be there in the issue)

This approach is a corollary of "test driven development" where you start with tests that fail, and then gradually change your code to make the tests pass - in separate PR and step-wise. If you write new code, then all tests fail initially; but we are improving coverage and consistency of an existing code base, so only some tests will fail in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: PR in progress
Development

Successfully merging this pull request may close these issues.

[ENH] Improve the test framework of model testing in v1 and make it more robust
2 participants