[ENH] inspectable set-valued domains for distributions #292

VascoSch92 · 2024-05-03T06:43:00Z

Reference Issues/PRs

First version fro implementation of #244

What does this implement/fix? Explain your changes.

The PR implements a first version for set-valued domains for distributions.

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

There are no tests for the changes. This is a point in the PR checklist. For instance, this is just a. draft PR to discuss a first approach distributions domains.

Any other comments?

As said, this is a draft PR to start a discussion about domains for distributions.

For instance, the most important domains for distributions that skpro needs are

interval of the real line (class Interval)
finite sets of the real line (class Finite)
direct product of the above domains (class Product)

Each of these class implements the methods:

__contains__ which can be used to elegantly check if an element is in the domain of an interval
boundary which can be used for integration

I found no other method that can be useful, for now. Perhaps, a couple of example could clarify the situation.

About Infinite sets of point. They can be of two kinds:

discrete
continuous (if they have an accumulation point, e.g., $ \{ \frac{1}{n} : n \in \mathbb{N} \}$

Which of them do we need? There are examples of distributions which needs these sets?

Finally, I did not implement tags, because of two reasons:

I'm not 100% sure how tags works (the are tags of subclasses override tags of parent classes or are added?)
I didn't know which tags could be useful in this context.

PR checklist

I've added myself to the list of contributors with any new badges I've earned :-)
How to: add yourself to the all-contributors file in the skpro root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
See here for full badge reference
The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
Implement tests for the new classes
Document the various classes
Add _tags

skpro/domains/domains.py

fkiraly

Thanks a lot!

I have now had time to review and think about this design.
I think it's great, of course still work in progress.

I want to add three things to think about - these are not change requests but thinking points.

How do we use this from BaseDistribution? I was thinking about two attributes - domain and support. The support is the locus of non-zero mass/density, domain is the full range. For now, domain is perhaps redundant, but it will get useful for categorical rv.
- a technical point, how would this look like? I would use a property and return a clone each time.
an open question that I have no good answer for - how do we deal with the a dataframe/array case of distribution? Suppose we have the Uniform distribution, it has lower and upper interval. But this can be different for each index. So what is support in this case? Some kind of array set? You have Product, but I was wondering whether there should be a more direct support of the array case - building it into the core API rather than by composition, like with BaseDistribution.
I was also thinking about the Lebesgue decomposition, i.e., continuous part plus discrete part. We need to express both domains, somehow, and make them queriable.
finally, this is more of a crazy idea: should we decorate all functions with their range and domain? You could think of getattr(my_distr, "pdf") returning not a method but an object, like loc and iloc in pandas or BaseDistribution. This could have attributes, so doing sth like my_dist.pdf.domain and my_dist.pdf.range will return a set object. What do you think of it? Maybe it is overengineered, I would not do it right away anyway.

VascoSch92 · 2024-05-07T10:33:56Z

How do we use this from BaseDistribution? I was thinking about two attributes - domain and support. The support is the locus of non-zero mass/density, domain is the full range. For now, domain is perhaps redundant, but it will get useful for categorical rv.
a technical point, how would this look like? I would use a property and return a clone each time.

I think using domain and support is a good idea, and cloning each time is clean.

an open question that I have no good answer for - how do we deal with the a dataframe/array case of distribution? Suppose we have the Uniform distribution, it has lower and upper interval. But this can be different for each index. So what is support in this case? Some kind of array set? You have Product, but I was wondering whether there should be a more direct support of the array case - building it into the core API rather than by composition, like with BaseDistribution.

I don't have an answer for that. What can I suggest is that we start to introducing the domain of a distribution for easy distribution (Normal for example) and then we progress with more challenging one. In this way we can understand what we have to change to serve at best our final scope.

finally, this is more of a crazy idea: should we decorate all functions with their range and domain? You could think of getattr(my_distr, "pdf") returning not a method but an object, like loc and iloc in pandas or BaseDistribution. This could have attributes, so doing sth like my_dist.pdf.domain and my_dist.pdf.range will return a set object. What do you think of it? Maybe it is overengineered, I would not do it right away anyway.

It is an interesting Idea. I'm thinking, probability density function, logarithmic probability function and cumulative density function have aways the same domain right? Therefore, this decorator it is not helpful to define domains for function associate to distributions (perhaps I'm wrong and there is some case where it can be used for that). On the other side, this could be a nice API for the final user :-)

Should I finalise the class and then we can work on a branch where we add the domain/support for distribution already coded? What do you think?

fkiraly · 2024-05-08T17:22:41Z

I don't have an answer for that. What can I suggest is that we start to introducing the domain of a distribution for easy distribution (Normal for example) and then we progress with more challenging one. In this way we can understand what we have to change to serve at best our final scope.

Hm, ok. I suppose that works, incremental design is better than overdesign.

It is an interesting Idea. I'm thinking, probability density function, logarithmic probability function and cumulative density function have aways the same domain right?

For the full domain, yes (the reals), but supports for the discrete part would be most interesting here, since that is not directly inspectable right now. I agree though that starting simple would be best.

Should I finalise the class and then we can work on a branch where we add the domain/support for distribution already coded? What do you think?

Agreed, let's see whether this works or whether we hit limitations.

VascoSch92 · 2024-05-13T15:05:34Z

Hey @fkiraly

I worked a little on the PR request and I think it is ready for a review.

Let me know what do you think.

Some remarks:

the way I coded the unit tests can do better.
what do you think about merging the module not in main but in another branch, such that we can experimenting? Once we have a stable API, we can merge in main.
For instance, just finite, interval and product sets are implemented.

fkiraly · 2024-05-14T14:27:59Z

Yes, I agree with developing this off main to a point where the functionality is integrated with the BaseDistribution class sufficiently.

I would suggest, let's make a branch on sktime called sets, I will base it off this branch, and we continue making PR to that branch?

Also, it would be nice to have a short write-up of your plans for the next steps, perhaps in the issue #244. I'm happy to halp with that or set up a meeting.

fkiraly · 2024-05-14T14:29:11Z

done, there is now a branch called sets on skpro.

VascoSch92 · 2024-05-15T12:13:58Z

Perfect!

Should I close this PR and checkout the new branch?

I will take sometime to write in #244 what are the plans for the future, such that we can discuss ;-)

fkiraly · 2024-05-15T13:18:08Z

Should I close this PR and checkout the new branch?

I would say, let's leave this branch and PR as is, as it "freezes" the current state. Perhaps turn it into a draft.
The new PR will point to a branch that will change, so even if they are now identical they are not redundant in function.

Further, I think the tests are failing due to lack of get_test_params - perhaps you want to fix that to ensure BaseObject API compliance before we move on.

VascoSch92 · 2024-05-15T18:05:52Z

Should I close this PR and checkout the new branch?

I would say, let's leave this branch and PR as is, as it "freezes" the current state. Perhaps turn it into a draft. The new PR will point to a branch that will change, so even if they are now identical they are not redundant in function.

Further, I think the tests are failing due to lack of get_test_params - perhaps you want to fix that to ensure BaseObject API compliance before we move on.

ok perfect. This PR is draft now and I will open another PR into the other branch to fix the tests.

The tests were passing locally. I should give a look on how you implement the tests in the other modules.

fkiraly · 2024-05-15T19:02:53Z

The tests were passing locally. I should give a look on how you implement the tests in the other modules.

What were you trying for this? Have you tried check_estimator on the new objects?

VascoSch92 · 2024-05-15T20:55:37Z

What were you trying for this? Have you tried check_estimator on the new objects?

I implemented just unit tests. I was just running them using pytest directly. Should I use the machinery developed for the BaseObject?

fkiraly · 2024-05-15T22:29:34Z

Should I use the machinery developed for the BaseObject?

It's automatically applied by the CI, and you can mimick that by using check_estimator on the objects. For the specific classes, you'd need implemented get_test_params to pass - or, defaults.

VascoSch92 · 2024-05-16T12:07:54Z

Should I use the machinery developed for the BaseObject?

It's automatically applied by the CI, and you can mimick that by using check_estimator on the objects. For the specific classes, you'd need implemented get_test_params to pass - or, defaults.

Sorry i really don't get 100% how it works. Do you have an example? I was trying to understand the tests for distribuitions but with not so much success.

fkiraly · 2024-05-16T14:54:08Z

Sure! here is some explanation on the test classes:

https://www.sktime.net/en/latest/developer_guide/testing_framework.html#an-illustrative-example

Instances of objects are constructed from the parameter settings returned in get_test_params. If the latter contains no settings, and there are non-default params in __init__, the framework will not know how to construct an instance.

VascoSch92 · 2024-05-16T15:45:10Z

Is it mandatory to use this topology of testing if we are extending from BaseObject ?

fkiraly · 2024-05-16T16:18:49Z

Is it mandatory to use this topology of testing if we are extending from BaseObject?

Yes, in the sense that BaseObject descendants should be BaseObject API compliant, otherwise you lose the composability etc.
The requirement weight from this is typically minimal, such as specifying test parameters and not overriding methods.

VascoSch92 · 2024-05-18T19:44:43Z

Is it mandatory to use this topology of testing if we are extending from BaseObject?

Yes, in the sense that BaseObject descendants should be BaseObject API compliant, otherwise you lose the composability etc.

The requirement weight from this is typically minimal, such as specifying test parameters and not overriding methods.

Last question: there is a way to trigger the tests in the exact same way they are triggered in the workflow? Such that i can check that everything works before opening a Pr

fkiraly · 2024-05-19T02:31:47Z

Last question: there is a way to trigger the tests in the exact same way they are triggered in the workflow? Such that i can check that everything works before opening a Pr

Yes, running pytest on your local repository with no soft dependencies installed should trigger run-tests-no-extras.

Installing skpro[all_extras] and then triggering the tests should have the same effect as run-tests-all-extras.

VascoSch92 added 2 commits May 3, 2024 08:22

[ENH] inspectable set-valued domains for distributions

f46fd81

delete comments

109eef9

fkiraly reviewed May 4, 2024

View reviewed changes

skpro/domains/domains.py Outdated Show resolved Hide resolved

fkiraly reviewed May 5, 2024

View reviewed changes

update api finite sets

f8e2b0e

update set-valued domains objects

53d4171

VascoSch92 marked this pull request as ready for review May 13, 2024 15:05

VascoSch92 marked this pull request as draft May 15, 2024 18:06

Merge branch 'sktime:main' into main

3c619d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] inspectable set-valued domains for distributions #292

[ENH] inspectable set-valued domains for distributions #292

VascoSch92 commented May 3, 2024

fkiraly left a comment

VascoSch92 commented May 7, 2024

fkiraly commented May 8, 2024 •

edited

Loading

VascoSch92 commented May 13, 2024

fkiraly commented May 14, 2024

fkiraly commented May 14, 2024

VascoSch92 commented May 15, 2024

fkiraly commented May 15, 2024

VascoSch92 commented May 15, 2024

fkiraly commented May 15, 2024

VascoSch92 commented May 15, 2024

fkiraly commented May 15, 2024

VascoSch92 commented May 16, 2024

fkiraly commented May 16, 2024

VascoSch92 commented May 16, 2024

fkiraly commented May 16, 2024

VascoSch92 commented May 18, 2024

fkiraly commented May 19, 2024

[ENH] inspectable set-valued domains for distributions #292

Are you sure you want to change the base?

[ENH] inspectable set-valued domains for distributions #292

Conversation

VascoSch92 commented May 3, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Any other comments?

PR checklist

fkiraly left a comment

Choose a reason for hiding this comment

VascoSch92 commented May 7, 2024

fkiraly commented May 8, 2024 • edited Loading

VascoSch92 commented May 13, 2024

fkiraly commented May 14, 2024

fkiraly commented May 14, 2024

VascoSch92 commented May 15, 2024

fkiraly commented May 15, 2024

VascoSch92 commented May 15, 2024

fkiraly commented May 15, 2024

VascoSch92 commented May 15, 2024

fkiraly commented May 15, 2024

VascoSch92 commented May 16, 2024

fkiraly commented May 16, 2024

VascoSch92 commented May 16, 2024

fkiraly commented May 16, 2024

VascoSch92 commented May 18, 2024

fkiraly commented May 19, 2024

fkiraly commented May 8, 2024 •

edited

Loading