Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typehints and keyword-only args for binning #2012

Merged
merged 23 commits into from
Jul 8, 2021
Merged

Typehints and keyword-only args for binning #2012

merged 23 commits into from
Jul 8, 2021

Conversation

jl-wynen
Copy link
Member

No description provided.

@jl-wynen jl-wynen mentioned this pull request Jun 30, 2021
8 tasks
@@ -774,8 +774,8 @@
" 'y':sc.Variable(dims=['time'], unit='m', values=np.random.rand(N)),\n",
" 'time':sc.Variable(dims=['time'], values=(10000*np.random.rand(N)).astype('datetime64[s]')),\n",
" })\n",
"binned = sc.bin(data, edges=[sc.linspace(dim='x', unit='m', start=0.0, stop=1.0, num=5),\n",
" sc.linspace(dim='y', unit='m', start=0.0, stop=1.0, num=5)])\n",
"binned = sc.bin(data, bins=[sc.linspace(dim='x', unit='m', start=0.0, stop=1.0, num=5),\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the rename? That should be a separate discussion, see #1618.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it consistent with histogram.
I thought while I am breaking the API I might as well get this one out of the way. Would you rather postpone it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is: we should have a discussion first before making API changes.

What if we actually prefer to call it edges every rather than bins? We don't want to break things twice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@@ -297,7 +297,7 @@
"outputs": [],
"source": [
"sc.show(a['x',2].value)\n",
"a.bins.concatenate(a, out=a)\n",
"a.bins.concatenate(other=a, out=a)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in 95%+ uses we only have other, should we support that without keyword?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am generally more in favour of not requiring keywords in most places, so yes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think requiring it for all but the first (non-optional) arg is good, e.g., out in this case should always be keyword only?

"""Sum of each bin.

:return: The sum of each of the input bins.
:seealso: :py:func:`scipp.sum` for summing non-bin data
"""
return _call_cpp_func(_cpp.buckets.sum, self._obj)

def size(self):
def size(self) -> _cpp.DataArray:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data array or variable?

"""Set data of the bins"""
_cpp._bins_view(self._data()).data = data

@property
def constituents(self):
def constituents(self) -> Dict[str, Union[_cpp.Variable, _cpp.DataArray]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also str in the union.

other: Optional[Union[_cpp.Variable,
_cpp.DataArray]] = None,
dim: Optional[str] = None,
out: Optional[_cpp.DataArray] = None) -> _cpp.DataArray:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If self and other are variable, so is the return type?

@@ -49,3 +49,6 @@ def has_numeric_type(obj: _std_typing.Any) -> bool:

#: Any object that behaves like a scipp.Dataset in most operations.
DatasetLike = _std_typing.Union[DataArrayLike, sc.Dataset]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel this is very misleading, since most users typically use data arrays, which do not behave like datasets for a number of aspects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unify DatasetLike and DataArrayLike into LabelledArrayLike?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@nvaytet nvaytet Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure LabeledArray is the best option here, as to me it would suggest that there is only one array, so it would be for a DataArray. It feels like it doesn't fit Dataset so well...
But I don't have any good alternatives to suggest :-(

Comment on lines 71 to 72
da.groupby(group='x').sum('x'),
sc.groupby(da, group='x').sum('x'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting: What is the best style for functions that typically take only a single mandatory argument? Is enforcing keyword args really the right solution there? I agree that it is the correct way in the second line (since there is more than one argument), but what about the first line? The keyword arg is just repeating the function name. da.groupby('x') should be obvious?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

da.groupby('x') should be obvious?

Not sure I agree. Since groupby has this additional function of binning, it is not super clear which argument is which unless you name them.

But the main problem is, can the generic binding function decide which arguments should be keyword only and which positional? It would be hard to take subtleties like arguments with the same type into account. But of course we can bind groupby manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But isn't the first (group) arg always required, even if bins are given?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But that was never part of the discussion on requiring keywords. E.g. array requires dims and values but they are still keyword-only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that it is unclear. Personally I feel that any function that has exactly 1 required argument is ok without keywords. For example:

  • sc.scalar works without value=.
  • sc.sum(da) works
  • da.groupby('x') feels natural

Isn't one of the main reasons for keyword args to avoid mixing up arg order? In none of the above cases there is any risk for that, I believe?

It gets more blurry for things such as sc.sum(da, 'x'), which has 2 args. I still feel that is fine, but maybe that is just because I am used to it? Maybe because da.sum('x') does the same and is also clear?

Copy link
Member

@SimonHeybrock SimonHeybrock Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this guideline (exceptions may be considered individually):

  • Functions with a single arg are ok (sc.sin(da)).
  • Functions with two args are ok provided that one of these is fulfilled:
    1. Both args are scipp objects (sc.less(da1, da2)). Exception: sc.atan2 which has a potentially confusing arg order
    2. Second args is a dim (sc.sum(var, 'x')). Maybe an alternative formulation could be: If the func is also bounds as a method with a single arg, it is ok (because bascically it is a variant of the topmost bullet) (var.sum('x') is ok, therefore sc.sum(var, 'x') is ok).
  • Any further optional args must be keyword args.
  • Any functions that are not in the categories above should generally require keyword args.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly agree. But what about to_unit?
Maybe we can change ii. to functions where the first arg is a scipp object. Those can be seen analogous to methods in a Universal Function Call Syntax kind of way. In this case it would always be ok to have the first and potential second arg positional. All additional args would be keyword-only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion above sounds reasonable to me. Seems to balance enforcing keyword arguments where we really should avoid ambiguity and unnecessary typing where there is none. The atan2 example is maybe one to bear in mind as we implement new things, I can't think of any strong similar examples at present (maybe cross product if we had that), but there may be other exceptions like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree that you should allow non-keyword arguments where you only have one argument, or when using free functions like sc.sum().
I think users would be rapidly annoyed if they have to use kwargs everywhere.

'DataArrayLike': 'DataArrayLike',
'DatasetLike': 'DatasetLike',
'LabeledArray': 'LabeledArray',
'MetaDataMap': 'MetaDataMap',
Copy link
Member

@SimonHeybrock SimonHeybrock Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter refers to Coord or Masks? Would Dict instead of Map be more understandable to a Python user?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. The corresponding class in typing is MutableMapping, see definition of MetaDataMap.



def less(x: DataArrayLike, y: DataArrayLike) -> DataArrayLike:
def less(x: LabeledArray, y: LabeledArray) -> LabeledArray:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we doing ourselves a favor by introducing a 4th term, LabeledArray? It feels like a synonym for what a DataArray is, but we also mean variables and datasets? What do we gain? Isn't this adding more confusion than necessary? Or is this never displayed to the user, and instead lists the union of variable, data array, and dataset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the aliases set up in conf.py, 'LabeledArray' is shown in the docs. I could take that out, then Sphinx expands it and shows the Union. But it would still show up in the source and help messages.

@@ -70,8 +69,7 @@ def reciprocal(x: DataArrayLike,
return _call_cpp_func(_cpp.reciprocal, x, out=out)


def sqrt(x: DataArrayLike,
out: Optional[DataArrayLike] = None) -> DataArrayLike:
def sqrt(x: LabeledArray, out: Optional[LabeledArray] = None) -> LabeledArray:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should out args always require a keyword arg? I think so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

DataArrayLike = _std_typing.Union[sc.Variable, sc.DataArray]
#: Any object that behaves like a scipp.Variable,
#: that is an array with labeled dimensions.
LabeledArray = _std_typing.Union[sc.Variable, sc.DataArray, sc.Dataset]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, now I understand, "labeled" refers to the dims, not the coords (which would be labeling the elements). Then the name does make sense, but is still ambiguous. :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions are welcome!

We could just use VariableLike?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea, I am giving up for now, I think :(

@SimonHeybrock SimonHeybrock merged commit 87de0c3 into main Jul 8, 2021
@SimonHeybrock SimonHeybrock deleted the typehints branch July 8, 2021 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants