Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset aliases #1075

Closed
wants to merge 14 commits into from

Conversation

Projects
None yet
3 participants
@philippjfr
Copy link
Contributor

philippjfr commented Jan 20, 2017

Addresses #1001. Needs more checking and a lot of unit tests.

@philippjfr philippjfr added the WIP label Jan 20, 2017

@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Jan 20, 2017

Looks good though as you say, needs a lot more testing.

The unit tests on travis suggest something went wrong in groupby:

File "/home/travis/build/ioam/holoviews/holoviews/core/data/pandas.py", line 124, in groupby
    group_by = [d.alias for d in dimensions]
AttributeError: 'str' object has no attribute 'alias'

I suspect dimensions is expected to be a list of dimension objects and not strings. The corresponding change is here. It seems to be a line added instead of a line replaced (like the rest of them) so I'm hoping this is just a mistake.

Once the unit tests pass, we'll get the notebooks passing before adding more unit tests...

@philippjfr philippjfr force-pushed the dataset_aliases branch from f65f3b5 to 3970188 Jan 20, 2017

@philippjfr philippjfr force-pushed the dataset_aliases branch from 3970188 to acbbfc9 Jan 20, 2017

@philippjfr

This comment has been minimized.

Copy link
Contributor Author

philippjfr commented Jan 22, 2017

Ready for initial review, need to add to the dimensions method docstring. @jlstevens You can also start working on some aliasing documentation, a standalone tutorial will be helpful but a short example in the Columnar_Data tutorial would also help.

Philipp Rudiger Philipp Rudiger
@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Jan 23, 2017

By far the most important implications of this PR are regarding the changes toDimension. This is something I discussed with @philippjfr at length and we came to the following conclusions:

  • Having a way to always access the 'short name' is the right thing to do. If we were to redesign HoloViews from scratch, we might have kept name as always referring to the short name and added description (or similar) for the long name. It is much too late to make such a change - it would be a big disruption for our users.
  • This PR shows that the idea of .alias is sound. In addition, these changes can be made to work without disrupting the rest of the code.
  • The problem is the name as alias does not sound like a fundamental core attribute of Dimension. It is feels off to see core code working with dimension referencing alias everywhere instead of name, especially as the 'alias system' has almost been a hidden feature till now. To my mind, alias sound optional, and not a fundamental concept in the way name is.

As is often the case it comes down to naming. I quite liked the idea of label until Philipp pointed that label is already used in at least two different ways already. The other alternatives were identifier (bit too long) and id which is short but clashes with the built-in name and sounds like a number (and we already have an internal concept of anid). We didn't particularly like ident as a shortened version of identifier. Another suggestion we considered was varname.

We decided that the current best suggestion is key (hence Philipp's latest commit) due to these advantages:

  • Sounds like an fundamental core concept: a dimension key sounds like a more important thing than a dimension alias or even a dimension name.
  • It seems compatible with the existing notion of keys both in terms of methods operating on dimensions (e.g __getattr__) and in terms of how HoloMap keys are expressed by kdims.
  • Even shorter than name.

One possible disadvantage is that all dimensions will have key whether or not they are key or value dimensions. That said, dimensions already have an optional values attribute and we didn't worry that this would cause confusion with value dimensions.

@@ -677,7 +677,7 @@ def dimensions(self, selection='all', label=False):
else:
raise KeyError("Invalid selection %r, valid selections include"
"'all', 'value' and 'key' dimensions" % repr(selection))
return [(dim.name if label == 'long' else dim.alias)
return [(dim.name if label == 'long' else dim.key)

This comment has been minimized.

Copy link
@jlstevens

jlstevens Jan 23, 2017

Contributor

Now that we have label='key' I am wondering if the alternative isn't label='long' but label='name'?

@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Jan 23, 2017

Note that if we make key a parameter, we'll have the following ways of declaring a dimension:

1 - Dimension('Spectrum'): Both name and key are 'Spectrum'. Most of HoloViews currently use this type of declaration:

If key a parameter, we now get option 4 for specifying a name of 'Frequence Spectrum' and a key of 'Spectrum':

2 - Dimension(('Spectrum', 'Frequency spectrum')) : Tuple format.
3 - Dimension( hv.util.Aliases(Spectrum='Frequency spectrum').Spectrum) : Using the helper.
4 - Dimension('Frequency Spectrum', key='Spectrum') : If key is a parameter.

I don't think having multiple approaches is necessarily an issue: the key (if specified) is always the short version of the name. Compared to option 1, option 4 simply makes it easier to use a longer name by explicitly specifying a short key.

Option 2. is probably the most problematic as you need to remember the order of the tuple is (<short>, <long>) which is why I would probably always recommend option 3 over option 2. Having option 3. is still useful as it lets you define your aliases once instead of having to specify key everywhere (see the example in #312).

This leads me to the following recommendation:

1 - Dimension('Spectrum'): Use this if you are happy for the short name to match the long name.
2 - Dimension( al.Spectrum) : Use the Aliases object to allow easy repeated declaration of key.
3 - Dimension('Frequency Spectrum', key='Spectrum') : Explicit declaration of the shorter key.

Of course if we don't want to support option 3. in this revised list, we can leave key as an attribute. We might want to do this to encourage the use of the Aliases helper - if you don't keep the associations between names and keys consistent, things are likely to get messy.

@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Jan 23, 2017

One annoyance is that the order between options 2 and 4 is switched:

2 - Dimension(('Spectrum', 'Frequency spectrum')) : Tuple format.
4 - Dimension('Frequency Spectrum', key='Spectrum') : If key is a parameter.

I now wish I had reversed the tuple so that these would be equivalent:

2 - Dimension(('Frequency spectrum', 'Spectrum')) : Tuple format.
4 - Dimension('Frequency Spectrum', key='Spectrum') : If key is a parameter.

I can still switch the tuple ordering around but it would break backwards compatibility. At least if I made a matching reversal in hv.util.Aliases, then option 3. above would keep working.

I am tempted by this change as it would look more consistent and I never really recommended the tuple format precisely because you have to always keep those declarations consistent (and remember the right order).

One other approach would be to have Aliases not return tuples but something similar but distinguishable (named tuples?). Then only named tuples would be allowed - support for regular tuples could be restored through a backwards compatibility switch. This would help encourage use of the Aliases helper function while also making things more explicit for anyone wanting to use their own named tuples:

  1. As before using Aliases (unchanged, recommended):
Dimension( hv.util.Aliases(Spectrum='Frequency spectrum').Spectrum)
  1. The pointless way (shown to define the named tuple):
from collections import namedtuple
alias = namedtuple('alias', ['name','key']) # Might as well be in HoloViews!
Dimension(alias('Frequency spectrum', 'Spectrum')) # Explicit and consistently ordered.
  1. Alternatively to do a single alias without using Aliases (which is still recommended!):
Dimension(hv.util.alias('Frequency spectrum', 'Spectrum'))
Dimension(hv.util.alias(name='Frequency spectrum', key='Spectrum')) # Equivalent

I hope you can follow what I'm proposing - I think I like it!

@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Jan 23, 2017

As one last note, you can disable implicit named tuples (mapping based on order) by defining hv.util.alias as follows:

def alias(**kwargs):
   if set(kwargs.keys()) != set(['name', 'key']):
      raise KeyError('Alias requires a name and a key')
   nt = namedtuple('alias', ['name','key'])
   return nt(name=kwargs['name'], key=kwargs['key'])

This disables the top version of option 3. above so you would have to be explicit and use:

Dimension(hv.util.alias(name='Frequency spectrum', key='Spectrum'))
@philippjfr

This comment has been minimized.

Copy link
Contributor Author

philippjfr commented Jan 23, 2017

I think it's perfectly natural to declare the name of the data in your datasource first and the label second so I'm hesitant about making the proposed change. The real issue as you point out above that ideally we'd have declared a name and a description or label. Aliases were designed to give long names short aliases for easier access, however imo the more common use case is for the user to want to give their shortened variable name a longer more readable label.

Overall I feel explaining that you supply an alias for the name of your data in the datasource as a tuple is much more intuitive than having to grasp aliases and the difference between key and name, particularly because the actual distinction is inconsequential because the API is agnostic about which one you use.

In a simple example I feel this:

df = pd.DataFrame({'x': range(10), 'y': np.random.rand(10)})
hv.Curve(df, kdims=[('x', 'X-Label')], vdims=[('y', 'Y-label')])

is much clearer than this:

df = pd.DataFrame({'x': range(10), 'y': np.random.rand(10)})
hv.Curve(df, kdims=[hv.util.alias(name='X-label', key='x')], vdims=[hv.util.alias(name='Y-label', key='y')])

If you really do think this is unclear I will investigate whether we can't have .name and .label/description instead. It may not be as bad as I thought because the API generally doesn't care and the axis labels are easily fixed.

@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Jan 23, 2017

The recommended thing would be:

al = hv.util.Aliases(x='X-label', y='Y-label')
df = pd.DataFrame({'x': range(10), 'y': np.random.rand(10)})
hv.Curve(df, kdims=[al.x], vdims=[al.y])

I only showed the underlying named-tuple approach to illustrate what would be happening behind the scenes.

@philippjfr

This comment has been minimized.

Copy link
Contributor Author

philippjfr commented Jan 23, 2017

Opened a separate branch to check whether we can't just switch them. Overall it doesn't seem too bad: https://travis-ci.org/ioam/holoviews/builds/194476686 so perhaps it's doable with a bit more work.

@jbednar

This comment has been minimized.

Copy link
Contributor

jbednar commented Jan 23, 2017

imo the more common use case is for the user to want to give their shortened variable name a longer more readable label.

I agree -- people usually want long/complex/typeset names for the viewable labels in figures, and rarely want long/complex/encoded names as Python attributes or Pandas column names. So the curent aliases support seems backwards, and if we're ever to have a chance to fix that, it seems like now is the time.

@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Jan 23, 2017

I'm also in agreement. My main concerns are:

  • Breaking the existing alias system - probably not too bad as it certainly hasn't been used in the official documentation yet.
  • How bad it is to switch around - hopefully Philipp's initial reaction is correct and it won't be too bad!
  • Usual naming issues - as we discussed label is used in a fair few places already and description is a bit long...

Just to make sure we are on the same page - the idea would be that name is always the short version which means it is up to the plotting code to use the longer, less programmer friendly but more descriptive string. Is that right?

@philippjfr

This comment has been minimized.

Copy link
Contributor Author

philippjfr commented Jan 23, 2017

Breaking the existing alias system - probably not too bad as it certainly hasn't been used in the official documentation yet.

Hopefully we can maintain compatibility here, but there's a definitely a few issues to figure out here. Currently the dimension_sanitizer is used both to support actual aliases but also for sanitizing dimension names containing spaces.

Usual naming issues - as we discussed label is used in a fair few places already and description is a bit long...

Agreed, not hugely attached to either, but I don't think description is quite right because it's mostly used as a readable label while description sounds like something longer. @jbednar Any other suggestions?

@jbednar

This comment has been minimized.

Copy link
Contributor

jbednar commented Jan 23, 2017

Seems like a label to me.

@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Jan 23, 2017

Seems like a label to me.

I like label as long as it doesn't get confused with group.label (i.e the label on dimensioned objects) the label option when getting dimensions.

@jlstevens

This comment has been minimized.

Copy link
Contributor

jlstevens commented Jan 26, 2017

I've discussed this again with Philipp and I think this is might come together quite nicely! In particular, I think it would be good to use util.Aliases to define all the other dimension parameters. The result would be a system that can be used instead of dimension presets (another mostly undocumented mechanism).

You could use the following to define the dimension name/label:

al = hv.util.Aliases(x='X-label', y='Y-label')

Or dictionaries for the kwargs e.g:

al = hv.util.Aliases(x=dict(label='X-label', range=(0,1))

Now when you refer to al.x, you get a Dimension object you can use. This can work alongside the tuple format and can be made backwards compatible with how util.Aliases worked before if we allow dimensions to be passed into the Dimension constructor (resulting in an equivalent definition).

All this would need to be documented in a new aliases tutorial (that should probably discuss redim as well). I think the old mechanisms should still be compatible but maybe we should add deprecation warnings so there aren't too many ways of doing the same thing.

One of the goals of the new tutorial would be to lay out the recommended approach for different cases e.g library authors, users trying to define things once quickly and users who expect to be reusing dimension definitions over and over again.

@philippjfr

This comment has been minimized.

Copy link
Contributor Author

philippjfr commented Jan 27, 2017

Closing in favor of #1083, turns out it was fairly easy to switch the name and the alias.

@philippjfr philippjfr closed this Jan 27, 2017

@philippjfr philippjfr deleted the dataset_aliases branch Feb 10, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.