Don't enforce names on input datasets (source, target, value) #56

ELC · 2018-08-02T12:36:55Z

When working with pre-processed datasets it may be the case where there is no column name source or target or value and even worse, they may be present with a completely different meaning!

I suggest to add optional parameters to set the name of the columns corresponding to this function, this way one shouldn't be "renaming" columns just to fit the library specs.

The text was updated successfully, but these errors were encountered:

ricklupton · 2018-09-12T20:12:11Z

Hi, thanks for your comment and sorry for the slow response. I like the suggestion in principle, but I don't know if would actually be worth the extra complexity? Currently you can do something like this:

dataset = floweaver.Dataset(my_data_frame.rename(columns={
    'my_source': 'source',
    'my_target': 'target',
}))

while what you're suggesting I guess would look like this?

dataset = floweaver.Dataset(my_data_frame,
                            source_column='my_source',
                            target_column='my_target')

which is probably worthwhile though not all that much neater. Is that what you mean?

Note that the value column name is already completely flexible, see the measures argument to weave() and this example.

ELC · 2018-09-13T01:33:26Z

I didn't know about the measures argument! That solves the problem with the value column.

I believe it is worth it because of the following situations:

It is much clearner and safer to make fewer assumptions about the dataset we are expecting so a default value for the columns could be source and target with a Warning showed to the user when they weren't explicitly set up. It makes everything seems less "You have to do this because it's the framworks way".
A dataset may have already a column named source or target that are indeed use in some part of the sankey so in this case more than 2 columns should be renamed and those renamings should be remember from that point.
A dataset might be used in several ways but with the same structure, so one dataset could be use to represent a flow but maybe the reversal flow is also required so in this case the source and target has to be swapped and that can lead to lots of confussions because constant renaming.
A dataset could be used to represent flows with increasing level of detail. Imaging a flow with a source, a target and 10 waypoints in between, now, one is asked to represent the same flow but instead of using the same source one should start with the 1st waypoint, and this goes on until one finally make a flow with the 10th waypoint as a source. Renaming the columns on each step just to follow the framwork spec is a nightmare! (I had to do exactly this kind of work at University). Leaving partitioning, ordering and nodes aside, one could easily manage this situation if the source and target (in this case just the source) could be set via a parameter and not a column name

ricklupton · 2018-09-13T20:16:23Z

A dataset could be used to represent flows with increasing level of detail. Imaging a flow with a source, a target and 10 waypoints in between, now, one is asked to represent the same flow but instead of using the same source one should start with the 1st waypoint, and this goes on until one finally make a flow with the 10th waypoint as a source. Renaming the columns on each step just to follow the framwork spec is a nightmare! (I had to do exactly this kind of work at University). Leaving partitioning, ordering and nodes aside, one could easily manage this situation if the source and target (in this case just the source) could be set via a parameter and not a column name

There's a difference between the flows in the dataset, and the way they're visualised. In this case why don't you change the partition of the source?

nodes = {
   'x': ProcessGroup(['a', 'b', 'c'], partition1),
    'y': Waypoint(partition2),
    'z': ProcessGroup(['end'], partition3),
}

bundles = [
    Bundle('x', 'z', waypoints=['y'])
]

To make it look like it starts with the first waypoint (partition2) it would become

nodes = {
   'x': ProcessGroup(['a', 'b', 'c'], partition2),
    'z': ProcessGroup(['end'], partition3),
}

bundles = [
    Bundle('x', 'z', waypoints=[])
]

You don't have to change the source because it's the /same data/, just grouped differently.

ricklupton · 2018-09-13T20:17:49Z

But generally, ok, I think it'd be reasonable to add this. But it's not trivial as source and target are hard-coded in lots of places. Would you like to have a go?

ELC · 2018-09-14T00:13:52Z

I can try but not in the short term, I really like the framework and I think it's worth doing. Is there any way to quickly test the code? I don't want to introduce new errors

ricklupton · 2018-09-14T14:58:14Z

Thanks, see my reply in #55

ricklupton added enhancement more-info-needed labels Sep 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't enforce names on input datasets (source, target, value) #56

Don't enforce names on input datasets (source, target, value) #56

ELC commented Aug 2, 2018

ricklupton commented Sep 12, 2018

ELC commented Sep 13, 2018 •

edited

Loading

ricklupton commented Sep 13, 2018

ricklupton commented Sep 13, 2018

ELC commented Sep 14, 2018 •

edited

Loading

ricklupton commented Sep 14, 2018

Don't enforce names on input datasets (source, target, value) #56

Don't enforce names on input datasets (source, target, value) #56

Comments

ELC commented Aug 2, 2018

ricklupton commented Sep 12, 2018

ELC commented Sep 13, 2018 • edited Loading

ricklupton commented Sep 13, 2018

ricklupton commented Sep 13, 2018

ELC commented Sep 14, 2018 • edited Loading

ricklupton commented Sep 14, 2018

ELC commented Sep 13, 2018 •

edited

Loading

ELC commented Sep 14, 2018 •

edited

Loading