Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't enforce names on input datasets (source, target, value) #56

Open
ELC opened this issue Aug 2, 2018 · 6 comments
Open

Don't enforce names on input datasets (source, target, value) #56

ELC opened this issue Aug 2, 2018 · 6 comments

Comments

@ELC
Copy link

ELC commented Aug 2, 2018

When working with pre-processed datasets it may be the case where there is no column name source or target or value and even worse, they may be present with a completely different meaning!

I suggest to add optional parameters to set the name of the columns corresponding to this function, this way one shouldn't be "renaming" columns just to fit the library specs.

@ricklupton
Copy link
Owner

Hi, thanks for your comment and sorry for the slow response. I like the suggestion in principle, but I don't know if would actually be worth the extra complexity? Currently you can do something like this:

dataset = floweaver.Dataset(my_data_frame.rename(columns={
    'my_source': 'source',
    'my_target': 'target',
}))

while what you're suggesting I guess would look like this?

dataset = floweaver.Dataset(my_data_frame,
                            source_column='my_source',
                            target_column='my_target')

which is probably worthwhile though not all that much neater. Is that what you mean?

Note that the value column name is already completely flexible, see the measures argument to weave() and this example.

@ELC
Copy link
Author

ELC commented Sep 13, 2018

I didn't know about the measures argument! That solves the problem with the value column.

I believe it is worth it because of the following situations:

  • It is much clearner and safer to make fewer assumptions about the dataset we are expecting so a default value for the columns could be source and target with a Warning showed to the user when they weren't explicitly set up. It makes everything seems less "You have to do this because it's the framworks way".
  • A dataset may have already a column named source or target that are indeed use in some part of the sankey so in this case more than 2 columns should be renamed and those renamings should be remember from that point.
  • A dataset might be used in several ways but with the same structure, so one dataset could be use to represent a flow but maybe the reversal flow is also required so in this case the source and target has to be swapped and that can lead to lots of confussions because constant renaming.
  • A dataset could be used to represent flows with increasing level of detail. Imaging a flow with a source, a target and 10 waypoints in between, now, one is asked to represent the same flow but instead of using the same source one should start with the 1st waypoint, and this goes on until one finally make a flow with the 10th waypoint as a source. Renaming the columns on each step just to follow the framwork spec is a nightmare! (I had to do exactly this kind of work at University). Leaving partitioning, ordering and nodes aside, one could easily manage this situation if the source and target (in this case just the source) could be set via a parameter and not a column name

@ricklupton
Copy link
Owner

A dataset could be used to represent flows with increasing level of detail. Imaging a flow with a source, a target and 10 waypoints in between, now, one is asked to represent the same flow but instead of using the same source one should start with the 1st waypoint, and this goes on until one finally make a flow with the 10th waypoint as a source. Renaming the columns on each step just to follow the framwork spec is a nightmare! (I had to do exactly this kind of work at University). Leaving partitioning, ordering and nodes aside, one could easily manage this situation if the source and target (in this case just the source) could be set via a parameter and not a column name

There's a difference between the flows in the dataset, and the way they're visualised. In this case why don't you change the partition of the source?

nodes = {
   'x': ProcessGroup(['a', 'b', 'c'], partition1),
    'y': Waypoint(partition2),
    'z': ProcessGroup(['end'], partition3),
}

bundles = [
    Bundle('x', 'z', waypoints=['y'])
]

To make it look like it starts with the first waypoint (partition2) it would become

nodes = {
   'x': ProcessGroup(['a', 'b', 'c'], partition2),
    'z': ProcessGroup(['end'], partition3),
}

bundles = [
    Bundle('x', 'z', waypoints=[])
]

You don't have to change the source because it's the /same data/, just grouped differently.

@ricklupton
Copy link
Owner

But generally, ok, I think it'd be reasonable to add this. But it's not trivial as source and target are hard-coded in lots of places. Would you like to have a go?

@ELC
Copy link
Author

ELC commented Sep 14, 2018

I can try but not in the short term, I really like the framework and I think it's worth doing. Is there any way to quickly test the code? I don't want to introduce new errors

@ricklupton
Copy link
Owner

Thanks, see my reply in #55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants