Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent accidental column re-ordering #37

Closed
schuderer opened this issue Jul 29, 2019 · 1 comment · Fixed by #87
Closed

Prevent accidental column re-ordering #37

schuderer opened this issue Jul 29, 2019 · 1 comment · Fixed by #87
Assignees
Labels
enhancement New feature or request needs discussion Several opinions should be heard and considered

Comments

@schuderer
Copy link
Owner

Description

Right now, columns might come in in a different order using the API (args_dict) vs in the
data sources, and the user has to be aware that they have to make sure that the inputs are not jumbled, because this can cause the wrong weights to be applied to the variables, depending on the model (scikit-learn, for example, does not automatically make sure that re-ordered columns
are processed correctly).

Unfortunately, we cannot simply silently order the columns of the data sources and the API,
because we don't know whether data sources are actually combined or processed, etc
before they are used. In other words, we don't know the definitive set of column names
beforehand.

Suggestion

There are several ways to handle this:

  1. Let the user deal with it and put a warning in the docs.
  2. Provide a helper function like e.g. with_columns_sorted which the user can use on a dict or DataFrame and which returns an object of the same type, but with the columns sorted deterministically (e.g. alphanumerically).
  3. Assume that datasource columns = api keys and provide an option in the model/api config like sort_columns_of_datasource: <datasourcename> to sort columns of a specific datasource as well as those of the args_dict. Pro: makes happy case simple, Con: creates confusion for non-trivial models.
  4. Similar to 3 (option), but instead of sorting, keep the datasource's columns as they are, but internally save its column order with the model metadata, and apply it to the args_dict

Current Workaround

User has to deal with this manually in the data preparation.

@schuderer schuderer added the enhancement New feature or request label Jul 29, 2019
@schuderer schuderer added this to To do in Release 1.0.0 tracker via automation Jul 29, 2019
@schuderer schuderer changed the title Enforce column ordering Prevent column re-ordering Jul 29, 2019
@schuderer schuderer changed the title Prevent column re-ordering Prevent accidental column re-ordering Jul 29, 2019
@schuderer
Copy link
Owner Author

schuderer commented Dec 12, 2019

  1. Option “sort_colums: true/false” in datasources’ and model’s config (default false), make args_dict an OrderedDict (and of course document how to deal with this in case data sources need to be joined, point out that mllaunchpad’s own sort_columns function can be used, etc)

@schuderer schuderer moved this from To do to In progress in Release 1.0.0 tracker Jan 15, 2020
@schuderer schuderer self-assigned this Feb 21, 2020
@schuderer schuderer added the needs discussion Several opinions should be heard and considered label Mar 12, 2020
Release 1.0.0 tracker automation moved this from In progress to Done Mar 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs discussion Several opinions should be heard and considered
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

1 participant