Prevent accidental column re-ordering #37

schuderer · 2019-07-29T10:19:10Z

Description

Right now, columns might come in in a different order using the API (args_dict) vs in the
data sources, and the user has to be aware that they have to make sure that the inputs are not jumbled, because this can cause the wrong weights to be applied to the variables, depending on the model (scikit-learn, for example, does not automatically make sure that re-ordered columns
are processed correctly).

Unfortunately, we cannot simply silently order the columns of the data sources and the API,
because we don't know whether data sources are actually combined or processed, etc
before they are used. In other words, we don't know the definitive set of column names
beforehand.

Suggestion

There are several ways to handle this:

Let the user deal with it and put a warning in the docs.
Provide a helper function like e.g. with_columns_sorted which the user can use on a dict or DataFrame and which returns an object of the same type, but with the columns sorted deterministically (e.g. alphanumerically).
Assume that datasource columns = api keys and provide an option in the model/api config like sort_columns_of_datasource: <datasourcename> to sort columns of a specific datasource as well as those of the args_dict. Pro: makes happy case simple, Con: creates confusion for non-trivial models.
Similar to 3 (option), but instead of sorting, keep the datasource's columns as they are, but internally save its column order with the model metadata, and apply it to the args_dict

Current Workaround

User has to deal with this manually in the data preparation.

The text was updated successfully, but these errors were encountered:

schuderer · 2019-12-12T07:44:00Z

Option “sort_colums: true/false” in datasources’ and model’s config (default false), make args_dict an OrderedDict (and of course document how to deal with this in case data sources need to be joined, point out that mllaunchpad’s own sort_columns function can be used, etc)

schuderer added the enhancement New feature or request label Jul 29, 2019

schuderer added this to To do in Release 1.0.0 tracker via automation Jul 29, 2019

schuderer changed the title ~~Enforce column ordering~~ Prevent column re-ordering Jul 29, 2019

schuderer changed the title ~~Prevent column re-ordering~~ Prevent accidental column re-ordering Jul 29, 2019

schuderer moved this from To do to In progress in Release 1.0.0 tracker Jan 15, 2020

schuderer self-assigned this Feb 21, 2020

schuderer added the needs discussion Several opinions should be heard and considered label Mar 12, 2020

schuderer mentioned this issue Mar 18, 2020

Column ordering helper function #87

Merged

schuderer closed this as completed in #87 Mar 20, 2020

Release 1.0.0 tracker automation moved this from In progress to Done Mar 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent accidental column re-ordering #37

Prevent accidental column re-ordering #37

schuderer commented Jul 29, 2019

schuderer commented Dec 12, 2019 •

edited

Prevent accidental column re-ordering #37

Prevent accidental column re-ordering #37

Comments

schuderer commented Jul 29, 2019

Description

Suggestion

Current Workaround

schuderer commented Dec 12, 2019 • edited

schuderer commented Dec 12, 2019 •

edited