You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now, columns might come in in a different order using the API (args_dict) vs in the
data sources, and the user has to be aware that they have to make sure that the inputs are not jumbled, because this can cause the wrong weights to be applied to the variables, depending on the model (scikit-learn, for example, does not automatically make sure that re-ordered columns
are processed correctly).
Unfortunately, we cannot simply silently order the columns of the data sources and the API,
because we don't know whether data sources are actually combined or processed, etc
before they are used. In other words, we don't know the definitive set of column names
beforehand.
Suggestion
There are several ways to handle this:
Let the user deal with it and put a warning in the docs.
Provide a helper function like e.g. with_columns_sorted which the user can use on a dict or DataFrame and which returns an object of the same type, but with the columns sorted deterministically (e.g. alphanumerically).
Assume that datasource columns = api keys and provide an option in the model/api config like sort_columns_of_datasource: <datasourcename> to sort columns of a specific datasource as well as those of the args_dict. Pro: makes happy case simple, Con: creates confusion for non-trivial models.
Similar to 3 (option), but instead of sorting, keep the datasource's columns as they are, but internally save its column order with the model metadata, and apply it to the args_dict
Current Workaround
User has to deal with this manually in the data preparation.
The text was updated successfully, but these errors were encountered:
Option “sort_colums: true/false” in datasources’ and model’s config (default false), make args_dict an OrderedDict (and of course document how to deal with this in case data sources need to be joined, point out that mllaunchpad’s own sort_columns function can be used, etc)
Description
Right now, columns might come in in a different order using the API (
args_dict
) vs in thedata sources, and the user has to be aware that they have to make sure that the inputs are not jumbled, because this can cause the wrong weights to be applied to the variables, depending on the model (scikit-learn, for example, does not automatically make sure that re-ordered columns
are processed correctly).
Unfortunately, we cannot simply silently order the columns of the data sources and the API,
because we don't know whether data sources are actually combined or processed, etc
before they are used. In other words, we don't know the definitive set of column names
beforehand.
Suggestion
There are several ways to handle this:
with_columns_sorted
which the user can use on adict
orDataFrame
and which returns an object of the same type, but with the columns sorted deterministically (e.g. alphanumerically).sort_columns_of_datasource: <datasourcename>
to sort columns of a specific datasource as well as those of theargs_dict
. Pro: makes happy case simple, Con: creates confusion for non-trivial models.Current Workaround
User has to deal with this manually in the data preparation.
The text was updated successfully, but these errors were encountered: