Output features property for Transformers #553
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Inspired by @maartenbreddels , this PR adds
.features_
property to the vaex Transformers, which is list of output feature names. The main idea is to simplify the feature combining process during ML pipeline prototyping..features_
property to the base Transformer class_get_output_features()
for populating the.features_
list_get_output_features()
thePCA
andOneHotEncoder
Transformers since their functionality is different compared to the majorityThis change brings some level of "awkwardness" in the implementation of (some of) the transformers:
Currently,
._get_output_features()
method is called during the.fit()
method of each transformer. One idea is to introduce a.fit()
method in the base Transformer class where_get_output_features
will be called prior to calling a._fit()
method. Thus we should rename the.fit()
method of each transformer to._fit()
.This will help to reduce code duplication. I am not sure how such a change would impact readability and maintainability of the code. This is similar to what
scikit-learn
does, but is this the right path for us @maartenbreddels ? Also, if we do this, the docstrings of all.fit()
methods will be identical (maybe we can get away with this?), unless we re-define.fit
which is defeating the point of this strategy. I am fine leaving things as they are, but i thought to mention this just in case.PCA: our implementation of
_get_output_features
is tricky here, since we are not overwriting output columns but just shifting the component identifier (see.transform
method of PCA). So, do we want the PCA implementation to change in a way that, if columns of those names already exist, an exception should be raised?How often is one expected to re-calculate the PCA on the same features without any other changes (@xdssio). Right now, the
.features_
lists the "naive" output, i.e. the features that should be there without overwriting during.transform
time.There is some duplication/redundancy when determining the feature names. During
.fit
right now we get the list of output features (features_
). Then, during.transform
we still determine the output feature names just before calculating the expressions, in more or less the same way.Do we need to spend time in reducing this redundancy, or somehow re-factoring (the way was not obvious to me). Maybe keeping things as they are is fine for now, it looked a bit weird to me, so I thought to bring it up (@maartenbreddels).