Output features property for Transformers #553

JovanVeljanoski · 2020-01-15T23:27:56Z

Inspired by @maartenbreddels , this PR adds .features_ property to the vaex Transformers, which is list of output feature names. The main idea is to simplify the feature combining process during ML pipeline prototyping.

Implement a .features_ property to the base Transformer class
Implement a general function and private function _get_output_features() for populating the .features_ list
Implement custom _get_output_features() the PCA and OneHotEncoder Transformers since their functionality is different compared to the majority
Update tests so that they test the new element (check test for pca!)
Update the Changelog
Review: Discuss and agree on implementation details, issues and changes (see text below)

This change brings some level of "awkwardness" in the implementation of (some of) the transformers:

Currently, ._get_output_features() method is called during the .fit() method of each transformer. One idea is to introduce a .fit() method in the base Transformer class where _get_output_features will be called prior to calling a ._fit() method. Thus we should rename the .fit() method of each transformer to ._fit().
This will help to reduce code duplication. I am not sure how such a change would impact readability and maintainability of the code. This is similar to what scikit-learn does, but is this the right path for us @maartenbreddels ? Also, if we do this, the docstrings of all .fit() methods will be identical (maybe we can get away with this?), unless we re-define .fit which is defeating the point of this strategy. I am fine leaving things as they are, but i thought to mention this just in case.
PCA: our implementation of _get_output_features is tricky here, since we are not overwriting output columns but just shifting the component identifier (see .transform method of PCA). So, do we want the PCA implementation to change in a way that, if columns of those names already exist, an exception should be raised?
How often is one expected to re-calculate the PCA on the same features without any other changes (@xdssio). Right now, the .features_ lists the "naive" output, i.e. the features that should be there without overwriting during .transform time.
There is some duplication/redundancy when determining the feature names. During .fit right now we get the list of output features (features_). Then, during .transform we still determine the output feature names just before calculating the expressions, in more or less the same way.
Do we need to spend time in reducing this redundancy, or somehow re-factoring (the way was not obvious to me). Maybe keeping things as they are is fine for now, it looked a bit weird to me, so I thought to bring it up (@maartenbreddels).

…ist of output features.

JovanVeljanoski · 2020-01-16T12:10:56Z

Another option for the PCA issue would be to not modify the PCA at all, but also simply not support this features_ property in this case.

JovanVeljanoski added 2 commits January 15, 2020 23:59

feat(ml) Add the features_ property to the transformers, which is a l…

0de9590

…ist of output features.

test(ml) Update the tests to check the features_ property

1c81c9b

JovanVeljanoski requested review from maartenbreddels and xdssio January 15, 2020 23:27

chore: Update the CHANGELOG

f2bb6e1

maartenbreddels force-pushed the master branch from 49aa65c to 9289005 Compare October 24, 2020 12:30

maartenbreddels force-pushed the master branch from c144d6e to 6f60dd0 Compare June 4, 2021 14:31

maartenbreddels force-pushed the master branch from 369423b to 6927b35 Compare November 25, 2021 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output features property for Transformers #553

Output features property for Transformers #553

JovanVeljanoski commented Jan 15, 2020

JovanVeljanoski commented Jan 16, 2020

Output features property for Transformers #553

Are you sure you want to change the base?

Output features property for Transformers #553

Conversation

JovanVeljanoski commented Jan 15, 2020

JovanVeljanoski commented Jan 16, 2020