Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output features property for Transformers #553

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

JovanVeljanoski
Copy link
Member

Inspired by @maartenbreddels , this PR adds .features_ property to the vaex Transformers, which is list of output feature names. The main idea is to simplify the feature combining process during ML pipeline prototyping.

  • Implement a .features_ property to the base Transformer class
  • Implement a general function and private function _get_output_features() for populating the .features_ list
  • Implement custom _get_output_features() the PCA and OneHotEncoder Transformers since their functionality is different compared to the majority
  • Update tests so that they test the new element (check test for pca!)
  • Update the Changelog
  • Review: Discuss and agree on implementation details, issues and changes (see text below)

This change brings some level of "awkwardness" in the implementation of (some of) the transformers:

  1. Currently, ._get_output_features() method is called during the .fit() method of each transformer. One idea is to introduce a .fit() method in the base Transformer class where _get_output_features will be called prior to calling a ._fit() method. Thus we should rename the .fit() method of each transformer to ._fit().
    This will help to reduce code duplication. I am not sure how such a change would impact readability and maintainability of the code. This is similar to what scikit-learn does, but is this the right path for us @maartenbreddels ? Also, if we do this, the docstrings of all .fit() methods will be identical (maybe we can get away with this?), unless we re-define .fit which is defeating the point of this strategy. I am fine leaving things as they are, but i thought to mention this just in case.

  2. PCA: our implementation of _get_output_features is tricky here, since we are not overwriting output columns but just shifting the component identifier (see .transform method of PCA). So, do we want the PCA implementation to change in a way that, if columns of those names already exist, an exception should be raised?
    How often is one expected to re-calculate the PCA on the same features without any other changes (@xdssio). Right now, the .features_ lists the "naive" output, i.e. the features that should be there without overwriting during .transform time.

  3. There is some duplication/redundancy when determining the feature names. During .fit right now we get the list of output features (features_). Then, during .transform we still determine the output feature names just before calculating the expressions, in more or less the same way.
    Do we need to spend time in reducing this redundancy, or somehow re-factoring (the way was not obvious to me). Maybe keeping things as they are is fine for now, it looked a bit weird to me, so I thought to bring it up (@maartenbreddels).

@JovanVeljanoski
Copy link
Member Author

Another option for the PCA issue would be to not modify the PCA at all, but also simply not support this features_ property in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant