Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Tensorflow Integration #622

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

JovanVeljanoski
Copy link
Member

@JovanVeljanoski JovanVeljanoski commented Mar 1, 2020

This PR realises #616
The idea is enable vaex DataFrames to be used as input data sources for tensorflow (tf) data sources.

This is made possible in large part thanks to tensorflow-io via arrow.

  • A method to convert vaex DataFrames to tf datasets, to be used as input for tensorflow Estimators. This is now possible via df.ml.tensorflow.to_dataset()
  • A convenience function meant as a direct data source for tf Estimators. Note that tf Estimator consume an input function that returns a dataset object, rather than the dataset object itself.
  • Enabling the to_dataset() method do make the dataset compatible with Keras, so that one can use the dataset as an input for the Keras model .fit() method.
  • A high level wrapper for the tf Estimators, which will make them easily consumable on the vaex side, while also turning them into vaex transformers, and also serialise them, and make them compatible with the state transfer system.
  • A high level wrapper that will turn a pre-defined keras model into a vaex transformer.
  • Unit tests (providing enough coverage & passing)
  • All new methods have helpful docstrings
  • Update the CHANGELOG when/if this PR is considered to be accepted
  • Code review approval

Important notes:

  • It is not readily obvious what the API of the high level wrapper for the tf Estimators should be. Typically, such wrappers need a features argument, used to select which features/columns of the vaex DataFrame should be used. To instantiate tf Estimators one need to provide one or more (in the case of DNNLinearCombinedEstimator) lists of feature_columns which give more detail to tf about the features, i.e. options on whether the features and numerical, categorical, providing various options of handing categorical variables, as well as options for engineering new features from the primary ones. In addition, some tf Estimators require various kwargs to be instantiated. Thus, over the next few days I will add some example code as comments here, from the user side, so we can decide on the type of API we should go with.
  • The test currently named test_to_dataset_keras_model_train that tests the successful passing of a vaex DataFrame to keras model via a tf dataset fails with a segmentation fault. At this point, I do not know whether this arises due to problems in vaex, tensorflow, tensorflow-io, or the code committed in this PR. However, my independent testing have also found other segmentation faults happening when making an input function or a dataset to be used for tf Estimators. Thus it is important that we investigate this issue.
  • At this point in time, we are using tensorflow-io.arrow to pass vaex data to keras, but this is not strictly necessary. It might be useful to investigate performance when using the tensorflow-io.arrow connection vs simply using vaex iterators to pass numpy arrays to the keras models .fit method, since it can consume generators that return numpy arrays,

Copy link
Member

@maartenbreddels maartenbreddels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some first comments, exciting work!

packages/vaex-ml/setup.py Outdated Show resolved Hide resolved
packages/vaex-ml/vaex/ml/tensorflow.py Outdated Show resolved Hide resolved
packages/vaex-ml/vaex/ml/tensorflow.py Outdated Show resolved Hide resolved
tests/ml/tensorflow_test.py Outdated Show resolved Hide resolved
tests/ml/tensorflow_test.py Outdated Show resolved Hide resolved
@JovanVeljanoski
Copy link
Member Author

Substantial update to this PR.
Now we have support for multi-class classification and multi-target regression. One can pass to_dataset() datasets to both tf estimators and keras models. For keras models, one can construct the dataset as dicts (similar as in the case of the tf estimators), or as tensors.

For passing data to Keras models, now we can also just use vaex as the main generator, no need to go through tensorflow-io. (I have not checked which one is faster).

For comprehensive examples on how all this works, please see the unit-tests. There are now more detailed unit tests, which test both the creation of the data sources, and how they are passed on to models as well.

Copy link
Member

@maartenbreddels maartenbreddels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, really good. I'm not sure of the features argument. Maybe it is good in @xdssio reviews it

packages/vaex-ml/vaex/ml/tensorflow.py Outdated Show resolved Hide resolved
packages/vaex-ml/vaex/ml/tensorflow.py Outdated Show resolved Hide resolved
"""Create a tensorflow Dataset object from a DataFrame, via Arrow.

:param features: A list of column names.
:param target: The dependent or target column, if any.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The I think if target is given, it should exclude that column for the rest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And when you only give target, I expect features to be all columns but the target

packages/vaex-ml/vaex/ml/tensorflow.py Outdated Show resolved Hide resolved
@solalatus
Copy link

This would be such a killer feature! Any news on this?

@JovanVeljanoski
Copy link
Member Author

Hi @solalatus

We are happy you think so!

Yeah, it is ready. Perhaps some cleanup is needed. We are doing some internal testing on big-ish data together with an involved (vaex based) pre-processing pipeline to make sure everything will actually work as intended.

If you are comfortable with a dev install you can try this branch. Otherwise, if all goes well I estimate 1-2 weeks before it is merged and released.

@solalatus
Copy link

Marvelous news! I will be talking about data iterators soon in a teaching context, so I am now more confident to hint at Vaex!
Keep pushing guys, it is cool stuff!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants