-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Tensorflow Integration #622
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some first comments, exciting work!
8528ef2
to
fea9f74
Compare
Substantial update to this PR. For passing data to Keras models, now we can also just use vaex as the main generator, no need to go through tensorflow-io. (I have not checked which one is faster). For comprehensive examples on how all this works, please see the unit-tests. There are now more detailed unit tests, which test both the creation of the data sources, and how they are passed on to models as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, really good. I'm not sure of the features argument. Maybe it is good in @xdssio reviews it
"""Create a tensorflow Dataset object from a DataFrame, via Arrow. | ||
|
||
:param features: A list of column names. | ||
:param target: The dependent or target column, if any. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The I think if target is given, it should exclude that column for the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And when you only give target, I expect features to be all columns but the target
1240c88
to
9d92a32
Compare
d924fb2
to
6ec1f13
Compare
This would be such a killer feature! Any news on this? |
Hi @solalatus We are happy you think so! Yeah, it is ready. Perhaps some cleanup is needed. We are doing some internal testing on big-ish data together with an involved (vaex based) pre-processing pipeline to make sure everything will actually work as intended. If you are comfortable with a dev install you can try this branch. Otherwise, if all goes well I estimate 1-2 weeks before it is merged and released. |
Marvelous news! I will be talking about data iterators soon in a teaching context, so I am now more confident to hint at Vaex! |
49aa65c
to
9289005
Compare
ee2029b
to
36275aa
Compare
369423b
to
6927b35
Compare
This PR realises #616
The idea is enable vaex DataFrames to be used as input data sources for tensorflow (tf) data sources.
This is made possible in large part thanks to
tensorflow-io
via arrow.df.ml.tensorflow.to_dataset()
to_dataset()
method do make the dataset compatible with Keras, so that one can use the dataset as an input for the Keras model.fit()
method.Important notes:
features
argument, used to select which features/columns of the vaex DataFrame should be used. To instantiate tf Estimators one need to provide one or more (in the case ofDNNLinearCombinedEstimator
) lists offeature_columns
which give more detail to tf about the features, i.e. options on whether the features and numerical, categorical, providing various options of handing categorical variables, as well as options for engineering new features from the primary ones. In addition, some tf Estimators require various kwargs to be instantiated. Thus, over the next few days I will add some example code as comments here, from the user side, so we can decide on the type of API we should go with.test_to_dataset_keras_model_train
that tests the successful passing of a vaex DataFrame to keras model via a tf dataset fails with a segmentation fault. At this point, I do not know whether this arises due to problems in vaex, tensorflow, tensorflow-io, or the code committed in this PR. However, my independent testing have also found other segmentation faults happening when making an input function or a dataset to be used for tf Estimators. Thus it is important that we investigate this issue.tensorflow-io.arrow
to pass vaex data to keras, but this is not strictly necessary. It might be useful to investigate performance when using the tensorflow-io.arrow connection vs simply using vaex iterators to pass numpy arrays to the keras models.fit
method, since it can consume generators that return numpy arrays,