Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot transform unknown length data/chunked data with a Pipeline (no ChunkedTransformer/ChunkedEstimator/ChunkedClassifier) #15749

Open
GregoryMorse opened this issue Dec 1, 2019 · 7 comments
Labels
API Needs Decision Requires decision

Comments

@GregoryMorse
Copy link
Contributor

GregoryMorse commented Dec 1, 2019

Description

So you have a more than can fit in memory amount of data in separate files on the disk. There are frameworks like dask which you could use to compile them, but that's a huge amount of data copying for little benefit. The total data length could be calculated but there is no real need to do that either. There is no way to perform feature selection in chunks with partial fitting followed by concatenated transforming and then fit post transformation on a reasonable size of data using a Pipeline.

Furthermore the transformers may be used for certain features only. So 10000 features becomes 10 transformers with 1000 features each, and the best 5 features of each transform is used, resulting in 50 features. This division of features into the transformers is also important due to memory constraints. Of course ColumnTransformer solves the task of dividing columns among transformers and choosing the best amongst those.

The assumption is enough memory is available for 50 features over all the data. But 1000 features over all the data will not fit in memory. I see no way a Pipeline can support chunked data/unknown length/load on demand data. Basically you want to partial_fit the data, then concatenate while transforming the data then move on to the next stage of the pipeline which could have more transformers and then an estimator. Only the very first stage transformer of the pipeline should need this special operation. However in some cases perhaps the intermediate phases may need to be cached and redone with the same strategy.

Steps/Code to Reproduce

Say there are 1000 files with various rows of data (1000-10000), and 50000 features each. Feature selection will reduce these 50000 features down to a mere 50 by choosing the best. Its assumed the data will fit into memory despite the numbers used after feature selection but not before so a full fit can be done.

What you want to do:
Stage 1: partial_fit over and over on the transformers for each chunk of data
Stage 2: transform for each chunk of data with all the transformers which will do the feature selection
Stage 3: concatenate the results
Stage 4: fit on the estimator

This would seem to be an incredibly common idea for data sets which are too large, but feature extraction can reduce them down to a reasonable size.

Solutions

Add ChunkedTransformer to scikit learn. This will perform partial_fit in chunks then transform in chunks and compile the result. I suppose ChunkedEstimator and ChunkedClassifier could also provide wrappers where partial_fit is available or for overly large data in predict. The input X format would need to be slightly different as an iterator of arrays rather than a single array but the output would be the same. Caching in a Pipeline is already supported so intermediate caching and loading is at least possible. These are simple classes to implement as they merely wrap an underlying Transformer/Estimator/Classifier and feed the partial_fit or transform or predict functions from an iterable and compile results. Unfortunately there are a good deal of places where the X parameter is assumed to not be of a generic iterable format and might take some refactoring in those places to deal with it such as in GridSearchCV.

Another idea is that sklearn support callback functions for X in models which can load the data in chunks via iteration. But that's probably too much of a massive change. And the fit_transform would still need to enumerate the data twice using that. Perhaps a class can do this overriding the various __getitem__ methods to provide a virtualized view of the data. Nonetheless the length is not known so it would have to still fetch it in chunks.
Another is a partial_fit method which can be called repeated only the pipeline. Then a transform_fit method automatically will concatenate the results of the transform and fit on the estimator. These 2 ideas are basically dead ends since they do not provide the granularity of flexibility for multiple stages of transformers in the pipeline. The first idea of adding Chunked wrappers is the only real solution.

Is there something I am missing or is this incredibly obvious task not possible with a pipeline? I would like to discretize my data in the transformers, and also have the transformer use a scoring algorithm to choose the best features when discretized. Then finally the fitting can occur. But I would like to do it on a good deal of data. Of course another solution without a pipeline:

  1. partial_fit in a loop until data exhausted.
  2. transform in a loop with concatenation until data exhausted
  3. pass to estimator with fit

But now the benefit of using something like GridSearchCV to tune the feature selection and model together is missing.

@GregoryMorse GregoryMorse changed the title Cannot transforming unknown length data/chunked data with a Pipeline (no partial_fit/transform_fit) Cannot transform unknown length data/chunked data with a Pipeline (no partial_fit/transform_fit) Dec 1, 2019
@GregoryMorse GregoryMorse changed the title Cannot transform unknown length data/chunked data with a Pipeline (no partial_fit/transform_fit) Cannot transform unknown length data/chunked data with a Pipeline (no ChunkedTransformer or partial_fit/transform_fit) Dec 1, 2019
@GregoryMorse GregoryMorse changed the title Cannot transform unknown length data/chunked data with a Pipeline (no ChunkedTransformer or partial_fit/transform_fit) Cannot transform unknown length data/chunked data with a Pipeline (no ChunkedTransformer/ChunkedEstimator/ChunkedClassifier) Dec 1, 2019
@rth
Copy link
Member

rth commented Dec 1, 2019

dask and dask-ml seems to be exactly the use case for this.

There are frameworks like dask which you could use to compile them, but that's a huge amount of data copying for little benefit.

Why do you think it's going make more data copies? It shouldn't copy anything on disk unless you ask it to and copies in memory are negligible with respect to disk I/O (and scikit-learn itself does plenty of those anyway).

Even assuming you have estimators that all support partial fit, in a pipeline the issue is that in general you would need several passes over the data e.g. in,
make_pipeline(Transformer1(), FeatureSelector1(), Classifier1()),

  • first pass where you load the data in chunks to fit Transformer1() with partial_fit
  • a second pass over the data to fit FeatureSelector1() etc.

I'm not sure that this logic is something that would work with the current scikit-learn API.

I think building computational graph as dask does to load each chunk of the data when it's needed would make most sense here.

If you want to stay with scikit-learn I would just either do simple feature selection one column at a time (assuming you can load those efficiently) over the full dataset. Or alternatively subsample the full dataset and run feature selection on it. Then pick features that are robust by the change of random seed used for subsampling.

@GregoryMorse
Copy link
Contributor Author

GregoryMorse commented Dec 1, 2019

@rth thank you for your reply. Although I agree that dask-ml does tend to try to solve this problem, it somewhat fails in the case that you need to send many features through a transformer partial fit efficiently. And then consolidate them with partial transforms. Yes the partial transforms could be chained. In fact in my case the estimator will not work on partial data so I am only doing it for a single stage transform. Now I am learning a bit more how dask-ml really works, as if the rows are not all loaded into memory at the same time, I suspect its highly inefficient. As some algorithms like I am implementing traverse through the features not the rows.

  1. dask would at minimum have to scan all the data for its length. Which is an extra, expensive and unnecessary iteration through the data. The raw data generates the features but is not necessarily the features themselves (yes I suppose its also an implied transform but so formulaic it need not be anywhere associated with the actual model).
  2. If it caches the data while doing this, its not a copy but a lot of data which is cheaper to generate on the fly than kill the disk with storage activity.
  3. dask cannot known the shape of the data and avoid caching or an unnecessary iteration through all of the data. Which in my opinion makes it simply unsuitable for this optimization. Probably it will work but I am guessing it will write my disk more than I had cared to do. I only needed to read/generate the features twice in partial_fit and transform with my proposal.

In the meantime, I figure I might as well stick with sklearn as I like the straight-forwardness, simplicity yet elegance of the library. What I can do is simply implement my own Pipeline which takes an iterable and marks which entries in the Pipeline require iterable adapters. And I can see if I can fool GridSearchCV into not inspecting or touching the X field. Of course all this discussion applies theoretically to 'Y' too.

I think it would be an excellent enhancement to support batching or chunks or partial data or whatever terminology it would be called. Not all algorithms are row-based. Some are feature-based, and partially fittable and in that context changes to sklearn would shine. Again, I admit perhaps dask-ml is a total solution to this problem but I would have to understand more how it works as my features are actually generated on demand from data on disk, and are accessed as entire columns while entire rows are never accessed together. I can come up with many real-world problems that would benefit from such a structure.

@GregoryMorse
Copy link
Contributor Author

GregoryMorse commented Dec 1, 2019

Is the addition of iterator generators for partial_fit and score, transform predict for both X and y really unachievable? Obviously it cannot be an iterator as GridSearchCV needs to be able to construct the iterator more than once. I think GridSearchCV could take an array of function pointers to trick it. Then ChunkedTransformer could merely do the work and hence my proposal should be easily an addition to sklearn through only the implementations of these 3 additional wrapper classes that adapt arrays of function pointers to iterables which the inner transformer/estimator can then understand needs to be iterated.

With deeper throught and consideration, I don't even think this change is major or requires a SLEP, just merely 3 very clever classes, and some documentation of exactly how they can be used. The API remains totally the same and is completely transparent to it.

@GregoryMorse
Copy link
Contributor Author

The Chunked... can have a parameter to either:

  1. recombine chunked transforms back into the normal format.
  2. have transform or other functions yield another iterator generator which would be used for the next partial_fit or other functions.

I am almost certain now the 3 classes are enough to do the job although its introducing a new notion of how data is passed around.

@adrinjalali
Copy link
Member

I'd be happy to see a PR implementing your idea, to see concretely how it works.

Better support for partial_fit has been in our roadmap, but from our perspective it's a very challenging one and touches a lot of places. But I'm happy to see a prototype of your idea.

@adrinjalali adrinjalali added API Needs Decision Requires decision labels Dec 4, 2019
@GregoryMorse
Copy link
Contributor Author

I will provide it in a few days as I need to develop my underlying Transformer a bit first. It also looks like I will require not just row loading in chunks but also column selective loading. I will have to see how many adapter generalizations are needed to get it all working. But there will be concrete code for these proposed classes.

@GregoryMorse
Copy link
Contributor Author

dask delayed does seem to solve a good deal of what I have alluded to. Chunked data is not really needed in the known length case as dask solves this. The problem is mainly when length computation requires evaluation and thus undoes some benefit of using dask (or a huge amount of caching would occur). The number of features should usually be determinable without computation and not unknown (though theoretically it could be). And since nearly all models need all features simultaneously and do not try to go through them in batches and dask solves the memory constraint issue here, I will keep this as a row only issue as its a practical use case.

But as for unknown row sizes, dask cannot address the fact that the length computation might be undesirable.

The issue here is the model itself need to understand that partial fitting as a phase and partial transforms as a phase eliminates the need for the computation of the data up front.

In this case, the computation is only done once for all the data in the fitting phase. Then in the transform phase, only a single feature will be computed as the others have been eliminated. The extra computation for the single feature, should be cheap enough that having all the data cached during its precomputation dask graph building is simply not worth it.

If one does not care about writing large amounts of pre-computed data to the disk, then certainly dask is the way to go.

I will go ahead and write ChunkedTransformer, submit a PR for it so better discussion can ensue, and probably will still use dask to handle the scalability among the features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Needs Decision Requires decision
Projects
None yet
Development

No branches or pull requests

3 participants