-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimator data source object #16393
Comments
Consider the analogy of CountVectorizer(input='filename'). If your
MongoDataSource
has a transform method which translates each id into an observation, and if
you pass y separately for the same IDs, it should work.
I think the uncomfortable part here is not being able to get y from the
same source at the same time.
|
I do not really think that it can be done through a transform method because this method, according to your convenction expects the X argument value and the lengh of the array output should be the same as the X input. Moreover in my example, the fit() method does not require any argument value, because of the presence of the MongoDataSource object Regarding to the part of not gettind the y value. This is the kind of logic that could be implemented inside the MongoDataSource object example (i dont know how yet, maybe exoecting a certaing naming from the query). I now that it changes a little how the pipelines are currently build in Sklearn, however, I think that it might have a great adoption from the community noting the annalogy with other frameworks (as I told you before). I could also code a first version of this object an some usage rules, if it helps. Thank you! |
Does the |
FunctionTransformer won't help. But I don't see what's wrong with
transforming a list of IDs into a DataFrame or array. Doing so enables it
to work with other parts of the library such as cross validation, no matter
what other frameworks as you told us before do.
|
As far as I know when you both talk about transformers and FunctionTransformer are just talking about classes with a method "transform" defined as an object method. According to your documentation, the convention for fitting in the Pipeline framework is that this method should recieve a matrix X with shape (n_samples, n_features) and return a matrix with the same dimensions. With the DataSource object I would expect it to have a method that just recieves a configuration object (such as a dictionary) or even just pass it in the instanciation (such as in the example above) and it will return the X and y (if required) when the methods fit() and predict() from the Pipeline are called. I do not really know if this explanation helps, otherwise I could code a first version of it to see how I expect it to work. Thank you! |
That documentation simplifies things a bit. It needs to be an iterable (usually an array or sequence) of length n_samples. See the glossary on One certainly does not need to return a matrix with the same dimensions, but
This will not work with our cross validation routines. However, |
I see your point, you can make a workaround with a transformer. But dont you really think that it could be interesting to solve this with a new feature or object? The Pipeline object is very similar (in terms of process and sintaxis) to functional programming frameworks, and this feature is really common in everyone of them https://spark.apache.org/docs/latest/ml-datasource Besides I am always talking about "source" but it may be also used as a "sink". This will really convinient for streaming processing. |
Hi! The pipeline instances is very interesting because it uses a sintaxis very similar to the functional programming frameworks, such as pyspark and apache beam. However, I miss a "DataSource" object which dealed with the logic required to load the data beyond passing it as a list of values. For instance, it could be able to connect to a database to load the data for fitting and for predicting.
I have tried to build one of those objects by my own using an Estimator structure, however, I saw that it does not fit your conventions (https://scikit-learn.org/stable/developers/develop.html). That is why I think that is needed a new object
What do you think? Thank you!
The text was updated successfully, but these errors were encountered: