Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimator data source object #16393

Open
DelgadoPanadero opened this issue Feb 5, 2020 · 7 comments
Open

Estimator data source object #16393

DelgadoPanadero opened this issue Feb 5, 2020 · 7 comments

Comments

@DelgadoPanadero
Copy link

DelgadoPanadero commented Feb 5, 2020

Hi! The pipeline instances is very interesting because it uses a sintaxis very similar to the functional programming frameworks, such as pyspark and apache beam. However, I miss a "DataSource" object which dealed with the logic required to load the data beyond passing it as a list of values. For instance, it could be able to connect to a database to load the data for fitting and for predicting.

Pipeline([MongoDataSource("127.0.0.1:8087", query),
          Normalizer(),
          SVC()]
          ).fit()

I have tried to build one of those objects by my own using an Estimator structure, however, I saw that it does not fit your conventions (https://scikit-learn.org/stable/developers/develop.html). That is why I think that is needed a new object

What do you think? Thank you!

@jnothman
Copy link
Member

jnothman commented Feb 6, 2020 via email

@DelgadoPanadero
Copy link
Author

I do not really think that it can be done through a transform method because this method, according to your convenction expects the X argument value and the lengh of the array output should be the same as the X input. Moreover in my example, the fit() method does not require any argument value, because of the presence of the MongoDataSource object

Regarding to the part of not gettind the y value. This is the kind of logic that could be implemented inside the MongoDataSource object example (i dont know how yet, maybe exoecting a certaing naming from the query).

I now that it changes a little how the pipelines are currently build in Sklearn, however, I think that it might have a great adoption from the community noting the annalogy with other frameworks (as I told you before). I could also code a first version of this object an some usage rules, if it helps. Thank you!

@glemaitre
Copy link
Member

Does the FunctionTransformer alleviate the issue:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html

@jnothman
Copy link
Member

jnothman commented Feb 15, 2020 via email

@DelgadoPanadero
Copy link
Author

As far as I know when you both talk about transformers and FunctionTransformer are just talking about classes with a method "transform" defined as an object method. According to your documentation, the convention for fitting in the Pipeline framework is that this method should recieve a matrix X with shape (n_samples, n_features) and return a matrix with the same dimensions.

With the DataSource object I would expect it to have a method that just recieves a configuration object (such as a dictionary) or even just pass it in the instanciation (such as in the example above) and it will return the X and y (if required) when the methods fit() and predict() from the Pipeline are called.

I do not really know if this explanation helps, otherwise I could code a first version of it to see how I expect it to work. Thank you!

@jnothman
Copy link
Member

According to your documentation, the convention for fitting in the Pipeline framework is that this method should recieve a matrix X with shape (n_samples, n_features) and return a matrix with the same dimensions.

That documentation simplifies things a bit. It needs to be an iterable (usually an array or sequence) of length n_samples. See the glossary on X.

One certainly does not need to return a matrix with the same dimensions, but transform should return an object of equal length (on the first axis).

With the DataSource object I would expect it to have a method that just recieves a configuration object (such as a dictionary) or even just pass it in the instanciation (such as in the example above) and it will return the X and y (if required) when the methods fit() and predict() from the Pipeline are called.

This will not work with our cross validation routines. However, transform can take as input a list of IDs and return a list or array or dataframe of representations of those IDs.

@DelgadoPanadero
Copy link
Author

I see your point, you can make a workaround with a transformer. But dont you really think that it could be interesting to solve this with a new feature or object? The Pipeline object is very similar (in terms of process and sintaxis) to functional programming frameworks, and this feature is really common in everyone of them

https://spark.apache.org/docs/latest/ml-datasource
https://beam.apache.org/documentation/programming-guide/#pipeline-io

Besides I am always talking about "source" but it may be also used as a "sink". This will really convinient for streaming processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants