Estimator data source object #16393

DelgadoPanadero · 2020-02-05T22:58:13Z

Hi! The pipeline instances is very interesting because it uses a sintaxis very similar to the functional programming frameworks, such as pyspark and apache beam. However, I miss a "DataSource" object which dealed with the logic required to load the data beyond passing it as a list of values. For instance, it could be able to connect to a database to load the data for fitting and for predicting.

Pipeline([MongoDataSource("127.0.0.1:8087", query),
          Normalizer(),
          SVC()]
          ).fit()

I have tried to build one of those objects by my own using an Estimator structure, however, I saw that it does not fit your conventions (https://scikit-learn.org/stable/developers/develop.html). That is why I think that is needed a new object

What do you think? Thank you!

The text was updated successfully, but these errors were encountered:

jnothman · 2020-02-06T09:02:24Z

Consider the analogy of CountVectorizer(input='filename'). If your MongoDataSource has a transform method which translates each id into an observation, and if you pass y separately for the same IDs, it should work. I think the uncomfortable part here is not being able to get y from the same source at the same time.

DelgadoPanadero · 2020-02-08T12:02:38Z

I do not really think that it can be done through a transform method because this method, according to your convenction expects the X argument value and the lengh of the array output should be the same as the X input. Moreover in my example, the fit() method does not require any argument value, because of the presence of the MongoDataSource object

Regarding to the part of not gettind the y value. This is the kind of logic that could be implemented inside the MongoDataSource object example (i dont know how yet, maybe exoecting a certaing naming from the query).

I now that it changes a little how the pipelines are currently build in Sklearn, however, I think that it might have a great adoption from the community noting the annalogy with other frameworks (as I told you before). I could also code a first version of this object an some usage rules, if it helps. Thank you!

glemaitre · 2020-02-14T10:50:46Z

Does the FunctionTransformer alleviate the issue:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html

jnothman · 2020-02-15T15:28:47Z

FunctionTransformer won't help. But I don't see what's wrong with transforming a list of IDs into a DataFrame or array. Doing so enables it to work with other parts of the library such as cross validation, no matter what other frameworks as you told us before do.

DelgadoPanadero · 2020-02-17T17:24:34Z

As far as I know when you both talk about transformers and FunctionTransformer are just talking about classes with a method "transform" defined as an object method. According to your documentation, the convention for fitting in the Pipeline framework is that this method should recieve a matrix X with shape (n_samples, n_features) and return a matrix with the same dimensions.

With the DataSource object I would expect it to have a method that just recieves a configuration object (such as a dictionary) or even just pass it in the instanciation (such as in the example above) and it will return the X and y (if required) when the methods fit() and predict() from the Pipeline are called.

I do not really know if this explanation helps, otherwise I could code a first version of it to see how I expect it to work. Thank you!

jnothman · 2020-02-17T22:42:01Z

According to your documentation, the convention for fitting in the Pipeline framework is that this method should recieve a matrix X with shape (n_samples, n_features) and return a matrix with the same dimensions.

That documentation simplifies things a bit. It needs to be an iterable (usually an array or sequence) of length n_samples. See the glossary on X.

One certainly does not need to return a matrix with the same dimensions, but transform should return an object of equal length (on the first axis).

With the DataSource object I would expect it to have a method that just recieves a configuration object (such as a dictionary) or even just pass it in the instanciation (such as in the example above) and it will return the X and y (if required) when the methods fit() and predict() from the Pipeline are called.

This will not work with our cross validation routines. However, transform can take as input a list of IDs and return a list or array or dataframe of representations of those IDs.

DelgadoPanadero · 2020-02-18T13:42:38Z

I see your point, you can make a workaround with a transformer. But dont you really think that it could be interesting to solve this with a new feature or object? The Pipeline object is very similar (in terms of process and sintaxis) to functional programming frameworks, and this feature is really common in everyone of them

https://spark.apache.org/docs/latest/ml-datasource
https://beam.apache.org/documentation/programming-guide/#pipeline-io

Besides I am always talking about "source" but it may be also used as a "sink". This will really convinient for streaming processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimator data source object #16393

Estimator data source object #16393

DelgadoPanadero commented Feb 5, 2020 •

edited

jnothman commented Feb 6, 2020 via email

DelgadoPanadero commented Feb 8, 2020

glemaitre commented Feb 14, 2020

jnothman commented Feb 15, 2020 via email

DelgadoPanadero commented Feb 17, 2020

jnothman commented Feb 17, 2020

DelgadoPanadero commented Feb 18, 2020

Estimator data source object #16393

Estimator data source object #16393

Comments

DelgadoPanadero commented Feb 5, 2020 • edited

jnothman commented Feb 6, 2020 via email

DelgadoPanadero commented Feb 8, 2020

glemaitre commented Feb 14, 2020

jnothman commented Feb 15, 2020 via email

DelgadoPanadero commented Feb 17, 2020

jnothman commented Feb 17, 2020

DelgadoPanadero commented Feb 18, 2020

DelgadoPanadero commented Feb 5, 2020 •

edited