Skip to content

Conversation

@jeromedockes
Copy link
Member

@jeromedockes jeromedockes commented Jun 4, 2024

closes #924

also maybe enough to close #871 and #866?

TODO:

  • decide on name

@jeromedockes jeromedockes marked this pull request as draft June 4, 2024 15:08
@jeromedockes jeromedockes changed the title [WIP] add get_learner function [WIP] add make_tabular_pipeline function Jun 5, 2024
@GaelVaroquaux
Copy link
Member

Name suggestion: "make_skrub_learner" ?

We could have "make_skrub_pipeline", but I like to put the emphasis on the fact that this gives a supervised learner

@jeromedockes
Copy link
Member Author

I don't have a strong opinion on the name. I think the advantage of having "pipeline" or something similar in it is it makes a clearer distinction between the learner provided by the user and the learner returned to them

@jeromedockes jeromedockes marked this pull request as ready for review June 7, 2024 13:56
@jeromedockes jeromedockes changed the title [WIP] add make_tabular_pipeline function Add make_tabular_pipeline function Jun 7, 2024
@jeromedockes
Copy link
Member Author

let's vote to pick a name in #938 , don't hesitate to add options!

Copy link
Contributor

@TheooJ TheooJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! We just need a name :)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 17, 2024 via email

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. A few suggestions.

Can you also add it in see also section in the docstring of the TableVectorizer, as well as on the encoding narrative doc (first section).

Thanks!!

do not accept heterogeneous dataframes containing complex data such as
datetimes or strings. Moreover, they do not always accept the input to
contain missing values. Therefore, some preprocessing must be applied to
dataframes before they are passed to an estimator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description on top of the docstring is too long. It should be moved to a "Notes" section, leaving only a quick read before the parameters section.

@jeromedockes
Copy link
Member Author

@GaelVaroquaux IIRC some experimentation you're working on reveals that adding a missing value indicator when imputing usually helps. Should we set add_indicator=True on the SimpleImputer here?

@glemaitre
Copy link
Member

Should we set add_indicator=True on the SimpleImputer here?

Oh yes ;)

@glemaitre glemaitre changed the title Add make_tabular_pipeline function Add tabular_learner function Jun 18, 2024
@glemaitre glemaitre changed the title Add tabular_learner function FEA Add tabular_learner factory function Jun 18, 2024
@glemaitre glemaitre self-requested a review June 18, 2024 09:16
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 18, 2024 via email

@TheooJ TheooJ merged commit b0ad899 into skrub-data:main Jun 18, 2024
@TheooJ
Copy link
Contributor

TheooJ commented Jun 18, 2024

Thanks @jeromedockes !

@glemaitre
Copy link
Member

Oups, I was not fast enough for giving my review. I'll open a PR with little changes in the doc :)

@tomMoral
Copy link

great feature! thx :)

@jeromedockes jeromedockes deleted the add-get-learner branch June 18, 2024 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

adding make_learner to create a default pipeline for a given predictor Handle numerical missing values in TableVectorizer

5 participants