Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "anti-pattern" examples #14081

Open
NicolasHug opened this issue Jun 13, 2019 · 16 comments

Comments

@NicolasHug
Copy link
Contributor

@NicolasHug NicolasHug commented Jun 13, 2019

As suggested by @ogrisel in a tweet, it'd be nice to have some examples showing what not to do.

A good first one is the one in the twitter thread (@jmschrei maybe?), but there are many other scenarios (typically any preprocessing applied to the whole dataset before splits, and not using pipelines)

We should make it very clear that these examples are meant to illustrate bad practices, not good ones (with emphasis in titles, warnings, etc.)

@qinhanmin2014

This comment has been minimized.

Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Jun 13, 2019

I think we already have some similar things, e.g., about cross_val_predict (https://scikit-learn.org/dev/modules/cross_validation.html#obtaining-predictions-by-cross-validation), but I agree that we need more (pipeline is a good example IMO).
Do we want to put these things together (e.g., common mistakes when using scikit-learn), or leave these things in separate sections?

@glemaitre

This comment has been minimized.

Copy link
Contributor

@glemaitre glemaitre commented Jun 13, 2019

I was aware of the hyper-parameter search: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py

I can think about the same thing about the RF feature importance.

We could make a "Pitfall" section. Having the examples dispatched in different sections might make it more difficult to be discovered.

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

@NicolasHug NicolasHug commented Jun 13, 2019

Agreed, a dedicated section will provide more visibility.

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Jun 14, 2019

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Jul 12, 2019

@jnothman those two suggestions seem to be slightly add odds as Pipeline uses fit_transform internally, right?
Why should fit_transform be used with caution?

@thomasjpfan thomasjpfan added this to To do in Sprint Scipy 2019 Jul 13, 2019
@ManishAradwad

This comment has been minimized.

Copy link

@ManishAradwad ManishAradwad commented Aug 5, 2019

Is this issue assigned to someone? I'd like to work on this one..

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Aug 5, 2019

Why should fit_transform be used with caution?

Because it's easy for users to ignorantly apply it to their whole dataset, and then do model selection/evaluation.

those two suggestions seem to be slightly add odds as Pipeline uses fit_transform internally, right?

I don't have a problem with fit_transform being called, only with it being called in user code without caution

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Aug 5, 2019

@ManishAradwad there are no pull requests referencing this issue; I don't think it's been attempted.

@ManishAradwad

This comment has been minimized.

Copy link

@ManishAradwad ManishAradwad commented Aug 5, 2019

Thanks @jnothman!! I'll start working on this one then...
Can you plz help how should I approach??

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

@NicolasHug NicolasHug commented Aug 5, 2019

@ManishAradwad there are a few suggestions in this thread already. They require some minimal experience with scikit-learn and ML in general.

@ManishAradwad

This comment has been minimized.

Copy link

@ManishAradwad ManishAradwad commented Aug 5, 2019

As much as I have understood, we're supposed to create a section 'Pitfalls'(as suggested by @glemaitre) which includes practices which SHOULD NOT be followed by users, right?

If its ok, then we need such examples for following topics,
Pitfalls:

  • SelectKBest
  • cross_val_predict
  • pipeline
  • RF

Plz add more topics if I missed anything..
If everything's fine then I'll start with SelectKBest..
Thanks for the help!!

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

@NicolasHug NicolasHug commented Aug 5, 2019

you got it right.

I'd start with a very simple data link example, e.g. scaling the whole data and then doing cross validation instead of using a pipeline.

@ManishAradwad

This comment has been minimized.

Copy link

@ManishAradwad ManishAradwad commented Aug 11, 2019

@NicolasHug I have installed all the required dependencies and used make to generate all the .html files.
Now, I'm little bit confused about how to add the another section. Is it ok to create another pitfalls.html file and modify documentation.html?? or there is another easier way to do this?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Aug 12, 2019

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Aug 14, 2019

Maybe this should be a more general "practical usage advice" section. For example, I'm not really sure that it would be right to frame "choosing an evaluation metric and cross validation procedure is of utmost importance" as a pitfall.

Another topic that might fall under this rubric is "don't blindly use Euclidean distance in clustering" aka "unsupervised models need bias".

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

@NicolasHug NicolasHug commented Aug 14, 2019

"Recommended Practices"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
6 participants
You can’t perform that action at this time.