Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "anti-pattern" examples #14081

Closed
NicolasHug opened this issue Jun 13, 2019 · 21 comments
Closed

Add "anti-pattern" examples #14081

NicolasHug opened this issue Jun 13, 2019 · 21 comments

Comments

@NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Jun 13, 2019

As suggested by @ogrisel in a tweet, it'd be nice to have some examples showing what not to do.

A good first one is the one in the twitter thread (@jmschrei maybe?), but there are many other scenarios (typically any preprocessing applied to the whole dataset before splits, and not using pipelines)

We should make it very clear that these examples are meant to illustrate bad practices, not good ones (with emphasis in titles, warnings, etc.)

@qinhanmin2014
Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Jun 13, 2019

I think we already have some similar things, e.g., about cross_val_predict (https://scikit-learn.org/dev/modules/cross_validation.html#obtaining-predictions-by-cross-validation), but I agree that we need more (pipeline is a good example IMO).
Do we want to put these things together (e.g., common mistakes when using scikit-learn), or leave these things in separate sections?

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jun 13, 2019

I was aware of the hyper-parameter search: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py

I can think about the same thing about the RF feature importance.

We could make a "Pitfall" section. Having the examples dispatched in different sections might make it more difficult to be discovered.

@NicolasHug
Copy link
Member Author

@NicolasHug NicolasHug commented Jun 13, 2019

Agreed, a dedicated section will provide more visibility.

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 14, 2019

@amueller
Copy link
Member

@amueller amueller commented Jul 12, 2019

@jnothman those two suggestions seem to be slightly add odds as Pipeline uses fit_transform internally, right?
Why should fit_transform be used with caution?

@thomasjpfan thomasjpfan added this to To do in Sprint Scipy 2019 Jul 13, 2019
@ManishAradwad
Copy link
Contributor

@ManishAradwad ManishAradwad commented Aug 5, 2019

Is this issue assigned to someone? I'd like to work on this one..

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 5, 2019

Why should fit_transform be used with caution?

Because it's easy for users to ignorantly apply it to their whole dataset, and then do model selection/evaluation.

those two suggestions seem to be slightly add odds as Pipeline uses fit_transform internally, right?

I don't have a problem with fit_transform being called, only with it being called in user code without caution

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 5, 2019

@ManishAradwad there are no pull requests referencing this issue; I don't think it's been attempted.

@ManishAradwad
Copy link
Contributor

@ManishAradwad ManishAradwad commented Aug 5, 2019

Thanks @jnothman!! I'll start working on this one then...
Can you plz help how should I approach??

@NicolasHug
Copy link
Member Author

@NicolasHug NicolasHug commented Aug 5, 2019

@ManishAradwad there are a few suggestions in this thread already. They require some minimal experience with scikit-learn and ML in general.

@ManishAradwad
Copy link
Contributor

@ManishAradwad ManishAradwad commented Aug 5, 2019

As much as I have understood, we're supposed to create a section 'Pitfalls'(as suggested by @glemaitre) which includes practices which SHOULD NOT be followed by users, right?

If its ok, then we need such examples for following topics,
Pitfalls:

  • SelectKBest
  • cross_val_predict
  • pipeline
  • RF

Plz add more topics if I missed anything..
If everything's fine then I'll start with SelectKBest..
Thanks for the help!!

@NicolasHug
Copy link
Member Author

@NicolasHug NicolasHug commented Aug 5, 2019

you got it right.

I'd start with a very simple data leak example, e.g. scaling the whole data and then doing cross validation instead of using a pipeline.

@ManishAradwad
Copy link
Contributor

@ManishAradwad ManishAradwad commented Aug 11, 2019

@NicolasHug I have installed all the required dependencies and used make to generate all the .html files.
Now, I'm little bit confused about how to add the another section. Is it ok to create another pitfalls.html file and modify documentation.html?? or there is another easier way to do this?

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 12, 2019

@jnothman
Copy link
Member

@jnothman jnothman commented Aug 14, 2019

Maybe this should be a more general "practical usage advice" section. For example, I'm not really sure that it would be right to frame "choosing an evaluation metric and cross validation procedure is of utmost importance" as a pitfall.

Another topic that might fall under this rubric is "don't blindly use Euclidean distance in clustering" aka "unsupervised models need bias".

@NicolasHug
Copy link
Member Author

@NicolasHug NicolasHug commented Aug 14, 2019

"Recommended Practices"?

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Feb 7, 2020

From a marketing point of view, "Recommend Practices" is boring and readers will skip :) I really like the negative phrasing better (common pitfalls, methodological traps, anti-patterns, failure modes...). It's more catchy and I think it will increase the likelihood of our users to pay attention.

@ManishAradwad
Copy link
Contributor

@ManishAradwad ManishAradwad commented Feb 8, 2020

Due to some issues with my laptop, I lost my fork. So, I had to create another one. I've made the changes suggested in previous PR. Plz review this one...

@skeller88
Copy link
Contributor

@skeller88 skeller88 commented Apr 10, 2020

This is great. I see some overlap with the issue I raised here. But this issue is more specific to scikit-learn, which makes sense.

Here are some pitfalls I have in mind to contribute. Thoughts? If these pitfalls sound like a good fit for documentation, I'll start working on them.

  • Using a different preprocessing flow on the train and test data. For example, scaling the train but not the test.
  • Not setting a random state when a project uses shuffle operations and needs to be reproducible.
  • Using loops for processing data when a vectorized operation is available.
  • The dummy variable trap when one-hot encoding categorical variables
  • Using a model or statistical test without testing its assumptions. Examples are linear regression on data that isn't known to be normally distributed, using Pearson's correlation on data that isn't known to be normally distributed, and using a tree-based model for inference on data that has multicollinearity.

@lorentzenchr
Copy link
Member

@lorentzenchr lorentzenchr commented Aug 5, 2020

Just as info: Pitfalls to Avoid when Interpreting Machine Learning Models is good reference on pitfalls for model inspection .

@NicolasHug
Copy link
Member Author

@NicolasHug NicolasHug commented Oct 26, 2020

Now that we have this new chapter in the docs, I think we can close this issue. Thanks @ManishAradwad, @lucyleeow, and @skeller88 for your PRs

Side note: #18619 (comment) is a good candidate for a new subsection

@NicolasHug NicolasHug closed this Oct 26, 2020
@glemaitre glemaitre moved this from TO DO to WAITING FOR CONSENSUS in Guillaume's pet Jun 30, 2021
@glemaitre glemaitre moved this from WAITING FOR CONSENSUS to TO BE MERGED in Guillaume's pet Jun 30, 2021
@glemaitre glemaitre moved this from TO BE MERGED to MERGED in Guillaume's pet Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants