Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace references to deprecated Boston housing prices dataset with another dataset #2322

Closed
skylerwharton opened this issue Dec 14, 2021 · 3 comments
Labels
documentation Relating to readthedocs, notebooks, and exposition in docstrings good first issue This is a fix that might be easier for someone to do as a first contribution
Milestone

Comments

@skylerwharton
Copy link

As noted in sklearn's documenation, the Boston housing dataset is being deprecated due to a significant ethical concern:

Warning: The Boston housing prices dataset has an ethical problem: as investigated in [1], the authors of this dataset engineered a non-invertible variable “B” assuming that racial self-segregation had a positive impact on house prices [2]. Furthermore the goal of the research that led to the creation of this dataset was to study the impact of air quality but it did not give adequate demonstration of the validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to study and educate about ethical issues in data science and machine learning.

Could uses of this dataset within the shap documentation (for example, on the "An introduction to explainable AI with Shapley values" page) be replaced with uses of another dataset? For example, sklearn suggests that either the California housing dataset or the Ames housing dataset are reasonable alternative datasets [source].

Thank you!

@shanewalsh1123
Copy link

I can have a look at this.

@thatlittleboy thatlittleboy added this to the 0.43.0 milestone Jul 23, 2023
@thatlittleboy thatlittleboy added the documentation Relating to readthedocs, notebooks, and exposition in docstrings label Jul 23, 2023
@connortann connortann added the good first issue This is a fix that might be easier for someone to do as a first contribution label Jul 27, 2023
@znacer
Copy link
Contributor

znacer commented Aug 27, 2023

To help deal with this issue, i listed files where "shap.datasets.boston" occurs :

  • notebooks:
    • notebooks/tabular_examples/model_agnostic/Simple Boston Demo.ipynb
    • notebooks/tabular_examples/tree_based_models/tree_shap_paper/Tree SHAP in Python.ipynb
    • notebooks/tabular_examples/tree_based_models/Catboost tutorial.ipynb
    • notebooks/tabular_examples/tree_based_models/Example of loading a custom tree model into SHAP.ipynb
    • notebooks/tabular_examples/tree_based_models/Python Version of Tree SHAP.ipynb
    • notebooks/tabular_examples/tree_based_models/Force Plot Colors.ipynb
    • notebooks/tabular_examples/tree_based_models/Front page example (XGBoost).ipynb
    • notebooks/benchmarks/tabular/Tabular Prediction Benchmark Demo.ipynb
    • notebooks/benchmarks/tabular/Benchmark XGBoost explanations.ipynb
  • documentation :
    • generated/shap.datasets.boston.rst
    • api.rst

@connortann
Copy link
Collaborator

I believe this is now closed 👌

Let us know if there's anything we've missed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Relating to readthedocs, notebooks, and exposition in docstrings good first issue This is a fix that might be easier for someone to do as a first contribution
Projects
None yet
Development

No branches or pull requests

5 participants