- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town #16155

jrsykes · 2020-01-18T20:09:24Z

While using the boston_housing data set, a data set hosted by the Scikit-learn package and used to demo models on house price prediction, I came across a feature titled 'B'. This struck me as odd because all other features had been given descriptive names such as 'AGE' or 'TAX'. It turns out that B = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. I naively assumed, as this data was being hosted by a prestigious package, that these data were in the data set because they offer significant explanatory value, which would point to a strongly pervasive racist mentality in the population at the time. However, after reading the blog post attached below, it appears as though the data in the B feature of the Boston housing data set were manufactured in an attempt to encourage segregation of the races. If true, this would be strong evidence of systemic institutional racism and by continuing to use this fraudulent data we would be perpetuating the effect desired by the author. I hope you will agree that we would be doing the scientific literature a service by investigating this issue further and ultimately consigning this data to historic reference archives and not encouraging its use in modern research by hosting it.

I look forward to your response,

Jamie R. Sykes

https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8

rth · 2020-01-18T22:42:49Z

Thanks for raising this issue and for providing the reference with additional details.

+1 to find a different housing dataset that wouldn't contain such variables or assumptions and substitute it in examples. Haven't looked for datasets we could use as a replacement yet.

jnothman · 2020-01-19T00:14:48Z

Yes, let's

martinacantaro · 2020-01-20T03:13:13Z

Hi, I wrote an article on this dataset and found the same problem with the language. I didn't address the assumptions behind the existance of the column because I'm not as intimate with the dinamics of race and gentrification in the US, but I can see how it can be problematic as well. Can we at least change "blacks" for "black people"?

jnothman · 2020-01-20T04:17:12Z

The problem noted on that article on Medium is that the column doesn't indicate proportion of black people. It indicates that the proportion of black people is either low or high - in other words, how far from some ideal proportion it is - representing prejudices about the effect of race on pricing. Renaming it is insufficient. Removing it may be acceptable.

martinacantaro · 2020-01-20T15:48:03Z

One option is removing it. Another one, proposed by the article, is going back to the original census and reconstruct the data. Following the article's logic, reconstructing 36 datapoints should suffice, and the author found 20. I don't know how feasible it is to find the other 16.

mcarlisle · 2020-01-20T19:06:36Z

Hi, I'm the author of the post @martinacantaro and @jrsykes are referring to above. While I second @martinacantaro on removing the column, I also suggest reconstructing the original tract data, and that this process should be documented and submitted along with the newly-constructed dataset, along with the original paper.

ogrisel · 2020-01-22T16:36:27Z

The fact that we distribute this dataset as such is indeed a problem because one could assume that we think that casually making such assumptions on the segregationist propensity of house buyers is fine.

Here are some possible ways to deal with this:

a- keep the data as such but add a warning in the documentation to state that this variable casually makes problematic assumptions and that using this dataset without questioning those assumptions will likely be considered as some kind of implicit endorsement of a racist worldview,
b- try to reconstruct the original Bk variable from the Census data but apparently this is not easy (or even not possible if the Boston dataset authors made errors when building the dataset as @mcarlisle suggests in the blog post).
c- drop the B variable,
d- drop the full dataset and point our users to alternatives such as the California Housing dataset.
e- add an option to load_boston such that, by default the B variable is not included but can be added back if the users really want to use it.

The problem with c and d is that we will break tutorials and educational resources written by others, including tutorials that aim at educating machine learning practitioners on fairness related issues. For instance: https://scikit-lego.readthedocs.io/en/latest/fairness.html uses scikit-learn's load_boston loader to illustrate to impact of the B variable on a "fairness proxy". I am not familiar enough with the literature to say whether or not the analysis and method proposed in this particular tutorial are valid but we should probably not prevent others to study those issues.

I think I would be in favor of a mix of proposals e and a, along with a FutureWarning deprecation cycle.

Edit: s/not dropped/not included/

jnothman · 2020-01-22T20:14:28Z

I agree that the common use of this dataset is a mild reason to change slowly, but then it deserves a bit more visibility than the change to documentation. Trying to think of the right warning class: EthicalWarning? HumanityWarning? PastWarning? "This dataset embeds the highly problematic assumption that racial segregation is valuable. It will be removed in a future version and should be used with care. See the documentation for more detail."

ogrisel · 2020-01-24T13:11:45Z

Ok but I would prefer not to completely remove the variable but instead not load it by default unless the user really needs it in which case they can pass a specific param to load_boston (e.g. to study the ethical implications of this variable for instance). E.g. load_boston(include_racial_segregation_variable=True).

HumanityWarning? PastWarning?

Those naming suggestions feel weird to me.

+0 for EthicalWarning or even EthicalFutureWarning that would subclass both a new EthicalWarning class and a FutureWarning class. This warning (EthicalFutureWarning ) would stop being raised when calling sklearn.datasets.load_boston in scikit-learn version 0.(x + 2).

WDYT?

jnothman · 2020-01-25T22:16:32Z

I was only joking with those names to be honest. Maybe it's not something to joke about. It's just not what you expect to have to warn about when developing software.

jnothman · 2020-01-25T22:17:19Z

With a name like include_racial_segregation_variable, yes, I suppose the warning can disappear.

mcarlisle · 2020-01-25T22:30:27Z

Personally and professionally, I'm enamored with the notion of introducing something like EthicalWarning, although I understand that this would cause many in tech that labor under the fallacy of "technology and mathematics are ethics-neutral" to get up in arms (present company not considered in this statement, just a general observation).

It is clear (to me, at least) that, at this point in technological history, with the ubiquity of large common-use datasets collected under statistically under-controlled, possibly politically charged, and certainly financially motivated, methodologies, that those using said datasets in the present and future are aware of such datasets' origins and potential concerns before using them. The forever questions are,

Who feels the responsibility to offer mechanisms of such education, and
Who would agree that those that say "yes" to (1) are the right ones to do so?

Does scikit-learn, as an enterprise, feel the need to dip their toes into these data ethics questions, and if so, to what extent? Or, does the project wish to remain under the impression that the tools they provide and maintain have status EthicsNeutral == True? I have a hunch (not at the moment backed up with evidence) this is not the only questionable dataset in the collection.

amueller · 2020-01-27T11:48:11Z

I think people have generally used the Ames housing data instead.

amueller · 2020-01-27T11:55:59Z

I would favor removing the dataset, potentially with a longer deprecation cycle and replacing it entirely. I don't think we need to introduce a new warning type but we can be explicit about the cause of the removal.
Edit: I'm fine with adding to the warning a way to load from OpenML as long as we raise the issue in the warning as well.

jnothman · 2020-01-27T12:07:42Z

If there is a standard alternative dataset, let's go with that.

adrinjalali · 2020-01-27T12:51:21Z

I would favor removing the dataset, potentially with a longer deprecation cycle and replacing it entirely. I don't think we need to introduce a new warning type but we can be explicit about the cause of the removal.

+1

amueller · 2020-01-27T15:47:02Z

One note: it's not really a good 1:1 replacement as it includes missing values and lots of categorical features. It's a good replacement in terms of semantics and an interesting dataset to play with, but it depends a bit what we want from the dataset.

jnothman · 2020-01-27T22:50:04Z

As an introductory regression dataset, missing values may be adding unwanted complexity...

maikia · 2020-01-29T09:46:07Z

I would like to work on that issue

glemaitre · 2020-01-29T10:12:51Z

So it seems that we will go for removal. So we can start deprecating the function. We can as well mention fetch_openml. I am not sure yet what would be the alternative thought.

rth · 2020-01-29T15:49:03Z

As far as I understand there was a decision for removal during the last dev meeting.

We need to decide which datasets we use for replacement in,

docstrings: we need a small example dataset, I think it doesn't need to be about housing, but it might be better to use a real dataset rather a synthetic one for the replacement.
examples: two candidates are California and Ames housing. The latter needs to be added to OpenML. California has 40x more samples than Boston and is probably not suitable for a small and quick example (unless we make a subset).
tests: I suppose the domain of the datasets won't mater, and so we can take another small classical regression dataset.

ogrisel · 2020-01-29T15:52:58Z

For information, the list of examples that need updating:

$ git grep load_boston examples | cut -d ":" -f 1 | uniq
examples/applications/plot_model_complexity_influence.py
examples/applications/plot_outlier_detection_housing.py
examples/compose/plot_transformed_target.py
examples/ensemble/plot_gradient_boosting_regression.py
examples/ensemble/plot_stack_predictors.py
examples/ensemble/plot_voting_regressor.py
examples/feature_selection/plot_select_from_model_boston.py
examples/impute/plot_missing_values.py
examples/model_selection/plot_cv_predict.py
examples/plot_partial_dependence_visualization_api.py

ogrisel · 2020-01-29T17:29:58Z

An alternative embedded dataset for non-synthetic toy regression in scikit-learn are the Linnerud datasets from load_linnerud (20 samples, 3 features).

This dataset is probably more than enough for illustration purpose in the docstrings of regression estimators.

thomasjpfan · 2020-01-29T17:42:05Z

A slightly bigger dataset for regression, load_diabetes (442 samples, 10 features) can be used as well.

cmarmo · 2021-03-08T09:03:46Z

This issue is a good candidate for finalization in 1.0. I will try to summarize here what is still missing, parsing the corresponding pull requests.

Document the poor_score estimator tag: its definition has been changed in MNT remove boston from the common test / estimator checks #17356 but the documentation has not been updated: file to be modified is doc/developers/develop.rst
Remove boston from common tests in sklearn/tests/test_base.py: see ~~TST exchage boston for diabetes dataset in test_base.py #17349~~ TST change load_boston in test_base to make_* #20174
Deprecate load_boston() pointing to fetch_openml(data_id=531) to load the dataset if this specific dataset is needed: files to be modified are doc/datasets/toy_dataset.rst, doc/modules/classes.rst, sklearn/datasets/_base.py, sklearn/datasets/tests/test_base.py (see [MRG] removed boston dataset #18594). Instructions about how to deprecate a function are available in the documentation.
ping @ogrisel and @adrinjalali for confirmation.

ogrisel · 2021-03-08T10:55:09Z

Thanks for the summary. For the last point, I would also be in favor of adding a permanent note in our documentation as discussed #18594 even if/after we deprecate load_boston to document the problem and suggest alternative datasets to educate our users about this issue.

adrinjalali · 2021-03-08T12:00:12Z

I'm in favor of all your suggestions @cmarmo

maikia mentioned this issue Jan 29, 2020

[wip] add depreciation to load_boston() #16274

Closed

ogrisel mentioned this issue Jan 29, 2020

ENH add as_frame functionality for toy datasets #15980

Merged

amy12xx mentioned this issue Oct 10, 2020

[MRG] removed boston dataset #18594

Closed

koaning mentioned this issue Jan 15, 2021

Racism and the "load_boston" dataset BCG-X-Official/facet#202

Closed

koaning mentioned this issue Jan 28, 2021

Racism and the load_boston dataset mckinsey/causalnex#91

Closed

jj-jr mentioned this issue Feb 2, 2021

Racial discrimination in the 'B' feature of the Boston housing dataset mckinsey/causalnex#92

Closed

cmarmo added this to the 1.0 milestone Mar 7, 2021

cmarmo added the help wanted label Mar 8, 2021

TomDLT mentioned this issue Mar 10, 2021

Inappropriate Language in scikit-learn: Boston house prices dataset Description #19657

Closed

ogrisel mentioned this issue Mar 11, 2021

DOC Use term 'black people' instead of 'blacks' in Boston descr #19661

Merged

mralbu mentioned this issue Apr 14, 2021

Add GMMRegression scikit-learn RegressorMixin AlexanderFabisch/gmr#29

Closed

enharten mentioned this issue Apr 15, 2021

Reference consistency fairlearn/fairlearn#732

Merged

koaning mentioned this issue May 4, 2021

Replace load_boston due to Ethical Issues Lightning-Universe/lightning-bolts#629

Merged

8 tasks

azihna mentioned this issue May 31, 2021

TST change load_boston in test_base to make_* #20174

Merged

glemaitre added this to TO DO in Guillaume's pet Jul 16, 2021

This was referenced Jul 18, 2021

[MRG] DOC Fix: update dataset name from Boston housing to Wine amferraboli/scikit-learn#1

Closed

DOC Update dataset name from Boston housing to Wine in example #20561

Merged

glemaitre mentioned this issue Aug 10, 2021

DOC add warning regarding the load_boston function #20729

Merged

2 tasks

glemaitre closed this as completed in #20729 Aug 17, 2021

glemaitre moved this from TO DO to MERGED in Guillaume's pet Aug 17, 2021

alliesaizan mentioned this issue Sep 12, 2021

Documentation for fetch_boston fairlearn/fairlearn#507

Closed

ogrisel mentioned this issue Sep 28, 2021

FIX sphinx formatting in load_boston #21185

Merged

koaning mentioned this issue Oct 21, 2021

Adding the fetch_boston documentation fairlearn/fairlearn#961

Merged

jmoralez mentioned this issue Nov 12, 2021

[tests][python-package] Modify tests using boston housing dataset microsoft/LightGBM#4793

Closed

29 tasks

StrikerRUS mentioned this issue Dec 21, 2021

replace Boston dataset with diabetes dataset in examples RGF-team/rgf#357

Merged

StrikerRUS mentioned this issue Jan 26, 2022

Stop using Boston dataset in tests and examples BayesWitnesses/m2cgen#494

Open

trevorstephens mentioned this issue May 2, 2022

boston dataset being depreciated by scikitlearn trevorstephens/gplearn#257

Closed

koniecsveta mentioned this issue Jun 1, 2022

demo_rulefit_boston [nocheck] h2oai/h2o-3#5982

Merged

masmangan mentioned this issue Jul 8, 2022

Replace Boston housing example with California housing example or Ames housing example scipy-lectures/scientific-python-lectures#501

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town #16155

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town #16155

jrsykes commented Jan 18, 2020

rth commented Jan 18, 2020 •

edited

Loading

jnothman commented Jan 19, 2020 via email

martinacantaro commented Jan 20, 2020

jnothman commented Jan 20, 2020 via email

martinacantaro commented Jan 20, 2020

mcarlisle commented Jan 20, 2020

ogrisel commented Jan 22, 2020 •

edited

Loading

jnothman commented Jan 22, 2020 via email

ogrisel commented Jan 24, 2020

jnothman commented Jan 25, 2020 via email

jnothman commented Jan 25, 2020 via email

mcarlisle commented Jan 25, 2020

amueller commented Jan 27, 2020

amueller commented Jan 27, 2020 •

edited

Loading

jnothman commented Jan 27, 2020 via email

adrinjalali commented Jan 27, 2020

amueller commented Jan 27, 2020

jnothman commented Jan 27, 2020 via email

maikia commented Jan 29, 2020

glemaitre commented Jan 29, 2020

rth commented Jan 29, 2020

ogrisel commented Jan 29, 2020 •

edited

Loading

ogrisel commented Jan 29, 2020 •

edited

Loading

thomasjpfan commented Jan 29, 2020

cmarmo commented Mar 8, 2021 •

edited by ogrisel

Loading

ogrisel commented Mar 8, 2021

adrinjalali commented Mar 8, 2021

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town #16155

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town #16155

Comments

jrsykes commented Jan 18, 2020

rth commented Jan 18, 2020 • edited Loading

jnothman commented Jan 19, 2020 via email

martinacantaro commented Jan 20, 2020

jnothman commented Jan 20, 2020 via email

martinacantaro commented Jan 20, 2020

mcarlisle commented Jan 20, 2020

ogrisel commented Jan 22, 2020 • edited Loading

jnothman commented Jan 22, 2020 via email

ogrisel commented Jan 24, 2020

jnothman commented Jan 25, 2020 via email

jnothman commented Jan 25, 2020 via email

mcarlisle commented Jan 25, 2020

amueller commented Jan 27, 2020

amueller commented Jan 27, 2020 • edited Loading

jnothman commented Jan 27, 2020 via email

adrinjalali commented Jan 27, 2020

amueller commented Jan 27, 2020

jnothman commented Jan 27, 2020 via email

maikia commented Jan 29, 2020

glemaitre commented Jan 29, 2020

rth commented Jan 29, 2020

ogrisel commented Jan 29, 2020 • edited Loading

ogrisel commented Jan 29, 2020 • edited Loading

thomasjpfan commented Jan 29, 2020

cmarmo commented Mar 8, 2021 • edited by ogrisel Loading

ogrisel commented Mar 8, 2021

adrinjalali commented Mar 8, 2021

rth commented Jan 18, 2020 •

edited

Loading

ogrisel commented Jan 22, 2020 •

edited

Loading

amueller commented Jan 27, 2020 •

edited

Loading

ogrisel commented Jan 29, 2020 •

edited

Loading

ogrisel commented Jan 29, 2020 •

edited

Loading

cmarmo commented Mar 8, 2021 •

edited by ogrisel

Loading