Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town #16155

Closed
jrsykes opened this issue Jan 18, 2020 · 40 comments · Fixed by #20729
Closed

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town #16155

jrsykes opened this issue Jan 18, 2020 · 40 comments · Fixed by #20729

Comments

@jrsykes
Copy link

jrsykes commented Jan 18, 2020

While using the boston_housing data set, a data set hosted by the Scikit-learn package and used to demo models on house price prediction, I came across a feature titled 'B'. This struck me as odd because all other features had been given descriptive names such as 'AGE' or 'TAX'. It turns out that B = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. I naively assumed, as this data was being hosted by a prestigious package, that these data were in the data set because they offer significant explanatory value, which would point to a strongly pervasive racist mentality in the population at the time. However, after reading the blog post attached below, it appears as though the data in the B feature of the Boston housing data set were manufactured in an attempt to encourage segregation of the races. If true, this would be strong evidence of systemic institutional racism and by continuing to use this fraudulent data we would be perpetuating the effect desired by the author. I hope you will agree that we would be doing the scientific literature a service by investigating this issue further and ultimately consigning this data to historic reference archives and not encouraging its use in modern research by hosting it.

I look forward to your response,

Jamie R. Sykes

https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8

@rth
Copy link
Member

rth commented Jan 18, 2020

Thanks for raising this issue and for providing the reference with additional details.

+1 to find a different housing dataset that wouldn't contain such variables or assumptions and substitute it in examples. Haven't looked for datasets we could use as a replacement yet.

@jnothman
Copy link
Member

jnothman commented Jan 19, 2020 via email

@martinacantaro
Copy link

Hi, I wrote an article on this dataset and found the same problem with the language. I didn't address the assumptions behind the existance of the column because I'm not as intimate with the dinamics of race and gentrification in the US, but I can see how it can be problematic as well. Can we at least change "blacks" for "black people"?

@jnothman
Copy link
Member

jnothman commented Jan 20, 2020 via email

@martinacantaro
Copy link

One option is removing it. Another one, proposed by the article, is going back to the original census and reconstruct the data. Following the article's logic, reconstructing 36 datapoints should suffice, and the author found 20. I don't know how feasible it is to find the other 16.
Captura de Pantalla 2020-01-20 a la(s) 12 42 38

@mcarlisle
Copy link

Hi, I'm the author of the post @martinacantaro and @jrsykes are referring to above. While I second @martinacantaro on removing the column, I also suggest reconstructing the original tract data, and that this process should be documented and submitted along with the newly-constructed dataset, along with the original paper.

@ogrisel
Copy link
Member

ogrisel commented Jan 22, 2020

The fact that we distribute this dataset as such is indeed a problem because one could assume that we think that casually making such assumptions on the segregationist propensity of house buyers is fine.

Here are some possible ways to deal with this:

a- keep the data as such but add a warning in the documentation to state that this variable casually makes problematic assumptions and that using this dataset without questioning those assumptions will likely be considered as some kind of implicit endorsement of a racist worldview,
b- try to reconstruct the original Bk variable from the Census data but apparently this is not easy (or even not possible if the Boston dataset authors made errors when building the dataset as @mcarlisle suggests in the blog post).
c- drop the B variable,
d- drop the full dataset and point our users to alternatives such as the California Housing dataset.
e- add an option to load_boston such that, by default the B variable is not included but can be added back if the users really want to use it.

The problem with c and d is that we will break tutorials and educational resources written by others, including tutorials that aim at educating machine learning practitioners on fairness related issues. For instance: https://scikit-lego.readthedocs.io/en/latest/fairness.html uses scikit-learn's load_boston loader to illustrate to impact of the B variable on a "fairness proxy". I am not familiar enough with the literature to say whether or not the analysis and method proposed in this particular tutorial are valid but we should probably not prevent others to study those issues.

I think I would be in favor of a mix of proposals e and a, along with a FutureWarning deprecation cycle.

Edit: s/not dropped/not included/

@jnothman
Copy link
Member

jnothman commented Jan 22, 2020 via email

@ogrisel
Copy link
Member

ogrisel commented Jan 24, 2020

Ok but I would prefer not to completely remove the variable but instead not load it by default unless the user really needs it in which case they can pass a specific param to load_boston (e.g. to study the ethical implications of this variable for instance). E.g. load_boston(include_racial_segregation_variable=True).

HumanityWarning? PastWarning?

Those naming suggestions feel weird to me.

+0 for EthicalWarning or even EthicalFutureWarning that would subclass both a new EthicalWarning class and a FutureWarning class. This warning (EthicalFutureWarning ) would stop being raised when calling sklearn.datasets.load_boston in scikit-learn version 0.(x + 2).

WDYT?

@jnothman
Copy link
Member

jnothman commented Jan 25, 2020 via email

@jnothman
Copy link
Member

jnothman commented Jan 25, 2020 via email

@mcarlisle
Copy link

Personally and professionally, I'm enamored with the notion of introducing something like EthicalWarning, although I understand that this would cause many in tech that labor under the fallacy of "technology and mathematics are ethics-neutral" to get up in arms (present company not considered in this statement, just a general observation).

It is clear (to me, at least) that, at this point in technological history, with the ubiquity of large common-use datasets collected under statistically under-controlled, possibly politically charged, and certainly financially motivated, methodologies, that those using said datasets in the present and future are aware of such datasets' origins and potential concerns before using them. The forever questions are,

  1. Who feels the responsibility to offer mechanisms of such education, and
  2. Who would agree that those that say "yes" to (1) are the right ones to do so?

Does scikit-learn, as an enterprise, feel the need to dip their toes into these data ethics questions, and if so, to what extent? Or, does the project wish to remain under the impression that the tools they provide and maintain have status EthicsNeutral == True? I have a hunch (not at the moment backed up with evidence) this is not the only questionable dataset in the collection.

@amueller
Copy link
Member

I think people have generally used the Ames housing data instead.

@amueller
Copy link
Member

amueller commented Jan 27, 2020

I would favor removing the dataset, potentially with a longer deprecation cycle and replacing it entirely. I don't think we need to introduce a new warning type but we can be explicit about the cause of the removal.
Edit: I'm fine with adding to the warning a way to load from OpenML as long as we raise the issue in the warning as well.

@jnothman
Copy link
Member

jnothman commented Jan 27, 2020 via email

@adrinjalali
Copy link
Member

I would favor removing the dataset, potentially with a longer deprecation cycle and replacing it entirely. I don't think we need to introduce a new warning type but we can be explicit about the cause of the removal.

+1

@amueller
Copy link
Member

One note: it's not really a good 1:1 replacement as it includes missing values and lots of categorical features. It's a good replacement in terms of semantics and an interesting dataset to play with, but it depends a bit what we want from the dataset.

@jnothman
Copy link
Member

jnothman commented Jan 27, 2020 via email

@maikia
Copy link
Contributor

maikia commented Jan 29, 2020

I would like to work on that issue

@glemaitre
Copy link
Member

So it seems that we will go for removal. So we can start deprecating the function. We can as well mention fetch_openml. I am not sure yet what would be the alternative thought.

@rth
Copy link
Member

rth commented Jan 29, 2020

As far as I understand there was a decision for removal during the last dev meeting.

We need to decide which datasets we use for replacement in,

  • docstrings: we need a small example dataset, I think it doesn't need to be about housing, but it might be better to use a real dataset rather a synthetic one for the replacement.
  • examples: two candidates are California and Ames housing. The latter needs to be added to OpenML. California has 40x more samples than Boston and is probably not suitable for a small and quick example (unless we make a subset).
  • tests: I suppose the domain of the datasets won't mater, and so we can take another small classical regression dataset.

@ogrisel
Copy link
Member

ogrisel commented Jan 29, 2020

For information, the list of examples that need updating:

$ git grep load_boston examples | cut -d ":" -f 1 | uniq
examples/applications/plot_model_complexity_influence.py
examples/applications/plot_outlier_detection_housing.py
examples/compose/plot_transformed_target.py
examples/ensemble/plot_gradient_boosting_regression.py
examples/ensemble/plot_stack_predictors.py
examples/ensemble/plot_voting_regressor.py
examples/feature_selection/plot_select_from_model_boston.py
examples/impute/plot_missing_values.py
examples/model_selection/plot_cv_predict.py
examples/plot_partial_dependence_visualization_api.py

@ogrisel
Copy link
Member

ogrisel commented Jan 29, 2020

An alternative embedded dataset for non-synthetic toy regression in scikit-learn are the Linnerud datasets from load_linnerud (20 samples, 3 features).

This dataset is probably more than enough for illustration purpose in the docstrings of regression estimators.

@thomasjpfan
Copy link
Member

A slightly bigger dataset for regression, load_diabetes (442 samples, 10 features) can be used as well.

@cmarmo
Copy link
Member

cmarmo commented Mar 8, 2021

This issue is a good candidate for finalization in 1.0. I will try to summarize here what is still missing, parsing the corresponding pull requests.

@ogrisel
Copy link
Member

ogrisel commented Mar 8, 2021

Thanks for the summary. For the last point, I would also be in favor of adding a permanent note in our documentation as discussed #18594 even if/after we deprecate load_boston to document the problem and suggest alternative datasets to educate our users about this issue.

@adrinjalali
Copy link
Member

I'm in favor of all your suggestions @cmarmo

@glemaitre glemaitre added this to TO DO in Guillaume's pet Jul 16, 2021
@glemaitre glemaitre moved this from TO DO to MERGED in Guillaume's pet Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.