-
-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town #16155
Comments
Thanks for raising this issue and for providing the reference with additional details. +1 to find a different housing dataset that wouldn't contain such variables or assumptions and substitute it in examples. Haven't looked for datasets we could use as a replacement yet. |
Yes, let's
|
Hi, I wrote an article on this dataset and found the same problem with the language. I didn't address the assumptions behind the existance of the column because I'm not as intimate with the dinamics of race and gentrification in the US, but I can see how it can be problematic as well. Can we at least change "blacks" for "black people"? |
The problem noted on that article on Medium is that the column doesn't
indicate proportion of black people. It indicates that the proportion of
black people is either low or high - in other words, how far from some
ideal proportion it is - representing prejudices about the effect of race
on pricing. Renaming it is insufficient. Removing it may be acceptable.
|
Hi, I'm the author of the post @martinacantaro and @jrsykes are referring to above. While I second @martinacantaro on removing the column, I also suggest reconstructing the original tract data, and that this process should be documented and submitted along with the newly-constructed dataset, along with the original paper. |
The fact that we distribute this dataset as such is indeed a problem because one could assume that we think that casually making such assumptions on the segregationist propensity of house buyers is fine. Here are some possible ways to deal with this: a- keep the data as such but add a warning in the documentation to state that this variable casually makes problematic assumptions and that using this dataset without questioning those assumptions will likely be considered as some kind of implicit endorsement of a racist worldview, The problem with c and d is that we will break tutorials and educational resources written by others, including tutorials that aim at educating machine learning practitioners on fairness related issues. For instance: https://scikit-lego.readthedocs.io/en/latest/fairness.html uses scikit-learn's load_boston loader to illustrate to impact of the B variable on a "fairness proxy". I am not familiar enough with the literature to say whether or not the analysis and method proposed in this particular tutorial are valid but we should probably not prevent others to study those issues. I think I would be in favor of a mix of proposals e and a, along with a Edit: s/not dropped/not included/ |
I agree that the common use of this dataset is a mild reason to change
slowly, but then it deserves a bit more visibility than the change to
documentation.
Trying to think of the right warning class: EthicalWarning?
HumanityWarning? PastWarning?
"This dataset embeds the highly problematic assumption that racial
segregation is valuable. It will be removed in a future version and should
be used with care. See the documentation for more detail."
|
Ok but I would prefer not to completely remove the variable but instead not load it by default unless the user really needs it in which case they can pass a specific param to
Those naming suggestions feel weird to me. +0 for WDYT? |
I was only joking with those names to be honest. Maybe it's not something
to joke about. It's just not what you expect to have to warn about when
developing software.
|
With a name like include_racial_segregation_variable, yes, I suppose the
warning can disappear.
|
Personally and professionally, I'm enamored with the notion of introducing something like It is clear (to me, at least) that, at this point in technological history, with the ubiquity of large common-use datasets collected under statistically under-controlled, possibly politically charged, and certainly financially motivated, methodologies, that those using said datasets in the present and future are aware of such datasets' origins and potential concerns before using them. The forever questions are,
Does scikit-learn, as an enterprise, feel the need to dip their toes into these data ethics questions, and if so, to what extent? Or, does the project wish to remain under the impression that the tools they provide and maintain have status |
I think people have generally used the Ames housing data instead. |
I would favor removing the dataset, potentially with a longer deprecation cycle and replacing it entirely. I don't think we need to introduce a new warning type but we can be explicit about the cause of the removal. |
If there is a standard alternative dataset, let's go with that.
|
+1 |
One note: it's not really a good 1:1 replacement as it includes missing values and lots of categorical features. It's a good replacement in terms of semantics and an interesting dataset to play with, but it depends a bit what we want from the dataset. |
As an introductory regression dataset, missing values may be adding
unwanted complexity...
|
I would like to work on that issue |
So it seems that we will go for removal. So we can start deprecating the function. We can as well mention |
As far as I understand there was a decision for removal during the last dev meeting. We need to decide which datasets we use for replacement in,
|
For information, the list of examples that need updating:
|
An alternative embedded dataset for non-synthetic toy regression in scikit-learn are the Linnerud datasets from This dataset is probably more than enough for illustration purpose in the docstrings of regression estimators. |
A slightly bigger dataset for regression, |
This issue is a good candidate for finalization in 1.0. I will try to summarize here what is still missing, parsing the corresponding pull requests.
|
Thanks for the summary. For the last point, I would also be in favor of adding a permanent note in our documentation as discussed #18594 even if/after we deprecate |
I'm in favor of all your suggestions @cmarmo |
While using the boston_housing data set, a data set hosted by the Scikit-learn package and used to demo models on house price prediction, I came across a feature titled 'B'. This struck me as odd because all other features had been given descriptive names such as 'AGE' or 'TAX'. It turns out that B = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. I naively assumed, as this data was being hosted by a prestigious package, that these data were in the data set because they offer significant explanatory value, which would point to a strongly pervasive racist mentality in the population at the time. However, after reading the blog post attached below, it appears as though the data in the B feature of the Boston housing data set were manufactured in an attempt to encourage segregation of the races. If true, this would be strong evidence of systemic institutional racism and by continuing to use this fraudulent data we would be perpetuating the effect desired by the author. I hope you will agree that we would be doing the scientific literature a service by investigating this issue further and ultimately consigning this data to historic reference archives and not encouraging its use in modern research by hosting it.
I look forward to your response,
Jamie R. Sykes
https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8
The text was updated successfully, but these errors were encountered: