# Case Study - The Boston Housing Dataset

## Introduction
The Boston Housing dataset collected for the paper [Hedonic housing prices and the demand for clean air](https://github.com/learn-co-curriculum/dsc-ethics-cs-boston-housing/blob/ad8c14841b5dddc3750e559e84bbfb777864e0c5/Hedonic-Housing-Prices-and-the-Demand-for-Clean-Air.pdf) (Harrison and Rubinfeld, 1976). According to the abstract, the paper aimed to "investigates the methodological problems associated with the use of housing market data to measure the willingness to pay for clean air." However, due to the mishandling of sensitive features, the data has been linked to reinforcing harmful biases. 

In this case study, you will explore the nitty-gritty about why the Boston Housing Dataset is so problematic so you can spot __sensitive features__ and handle them accordingly. You will also touch on __data integrity issues__, a topic that we will explore in further detail in future lessons. 

## Learning Objectives
You will be able to:

* Download and read the Boston Housing Dataset from GitHub
* Identify sensitive features related to protected characteristics in Boston Housing Dataset
* Identify data integrity issues in the Boston Housing Dataset
* Discuss the reasons why the Boston Housing Dataset is not an appropriate sample dataset for students

## Load the Boston Housing Data Set
Due to the issues you will learn in this lab, the Boston Housing Data set is no longer available in Seaborn as a sample data set. However, you can still access this data set on GitHub. Run the code cells below to import the Boston Housing data set from GitHub.

In [1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')

In [2]:
df

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


## Sensitive Features: "B"
The feature that has received the most scrutiny with respect to the Boston Housing Data set is the `B` feature. So why is the "B" feature such an issue?

### Sociological Issues
It contains dated language referring to Black people as "blacks", seems like a questionable feature to include when 1970s Boston had relatively few Black people, and has this strange quadratic transformation applied to it.

#### Implicit Bias 
The intentions of the designers of the study was to control for factors that might influence the price of a home to isolate for their target of clean air. **However, by including a measurement of the concentration of *exclusively black people* it implicitly suggests that the *presence of black people and not racism* is what negatively impacts home values.

#### Discriminatory Proxies
While using race as a proxy *can* have a benefit in the proper context. For example, if you wanted to improve equity in education, it might be helpful to understand the relationship between different racial populations and quality of the education offered in their area. However, without proper context, some features can be interpreted in discriminatory or harmful ways. 

### Data Integrity Issues
Beyond the potential negative social impacts of using a compromised dataset for analysis, there are numerous data integrity issues that arise which make the data unreliable and unfit for use.

#### Destructive Transformation Methods
When conducting research, it is important to retain the original data when making transformations to avoid data loss by __destructive transformation methods__.

The "B" field is rendered uninterpretable because it contains a strange quadratic equation that cannot be reversed, and the original data was not retained. We can glean how this transformation was accomplished from the original description of this feature in the meta-data:

```
'1000(Bk — 0.63)² where Bk is the proportion of blacks(sic) by town'
```

Data scientist M Carlisle investigated that quadratic transformation in [this Medium post](https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8) and found that it both destroyed some data (because it was a non-invertible transformation) and that it appears to be based on a since-disproven racist hypothesis of housing prices based on "self-segregation". This investigation led to the [deprecation](https://github.com/scikit-learn/scikit-learn/issues/16155) and planned removal of the dataset from scikit-learn.


#### Data Limitations
Sometimes a dataset is unusable because __the data itself is outdated__ or there were __limitations in the collection methods__. 

Data scientist [Martina Cantaro](https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8) has noted that the "B" feature is not the only problem with the Boston Housing dataset. The prices and geographic regions are outdated, there are [major mistakes in the target column](https://spatial-statistics.com/pace_manuscripts/jeem_ms_dir/pdf/fin_jeem.pdf), and some values are artificially capped at &#36;50k. The original paper's analysis has also not stood up to recent [revalidation efforts](https://openjournals.wu.ac.at/region/paper_107/107.html).

For all of these reasons, data professionals and educators such as [Colleen Crangle](https://www.linkedin.com/pulse/its-time-retire-boston-housing-dataset-colleen-e-crangle/) have said "It’s time to retire the Boston Housing data set". Nowadays the preferred housing prices dataset for educational purposes is the [Ames Housing dataset](https://www.kaggle.com/datasets/prevek18/ames-housing-dataset), which was published with the paper [*Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project*](http://jse.amstat.org/v19n3/decock.pdf). This dataset is larger and messier than the Boston Housing dataset and also avoids some of its thorny ethical issues.

As Crangle wrote:

> So this outdated data set, while not meaning to racially profile neighborhoods, leads to racist interpretations of the data — especially when the data set is not put into its proper historical context in data science courses.

By using the Boston Housing dataset, even in an educational context, an analysis might thus lead to unintentionally discriminatory outcomes -- or in other words, _disparate impact_. Therefore even now as you are still learning the basics of data analysis, it is important to recognize these kinds of potential impacts.


### Your Turn: Identifying Potential Discriminatory Proxies
1. Run the code cell below to view the columns in the Boston Housing data set. 

In [3]:
df.columns

Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
       'ptratio', 'b', 'lstat', 'medv'],
      dtype='object')

2. Next, take a look at the image below which is extract from the 1976 paper that utilized the Boston Housing Data set. Can you identify any features that juxtaposed with race might create a fallacious argument for racist ideas? Type your response below the image.

<p>
<div>
    <center>
<table><tr><td>
<img src="https://curriculum-content.s3.amazonaws.com/data-science/images/images/ethics/lab1/boston-1.png" alt="Image of the sheets tab in the lower left corner of the Tableau Data Source Page, with the Left Pane and Canvas visible." alt="This is the alt-text for the image." style="width: 700px;"/>
</td></tr></table>
    </center>
</div>

### Potentially Discriminatory Proxies
* Neighborhood B: states an increase in Black people  in the populaton of a neighborhood negatively affects housing values.
* LSTAT: proportion of popuation that is lower status, without high school education and proportions of male workers classified as laborers; race can be linked to education levels and socioeconomic status

## Summary
While the intentions of the study designers may not have been racially motivated, they were not sensitive to the potential detrimental consequences of the features they selected for their study. They failed use caution when including the "B" feature which led to the misappropriation of the data set over the course of its life as an educational tool. They did not consider how the set of features without the appropriate context could reinforce negative stereotypes about black communities. 