# Data Science Ethics - Sensitive Features Case Study

## Introduction
In this lab, you will explore the nitty-gritty about why the Boston Housing data set is so problematic so you can spot __sensitive features__ and handle them accordingly. You will also touch on __data integrity issues__, a topic that we will explore in further detail in future lessons. 

## Learning Objectives
You will be able to:

* Download and read the Boston Housing data set from GitHub
* Identify and describe common data integrity issues
* Identify and describe sensitive features and possible negative impacts
* Discuss the reasons why the Boston Housing Data set is not an appropriate sample data set

## Load the Boston Housing Data Set
Due to the issues you will learn in this lab, the Boston Housing Data set is no longer available in Seaborn as a sample data set. However, you can still access this data set on GitHub. Run the code cells below to import the Boston Housing data set from GitHub.

In [1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')

In [7]:
df

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


## Sensitive Features: "B"
The feature that has received the most scrutiny with respect to the Boston Housing Data set is the `B` feature. So why is the "B" feature such an issue?

One clue as to the reason why this feature is so problematic can be gleaned from the original description of this feature in the meta-data:
```
'1000(Bk — 0.63)² where Bk is the proportion of blacks(sic) by town'
```

### Sociological Issues
Beyond the cringe-worthy and dated language, from a sociological perspective there are several problems.

#### Implicit Bias 
The intentions of the designers of the study was to control for factors that might influence the price of a home to isolate for their target of clean air. **However, by including a measurement of the concentration of *exclusively black people* it implicitly suggests that the *presence of black people and not racism* is what negatively impacts home values.

#### Discriminatory Proxies
While using race as a proxy *can* have a benefit in the proper context. For example, if you wanted to improve equity in education, it might be helpful to understand the relationship between different racial populations and quality of the education offered in their area. However, without proper context, some features can be interpreted in discriminatory or harmful ways. 

### Statistical Issues
In addition to the harmful implications of the manner 

### Your Turn: Identifying Potential Discriminatory Proxies
1. Run the code cell below to view the columns in the Boston Housing data set. 

In [8]:
df.columns

Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
       'ptratio', 'b', 'lstat', 'medv'],
      dtype='object')

2. Next, take a look at the image below which is extract from the 1976 paper that utilized the Boston Housing Data set. Can you identify any features that juxtaposed with race might create a fallacious argument for racist ideas? Type your response below the image.

<p>
<div>
    <center>
<table><tr><td>
<img src="https://curriculum-content.s3.amazonaws.com/data-science/images/images/ethics/lab1/boston-1.png" alt="Image of the sheets tab in the lower left corner of the Tableau Data Source Page, with the Left Pane and Canvas visible." alt="This is the alt-text for the image." style="width: 700px;"/>
</td></tr></table>
    </center>
</div>

### Potentially Discriminatory Proxies
* Proxy 1
* Proxy 2

## Summary
While the intentions of the study designers may not have been racially motivated, they were not sensitive to the potential detrimental consequences of the features they selected for their study. They failed use caution when including the "B" feature which led to the misappropriation of the data set over the course of its life as an educational tool. They did not consider how the set of features without the appropriate context could reinforce negative stereotypes about black communities. 