# Fairness

Sometimes we create machine leaning models to see if an image is of a dog
or models to score the cuteness of said dog.
These predictions that these models produce are largely benign and have limited impact.
But that will not always be true.
As you progress as a machine learning professional,
the time may come when you create models that can have serious impact on people's lives
(and not always in expected ways).

In this exercise, we start to explore the concept of fairness and bias in machine learning.
We will see an example of real-world data that is biased,
an example of synthetic data (inspired by real data) that is more subtly biased,
and the complexity of a real-world machine learning controversy.

This example will touch on the following topics:
 - Cleaning Real-World Data
 - Inspecting a Classifiers' Most Relevant Features
 - Columns that show a Direct Bias (Protected Attributes)
 - Columns that show an Indirect Bias
 - Columns that show a Distant Bias (Proxy Attributes)
 - Fairness Metrics

## Census Data

In this example, we will be using real-world data collected from the 1994 US Census database.
It is called the [UCI Adult](https://archive.ics.uci.edu/ml/datasets/Adult) (or UCI Census) dataset,
and is one of the most well-known and widely used datasets in the machine learning fairness community.
It contains information about people and whether or not they make more than $50K a year (in 1994 dollars).
The data has already been cleaned by the original authors, but still has some issues we will need to address.
(So note that the data may look all clean in this nice neat notebook,
but remember that there was someone (me) spending a few hours experimenting on (and cursing) the data to make sure it is clean (enough)).

For the purposes of this example, a single column (`zipcode`) has been added to the dataset.
This column has been synthetically generated (i.e. does not represent the actual zip code of the people represented in the data),
but has been generated based on a real-life pattern that will be discussed later in this example.
No other modification have been made to this dataset.

Technically, the classification target of this data is predicting if the person earns more than $50K a year.
However, you can easily see how this can be used as a proxy for other tasks.
Historically, some related tasks of interest are approval for loans (business, home, and automobile) and approval for credit cards.
All of these financial tasks can have real impact for a person and the result of a classifier can seriously change a person's life (like getting a loan for a business or house).
It would not be a stretch for a future boss/manager to come to you (the resident machine learning expert) and saying something like:
"Here we have this public census data,
see if you can find a way to predict if someone will earn enough in the future to be a safe bet for a loan."
So as we go through this example, imagine that our actual prediction target is approval for a loan.

Let's start as we often do in machine learning, with loading our data.

In [None]:
import math
import os

import matplotlib.pyplot
import pandas
import sklearn.linear_model
import sklearn.metrics
import sklearn.preprocessing

# Note that our data is actually zipped up, but pandas
# (as well as Python and most languages/libraries) can easily handle that.
DATA_DIR = 'uci-census'
TRAIN_PATH = os.path.join(DATA_DIR, 'adult.data.gz')
TEST_PATH = os.path.join(DATA_DIR, 'adult.test.gz')

# The column names are not in the actual data files,
# but instead are listed in the description of the data.
COLUMN_NAMES = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'zipcode',
    # This column doesn't actually have a name, we just had to make one up.
    'label',
]

# Load the training data.
raw_train_data = pandas.read_csv(
    TRAIN_PATH,
    # Note that the column separator is not a comma (which it should be),
    # instead it is a comma and a space (', ').
    # Setting `sep` to more than one character also means we should set `engine`,
    # or we will get a warning (try it out).
    sep = ', ', engine = 'python',
    # The column names are not in the data file.
    header = None, names = COLUMN_NAMES,
)

# Load the test data.
raw_test_data = pandas.read_csv(
    TEST_PATH,
    sep = ', ', engine = 'python',
    header = None, names = COLUMN_NAMES,
    # The first row in the test file is a header (but not column names).
    skiprows = 1,
)

# For some stupid (and really frustrating) reason, the test data rows all end with a period.
raw_test_data['label'] = raw_test_data['label'].str.rstrip('.')

# Make sure that the zip code is treated as a string and not a number.
raw_train_data['zipcode'] = raw_train_data['zipcode'].astype(str)
raw_test_data['zipcode'] = raw_test_data['zipcode'].astype(str)

raw_train_data.info()

We loaded out data, and now we need to do some general cleaning and encoding.
We will do this in a method so we can invoke it multiple times later in this example.

In [None]:
# Create a function for standard cleaning on the census data,
# We will be needing to invoke this multiple times later.
def clean_census_data(train_data, test_data):
    # Temporarily combine the train and test data to ensure all cleaning and encoding is consistent.
    frame = pandas.concat([train_data, test_data], ignore_index = True)
    
    # Make the label a boolean.
    frame['label'] = frame['label'] == '>50K'

    # Remove the label from the features.
    labels = frame.pop('label')
    
    # Encode categorical columns.
    # By default, pandas does not encode numeric columns.
    frame = pandas.get_dummies(frame)
    
    # Scale numerical columns.
    transformer = sklearn.preprocessing.StandardScaler()
    
    # We already one-hot encoded, but further normalizing the columns is a trick to help converge.
    columns = list(frame.columns)
    frame[columns] = transformer.fit_transform(frame[columns])
    
    # Split the data back into train/test.
    # Note that we have been careful not to change the order of the data,
    # so re-splitting is easy.
    
    train_features = frame[:len(train_data)]
    train_labels = labels[:len(train_data)]
    
    test_features = frame[len(train_data):]
    test_labels = labels[len(train_data):]

    return train_features, train_labels, test_features, test_labels

train_features, train_labels, test_features, test_labels = clean_census_data(raw_train_data,
                                                                             raw_test_data)

train_features.describe()

Now that we have some clean and split data, let's see how a basic classifier will perform without any tweaks.

In [None]:
classifier = sklearn.linear_model.LogisticRegression(random_state = 4)
classifier.fit(train_features, train_labels)
score = classifier.score(test_features, test_labels)

print("Score: ", score)

85% is a pretty good score to start with.
We can surely do better if we tweak it around some,
but let's start with something else.
What are the important attributes/features for our classifier?

We specifically choose a linear classifier (logistic regression in this case),
so that it is very simple to see what the most important features are.
We just have to look at the highest and lowest weights for each feature.

Let's make a function for checking the important features
(because we will probably be doing this a lot).

In [None]:
def show_important_features(classifier,
                            train_features, train_labels, test_features, test_labels,
                            count = 10):
    classifier.fit(train_features, train_labels)
    score = classifier.score(test_features, test_labels)

    print("Score: %5.4f" % (score))

    # Get the weights for each feature.
    # Note that we are putting the weight and feature name in a tuple
    # (with the weight first) so that we can sort them by weight easily.
    feature_scores = []
    for i in range(classifier.n_features_in_):
        feature_scores.append((classifier.coef_[0][i], classifier.feature_names_in_[i]))
    
    feature_scores = list(sorted(feature_scores, reverse = True))
    print("Top Positive Features")
    for i in range(count):
        print("    % 4.2f -- %s" % feature_scores[i])
    
    feature_scores = list(sorted(feature_scores, reverse = False))
    print("Top Negative Features")
    for i in range(count):
        print("    % 4.2f -- %s" % feature_scores[i])

Now let's see what our important features actually are.

In [None]:
classifier = sklearn.linear_model.LogisticRegression(random_state = 4)
show_important_features(classifier, train_features, train_labels, test_features, test_labels)

Okay, we can see some stuff that makes sense (and some stuff that seems really troublesome).

For the top three positive features, we can see some stuff that really makes sense:
 - capital gains -- I feel like someone who is making capital gains of stocks will probably be making good money.
 - Married -- Sure. I think I remember hearing somewhere that married people tend to make more (especially if the income is actually household income).
 - Education -- Okay, people with higher levels of education tend to make more money.

Some of the top negative features makes sense:
 - Not Married -- Maybe the opposite of the married we saw in the positive features.
 - Preschool Education -- People who have not had an opportunity to go to school probably have less opportunities to make money.
 - Zip Code -- Maybe that is a poorer area?

But then there are also ones that seem REALLY bad to have in our model,
first among them being "sex_Female".
It would be not just morally wrong to build a model that says predicts lower income (and rejects loan applications) because an applicant in female,
but it is also illegal in the US (and you can bet this would be super illegal under EU regulations).
The [Equal Credit Opportunity Act (ECOA)](https://www.ftc.gov/legal-library/browse/statutes/equal-credit-opportunity-act)
> ... prohibits discrimination on the basis of race, color, religion, national origin, sex, marital status, age, receipt of public assistance, or good faith exercise of any rights under the Consumer Credit Protection Act.

Also note that the Consumer Financial Protection Bureau (CFPB) has
[clarified that sexual orientation and gender are also covered by the ECOA](https://www.consumerfinance.gov/about-us/newsroom/cfpb-clarifies-discrimination-by-lenders-on-basis-of-sexual-orientation-and-gender-identity-is-illegal/).
Also keep in mind that this data is from 1994, so it uses outdated concepts like sex as a proxy for gender.

When we have features in a model that we cannot use morally or legally,
we call these features "protected attributes".
Let's go ahead and remove all those protected attributes from our data.

In [None]:
train_data = raw_train_data.copy()
test_data = raw_test_data.copy()

REMOVE_COLUMNS = [
    'age',
    'marital-status',
    'race',
    'sex',
    'native-country',
]

train_data = train_data.drop(columns = REMOVE_COLUMNS)
test_data = test_data.drop(columns = REMOVE_COLUMNS)

train_features, train_labels, test_features, test_labels = clean_census_data(train_data, test_data)

train_features.describe()

Now how does our model perform (and what are our new important features) now that we have removed the protected attributes?

In [None]:
classifier = sklearn.linear_model.LogisticRegression(random_state = 4)
show_important_features(classifier, train_features, train_labels, test_features, test_labels)

Looks like our model performs about the same without the biased columns.
Note that the accuracy is slightly lower, but we are also not doing any cross-validation so we will treat the results a little looser.

In some machine learning scenarios (usually not the ones you will see in this class),
we will even sometimes see better results for models that are more fair.
This is because enforcing fairness in a model can make a model more robust,
and robust models tend to generalize better and perform better on unseen data.

Of course, we no longer see the protected attributes that we removed.
However, we do see columns like "relationship_Husband" and "relationship_Wife".
Of course, this violates the marital status provision in the ECOA, but these columns also indirectly violate the sex provision.
Let's look more closely at the values in this column.

In [None]:
train_data['relationship'].unique()

Seems like this is a column that should be removed.
But **just for teaching purposes**, let's do something else instead.
Let's pretend that marital status is not a protected attribute.
If marital status is not protected, then the problem with the "relationship" column is that it exposes information about a protected attribute (sex).
In this way, we are leaking protected information into our model.

Usually, we just get rid of anything that has or leaks protected information.
But Sometimes, if we are lucky, we may be able to remove the protected information without removing all the information in the column.
We call this "debiasing" our data.

What if we merge the 'Husband' and 'Wife' values into a single 'Spouse' value?
Then we can remove the sex information without loosing the other information from this attribute.

In [None]:
# Replace 'Husband' and 'Wife' in the "relationship' column with 'Spouse'.
train_data.loc[train_data['relationship'].isin(['Husband', 'Wife']), 'relationship'] = 'Spouse'
test_data.loc[test_data['relationship'].isin(['Husband', 'Wife']), 'relationship'] = 'Spouse'

train_features, train_labels, test_features, test_labels = clean_census_data(train_data, test_data)

# Look at the relationship column.
train_data

Now how does our classifier perform with the new debiased "relationship" column?

In [None]:
classifier = sklearn.linear_model.LogisticRegression(random_state = 4)
show_important_features(classifier, train_features, train_labels, test_features, test_labels)

Great, we merged the offending values and still do about the same.
We even still see the debiased column in our top positive features.
This means that the column is useful even when sex information is removed
(e.g., the biased/protected information is not necessary for the model, and the debiased information is useful).

So, are we now done?
Did we remove all the bias in our dataset and make it super fair?
You can make some valid arguments about the fairness of using some of the information we still have like nation of origin,
but there is something else that really stands out.

If you look back to our most recent important features,
there is one zip code entry that is the top negative feature.
Note that this column is not all zip codes (since we one-hot encoded them),
this is just a single zip code (30940).
It's the only zip code we see in either the positive or negative features.
But this could make sense.
We know that there are affluent and impoverished areas.
It makes sense that a zip code could help predict how much someone earns
(e.g., you have to make more money to live in a rich area).

But is that all there is to this specific zip code?
Unfortunately, no.
There is something much more depressing going on here.
Let's look closer at the people in this zip code.

In [None]:
train_data[raw_train_data['zipcode'] == '30940']

We can see that they mostly have the negative label,
but we already know that the zip code is a good predictor of this label.
Aside from that, it's hard to see any other specific patterns.

But, here we are looking at the clean data,
what if we go all the way back to the raw data?

In [None]:
raw_train_data[raw_train_data['zipcode'] == '30940']

It looks like all (or almost all) the people in this zip code are black.
Let's look closer.

In [None]:
raw_train_data[raw_train_data['zipcode'] == '30940']['race'].sort_values().hist()

Clearly, this zip code is predominantly black,
but this could just match the overall distribution by race for the entire dataset.
What if the census data was taken in a predominately black city.

Let's look at the breakdown by race for the entire dataset.

In [None]:
raw_train_data['race'].sort_values().hist()

Wow.
This dataset is heavily weighted towards white participants.
But even with that weighting, the zip code we were looking at is still predominantly black.
And not only is the zip code predominantly black,
but it is also a strong indicator for lower income.

What about the other zip codes?
Do they also follow a similar pattern?
Let's plot a histogram by race for all our zip codes.

In [None]:
figure = matplotlib.pyplot.figure(figsize = [15, 15])
zipcodes = sorted(raw_train_data['zipcode'].unique())

# Loop through each zip code.
for i in range(len(zipcodes)):
    axis = figure.add_subplot(math.ceil(len(zipcodes) / 3), 3, i + 1)
    
    raw_train_data[raw_train_data['zipcode'] == zipcodes[i]]['race'].sort_values().hist(ax = axis)
    axis.set_title("zipcode = %s" % (zipcodes[i]))
                   
matplotlib.pyplot.show()

It looks like it is only that one zip code (30940) specifically that is biased.
All the others look about the same as the overall dataset.

What we discovered here is an attribute which has certain values that can be used as proxy for a protected attribute.
Therefore, we have discovered that our model is still unfair because it is getting a signal from a protected attribute.

But why are we seeing this?
How is zip code predicting race?

What we see here is actually something we see all over the United States.
In certain places, specific zip codes can be a strong predictor for race because of a practice called [redlining](https://en.wikipedia.org/wiki/Redlining).
Redlining is a discriminatory behavior that has (and still is) used to keep people considered "undesirable" (a thinly coded word usually targeted towards black homeowners) out of areas with predominately white residents.
This forced black potential homeowners into areas that were already predominantly black.
These areas were then starved of investment which affords less opportunity to the residents and less access to financial and social mobility.
Consider the cycle of redlining:
 1) You want to buy a house.
 2) You come from a redlined zip code, so the bank will not give you good rates for a home loan in a non-redlined area and relators may not cooperate with you in non-redlined areas.
 3) You are forced to buy a house in a redlined zip code that is considered "low income".
 4) Because the area is "low income", people with money do not move there.
 5) Because of lack of investment, low property values (and therefore property taxes), and lack of outsides moving in, the area has poor infrastructure like schools, grocery stores, and public transit.
 6) You want to start a business, but because you lived in a redlined zip code the bank denies your loan (maybe it was using the biased model we just built!).
 7) Your kids don't have the educational resources or community investment growing up and therefore have low financial and social mobility.
 8) Your kids now start back at #1.

Redlining is a form of quasi-legal segregation that was heavily practiced during reconstruction and through the civil rights era,
and is still more subtly practiced today.

Because of systematic racism, our "zipcode" column is biased in a very subtle way.
And because we are using a biased column, our model is not biased and unfair.
Machine learning can be hard and it can affects real people.
We always need to do our best to find places that are unfair and work to make them more fair.

## COMPAS

In the US justice system, before you are sentenced, paroled, or the terms of your bail are set
(all steps that may result in incarceration)
a judge or committee is supposed to assess your risk for future crime (nonviolent or violent).
If you are considered low risk you may get favorable (a lower sentence, early parole, a low bail),
but if you are considered high rick then you get the opposite treatment.
COMPAS is software that was developed to do this risk assessment instead of humans.

As a promising machine learner,
you should already be very cautious about a system that decides whether people will spend more time in jail.
(Especially since we are discussing it during a fairness example.)

The nonprofit organization [ProPublica](https://en.wikipedia.org/wiki/ProPublica)
did a [study on COMPAS](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) and found it to be racist.
ProPublica found that COMPAS was more likely to incorrectly label black defendants as high risk,
and less likely to label white defendants as high risk.
In more precise terms, if COMPAS' predictive task was a binary classification where high right is the positive label,
black defendants have a higher false positive rate (FPR)
while white defendants have a higher false negative rate (FNR).
NorthPoint, the company that makes COMPAS, responded that the predictive parity between the groups is the same.

What we see here is not so much an argument about the data (or even the model),
but really an argument about what metrics are most appropriate for this situation.
We have already talked about evaluation metrics and how there are [dozens of them](https://scikit-learn.org/stable/modules/model_evaluation.html).
But here we are talking about [fairness metrics](https://en.wikipedia.org/wiki/Fairness_(machine_learning)#Mathematical_formulation_of_group_fairness_definitions)
(measures of how fair a machine learning model is).
Like evaluation metrics, there are many different fairness metrics.
In fact, if we consider fairness to be defined by a metric, then there are at least 14 different definitions of fairness!
To make it even more difficult, it is mathematically impossible to guarantee that all fairness metrics can be satisfied at the same time.

As with many things in machine learning, this situation comes down to trade-offs and expertise.
There is not an automatic way to just say: this is the right fairness metric.
It is up to you as the expert to decide what is the most appropriate set of fairness metrics to use in a specific scenario.
The "right" answer is to usually pick a set of fairness metrics that best fit your scenario
while always keeping in mind how an unfair model can change people's lives.

Here are some resources if you want to read more about the COMPAS controversy:
 - [Wikipedia Entry for the COMPAS Software](https://en.wikipedia.org/wiki/COMPAS_(software))
 - [Lecture Slides that Discuss COMPAS and Fairness](https://web.stanford.edu/class/cs329t/slides/fairness-Week1.pdf)
 - [Initial ProPublica Article](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing)
   - [More Information on ProPublica Methodology](https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm)

Now that we have talked about fairness metrics,
let's see what fairness metrics will say about our census data.

Let's start by making a function that outputs a confusion matrix and stats for our classifier.

In [None]:
def show_confusion_matrix(train_features, train_labels, test_features, test_labels, title = None):
    classifier = sklearn.linear_model.LogisticRegression(random_state = 4)
    classifier.fit(train_features, train_labels)

    test_predictions = classifier.predict(test_features)
    sklearn.metrics.ConfusionMatrixDisplay.from_estimator(classifier, test_features, test_labels,
                                                          labels = [False, True],
                                                          display_labels = ['Low Income',
                                                                            'High Income'],
                                                          cmap = 'Blues', colorbar = False)
    matplotlib.pyplot.title(title)
    matplotlib.pyplot.show()

    # Output all sorts of stats.
    tn, fp, fn, tp = sklearn.metrics.confusion_matrix(test_labels, test_predictions).ravel()
    p = tp + fn
    n = tn + fp
    count = p + n
    
    print("Accuracy         : %5.4f" % ((tp + tn) / count))
    print()
    print("Recall (TPR)     : %5.4f" % (tp / p))
    print("Miss Rate (FNR)  : %5.4f" % (fn / p))
    print()
    print("Fall-Out (FPR)   : %5.4f" % (fp / n))
    print("Selectivity (TNR): %5.4f" % (tn / n))
    print()
    print("Precision        : %5.4f" % (tp / (tp + fp)))
    print()
    print("Positive Rate    : %5.4f" % ((tp + fp) / count))
    print("Negative Rate    : %5.4f" % ((tn + fn) / count))

    print('---------------------------------------------')

Let's run these stats over our full test data to get a baseline for what to expect.

In [None]:
show_confusion_matrix(train_features, train_labels, test_features, test_labels, title = 'All Data')

When discussing redlining and COMPAS, we were talking about race.
But I know (from exploring the data) that this dataset has another huge bias.
Let's look at the metics broken up by sex.

In [None]:
# Pull out test records by sex.
male_test_features = test_features[(raw_test_data['sex'] == 'Male').values]
male_test_labels = test_labels[(raw_test_data['sex'] == 'Male').values]

female_test_features = test_features[(raw_test_data['sex'] == 'Female').values]
female_test_labels = test_labels[(raw_test_data['sex'] == 'Female').values]

show_confusion_matrix(train_features, train_labels, male_test_features, male_test_labels,
                      title = 'Male')
show_confusion_matrix(train_features, train_labels, female_test_features, female_test_labels,
                      title = 'Female')

Pay particular attention to the values for False Positive Rate (FPR) and False Negative Rate (FNR)
(these metrics were called out in our COMPAS discussion).
In this case the FPR means that the system erred in the participants favor and predicted they have a higher income (and approved them for a loan),
while the FNR means that the system erred against the participant and incorrectly predicted they had a low income (and denied their loan).

The FPR for male participants is 0.0959 while the FPR for female participants is 0.0149.
While the FNR for male participants is 0.3953 while the FNR for female participants is 0.5424.

|        | FPR    | FNR    |
|--------|--------|--------|
| Male   | 0.0959 | 0.3953 |
| Female | 0.0149 | 0.5424 |

This means the model is about six times more likely to make a mistake in a man's favor (FPR) than in a women's,
and 30% more likely to incorrectly decide against a woman than a man.

Here we can see that is is important to not just consider how well our model can do,
but also what happens when our classifier messes up.
Our classifier may have a higher accuracy for female participants,
but when it messes up on women it disadvantages them at higher rates than men.

A more formal way we can see this disparity is by using a fairness metric.
In this case, we will a fairly popular one: [demographic parity](https://en.wikipedia.org/wiki/Fairness_(machine_learning)#Definitions_based_on_predicted_outcome) (also called statistical parity).
Demographic parity look at if the protected and unprotected groups have equal probability of getting a positive label.

For our task, we can use demographic parity to measure disparity using:
$$
    \frac{ P( Income > 50K | Sex = Female ) }{ P( Income > 50K | Sex = Male ) }
$$

Thankfully, we already computed these values above (the positive rate for the respective sexes).
$$
    \frac{ P( Income > 50K | Sex = Female ) }{ P( Income > 50K | Sex = Male ) } =
    \frac{ 0.0631 }{ 0.2484 } =
    0.2540 
$$

1.0 would be considered perfectly fair (and a 20% margin is often given to be "fair enough").
But we are far from that at 0.25.

Let's compare this result against two values that should be fair,
two of the unbiased zip codes.

In [None]:
# Pull out test records by zip.
zip1_test_features = test_features[(raw_test_data['zipcode'] == '11810').values]
zip1_test_labels = test_labels[(raw_test_data['zipcode'] == '11810').values]

zip2_test_features = test_features[(raw_test_data['zipcode'] == '13523').values]
zip2_test_labels = test_labels[(raw_test_data['zipcode'] == '13523').values]

show_confusion_matrix(train_features, train_labels, zip1_test_features, zip1_test_labels,
                      title = '11810')
show_confusion_matrix(train_features, train_labels, zip2_test_features, zip2_test_labels,
                      title = '13523')

$$
    \frac{ P( Income > 50K | zipcode = 11810 ) }{ P( Income > 50K | zipcode = 13523 ) } =
    \frac{ 0.1858 }{ 0.1843 } =
    1.0081
$$

As expected, for two fair values (two arbitrary zip codes),
we can see almost no disparity.

As with pretty much everything we have seen in machine learning,
fairness is complicated.
There are no clear answers, and it requires careful attention from the machine learning expert (you) to ensure that you models are free of biased data and fair.
Choosing the right data and fairness metrics can change how a model is viewed and how it predicts.

To conclude, I will leave you with a quote [of disputed origins](https://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statistics).
Consider this quote in the context of the data and fairness metrics you can choose for your models:
> There are three kinds of lies: lies, damned lies, and statistics.