# GerryFair Tutorial

## Required Data Format

In order to train and audit for bias, we require three dataframes. The first two are the standard `X` and `Y`, which are the samples and their labels respectively. These should be one-hot-encoded. The third required dataframe is the list of protected attributes. This contain the values of the protected attributes for each row in the sample. These will be the attributes that we will audit for bias towards. Please note that we do not promise to protect against bias towards attributes that are not included in this list.

#### Cleaning Data
If your data is not in that format, it needs to be cleaned. We provide a method, `clean_data` in *clean.py*, you can use to clean your data into the accepted format.

The variable `dataset` should hold the file path to the file containing the dataset. The variable `attributes` should hold the file path to the file containing protected attributes. This should simply be one row where a column is 2 if it is the label, 1 if it is protected, and 0 otherwise. You should set `centered` to be `True` if you want the data to be centered.

In [3]:
import gerryfair
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
dataset = "./data/preprocessed.csv"
attributes = "./data/protected.csv"
centered = True
X, X_prime, y = gerryfair.clean.clean_dataset(dataset, attributes, centered)

label feature: ['completed']
sensitive features: ['country_cd_US', 'is_female', 'bachelor_obtained', 'white']


TypeError: drop() takes from 1 to 2 positional arguments but 3 were given

In [None]:
print(X)

## Using tools to train a model

Now, we can use the `Model` class in order to train a new model. When instatiating the object, you may provide any options that you will want when training the classifier. If you wish to change the options, you may use the `set_options` method. Both are shown below.

In [None]:
C = 15
printflag = True
gamma = .01
fair_model = gerryfair.model.Model(C=C, printflag=printflag, gamma=gamma)
max_iters = 50
fair_model.set_options(max_iters=max_iters)

Now that we are happy with the options, we can use the `train` method to train a classifier using the Fictitious Play algorithm described in [the original paper](https://arxiv.org/abs/1711.05144v3). We will need our three dataframes from earlier. We also return the final errors and fp_difference from training.

I will first split `X` and `y` into a training and a testing set.

In [None]:
# Train Set
X_train = X.iloc[:X.shape[0]-50]
X_prime_train = X_prime.iloc[:X_prime.shape[0]-50]
y_train = y.iloc[:y.shape[0]-50]
# Test Set
X_test = X.iloc[-50:].reset_index(drop=True)
X_prime_test = X_prime.iloc[-50:].reset_index(drop=True)
y_test = y.iloc[-50:].reset_index(drop=True)

# Train the model
[errors, fp_difference] = fair_model.train(X_train, X_prime_train, y_train)

We can now use our model to make out of sample predictions. This can be done using the `predict` method of the object.

In [None]:
predictions = fair_model.predict(X_test)

## Using tools on evaluate a generic model

Once we have a model, whether it is fictitious play model or any generic model, we can use our tools to evaluate the fairness in several ways.

#### Auditing Predictions

You can audit for subgroup fairness of your predictions using the functionality within the `audit` object. These predictions can come from any arbitrary model. Auditing the predictions returns the group that failed the audit and gamma unfairness of the predictions on that group. We will be using our predictions from the previous part.

In [None]:
auditor = gerryfair.model.Auditor()
[group, gamma_unfairness] = auditor.audit(predictions, X_prime_test, y_test)

#### Plotting errors
You can also plot the errors of the model during training using the `plot_single` function in *fairness_plots.py*. Please note that these errors are returned by our fictitious play algorithm, so this is specifically for analyzing the effectiveness of our model.

In [None]:
gerryfair.fairness_plots.plot_single(errors, fp_difference, max_iters, gamma, C)

In [None]:
fair_model._fictitious_play(X_test, X_prime_test, y_test)