# Classical Implementation
In this notebook I have trained a classical (and likely biased) machine learning model on the original dataset

Useful information for reference:

number of instances: 1000 (190 young and 810 aged)

labels: 1 is good, 2 is bad

A13 == 0 means the individual is young

## 0.1 Read in the data and export it as a CSV

In [1]:
import re

with open('../german-credit-dataset/german.data-numeric', 'r') as infile:
    data_contents = infile.read()    
    data_contents = re.sub(r'[ ]+', ",", data_contents)
    data_contents = re.sub(r'^,', "", data_contents)
    data_contents = re.sub(r'\n,', "\n", data_contents)
    data_contents = re.sub(r',\n', "\n", data_contents)
    # data_contents = re.sub(r'^,|\n,|,\n', "\n", data_contents)

    with open('../german-credit-dataset/german-numeric.csv', 'w') as outfile:
        outfile.write(data_contents)

## 0.2 Create a pandas dataframe holding the dataset

The two most important data structures in this notebook are the original dataset, created below, and the modified dataset that's created in Task 4 (Fair implementation).

The original dataset will be split into training and testing data, the training data will be used to create a modified (fair) dataset later whilst the testing data will not be used except for evaluating models trained on either the original training data or the fair training data.

In [2]:
import pandas as pd

# data = pd.read_csv('../german-credit-dataset/german.csv')
data = pd.read_csv('../german-credit-dataset/german-numeric.csv', header=None)
data.columns = [
    'A1',
    'A2',
    'A3',
    'A5*',
    'A6',
    'A7',
    'A9',
    'A11',
    'A12',
    'A13',
    'A14',
    'A16',
    'A18',
    'A19',
    'A20',
    'A4????',
    'A8',
    'A10a',
    'A10b',
    'A15a',
    'A15b',
    'A17a',
    'A17b',
    'A17c',
    'label'
]


print('data read in and column names applied')

data read in and column names applied


## 0.3 Encode the age data as Young (0) and Aged (1)

In [3]:
data.loc[data.A13 <= 25, "A13"] = 0
data.loc[data.A13 > 25, "A13"] = 1

## 3.3.2 Split the data into Features and labels and into training and testing

In [4]:
from sklearn.model_selection import train_test_split

features = data.iloc[:, :24] # columns 0 to 24
labels = data.iloc[:, 24] # column 25

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.5, random_state=0) # This also shuffles the data

# print(features.head)
# print(labels.head)

## 3.3.3a Train a Naive Bayes model

In [5]:
# Import and fit a naive bayes model
from sklearn.naive_bayes import GaussianNB
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

## 3.3.3b Evaluate the model

In [6]:
# We also want to measure the disparate impact of a model
def disparate_impact(trained_model, sensitive_value, sensitive_column, desired_label, X_test, y_test):
    data_test = pd.concat([X_test, y_test], axis=1)

    data_test_unprivileged = data_test[data_test[sensitive_column] == sensitive_value]
    data_test_privileged = data_test[data_test[sensitive_column] != sensitive_value]

    # measure the rate of good outcomes among the unprivileged applicants
    X_test_up = data_test_unprivileged.iloc[:, :24] 
    predictions_up = trained_model.predict(X_test_up)
    good_up = (predictions_up == 1).sum()/len(predictions_up)

    X_test_p = data_test_privileged.iloc[:, :24] 
    predictions_p = trained_model.predict(X_test_p)
    good_p = (predictions_p == 1).sum()/len(predictions_p)

    return good_up/good_p
    

# disparate_impact(nb_classifier, 0, 'A13', 1, X_test, y_test)

In [7]:
from sklearn import metrics

def evaluate(trained_model, X_test, y_test):
    predictions = trained_model.predict(X_test)
    print(f'Accuracy: {metrics.accuracy_score(y_test, predictions)}')
    print(f'Disparate impact of classifier: {disparate_impact(trained_model, 0, "A13", 1, X_test, y_test)}')
    print('Classification report:')
    print(metrics.classification_report(y_test, predictions, target_names=['Good','Bad']))
    print('Confusion matrix:')
    print(metrics.confusion_matrix(y_test, predictions))

In [8]:
evaluate(nb_classifier, X_test, y_test)

Accuracy: 0.686
Disparate impact of classifier: 0.4015925480769231
Classification report:
              precision    recall  f1-score   support

        Good       0.84      0.68      0.75       350
         Bad       0.48      0.70      0.57       150

   micro avg       0.69      0.69      0.69       500
   macro avg       0.66      0.69      0.66       500
weighted avg       0.73      0.69      0.70       500

Confusion matrix:
[[238 112]
 [ 45 105]]


## 3.3.4 Subsample a new dataset and retrain the model

In [9]:
# the aged group has 810 entries, 590 have the positive class
# the young group has 190 entries, 110 have the positive class

# sampled data should have 80 young +, young -, old +, old -?

def resample(dataset):
    young_group = dataset[dataset['A13'] == 0]

    young_pos_group = young_group[young_group['label'] == 1]
    young_pos_sample = young_pos_group.sample(n=80, random_state=0)

    young_neg_group = young_group[young_group['label'] == 2]
    young_neg_sample = young_neg_group.sample(n=80, random_state=0)

    aged_group = dataset[dataset['A13'] == 1]

    aged_pos_group = aged_group[aged_group['label'] == 1]
    aged_pos_sample = aged_pos_group.sample(n=80, random_state=0)

    aged_neg_group = aged_group[aged_group['label'] == 2]
    aged_neg_sample = aged_neg_group.sample(n=80, random_state=0)

    data_resampled = pd.concat([young_pos_sample, young_neg_sample, aged_pos_sample, aged_neg_sample])
    return data_resampled

# Resample the dataset
data_resampled = resample(data)

# Split the resampled dataset into training and testing data
features_resampled = data_resampled.iloc[:, :24]
labels_resampled = data_resampled.iloc[:, 24]
X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled = train_test_split(features_resampled, labels_resampled, test_size=0.5, random_state=0) # This also shuffles the data

# Train a Naive Bayes classifier on the resampled dataset
nb_resampled = GaussianNB()
nb_resampled.fit(X_train_resampled, y_train_resampled)
evaluate(nb_resampled, X_test_resampled, y_test_resampled)

# Measure the disparate impack of the classifier trained on the resampled dataset
# disparate_impact(nb_resampled, 0, 'A13', 1, X_test_resampled, y_test_resampled)

Accuracy: 0.675
Disparate impact of classifier: 1.0368532955350815
Classification report:
              precision    recall  f1-score   support

        Good       0.72      0.62      0.67        84
         Bad       0.64      0.74      0.68        76

   micro avg       0.68      0.68      0.68       160
   macro avg       0.68      0.68      0.67       160
weighted avg       0.68      0.68      0.67       160

Confusion matrix:
[[52 32]
 [20 56]]


# Fairness adjustment

## 3.4.0 Discrimination Measure
We use the KCDM measure to test the Discrimination level present within the dataset.

In [10]:
def test_discrimination(dataset, sensitive_value, sensitive_column, desired_class):
    young_group = dataset[dataset[sensitive_column] == sensitive_value]
    young_pos_group = young_group[young_group['label'] == desired_class]
    aged_group = dataset[dataset[sensitive_column] != sensitive_value]
    aged_pos_group = aged_group[aged_group['label'] == desired_class]

    # print(young_group.shape[0])
    # print(young_pos_group.shape[0])
    # print(aged_group.shape[0])
    # print(aged_pos_group.shape[0])

    discrimination = aged_pos_group.shape[0] / aged_group.shape[0] - young_pos_group.shape[0] / young_group.shape[0]
    return discrimination

print(test_discrimination(data, 0, 'A13', 1))
print(test_discrimination(data_resampled, 0, 'A13', 1))


0.14944769330734242
0.0


## Apply the CND algorithm

The first step is to create a new dataset which is a concatenation of X_train and y_train, this will be modified to become unbiased

In [11]:
training_data = pd.concat((X_train, y_train), axis=1)

test_discrimination(training_data, 0, 'A13', 1)

# print(new_dataset)

# a = new_dataset[new_dataset['label'] == 1]
# b = new_dataset[new_dataset['A13'] == 1]
# c = a[a['A13'] == 1]

# a.shape, b.shape, c.shape

0.12919896640826867

We want to add a list of label probabilities to this using the pre-built classifier

In [12]:
def rank(dataset, sensitive_value, sensitive_column, desired_label):
    # Train a classifier using all the data available
    features = dataset.iloc[:, :24] # columns 0 to 24
    labels = dataset.iloc[:, 24] # column 25

    nb_classifier2 = GaussianNB()
    nb_classifier2.fit(features, labels)

    def nb_predict(row):
        '''
        INPUT: A row from the feature data
        RETURNS: The probability of that row belonging to the positive class
        '''
        a = row.values
        a = a.reshape(1,-1)
        ps = nb_classifier2.predict_proba(a)
        return ps[0][0]

    # Calculate the probabilities R[x] for x in D and store them in a new column
    dataset['rank_score'] = features.apply(nb_predict, axis=1, result_type='expand')
    dataset['label'] = labels

    # We also add indices for reference
    # dataset['new_index'] = range(len(dataset))

    candidates_for_promotion = dataset[dataset[sensitive_column] == sensitive_value][dataset['label'] != desired_label]
    # print(candidates_for_promotion.shape)
    candidates_for_promotion.sort_values('rank_score', inplace=True, ascending=False)
    
    candidates_for_demotion = dataset[dataset[sensitive_column] != sensitive_value][dataset['label'] == desired_label]
    # print(candidates_for_demotion.shape)
    candidates_for_demotion.sort_values('rank_score', inplace=True, ascending=True)

    rest_of_dataset_1 = dataset[dataset[sensitive_column] == sensitive_value][dataset['label'] == desired_label]
    rest_of_dataset_2 = dataset[dataset[sensitive_column] != sensitive_value][dataset['label'] != desired_label]
    rest_of_dataset = pd.concat([rest_of_dataset_1, rest_of_dataset_2])

    return candidates_for_promotion, candidates_for_demotion, rest_of_dataset

# rank(new_dataset, 0, 'A13', 1)

In [13]:
def cnd(dataset, sensitive_value, sensitive_column, desired_label):

    candidates_for_promotion, candidates_for_demotion, rest_of_dataset = rank(dataset, sensitive_value, sensitive_column, desired_label)
    
    # Calculate how many swaps we need
    young_group = dataset[dataset[sensitive_column] == sensitive_value]
    s = len(young_group)
    young_pos_group = young_group[young_group['label'] == desired_label]
    s_pos = len(young_pos_group)

    aged_group = dataset[dataset[sensitive_column] != sensitive_value]
    s_hat = len(aged_group)
    aged_pos_group = aged_group[aged_group['label'] == desired_label]
    s_hat_pos = len(aged_pos_group)
    swaps_required = round(( (s * s_hat_pos) - (s_hat * s_pos) ) / (s + s_hat))
    # print(swaps_required)

    for i in range(int(swaps_required)):
        row_cp = candidates_for_promotion.iloc[[i]]
        row_cp['label'] = 1
        candidates_for_promotion.iloc[[i]] = row_cp

        row_cd = candidates_for_demotion.iloc[[i]]
        row_cd['label'] = 2
        candidates_for_demotion.iloc[[i]] = row_cd

    print(f'{swaps_required} swaps were required to reduce the bias in the dataset')

    
    new_dataset = pd.concat([rest_of_dataset, candidates_for_promotion, candidates_for_demotion])
    return new_dataset

data_cnd = cnd(training_data, 0, 'A13', 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
9 swaps were required to reduce the bias in the dataset


In [14]:
X_train_cnd = data_cnd.iloc[:, :24] # columns 0 to 24
Y_train_cnd = data_cnd.iloc[:, 24] # column 25

print(test_discrimination(data_cnd, 0, 'A13', 1))

0.002808673182788435


Train a new, unbiased model

In [15]:
nb_cnd = GaussianNB()
nb_cnd.fit(X_train_cnd, Y_train_cnd)

GaussianNB(priors=None, var_smoothing=1e-09)

# Evaluation and comparison

In [16]:
print('Evaluating the original (biased) classifier:')
evaluate(nb_classifier, X_test, y_test)

print('\n\nEvaluating the CND-trained classifier:')
evaluate(nb_cnd, X_test, y_test)

Evaluating the original (biased) classifier:
Accuracy: 0.686
Disparate impact of classifier: 0.4015925480769231
Classification report:
              precision    recall  f1-score   support

        Good       0.84      0.68      0.75       350
         Bad       0.48      0.70      0.57       150

   micro avg       0.69      0.69      0.69       500
   macro avg       0.66      0.69      0.66       500
weighted avg       0.73      0.69      0.70       500

Confusion matrix:
[[238 112]
 [ 45 105]]


Evaluating the CND-trained classifier:
Accuracy: 0.69
Disparate impact of classifier: 0.7761834319526627
Classification report:
              precision    recall  f1-score   support

        Good       0.81      0.73      0.77       350
         Bad       0.49      0.61      0.54       150

   micro avg       0.69      0.69      0.69       500
   macro avg       0.65      0.67      0.65       500
weighted avg       0.71      0.69      0.70       500

Confusion matrix:
[[254  96]
 [ 59  91]]