# Classical Implementation
In this notebook I have trained a classical (and likely biased) machine learning model on the original dataset

Useful information for reference:

number of instances: 1000 (190 young and 810 aged)

labels: 1 is good, 2 is bad

A13 == 0 means the individual is young

## 0.1 Read in the data and export it as a CSV

In [94]:
import re

with open('../german-credit-dataset/german.data-numeric', 'r') as infile:
    data_contents = infile.read()    
    data_contents = re.sub(r'[ ]+', ",", data_contents)
    data_contents = re.sub(r'^,', "", data_contents)
    data_contents = re.sub(r'\n,', "\n", data_contents)
    data_contents = re.sub(r',\n', "\n", data_contents)
    # data_contents = re.sub(r'^,|\n,|,\n', "\n", data_contents)

    with open('../german-credit-dataset/german-numeric.csv', 'w') as outfile:
        outfile.write(data_contents)

## 0.2 Create a pandas dataframe holding the dataset

The two most important data structures in this notebook are the original dataset, created below, and the modified dataset that's created in Task 4 (Fair implementation).

The original dataset will be split into training and testing data, the training data will be used to create a modified (fair) dataset later whilst the testing data will not be used except for evaluating models trained on either the original training data or the fair training data.

In [95]:
import pandas as pd

# data = pd.read_csv('../german-credit-dataset/german.csv')
data = pd.read_csv('../german-credit-dataset/german-numeric.csv', header=None)
data.columns = [
    'A1',
    'A2',
    'A3',
    'A5*',
    'A6',
    'A7',
    'A9',
    'A11',
    'A12',
    'A13',
    'A14',
    'A16',
    'A18',
    'A19',
    'A20',
    'A4????',
    'A8',
    'A10a',
    'A10b',
    'A15a',
    'A15b',
    'A17a',
    'A17b',
    'A17c',
    'label'
]


print('data read in and column names applied')

data read in and column names applied


## 0.3 Encode the age data as Young (0) and Aged (1)

In [96]:
data.loc[data.A13 <= 25, "A13"] = 0
data.loc[data.A13 > 25, "A13"] = 1

## 3.3.2 Split the data into Features and labels and into training and testing

In [97]:
from sklearn.model_selection import train_test_split

features = data.iloc[:, :24] # columns 0 to 24
labels = data.iloc[:, 24] # column 25

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=0) # This also shuffles the data

# print(features.head)

# print(labels.head)

## 3.3.3a Train a Naive Bayes model

In [98]:
# Import and fit a naive bayes model
from sklearn.naive_bayes import GaussianNB
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

## 3.3.3b Evaluate the model

In [99]:
from sklearn import metrics

def evaluate(trained_model, X_test, y_test):
    predictions = trained_model.predict(X_test)
    print(f'Accuracy: {metrics.accuracy_score(y_test, predictions)}\n')
    print('Classification report:')
    print(metrics.classification_report(y_test, predictions, target_names=['Good','Bad'])[:166])
    print('Confusion matrix:')
    print(metrics.confusion_matrix(y_test, predictions))

In [100]:
evaluate(nb_classifier, X_test, y_test)

Accuracy: 0.7266666666666667

Classification report:
              precision    recall  f1-score   support

        Good       0.85      0.75      0.80       214
         Bad       0.52      0.67      0.59        86

  
Confusion matrix:
[[160  54]
 [ 28  58]]


## 3.3.4 Subsample a new dataset and retrain the model

In [101]:
# the aged group has 810 entries, 590 have the positive class
# the young group has 190 entries, 110 have the positive class

# sampled data should have 95 young +, young -, old +, old -?

young_group = data[data['A13'] <= 25]
young_pos_group = young_group[young_group['label'] == 1]
young_neg_group = young_group[young_group['label'] == 2]

aged_group = data[data['A13'] > 25]
aged_pos_group = aged_group[aged_group['label'] == 1]
aged_neg_group = aged_group[aged_group['label'] == 2]

data_resampled = pd.concat([young_pos_group, young_neg_group, aged_pos_group, aged_neg_group])

# TODO check that the index gets redone

features_new = data_resampled.iloc[:, :24] # columns 0 to 19
labels_new = data_resampled.iloc[:, 24] # column 20

# print(features_new.head)
# print(labels_new.head)


X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(features_new, labels_new, test_size=0.3, random_state=0) # This also shuffles the data

nb_classifier_resampled_dataset = GaussianNB()
nb_classifier_resampled_dataset.fit(X_train_new, y_train_new)
evaluate(nb_classifier_resampled_dataset, X_test_new, y_test_new)

 # The accuracy is now terrible.

Accuracy: 0.38666666666666666

Classification report:
              precision    recall  f1-score   support

        Good       0.77      0.13      0.23       203
         Bad       0.34      0.92      0.49        97

  
Confusion matrix:
[[ 27 176]
 [  8  89]]


# Fairness adjustment

## 3.4.0 Discrimination Measure
We use the KCDM measure to test the Discrimination level present within the dataset.

In [102]:
def test_discrimination(data):
    young_group = data[data['A13'] == 0]
    young_pos_group = young_group[young_group['label'] == 1]
    aged_group = data[data['A13'] == 1]
    aged_pos_group = aged_group[aged_group['label'] == 1]

    # print(young_group.shape[0])
    # print(young_pos_group.shape[0])
    # print(aged_group.shape[0])
    # print(aged_pos_group.shape[0])

    discrimination = aged_pos_group.shape[0] / aged_group.shape[0] - young_pos_group.shape[0] / young_group.shape[0]
    return discrimination

print(test_discrimination(data))


0.14944769330734242


## Apply the CND algorithm

The first step is to create a new dataset which is a concatenation of X_train and y_train, this will be modified to become unbiased

In [103]:
new_dataset = pd.concat((X_train, y_train), axis=1)

print(new_dataset)

     A1  A2  A3  A5*  A6  A7  A9  A11  A12  A13  ...  A4????  A8  A10a  A10b  \
105   2  24   4  119   1   3   3    3    3    1  ...       0   0     0     1   
68    4  36   2   18   1   3   3    4    4    1  ...       0   0     1     0   
479   1  15   4   15   1   5   3    4    3    1  ...       0   0     1     0   
399   4  24   4   15   4   3   2    1    1    1  ...       0   0     1     0   
434   1   9   2   21   1   3   3    2    1    0  ...       0   0     1     0   
..   ..  ..  ..  ...  ..  ..  ..  ...  ...  ...  ...     ...  ..   ...   ...   
835   1  12   0   11   1   3   3    4    3    1  ...       1   0     1     0   
192   2  27   2   39   1   3   3    2    3    1  ...       0   0     1     0   
629   4   9   2   38   5   5   3    4    1    1  ...       0   0     1     0   
559   2  18   4   19   1   2   3    2    1    1  ...       0   0     1     0   
684   2  36   3   99   2   4   3    3    2    1  ...       0   0     1     0   

     A15a  A15b  A17a  A17b  A17c  labe

We want to add a list of label probabilities to this using the pre-built classifier

In [107]:
def rank(dataset, sensitive_value, sensitive_column, desired_label):
    features = dataset.iloc[:, :24] # columns 0 to 24
    labels = dataset.iloc[:, 24] # column 25

    nb_classifier2 = GaussianNB()
    nb_classifier2.fit(features, labels)

    def nb_predict(row):
        '''
        INPUT: A row from the feature data
        RETURNS: The probability of that row belonging to the positive class
        '''
        a = row.values
        a = a.reshape(1,-1)
        ps = nb_classifier2.predict_proba(a)
        return ps[0][0]

    # Calculate the probabilities R[x] for x in D and store them in a new column
    new_dataset['rank_score'] = features.apply(nb_predict, axis=1, result_type='expand')
    new_dataset['label'] = labels
    # We also add indices for reference
    # new_dataset['new_index'] = range(len(new_dataset))

    candidates_for_promotion = new_dataset[new_dataset[sensitive_column] == sensitive_value][new_dataset['label'] != desired_label]
    print(candidates_for_promotion.shape)
    candidates_for_promotion.sort_values('rank_score', inplace=True, ascending=False) # FIXME if it doesn't work swap the ascending value
    
    candidates_for_demotion = new_dataset[new_dataset[sensitive_column] != sensitive_value][new_dataset['label'] == desired_label]
    print(candidates_for_demotion.shape)
    candidates_for_demotion.sort_values('rank_score', inplace=True, ascending=True) # FIXME if it doesn't work swap the ascending value

    return candidates_for_promotion, candidates_for_demotion

In [108]:
rank(new_dataset, 0, 'A13', '1')

(131, 26)
(0, 26)


(     A1  A2  A3  A5*  A6  A7  A9  A11  A12  A13  ...  A8  A10a  A10b  A15a  \
 613   1  24   1   36   1   3   2    4    3    0  ...   1     0     0     1   
 930   1  24   2   17   1   2   3    1    2    0  ...   0     0     1     0   
 280   4  15   4   34   4   5   3    4    4    0  ...   1     1     0     1   
 258   4  15   2   38   2   2   2    4    3    0  ...   1     1     0     0   
 476   4  39   2   26   3   3   3    4    3    0  ...   1     1     0     0   
 ..   ..  ..  ..  ...  ..  ..  ..  ...  ...  ...  ...  ..   ...   ...   ...   
 63    2  48   0  144   1   3   3    2    3    0  ...   0     1     0     0   
 887   2  48   2  157   1   3   3    2    3    0  ...   0     1     0     0   
 633   4   9   2   20   1   2   2    2    3    0  ...   0     0     1     1   
 618   2  30   2   34   2   3   2    4    3    0  ...   0     0     1     1   
 59    1  36   4   62   1   2   2    4    4    0  ...   0     0     1     1   
 
      A15b  A17a  A17b  A17c  label    rank_score 

Now we want to identify two groups, CP and CD
We then want to swap the labels of the corresponding rows in the new_dataset

In [106]:



print(len(candidates_for_promotion))
print(len(candidates_for_demotion))

# Sort the groups by their rank_score
candidates_for_promotion.sort_values('rank_score', inplace=True)
candidates_for_promotion.sort_values('rank_score', inplace=True)


NameError: name 'candidates_for_promotion' is not defined

In [12]:
# Calculate how many swaps we need
young_group = data[data['A13'] == 0]
young_group_good = young_group[young_group['Score'] == 1]
aged_group = data[data['A13'] == 1]
aged_group_good = aged_group[aged_group['Score'] == 1]

swaps_required = ( (young_group.shape[0] * aged_group_good.shape[0]) - (aged_group.shape[0] * young_group_good.shape[0]) ) / (young_group.shape[0] + aged_group.shape[0])

cp_indices = list(candidates_for_promotion['new_index'])
cd_indices = list(candidates_for_demotion['new_index'])

# Make that many swaps
for i in range(1, int(swaps_required)):
    i_cp = cp_indices[i]
    new_dataset.iloc[i_cp].Score = 1
    i_cd = cd_indices[i]
    new_dataset.iloc[i_cd].Score = 2

# Drop the columns we created
new_dataset.drop(columns=['new_index', 'rank_score'])
X_train = new_dataset.iloc[:, :24] # columns 0 to 24
y_train = new_dataset.iloc[:, 24] # last column

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [13]:
# Retest the discrimination of this dataset
print(test_discrimination(new_dataset))

131
79
569
407
0.11223654731080368


Train a new, unbiased model

In [14]:
# Import and fit a naive bayes model
from sklearn.naive_bayes import GaussianNB
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [15]:
nb_predictions = nb_classifier.predict(X_test)

In [16]:
print(f'Accuracy of the Naive Bayes classifier {metrics.accuracy_score(y_test, nb_predictions)}\n')

print(f'Classification report for Naive Bayes (0:died, 1:recovered):')
print(metrics.classification_report(y_test, nb_predictions, target_names=['Good','Bad'])[:166])

Accuracy of the Naive Bayes classifier 0.7266666666666667

Classification report for Naive Bayes (0:died, 1:recovered):
              precision    recall  f1-score   support

        Good       0.85      0.75      0.80       214
         Bad       0.52      0.67      0.59        86

  


# Evaluation and comparison